Feb 8, 2026

MUHAMMAD GHIFARY

In the world of 3D computer vision, reconstructing a detailed, animatable 3D face from a single 2D image is one of the “holy grail” problems. DECA (Detailed Expression Capture and Animation) is a reliable method that tackles this challenge with impressive results (Feng et al., 2021). It doesn’t just capture the coarse shape of the face, it captures fine details like wrinkles that appear during specific expressions, and it does so in a way that allows the face to be re-animated.

Reconstructing detailed 3D faces from single images isn’t just a research curiosity. It has profound implications across several domains:

In this article, I’ll dissect how DECA works, exploring it architecture, the mathematical principles behind it, and the code that powers it.

The Core Idea: Coarse-to-Fine

DECA operates on a coarse-to-fine principle. It doesn’t try to solve everything at once. Instead, it breaks the problem down into two stages:

  1. Coarse Reconstruction: Estimate the underlying head shape, head pose, and facial expression using a statistical 3D face model (FLAME).
  2. Detail Reconstruction: Predict a person-specific detail map (displacement map) that adds high-frequency details (like forehead wrinkles or crow’s feet) to the coarse mesh.

Screenshot 2026-02-08 at 19.18.31.png

The FLAME Model

At the heart of DECA is the FLAME (Faces Learned with Articulated Model and Expressions) model (Li et al. 2017). FLAME is a parametric model, meaning it generates a 3D mesh based on a set of low-dimensional parameters.

Mathematically, a 3D mesh $M$ with $N$ vertices is generated as:

$$ M(\beta, \theta, \psi) = W \left(T_r(\beta, \theta, \psi), J(\beta), \theta, \mathcal{W} \right) $$