Face transfer with multilinear models
Summary (4 min read)
1 Introduction
- Their system can either rewrite the original footage with adjusted expressions and visemes or transfer the performance to a different face in a different footage.
- This paper describes a general, controllable, and practical system for facial animation.
- In principle, given a large and varied data set, the model can generate any face, any expression, any viseme.
- Existing estimation algorithms require perfect one-to-one correspondence between all meshes, and a mesh for every possible combination of expression, viseme, and identity.
3 Multilinear Algebra
- In this section the authors provide insight behind the basic concepts needed for understanding of their Face Transfer system.
- De Lathauwer’s dissertation [1997] provides a comprehensive treatment of this topic.
- The basic mathematical object of multilinear algebra is the tensor, a natural generalization of vectors (1st order tensors) and matrices (2nd order tensors) to multiple indices.
- Viewing the data as a set of d1-dimensional vectors stored parallel to the first axis , the authors can define the mode-1 space as the span of those vectors.
- One can obtain a better approximation with further refinement of Ǔi’s and Creduced via alternating least squares [De Lathauwer 1997].
4.1 Face Data
- The authors demonstrate their proof-of-concept system on two separate face models: a bilinear model, and a trilinear model.
- Both were estimated from detailed 3D scans (∼ 30K vertices) acquired with 3dMD/3Q’s structured light scanner (http://www.3dmd.com/) in a process similar to regular flash photography, although their methods would apply equally to other geometric data sets such as motion capture.
- The subject pool included men, women, Caucasians, and Asians, from the mid-20s to mid-50s.
- 16 subjects were asked to perform 5 visemes in 5 different expressions (neutral, smiling, scowling, surprised, and sad).
- The resulting fourth order (4-mode) data tensor (30K vertices × 5 visemes × 5 expressions × 16 identities) was decomposed to yield a trilinear model providing 4 knobs for viseme, 4 for expression, and 16 for identity (the authors have kept the number of knobs large since their data sets were small).
4.2 Correspondence
- Training meshes that are not placed in perfect correspondence can considerably muddle the question of how to displace vertices to change one attribute versus another (e.g. identity versus expression), and thus the multilinear analysis may not give a model with good separability.
- [Praun and Hoppe 2003; Gotsman et al. 2003]), it took consid- erable experimentation to place many facial scans into detailed correspondence.
- The optimization objective, minimized with gradient descent, balances overall surface similarity, proximity of manually selected feature points on the two surfaces, and proximity of reference vertices to the nearest point on the scanned surface.
- For the trilinear model, the remaining m-viseme scans were marked with 21 features around eyebrows and lips, rigidly aligned to upper-face geometry on the appropriate neutral scans, and then non-rigidly put into correspondence as above.
4.3 Face Model
- Equation (3) shows how to approximate the data tensor by modemultiplying a smaller core tensor with a number of truncated orthogonal matrices.
- Since their goal is to output vertices as a function of attribute parameters, the authors can decompose the data tensor without factoring along the mode that corresponds to vertices (mode-1), changing Equation (3) to: T 'M ×2 Ǔ2×3 Ǔ3 · · ·×N ǓN , (4) where M can now be called the multilinear model of face geometry.
- Mode-multiplying M with Ǔi’s approximates the original data.
- In particular, mode-multiplying it with one row from each Ǔi reconstructs exactly one original face (the one corresponding to the attribute parameters contained in that row).
- Therefore, to generate an arbitrary interpolation (or extrapolation) of original faces, the authors can mode-multiply the model with a linear combination of rows for each Ǔi.
4.4 Missing Data
- Building the multilinear model from a set of face scans requires capturing the full Cartesian product of different face attributes, (i.e., all expressions and visemes need to be captured for each person).
- To that end, the authors collect the linear equations that determine a particular missing value in all the modes, and solve them together.
- Because their data set includes smiles for more than one person, the authors extend that approach to copy their average.
- Filling in missing data according to this model is computationally expensive.
- Instead, the authors approximate the true likelihood with a geometric average of Gaussians p(T |M ,{Ǔi}Ni=2)≈.
5 Face Transfer
- One produces animations from a multilinear model by varying the attribute parameters (the elements of the wi’s) as if they were dials, and generating mesh coordinates from Equation 5.
- The Nmode SVD conveniently gives groups of dials that separately control identity, expression and viseme.
- The dials can be “tuned” to reflect deformations of interest through a linear transform of each wi.
- A similar linear scheme was employed in [Blanz and Vetter 1999].
- To give similar power to a casual user, the authors have devised a method that automatically sets model parameters from given video data.
5.1 Face Tracking
- To link the parameters of a multilinear model to video data, the authors use optical flow in conjunction with the weak-perspective camera model.
- Matrix Z and vector e contain spatial and temporal intensity gradient information in the surrounding region [Birchfield 1996].
- If the currently tracked attribute varies from frame to frame (such as expression does), the authors solve the set of linear systems and proceed to the next pair of neighboring frames.
5.2 Initialization
- The method described above, since it is based on tracking, needs to be initialized with the first frame alignment (pose and all the weights of the multilinear model).
- The authors accomplish this by specifying a small number of feature points which are then used to position the face geometry.
- The correspondences can be either userprovided (which gives more flexibility and power) or automatically detected (which avoids user intervention).
- The authors have experimented with the automatic feature detector developed by [Viola and Jones 2001], and found that it is robust and precise enough in locating a number of key features (eye corners, nose tip, mouth corners) to give a good approximating alignment in most cases.
- Imperfect alignment can be improved by tracking the first few frames back and forth until the model snaps into a better location.
6 Results
- Multilinear models provide a convenient control of facial attributes.
- Face Transfer infers the attribute parameters automatically by tracking the face in a video.
- Because their system tracks approximately a thousand vertices, the process is less sensitive to localized intensity changes (e.g., around the furrow above the lip).
- In Figure 7B, the authors use the bilinear model to change a person’s identity while retaining the expressions from the original performance.
- From left to right, the authors present the original video frame, a frame from a new video, the new geometry, and the final modified frame (without blending).
7 Discussion
- Perhaps their most remarkable empirical result is that even with a model estimated from a rather tiny data set, the authors can produce videorealistic results for new source and target subjects.
- The authors also see algorithmic opportunities to make aspects of the system more automatic and robust.
- Correspondence between scans might be improved with some of the methods shown in [Kraevoy and Sheffer 2004].
- In a production setting, the scan data would need to be expanded to contain shape and texture information for the ears, neck, and hair, so that the authors can make a larger range of head pose changes.
- Finally, the texture function lifted from video is performance specific, in that the authors made no effort to remove variations due to lighting.
8 Conclusion
- The model is multilinear, and thus has the key property of separability: different attributes, such as identity and expression, can be manipulated independently.
- Thus the authors can change the identity and expression, but keep the smile.
- What makes this multilinear model a practical tool for animation is that the authors connect it directly to video, showing how to recover a time-series of poses and attribute parameters (expressions and visemes), plus a performance-driven texture function for an actor’s face.
- In addition, the model offers a rich source of synthetic actors that can be controlled via video.
- An intriguing prospect is that one could now build a multilinear model representing a vertex×identity×expression×viseme×age data tensor—without having to capture each individual’s face at every stage of their life.
Did you find this useful? Give us your feedback
Citations
9,227 citations
1,011 citations
Cites background or result from "Face transfer with multilinear mode..."
...[30] perform facial reenactment by tracking a face template, which is rerendered under different expression parameters on top of the target; the mouth interior is directly copied from the source video....
[...]
...It is important to note that we maintain the appearance of the target mouth shape; in contrast, existing methods either copy the source mouth region onto the target [30, 11] or a generic teeth proxy is rendered [14, 29], both of which leads to inconsistent results....
[...]
...This leads to much more realistic results than either copying the source mouth region [30, 11] or using a generic 3D teeth proxy [14, 29]....
[...]
952 citations
763 citations
629 citations
References
18,620 citations
5,027 citations
4,514 citations
3,810 citations
3,362 citations