scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Face transfer with multilinear models

TL;DR: Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another, based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes.
Abstract: Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another It extracts visemes (speech-related mouth articulations), expressions, and three-dimensional (3D) pose from monocular video or film footage These parameters are then used to generate and drive a detailed 3D textured face mesh for a target identity, which can be seamlessly rendered back into target footage The underlying face model automatically adjusts for how the target performs facial expressions and visemes The performance data can be easily edited to change the visemes, expressions, pose, or even the identity of the target---the attributes are separably controllable This supports a wide variety of video rewrite and puppetry applicationsFace Transfer is based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes (eg, identity, expression, and viseme) Separability means that each of these attributes can be independently varied A multilinear model can be estimated from a Cartesian product of examples (identities × expressions × visemes) with techniques from statistical analysis, but only after careful preprocessing of the geometric data set to secure one-to-one correspondence, to minimize cross-coupling artifacts, and to fill in any missing examples Face Transfer offers new solutions to these problems and links the estimated model with a face-tracking algorithm to extract pose, expression, and viseme parameters

Summary (4 min read)

1 Introduction

  • Their system can either rewrite the original footage with adjusted expressions and visemes or transfer the performance to a different face in a different footage.
  • This paper describes a general, controllable, and practical system for facial animation.
  • In principle, given a large and varied data set, the model can generate any face, any expression, any viseme.
  • Existing estimation algorithms require perfect one-to-one correspondence between all meshes, and a mesh for every possible combination of expression, viseme, and identity.

3 Multilinear Algebra

  • In this section the authors provide insight behind the basic concepts needed for understanding of their Face Transfer system.
  • De Lathauwer’s dissertation [1997] provides a comprehensive treatment of this topic.
  • The basic mathematical object of multilinear algebra is the tensor, a natural generalization of vectors (1st order tensors) and matrices (2nd order tensors) to multiple indices.
  • Viewing the data as a set of d1-dimensional vectors stored parallel to the first axis , the authors can define the mode-1 space as the span of those vectors.
  • One can obtain a better approximation with further refinement of Ǔi’s and Creduced via alternating least squares [De Lathauwer 1997].

4.1 Face Data

  • The authors demonstrate their proof-of-concept system on two separate face models: a bilinear model, and a trilinear model.
  • Both were estimated from detailed 3D scans (∼ 30K vertices) acquired with 3dMD/3Q’s structured light scanner (http://www.3dmd.com/) in a process similar to regular flash photography, although their methods would apply equally to other geometric data sets such as motion capture.
  • The subject pool included men, women, Caucasians, and Asians, from the mid-20s to mid-50s.
  • 16 subjects were asked to perform 5 visemes in 5 different expressions (neutral, smiling, scowling, surprised, and sad).
  • The resulting fourth order (4-mode) data tensor (30K vertices × 5 visemes × 5 expressions × 16 identities) was decomposed to yield a trilinear model providing 4 knobs for viseme, 4 for expression, and 16 for identity (the authors have kept the number of knobs large since their data sets were small).

4.2 Correspondence

  • Training meshes that are not placed in perfect correspondence can considerably muddle the question of how to displace vertices to change one attribute versus another (e.g. identity versus expression), and thus the multilinear analysis may not give a model with good separability.
  • [Praun and Hoppe 2003; Gotsman et al. 2003]), it took consid- erable experimentation to place many facial scans into detailed correspondence.
  • The optimization objective, minimized with gradient descent, balances overall surface similarity, proximity of manually selected feature points on the two surfaces, and proximity of reference vertices to the nearest point on the scanned surface.
  • For the trilinear model, the remaining m-viseme scans were marked with 21 features around eyebrows and lips, rigidly aligned to upper-face geometry on the appropriate neutral scans, and then non-rigidly put into correspondence as above.

4.3 Face Model

  • Equation (3) shows how to approximate the data tensor by modemultiplying a smaller core tensor with a number of truncated orthogonal matrices.
  • Since their goal is to output vertices as a function of attribute parameters, the authors can decompose the data tensor without factoring along the mode that corresponds to vertices (mode-1), changing Equation (3) to: T 'M ×2 Ǔ2×3 Ǔ3 · · ·×N ǓN , (4) where M can now be called the multilinear model of face geometry.
  • Mode-multiplying M with Ǔi’s approximates the original data.
  • In particular, mode-multiplying it with one row from each Ǔi reconstructs exactly one original face (the one corresponding to the attribute parameters contained in that row).
  • Therefore, to generate an arbitrary interpolation (or extrapolation) of original faces, the authors can mode-multiply the model with a linear combination of rows for each Ǔi.

4.4 Missing Data

  • Building the multilinear model from a set of face scans requires capturing the full Cartesian product of different face attributes, (i.e., all expressions and visemes need to be captured for each person).
  • To that end, the authors collect the linear equations that determine a particular missing value in all the modes, and solve them together.
  • Because their data set includes smiles for more than one person, the authors extend that approach to copy their average.
  • Filling in missing data according to this model is computationally expensive.
  • Instead, the authors approximate the true likelihood with a geometric average of Gaussians p(T |M ,{Ǔi}Ni=2)≈.

5 Face Transfer

  • One produces animations from a multilinear model by varying the attribute parameters (the elements of the wi’s) as if they were dials, and generating mesh coordinates from Equation 5.
  • The Nmode SVD conveniently gives groups of dials that separately control identity, expression and viseme.
  • The dials can be “tuned” to reflect deformations of interest through a linear transform of each wi.
  • A similar linear scheme was employed in [Blanz and Vetter 1999].
  • To give similar power to a casual user, the authors have devised a method that automatically sets model parameters from given video data.

5.1 Face Tracking

  • To link the parameters of a multilinear model to video data, the authors use optical flow in conjunction with the weak-perspective camera model.
  • Matrix Z and vector e contain spatial and temporal intensity gradient information in the surrounding region [Birchfield 1996].
  • If the currently tracked attribute varies from frame to frame (such as expression does), the authors solve the set of linear systems and proceed to the next pair of neighboring frames.

5.2 Initialization

  • The method described above, since it is based on tracking, needs to be initialized with the first frame alignment (pose and all the weights of the multilinear model).
  • The authors accomplish this by specifying a small number of feature points which are then used to position the face geometry.
  • The correspondences can be either userprovided (which gives more flexibility and power) or automatically detected (which avoids user intervention).
  • The authors have experimented with the automatic feature detector developed by [Viola and Jones 2001], and found that it is robust and precise enough in locating a number of key features (eye corners, nose tip, mouth corners) to give a good approximating alignment in most cases.
  • Imperfect alignment can be improved by tracking the first few frames back and forth until the model snaps into a better location.

6 Results

  • Multilinear models provide a convenient control of facial attributes.
  • Face Transfer infers the attribute parameters automatically by tracking the face in a video.
  • Because their system tracks approximately a thousand vertices, the process is less sensitive to localized intensity changes (e.g., around the furrow above the lip).
  • In Figure 7B, the authors use the bilinear model to change a person’s identity while retaining the expressions from the original performance.
  • From left to right, the authors present the original video frame, a frame from a new video, the new geometry, and the final modified frame (without blending).

7 Discussion

  • Perhaps their most remarkable empirical result is that even with a model estimated from a rather tiny data set, the authors can produce videorealistic results for new source and target subjects.
  • The authors also see algorithmic opportunities to make aspects of the system more automatic and robust.
  • Correspondence between scans might be improved with some of the methods shown in [Kraevoy and Sheffer 2004].
  • In a production setting, the scan data would need to be expanded to contain shape and texture information for the ears, neck, and hair, so that the authors can make a larger range of head pose changes.
  • Finally, the texture function lifted from video is performance specific, in that the authors made no effort to remove variations due to lighting.

8 Conclusion

  • The model is multilinear, and thus has the key property of separability: different attributes, such as identity and expression, can be manipulated independently.
  • Thus the authors can change the identity and expression, but keep the smile.
  • What makes this multilinear model a practical tool for animation is that the authors connect it directly to video, showing how to recover a time-series of poses and attribute parameters (expressions and visemes), plus a performance-driven texture function for an actor’s face.
  • In addition, the model offers a rich source of synthetic actors that can be controlled via video.
  • An intriguing prospect is that one could now build a multilinear model representing a vertex×identity×expression×viseme×age data tensor—without having to capture each individual’s face at every stage of their life.

Did you find this useful? Give us your feedback

Figures (6)

Content maybe subject to copyright    Report

To appear in SIGGRAPH 2005.
Face Transfer with Multilinear Models
Daniel Vlasic
Matthew Brand
Hanspeter Pfister Jovan Popovi
´
c
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Mitsubishi Electric Research Laboratories
Figure 1: Face Transfer with multilinear models gives animators decoupled control over facial attributes such as identity, expression, and
viseme. In this example, we combine pose and identity from the first frame, surprised expression from the second, and a viseme (mouth
articulation for a sound midway between ”oo” and ”ee”) from the third. The resulting composite is blended back into the original frame.
Abstract
Face Transfer is a method for mapping videorecorded perfor-
mances of one individual to facial animations of another. It ex-
tracts visemes (speech-related mouth articulations), expressions,
and three-dimensional (3D) pose from monocular video or film
footage. These parameters are then used to generate and drive a
detailed 3D textured face mesh for a target identity, which can be
seamlessly rendered back into target footage. The underlying face
model automatically adjusts for how the target performs facial ex-
pressions and visemes. The performance data can be easily edited
to change the visemes, expressions, pose, or even the identity of
the target—the attributes are separably controllable. This supports
a wide variety of video rewrite and puppetry applications.
Face Transfer is based on a multilinear model of 3D face meshes
that separably parameterizes the space of geometric variations due
to different attributes (e.g., identity, expression, and viseme). Sep-
arability means that each of these attributes can be independently
varied. A m ultilinear model can be estimated from a Cartesian
product of examples (identities × expressions × visemes) with
techniques from statistical analysis, but only after careful pre-
processing of the geometric data set to secure one-to-one corre-
spondence, to minimize cross-coupling artifacts, and to fill in any
missing examples. Face Transfer offers new solutions to these prob-
lems and links the estimated model with a face-tracking algorithm
to extract pose, expression, and viseme parameters.
CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and Realism—Animation; I.4.9 [Image Processing and
Computer Vision]: Applications;
Keywords: Facial Animation, Computer Vision—Tracking
MIT CSAIL, The Stata Center, 32 Vassar Street, Cambridge, MA
02139, USA
1 Introduction
Performance-driven animation has a growing role in film produc-
tion because it allows actors to express content and mood naturally,
and because the resulting animations have a degree of realism that
is hard to obtain from synthesis methods [Robertson 2004]. The
search for the highest quality motions has led to complex, expen-
sive, and hard-to-use systems. This paper introduces new tech-
niques for producing compelling facial animations that are inexpen-
sive, practical, versatile, and well suited for editing performances
and retargeting to new characters.
Face Transfer extracts performances from ordinary video
footage, allowing the transfer of facial action of actors who are
unavailable for detailed measurement, instrumentation, or for re-
recording with specialized scanning equipment. Expressions,
visemes (speech-related mouth articulations), and head motions are
extracted automatically, along with a performance-driven texture
function. With this information in hand, our system can either
rewrite the original footage with adjusted expressions and visemes
or transfer the performance to a different face in a different footage.
Multilinear models are ideally suited for this application because
they can describe face variations with separable attributes that can
be estimated from video automatically. In this paper, we estimate
such a model from a data set of three-dimensional (3D) face scans
that vary according to expression, viseme, and identity. The multi-
linear model decouples the three attributes (i.e., identity or viseme
can be varied while expression remains constant) and encodes them
consistently. Thus the attribute vector that encodes a smile for one
person encodes a smile for every face spanned by the model, re-
gardless of identity or viseme. Yet the model captures the fact that
every person smiles in a slightly different way. Separability and
consistency are the key properties that enable the transfer of a per-
formance from one face to another without a change in content.
Contributions. This paper describes a general, controllable, and
practical system for facial animation. It estimates a multilinear
model of human faces by examining geometric variations between
3D face scans. In principle, given a large and varied data set, the
model can generate any face, any expression, any viseme. As proof
of concept, we estimate the model from a couple of geometric data
sets: one with 15 identities and 10 expressions, and another with
1

To appear in SIGGRAPH 2005.
16 identities, 5 expressions, and 5 visemes. Existing estimation
algorithms require perfect one-to-one correspondence between all
meshes, and a mesh for every possible combination of expression,
viseme, and identity. Because acquiring the full Cartesian product
of meshes and putting them into dense correspondence is extremely
difficult, this paper introduces methods for populating the Cartesian
product from a sparse sampling of faces, and for placing unstruc-
tured face scans into correspondence with minimal cross-coupling
artifacts.
By linking the multilinear model to optical flow, we obtain a
single-camera tracker that estimates performance parameters and
detailed 3D geometry from video recordings. The model defines a
mapping from performance parameters back to 3D shape, thu s we
can arbitrarily mix pose, identity, expressions, and visemes from
two or more videos and render the result back into a target video.
As a result, the system provides an intuitive interface for both an-
imators (via separably controllable attributes) and performers (via
acting). And because it does not require performers to wear visible
facial markers or to be recorded by special face-scanning equip-
ment, it is an inexpensive and easy-to-use facial animation system.
2 Related Work
Realistic facial animation remains a fundamental challenge in com-
puter graphics. Beginning with Parke’s pioneering work [1974],
desire for improved realism has driven researchers to extend geo-
metric models [Parke 1982] with physical models of facial anatomy
[Waters 1987; Lee et al. 1995] and to combine them with non-linear
finite element methods [Koch et al. 1996] in systems that could be
used for planning facial surgeries. In parallel, Williams presented a
compelling argument [1990] in favor of performance-driven facial
animation, which anticipated techniques for tracking head motions
and facial expressions in video [Li et al. 1993; Essa et al. 1996;
DeCarlo and Metaxas 1996; Pighin et al. 1999]. A more expensive
alternative could use a 3D scanning technique [Zhang et al. 2004],
if the performance can be re-recorded with such a system.
Much of the ensuing work on face estimation and tracking re-
lied on the observation that variation in faces is well approximated
by a linear subspace of low dimension [Sirovich and Kirby 1987].
These techniques estimate either linear coefficients for known basis
shapes [Bascle and Blake 1998; Brand and Bhotika 2001] or both
the basis shapes and the coefficients, simultaneously [Bregler et al.
2000; Torresani et al. 2001]. In computer graphics, the combina-
tion of accurate 3D geometry with linear texture models [Pighin
et al. 1998; Blanz and Vetter 1999] produced striking results. In
addition, Blanz and Vetter [1999] presented a process for estimat-
ing the shape of a face in a single photograph, and a set of controls
for intuitive manipulation of appearance attributes (thin/fat, femi-
nine/masculine).
These and other estimation techniques share a common chal-
lenge of decoupling the attributes responsible for observed varia-
tions. As an early example, Pentland and Sclaroff estimate geom-
etry of deformable objects by decoupling linear elastic equations
into orthogonal vibration modes [1991]. In this case, modal analy-
sis uses eigen decomposition to compute the independent vibration
modes. Similar factorizations are also relied upon to separate vari-
ations due to pose and lighting, pose and expression, identity and
lighting, or style and content in general [Freeman and Tenenbaum
1997; Bregler et al. 2000; DeCarlo and Metaxas 2000; Georghiades
et al. 2001; Cao et al. 2003].
A technical limitation of these formulations is that each pair
of factors must be considered in isolation; they cannot easily de-
couple variations due to a combination of more than two factors.
The extension of such two-mode analysis to more modes of vari-
ation was first introduced by Tucker [1966] and later formalized
and improved on by Kroonenberg and de Leeuw [1980]. These
techniques were succ essfully applied to multilinear analysis of im-
ages [Vasilescu and Terzopoulos 2002; Vasilescu and Terzopoulos
2004].
This paper describes multilinear analysis of three-dimensional
(3D) data sets and generalizes face-tracking techniques to create a
unique performance-driven system for animation of any face, any
expression, and any viseme. In consideration of similar needs,
Bregler and colleagues introduced a two-dimensional method for
transferring mouth shapes from one performance to another [1997].
The method is ideal for film dubbing—a problem that could also
be solved without performance by first learning the mouth shapes
on a canonical data set and then generating new shapes for differ-
ent texts [Ezzat and Poggio 2000]. These methods are difficult to
use for general performance-driven animation because they cannot
change emotions of a face. Although the problem can be resolved
by decoupling emotion and content via two-mode analysis [Chuang
et al. 2002], all three techniques are view specific, which presents
difficulties when view, illumination, or both have to change.
Our Face Transfer learns a model of 3D facial geometry vari-
ations in order to infer a particular face shape from 2D images.
Previous work combines identity and expression spaces by copy-
ing deformations from one subject onto the geometry of other faces
[DeCarlo and Metaxas 2000; Blanz et al. 2003; Chai et al. 2003].
Expression cloning [Noh and Neumann 2001; Sumner and Popovi
´
c
2004] improves on this process but does not account for actor-
specific idiosyncrasies that can be revealed by statistical analysis
of the entire data set (i.e., the mesh vertex displacements that pro-
duce a smile should depend on who is smiling and on what they
are saying at the same time). Other powerful models of human
faces have been explored [Wang et al. 2004] at the cost of mak-
ing the estimation and transfer of model parameters more difficult.
This paper describes a method that incorporates all such informa-
tion through multilinear analysis, which naturally accommodates
variations along multiple attributes.
3 Multilinear Algebra
Multilinear algebra is a higher order generalization of linear al-
gebra. In this section we provide insight behind the basic con-
cepts needed for understanding of our Face Transfer system. De
Lathauwer’s dissertation [1997] provides a comprehensive treat-
ment of this topic. Concise overviews have also been published in
the graphics and vision literature [Vasilescu and Terzopoulos 2002;
Vasilescu and Terzopoulos 2004].
Tensors. The basic mathematical object of multilinear algebra is
the tensor, a natural generalization of vectors (1
st
order tensors)
and matrices (2
nd
order tensors) to multiple indices. An N
th
-order
tensor can be thought of as a block of data indexed by N indices:
T = (t
i
1
i
2
...i
N
). Figure 2 shows a 3
rd
-order (or 3-mode) tensor with
a total of d
1
× d
2
× d
3
elements. Different modes usually corre-
spond to particular attributes of the data (e.g, expression, identity,
etc.).
Mode Spaces. A matrix has two characteristic spaces, row and
column space; a tensor has one for each mode, hence we call them
mode spaces. The d
1
× d
2
× d
3
3-tensor in Figure 2 has three mode
spaces. Viewing the data as a set of d
1
-dimensional vectors stored
parallel to the first axis (Figure 2b), we can define the mode-1 space
as the span of those vectors. Similarly, mode-2 space is defined as
the span of the vectors stored parallel to the second axis, each of size
d
2
(Figure 2c). Finally, mode-3 space is spanned by vectors in the
third mode, of dimensionality d
3
(Figure 2d). Multilinear algebra
revolves around the analysis and manipulation of these spaces.
2

To appear in SIGGRAPH 2005.
mode 1
mode 2
mode 3
d
1
d
2
d
3
(a) (b) (c) (d)
T
Figure 2: In (a) we show a 3
rd
-order (3-mode) tensor T whose
modes have d
1
, d
2
, and d
3
elements respectively. Depending on
how we look at the data within the tensor, we can identify three
mode spaces. By viewing the data as vectors parallel to the first
mode (b), we define mode-1 space as the span of those vectors.
Similarly, mode-2 space is spanned by vectors parallel to the second
mode (c), and mode-3 space by vectors in the third mode (d).
Mode-n Product. The most obvious way of manipulating mode
spaces is via linear transformation, officially referred to as the
mode-n product. It is defined between a tensor T and a matrix
M for a specific mode n, and is written as a multiplication with a
subscript: T ×
n
M. This notation indicates a linear transforma-
tion of vectors in T s mode-n space by the matrix M. Concretely,
T ×
2
M would replace each mode-2 vector v (Figure 2c) with a
transformed vector Mv.
Tensor Decomposition. One particularly useful linear transfor-
mation of mode data is the N-mode singular value decomposition
(N-mode SVD). It rotates the mode spaces of a data tensor T pro-
ducing a core tensor C , whose variance monotonically decreases
from first to last element in each mode (analogous to matrix SVD).
This enables us to truncate the insignificant components and get a
reduced model of our data.
Mathematically, N-mode SVD can be expressed with mode
products
T ×
1
U
>
1
×
2
U
>
2
×
3
U
>
3
··· ×
N
U
>
N
= C (1)
= T = C ×
1
U
1
×
2
U
2
×
3
U
3
·· · ×
N
U
N
, (2)
where T is the data tensor, C is the core tensor, and U
i
s (or more
precisely their transposes) rotate the mode spaces. Each U
i
is an
orthonormal matrix whose columns contain left singular vectors of
the ith mode space, and can be computed via regular SVD of those
spaces [De Lathauwer 1997]. Since variance is concentrated in one
corner of the core tensor, data can be approximated by
T ' C
reduced
×
1
ˇ
U
1
×
2
ˇ
U
2
×
3
ˇ
U
3
·· · ×
N
ˇ
U
N
, (3)
where
ˇ
U
i
s are truncated versions of U
i
s with last few columns
removed. This truncation generally yields high quality approxima-
tions but it is not optimal—one of several matrix-SVD properties
that do not generalize in multilinear algebra. One can obtain a bet-
ter approximation with further refinement of
ˇ
U
i
s and C
reduced
via
alternating least squares [De Lathauwer 1997 ].
4 Multilinear Face Model
To construct the multilinear face model, we first acquire a range
of 3D face scans, put them in full correspondence, appropriately
arrange them into a data tensor (Figure 3), and use the N-mode
SVD to compute a model that captures the face geometry and its
variation due to attributes such as identity and expression.
vertices
expression
identity
Figure 3: Data tensor for a bilinear model that varies with iden-
tity and expression; the first mode contains vertices, while the sec-
ond and third modes correspond to expression and identity respec-
tively. The data is arranged so that each slice along the second mode
contains the same expression (in different identities) and each slice
along the third mode contains the same identity (in different expres-
sions). In our trilinear experiments we have added a fourth mode,
where scans in each slice share the same viseme.
4.1 Face Data
We demonstrate our proof-of-concept system on two separate face
models: a bilinear model, and a trilinear model. Both were es-
timated from detailed 3D scans ( 30K vertices) acquired with
3dMD/3Q’s structured light scanner (http://www.3dmd.com/) in
a process similar to regular flash photography, although our meth-
ods would apply equally to other geometric data sets such as motion
capture. As a preprocess, the scans were smoothed using the bilat-
eral filter [Jones et al. 2003] to eliminate some of the capture noise.
The subject pool included men, women, Caucasians, and Asians,
from the mid-20s to mid-50s.
Bilinear model. 15 subjects were scanned performing the same
10 facial expressions. The expressions were picked for their famil-
iarity as well as distinctiveness, and include neutral, smile, frown,
surprise, anger, and others. The scans were assembled into a third
order (3-mode) data tensor (30K vertices × 10 expressions × 15
identities). After N-mode SVD reduction, the resulting bilinear
model offers 6 knobs for manipulating expression and 9 for identity.
Trilinear model. 16 subjects were asked to perform 5 visemes in
5 different expressions (neutral, smiling, scowling, surprised, and
sad). The visemes correspond to the boldfaced sounds in man, car,
eel, too, and she. Principal components analysis of detailed speech
motion capture indicated that these five expressions broadly span
the space of lip shapes, and should give a good approximate basis
for all other visemes—with the possible exception of exaggerated
fricatives. The resulting fourth order (4-mode) data tensor (30K
vertices × 5 visemes × 5 expressions × 16 identities) was decom-
posed to yield a trilinear model providing 4 knobs for viseme, 4 for
expression, and 16 for identity (we have kept the number of knobs
large since our data sets were small).
4.2 Correspondence
Training meshes that are not placed in perfect correspondence can
considerably muddle the question of how to displace vertices to
change one attribute versus another (e.g. identity versus expres-
sion), and thus the multilinear analysis may not give a model with
good separability. We show here how to put a set of unstructured
face scans into correspondence suitable for multilinear analysis.
Despite rapid advances in automatic parameterization of meshes
(e.g., [Praun and Hoppe 2003; Gotsman et al. 2003]), it took consid-
3

To appear in SIGGRAPH 2005.
erable experimentation to place many facial scans into detailed cor-
respondence. The principal complicating factors are that the scans
do not have congruent mesh boundaries, and the problem of match-
ing widely varied lip deformations does not appear to be well served
by conformal maps or local isometric constraints. This made it nec-
essary to mark a small number of feature points in order to bootstrap
correspondence-finding across large deformations.
We developed a protocol for a template-fitting procedure [Allen
et al. 2003; Sumner and Popovi
´
c 2004], which seeks a minimal
deformation of a parameterized template mesh that fits the surface
implied by the scan. The optimization objective, minimized with
gradient descent, balances overall surface similarity, proximity of
manually selected feature points on the two surfaces, and proxim-
ity of reference vertices to the nearest point on the scanned sur-
face. We manually specified 42 reference points on a reference fa-
cial mesh and on a neutral (m-viseme) scan. After rigidly aligning
the template and the scan with Procrustes’ alignment, we deformed
the template mesh into the scan: at first, weighing the marked
correspondences heavily and afterwards emphasizing vertex prox-
imity. For the trilinear model, the remaining m-viseme (closed-
mouth) scans were marked with 21 features around eyebrows and
lips, rigidly aligned to upper-face geometry on the appropriate neu-
tral scans, and then non-rigidly put into correspondence as above.
Finally, all other viseme scans were similarly put into correspon-
dence with the appropriate closed-mouth scan, using the 18 features
marked around the lips.
4.3 Face Model
Equation (3) shows how to approximate the data tensor by mode-
multiplying a smaller core tensor with a number of truncated or-
thogonal matrices. Since our goal is to output vertices as a function
of attribute parameters, we can decompose the data tensor with-
out factoring along the mode that corresponds to vertices (mode-1),
changing Equation (3) to:
T ' M ×
2
ˇ
U
2
×
3
ˇ
U
3
·· · ×
N
ˇ
U
N
, (4)
where M can now be called the multilinear model of face geome-
try. Mode-multiplying M with
ˇ
U
i
s approximates the original data.
In particular, mode-multiplying it with one row from each
ˇ
U
i
re-
constructs exactly one original face (the one corresponding to the
attribute parameters contained in that row). Therefore, to generate
an arbitrary interpolation (or extrapolation) of original faces, we
can mode-multiply the model with a linear combination of rows for
each
ˇ
U
i
. We can write
f = M ×
2
w
2
>
×
3
w
3
>
·· · ×
N
w
N
>
, (5)
where w
i
is a column vector of parameters (weights) for the at-
tribute corresponding to i
th
mode, and f is a column vector of ver-
tices describing the resulting face.
4.4 Missing Data
Building the multilinear model from a set of face scans requires
capturing the full Cartesian product of different face attributes, (i.e.,
all expressions and visemes need to be captured for each person).
Producing a full data tensor is not always practical for large data
sets. For example, a certain person might have trouble performing
some expressions on cue, or a researcher might add a new expres-
sion to the database but be unable reach all the previous subjects. In
our case, data corruption and subsequent unavailability of a subject
led to an incomplete tensor. The problem becomes more evident if
we add age as one of the attributes, where we cannot expect to scan
each individual throughout their entire lives. In all these cases, we
would still like to include a person’s successful scans in the model,
and fill in the missing ones with the most likely candidates. This
process is known as imputation.
There are many possible schemes for estimating a model fro m
incomplete data. A naive imputation would find a complete sub-
tensor, use it to estimate a smaller model, use that to predict a miss-
ing face, use that to augment the data set, and repeat. In a more
sophisticated Bayesian setting, we would treat the missing data as
hidden variables to be MAP estimated (imputed) or marginalized
out. Both approaches require many iterations over a huge data set;
Bayesian methods are particularly expensive and generally require
approximations for tractability. With MAP estimation and naive
imputation, the results can be highly dependent on the order of op-
erations. Because it fails to exploit all available constraints, the
naive imputative scheme generally produces inferior results.
Here we use an imputative scheme that exploits more available
constraints than the naive one, producing better results. The main
intuition, which we formalize below, is that any optimization crite-
ria can be linearized in a particular tensor mode, where it yields a
matrix factorization problem with missing values. Then we lever-
age existing factorization schemes for incomplete matrices, where
known values contribute a set of linear constraints on the missing
values. These constraints are then combined and solved in the least-
squares sense.
Description. Our algorithm consists of two steps. First, for each
mode we assemble an incomplete matrix whose columns are the
corresponding mode vectors. We then seek a subspace decompo-
sition that best reconstructs the known values of that matrix. The
decomposition and the known values provide a set of linear con-
straints for the missing values. This can be done with off-the-shelf
imputative matrix factorizations (e.g., PPCA [Tipping and Bishop
1999], SPCA [Roweis 1997], or ISVD [Brand 2002]). Typically
these algorithms estimate a low-rank subspace from the complete
vectors of the mode and use that to predict missing values in the
incomplete columns (and/or update the subspace). In our experi-
ments we used the standard PPCA formulation for filling in miss-
ing values, which reduces to a system of linear equations that relate
unknown values to known values through the estimated mean and
covariance of the vectors in the mode space. Second, the linear con-
straints are combined through the missing elements, because they
are shared across all groups of modal vectors and must be filled
in with consistent values. To that end, we collect the linear equa-
tions that determine a particular missing value in all the modes, and
solve them together. For example, if two missing values co-occur in
some mode vector, then they must be jointly estimated. We update
the mean and covariance for each decomposition and repeat the two
steps until convergence.
Evaluation. Figure 4 contrasts the results of this method with
faces predicted by our generalization of the simple method pro-
posed by Blanz and colleagues [2003]. In their formulation the
same displacement vectors that make one person smile are copied
over onto every other identity. Because our data set includes smiles
for more than one person, we extend that approach to copy their av-
erage. In this particular example, 15% of real faces where held out
of the trilinear data set and predicted by our imputation scheme and
the simple averaging scheme. Note how the multilinear prediction
is closer to the truth in most examples, even predicting some indi-
vidual idiosyncrasies in puckers and smiles. The simple averaging
scheme, however, seems to do a better job at keeping the lips sealed
for closed-mouth faces (bottom row of Figure 4). We could obtain
better results by preferentially weighting detail around the mouth.
In our earlier trilinear experiments, we found that ISVD-based
imputations predic ted how faces vary from the mean with less than
9% relative error (Frobenius norm of the total error divided by the
4

To appear in SIGGRAPH 2005.
Figure 4: From top to bottom: Prediction of held-out faces with our
imputation scheme (on the trilinear model), the actual face, and a
simple averaging scheme.
norm of the held-out face variations) for up to 50% missing data. In
general, the predictions are drawn towards the mean of the known
data. Closed-mouth expressions, which are under-represented in
our data and thus lie far from the mean, were not predicted as well
as other expressions. That can be fixed by reweighting the data.
Tests performed on synthetic data indicate that the quality of im-
putation increases as the data set grows in size, even if significant
portions of it are missing. The reason why is that if the data is truly
low-dimensional in each of the modes, the missing samples will fall
within the span and density of the known ones.
Probabilistic Interpretation. The above algorithm fills in miss-
ing data by approximating the true multilinear distribution. The
form of this approximation is made precise by a probabilistic inter-
pretation, which starts from a multilinear generative model
T = M ×
2
ˇ
U
2
×
3
ˇ
U
3
·· · ×
N
ˇ
U
N
+ ν ,
where T and M are the data and model tensors,
ˇ
U
i
is the i-th modal
subspace, and ν is a Gaussian noise source. Filling in missing data
according to this model is computationally expensive. Instead, we
approximate the true likelihood with a geometric average of Gaus-
sians
p(T |M ,{
ˇ
U
i
}
N
i=2
)
N
j=2
q
j
(T ,M , {
ˇ
U
i
}
N
i=2
)
1/N
.
Each Gaussian q
j
(T ,M , {
ˇ
U
i
}
N
i=2
)
.
= N (T |
ˇ
U
j
J
j
,σ
2
j
) is found by
fixing {
ˇ
U
i
}
i6= j
and turning the tensor Equation (4) into matrix form:
T
j
=
ˇ
U
j
J
j
. Here, columns of T
j
are the mode- j vectors of T ,
and the columns of J
j
are the mode- j vectors of M ×
2
ˇ
U
2
·· · ×
j1
ˇ
U
j1
×
j+1
ˇ
U
j+1
·· · ×
N
ˇ
U
N
. The resulting likelihood becomes:
p(T |M ,{
ˇ
U
i
}
N
i=2
)
N
j=2
N (T |
ˇ
U
j
J
j
,σ
2
j
)
1/N
,
which can be maximized efficiently.
Taking logarithms and discarding constant factors such as N and
σ
j
, we seek to minimize the sum-squared error
N
j=2
kT
j
ˇ
U
j
J
j
k
2
F
Each term of the summation presents a matrix factorization prob-
lem with missing values, where
ˇ
U
j
and J
j
are treated as unknown
factors of the incomplete matrix T
j
, and are solved for using PPCA
as described above.
5 Face Transfer
One produces animations from a multilinear model by varying the
attribute parameters (the elements of the w
i
s) as if they were di-
als, and generating mesh coordinates from Equation 5. The N-
mode SVD conveniently gives groups of dials that separately con-
trol identity, expression and viseme. Within each group, the dials
do not correspond to semantically meaningful deformations (such
as smile or frown), but rather reflect the deformations that account
for most variance. However, the dials can be “tuned” to reflect de-
formations of interest through a linear transform of each w
i
. This
approach was successfully applied in [Allen et al. 2003] to make
their body shape dials correspond to height and weight. A similar
linear scheme was employed in [Blanz and Vetter 1999]. In general,
dial-based systems are currently used on most of the deformable
models in production, but only skilled animators can create believ-
able animations (or even stills) with them. To give similar power
to a casual user, we have devised a method that automatically sets
model parameters from given video data. With this tool, a user can
enact a performance in front of a camera, and have it automatically
transferred to the model.
5.1 Face Tracking
To link the parameters of a multilinear model to video data, we
use optical flow in conjunction with the weak-perspective cam-
era model. Using the symmetric Kanade-Lucas-Tomasi formula-
tion [Birchfield 1996], we express the frame-to-frame motion of a
tracked point with a linear system:
Zd = Z(p p
0
) = e . (6)
Here, the 2-vector d describes the image-space motion of the point,
also expressed as the difference between the point’s true location p
and its current best guess p
0
(if we have no guess, then p
0
is the
location from the previous frame). Matrix Z and vector e contain
spatial and temporal intensity gradient information in the surround-
ing region [Birchfield 1996].
Using a weak-perspective imaging model, the point position p
can be expanded in terms of rigid head-motion parameters and non-
rigid facial shape parameters, which are constrained by the multi-
linear model:
Z(sRf
i
+ t p
0
) = e , (7)
where the rigid parameters consist of scale factor s, the first two
rows of a 3D rotation matrix R, and the image-space translation t.
The 3D shape f comes from the multilinear model through Equa-
tion (5), with f
i
indicating the ith 3D vertex being tracked.
Solving for the pose and all the multilinear weights from a pair
of frames using Equation (7) is not a well-constrained problem. To
simplify the computation, we use a coordinate-descent method: we
let only one of the face attributes vary at a time by fixing all the oth-
ers to their current guesses. This transforms the multilinear problem
into a linear one, as described below, which we solve with standard
techniques that simultaneously compute the rigid pose along with
the linear weights from a pair of frames [Bascle and Blake 1998;
Brand and Bhotika 2001].
When we fix all but one attribute of the multilinear model,
thereby making f linear, Equation (7) turns into
Z(sRM
m,i
w
m
+ t p
0
) = e , (8)
5

Citations
More filters
Journal ArticleDOI
TL;DR: This survey provides an overview of higher-order tensor decompositions, their applications, and available software.
Abstract: This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or $N$-way array. Decompositions of higher-order tensors (i.e., $N$-way arrays with $N \geq 3$) have applications in psycho-metrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and elsewhere. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decomposition: CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank-one tensors, and the Tucker decomposition is a higher-order form of principal component analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox, Tensor Toolbox, and Multilinear Engine are examples of software packages for working with tensors.

9,227 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: A novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video) that addresses the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling and re-render the manipulated output video in a photo-realistic fashion.
Abstract: We present a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.

1,011 citations


Cites background or result from "Face transfer with multilinear mode..."

  • ...[30] perform facial reenactment by tracking a face template, which is rerendered under different expression parameters on top of the target; the mouth interior is directly copied from the source video....

    [...]

  • ...It is important to note that we maintain the appearance of the target mouth shape; in contrast, existing methods either copy the source mouth region onto the target [30, 11] or a generic teeth proxy is rendered [14, 29], both of which leads to inconsistent results....

    [...]

  • ...This leads to much more realistic results than either copying the source mouth region [30, 11] or using a generic 3D teeth proxy [14, 29]....

    [...]

Journal ArticleDOI
TL;DR: There is a much richer matching collection of expressions, enabling depiction of most human facial actions, in FaceWarehouse, a database of 3D facial expressions for visual computing applications.
Abstract: We present FaceWarehouse, a database of 3D facial expressions for visual computing applications. We use Kinect, an off-the-shelf RGBD camera, to capture 150 individuals aged 7-80 from various ethnic backgrounds. For each person, we captured the RGBD data of her different expressions, including the neutral expression and 19 other expressions such as mouth-opening, smile, kiss, etc. For every RGBD raw data record, a set of facial feature points on the color image such as eye corners, mouth contour, and the nose tip are automatically localized, and manually adjusted if better accuracy is required. We then deform a template facial mesh to fit the depth data as closely as possible while matching the feature points on the color image to their corresponding points on the mesh. Starting from these fitted face meshes, we construct a set of individual-specific expression blendshapes for each person. These meshes with consistent topology are assembled as a rank-3 tensor to build a bilinear face model with two attributes: identity and expression. Compared with previous 3D facial databases, for every person in our database, there is a much richer matching collection of expressions, enabling depiction of most human facial actions. We demonstrate the potential of FaceWarehouse for visual computing with four applications: facial image manipulation, face component transfer, real-time performance-based facial image animation, and facial animation retargeting from video to image.

952 citations

Journal ArticleDOI
TL;DR: Given audio of President Barack Obama, a high quality video of him speaking with accurate lip sync is synthesized, composited into a target video clip, and a recurrent neural network learns the mapping from raw audio features to mouth shapes to produce photorealistic results.
Abstract: Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

763 citations

Journal ArticleDOI
TL;DR: Faces Learned with an Articulated Model and Expressions is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model and is compared to these models by fitting them to static 3D scans and 4D sequences using the same optimization method.
Abstract: The field of 3D face modeling has a large gap between high-end and low-end methods. At the high end, the best facial animation is indistinguishable from real humans, but this comes at the cost of extensive manual labor. At the low end, face capture from consumer depth sensors relies on 3D face models that are not expressive enough to capture the variability in natural facial shape and expression. We seek a middle ground by learning a facial model from thousands of accurately aligned 3D scans. Our FLAME model (Faces Learned with an Articulated Model and Expressions) is designed to work with existing graphics software and be easy to fit to data. FLAME uses a linear shape space trained from 3800 scans of human heads. FLAME combines this linear shape space with an articulated jaw, neck, and eyeballs, pose-dependent corrective blendshapes, and additional global expression blendshapes. The pose and expression dependent articulations are learned from 4D face sequences in the D3DFACS dataset along with additional 4D sequences. We accurately register a template mesh to the scan sequences and make the D3DFACS registrations available for research purposes. In total the model is trained from over 33, 000 scans. FLAME is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model. We compare FLAME to these models by fitting them to static 3D scans and 4D sequences using the same optimization method. FLAME is significantly more accurate and is available for research purposes (http://flame.is.tue.mpg.de).

629 citations

References
More filters
Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations

Journal ArticleDOI
TL;DR: A generative appearance-based method for recognizing human faces under variation in lighting and viewpoint that exploits the fact that the set of images of an object in fixed pose but under all possible illumination conditions, is a convex cone in the space of images.
Abstract: We present a generative appearance-based method for recognizing human faces under variation in lighting and viewpoint. Our method exploits the fact that the set of images of an object in fixed pose, but under all possible illumination conditions, is a convex cone in the space of images. Using a small number of training images of each face taken with different lighting directions, the shape and albedo of the face can be reconstructed. In turn, this reconstruction serves as a generative model that can be used to render (or synthesize) images of the face under novel poses and illumination conditions. The pose space is then sampled and, for each pose, the corresponding illumination cone is approximated by a low-dimensional linear subspace whose basis vectors are estimated using the generative model. Our recognition algorithm assigns to a test image the identity of the closest approximated illumination cone. Test results show that the method performs almost without error, except on the most extreme lighting directions.

5,027 citations

Proceedings ArticleDOI
01 Jul 1999
TL;DR: A new technique for modeling textured 3D faces by transforming the shape and texture of the examples into a vector space representation, which regulates the naturalness of modeled faces avoiding faces with an “unlikely” appearance.
Abstract: In this paper, a new technique for modeling textured 3D faces is introduced. 3D faces can either be generated automatically from one or more photographs, or modeled directly through an intuitive user interface. Users are assisted in two key problems of computer aided face modeling. First, new face images or new 3D face models can be registered automatically by computing dense one-to-one correspondence to an internal face model. Second, the approach regulates the naturalness of modeled faces avoiding faces with an “unlikely” appearance. Starting from an example set of 3D face models, we derive a morphable face model by transforming the shape and texture of the examples into a vector space representation. New faces and expressions can be modeled by forming linear combinations of the prototypes. Shape and texture constraints derived from the statistics of our example faces are used to guide manual modeling or automated matching algorithms. We show 3D face reconstructions from single images and their applications for photo-realistic image manipulations. We also demonstrate face manipulations according to complex parameters such as gender, fullness of a face or its distinctiveness.

4,514 citations

Journal ArticleDOI
TL;DR: The model for three-mode factor analysis is discussed in terms of newer applications of mathematical processes including a type of matrix process termed the Kronecker product and the definition of combination variables.
Abstract: The model for three-mode factor analysis is discussed in terms of newer applications of mathematical processes including a type of matrix process termed the Kronecker product and the definition of combination variables. Three methods of analysis to a type of extension of principal components analysis are discussed. Methods II and III are applicable to analysis of data collected for a large sample of individuals. An extension of the model is described in which allowance is made for unique variance for each combination variable when the data are collected for a large sample of individuals.

3,810 citations

Journal ArticleDOI
TL;DR: In this paper, the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis.
Abstract: Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss the advantages conveyed by the definition of a probability density function for PCA.

3,362 citations

Frequently Asked Questions (1)
Q1. What have the authors contributed in "Face transfer with multilinear models" ?

Face Transfer this paper is a method for mapping videorecorded performances of one individual to facial animations of another.