Journal Article•DOI•

Face transfer with multilinear models

Daniel Vlasic¹, Matthew Brand², Hanspeter Pfister², Jovan Popović¹•Institutions (2)

Massachusetts Institute of Technology¹, Mitsubishi Electric Research Laboratories²

01 Jul 2005-Vol. 24, Iss: 3, pp 426-433

TL;DR: Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another, based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes.

read less

Abstract: Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another It extracts visemes (speech-related mouth articulations), expressions, and three-dimensional (3D) pose from monocular video or film footage These parameters are then used to generate and drive a detailed 3D textured face mesh for a target identity, which can be seamlessly rendered back into target footage The underlying face model automatically adjusts for how the target performs facial expressions and visemes The performance data can be easily edited to change the visemes, expressions, pose, or even the identity of the target---the attributes are separably controllable This supports a wide variety of video rewrite and puppetry applicationsFace Transfer is based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes (eg, identity, expression, and viseme) Separability means that each of these attributes can be independently varied A multilinear model can be estimated from a Cartesian product of examples (identities × expressions × visemes) with techniques from statistical analysis, but only after careful preprocessing of the geometric data set to secure one-to-one correspondence, to minimize cross-coupling artifacts, and to fill in any missing examples Face Transfer offers new solutions to these problems and links the estimated model with a face-tracking algorithm to extract pose, expression, and viseme parameters

...read moreread less

Summary (4 min read)

Jump to: [1 Introduction] – [2 Related Work] – [3 Multilinear Algebra] – [4.1 Face Data] – [4.2 Correspondence] – [4.3 Face Model] – [4.4 Missing Data] – [5 Face Transfer] – [5.1 Face Tracking] – [5.2 Initialization] – [6 Results] – [7 Discussion] and [8 Conclusion]

1 Introduction

Their system can either rewrite the original footage with adjusted expressions and visemes or transfer the performance to a different face in a different footage.
This paper describes a general, controllable, and practical system for facial animation.
In principle, given a large and varied data set, the model can generate any face, any expression, any viseme.
Existing estimation algorithms require perfect one-to-one correspondence between all meshes, and a mesh for every possible combination of expression, viseme, and identity.

3 Multilinear Algebra

In this section the authors provide insight behind the basic concepts needed for understanding of their Face Transfer system.
De Lathauwer’s dissertation [1997] provides a comprehensive treatment of this topic.
The basic mathematical object of multilinear algebra is the tensor, a natural generalization of vectors (1st order tensors) and matrices (2nd order tensors) to multiple indices.
Viewing the data as a set of d1-dimensional vectors stored parallel to the first axis , the authors can define the mode-1 space as the span of those vectors.
One can obtain a better approximation with further refinement of Ǔi’s and Creduced via alternating least squares [De Lathauwer 1997].

4.1 Face Data

The authors demonstrate their proof-of-concept system on two separate face models: a bilinear model, and a trilinear model.
Both were estimated from detailed 3D scans (∼ 30K vertices) acquired with 3dMD/3Q’s structured light scanner (http://www.3dmd.com/) in a process similar to regular flash photography, although their methods would apply equally to other geometric data sets such as motion capture.
The subject pool included men, women, Caucasians, and Asians, from the mid-20s to mid-50s.
16 subjects were asked to perform 5 visemes in 5 different expressions (neutral, smiling, scowling, surprised, and sad).
The resulting fourth order (4-mode) data tensor (30K vertices × 5 visemes × 5 expressions × 16 identities) was decomposed to yield a trilinear model providing 4 knobs for viseme, 4 for expression, and 16 for identity (the authors have kept the number of knobs large since their data sets were small).

4.2 Correspondence

Training meshes that are not placed in perfect correspondence can considerably muddle the question of how to displace vertices to change one attribute versus another (e.g. identity versus expression), and thus the multilinear analysis may not give a model with good separability.
[Praun and Hoppe 2003; Gotsman et al. 2003]), it took consid- erable experimentation to place many facial scans into detailed correspondence.
The optimization objective, minimized with gradient descent, balances overall surface similarity, proximity of manually selected feature points on the two surfaces, and proximity of reference vertices to the nearest point on the scanned surface.
For the trilinear model, the remaining m-viseme scans were marked with 21 features around eyebrows and lips, rigidly aligned to upper-face geometry on the appropriate neutral scans, and then non-rigidly put into correspondence as above.

4.3 Face Model

Equation (3) shows how to approximate the data tensor by modemultiplying a smaller core tensor with a number of truncated orthogonal matrices.
Since their goal is to output vertices as a function of attribute parameters, the authors can decompose the data tensor without factoring along the mode that corresponds to vertices (mode-1), changing Equation (3) to: T 'M ×2 Ǔ2×3 Ǔ3 · · ·×N ǓN , (4) where M can now be called the multilinear model of face geometry.
Mode-multiplying M with Ǔi’s approximates the original data.
In particular, mode-multiplying it with one row from each Ǔi reconstructs exactly one original face (the one corresponding to the attribute parameters contained in that row).
Therefore, to generate an arbitrary interpolation (or extrapolation) of original faces, the authors can mode-multiply the model with a linear combination of rows for each Ǔi.

4.4 Missing Data

Building the multilinear model from a set of face scans requires capturing the full Cartesian product of different face attributes, (i.e., all expressions and visemes need to be captured for each person).
To that end, the authors collect the linear equations that determine a particular missing value in all the modes, and solve them together.
Because their data set includes smiles for more than one person, the authors extend that approach to copy their average.
Filling in missing data according to this model is computationally expensive.
Instead, the authors approximate the true likelihood with a geometric average of Gaussians p(T |M ,{Ǔi}Ni=2)≈.

5 Face Transfer

One produces animations from a multilinear model by varying the attribute parameters (the elements of the wi’s) as if they were dials, and generating mesh coordinates from Equation 5.
The Nmode SVD conveniently gives groups of dials that separately control identity, expression and viseme.
The dials can be “tuned” to reflect deformations of interest through a linear transform of each wi.
A similar linear scheme was employed in [Blanz and Vetter 1999].
To give similar power to a casual user, the authors have devised a method that automatically sets model parameters from given video data.

5.1 Face Tracking

To link the parameters of a multilinear model to video data, the authors use optical flow in conjunction with the weak-perspective camera model.
Matrix Z and vector e contain spatial and temporal intensity gradient information in the surrounding region [Birchfield 1996].
If the currently tracked attribute varies from frame to frame (such as expression does), the authors solve the set of linear systems and proceed to the next pair of neighboring frames.

5.2 Initialization

The method described above, since it is based on tracking, needs to be initialized with the first frame alignment (pose and all the weights of the multilinear model).
The authors accomplish this by specifying a small number of feature points which are then used to position the face geometry.
The correspondences can be either userprovided (which gives more flexibility and power) or automatically detected (which avoids user intervention).
The authors have experimented with the automatic feature detector developed by [Viola and Jones 2001], and found that it is robust and precise enough in locating a number of key features (eye corners, nose tip, mouth corners) to give a good approximating alignment in most cases.
Imperfect alignment can be improved by tracking the first few frames back and forth until the model snaps into a better location.

6 Results

Multilinear models provide a convenient control of facial attributes.
Face Transfer infers the attribute parameters automatically by tracking the face in a video.
Because their system tracks approximately a thousand vertices, the process is less sensitive to localized intensity changes (e.g., around the furrow above the lip).
In Figure 7B, the authors use the bilinear model to change a person’s identity while retaining the expressions from the original performance.
From left to right, the authors present the original video frame, a frame from a new video, the new geometry, and the final modified frame (without blending).

7 Discussion

Perhaps their most remarkable empirical result is that even with a model estimated from a rather tiny data set, the authors can produce videorealistic results for new source and target subjects.
The authors also see algorithmic opportunities to make aspects of the system more automatic and robust.
Correspondence between scans might be improved with some of the methods shown in [Kraevoy and Sheffer 2004].
In a production setting, the scan data would need to be expanded to contain shape and texture information for the ears, neck, and hair, so that the authors can make a larger range of head pose changes.
Finally, the texture function lifted from video is performance specific, in that the authors made no effort to remove variations due to lighting.

8 Conclusion

The model is multilinear, and thus has the key property of separability: different attributes, such as identity and expression, can be manipulated independently.
Thus the authors can change the identity and expression, but keep the smile.
What makes this multilinear model a practical tool for animation is that the authors connect it directly to video, showing how to recover a time-series of poses and attribute parameters (expressions and visemes), plus a performance-driven texture function for an actor’s face.
In addition, the model offers a rich source of synthetic actors that can be controlled via video.
An intriguing prospect is that one could now build a multilinear model representing a vertex×identity×expression×viseme×age data tensor—without having to capture each individual’s face at every stage of their life.

Did you find this useful? Give us your feedback

Figures (6)

Figure 3: Data tensor for a bilinear model that varies with identity and expression; the first mode contains vertices, while the second and third modes correspond to expression and identity respectively. The data is arranged so that each slice along the second mode contains the same expression (in different identities) and each slice along the third mode contains the same identity (in different expressions). In our trilinear experiments we have added a fourth mode, where scans in each slice share the same viseme.

Figure 2: In (a) we show a 3rd-order (3-mode) tensor T whose modes have d1, d2, and d3 elements respectively. Depending on how we look at the data within the tensor, we can identify three mode spaces. By viewing the data as vectors parallel to the first mode (b), we define mode-1 space as the span of those vectors. Similarly, mode-2 space is spanned by vectors parallel to the second mode (c), and mode-3 space by vectors in the third mode (d).

Figure 5: Several faces generated by manipulating the parameters of the bilinear model. The left two faces show our attempt of expressing disgust, an expression that was not in the database. They only differ in identity, to demonstrate the separability of our parameters. The right two faces show surprise for two novel identities, illustrating how the expression adjusts to identity.

Figure 6: Faces generated by manipulating the parameters of the trilinear model. Left to right: producing the ‘oo’ sound, trying to whistle, breaking into a smile, two changes of identity, then adding a scowl.

Figure 4: From top to bottom: Prediction of held-out faces with our imputation scheme (on the trilinear model), the actual face, and a simple averaging scheme.

Figure 7: Several examples of our system: (A) A few frames of a bilinear-model tracking of a novel face and the corresponding 3D shapes below. (B) Changing the identity parameters of a performance tracked with the bilinear model. (C) Transferring a performance of a known identity from one video to another using the bilinear model. In each example the mouth gap was closed to allow for texturing. (D) Using the trilinear model to copy a singing performance from one video to another (left-to-right: original video, singing video, and the result). (E) Altering the performance from the previous example by adding a frown.

Content maybe subject to copyright Report

To appear in SIGGRAPH 2005.

Face Transfer with Multilinear Models

Daniel Vlasic

∗ †

Matthew Brand

†

Hanspeter Pﬁster Jovan Popovi

Computer Science and Artiﬁcial Intelligence Laboratory

Massachusetts Institute of Technology

†

Mitsubishi Electric Research Laboratories

Figure 1: Face Transfer with multilinear models gives animators decoupled control over facial attributes such as identity, expression, and

viseme. In this example, we combine pose and identity from the ﬁrst frame, surprised expression from the second, and a viseme (mouth

articulation for a sound midway between ”oo” and ”ee”) from the third. The resulting composite is blended back into the original frame.

Abstract

Face Transfer is a method for mapping videorecorded perfor-

mances of one individual to facial animations of another. It ex-

tracts visemes (speech-related mouth articulations), expressions,

and three-dimensional (3D) pose from monocular video or ﬁlm

footage. These parameters are then used to generate and drive a

detailed 3D textured face mesh for a target identity, which can be

seamlessly rendered back into target footage. The underlying face

model automatically adjusts for how the target performs facial ex-

pressions and visemes. The performance data can be easily edited

to change the visemes, expressions, pose, or even the identity of

the target—the attributes are separably controllable. This supports

a wide variety of video rewrite and puppetry applications.

Face Transfer is based on a multilinear model of 3D face meshes

that separably parameterizes the space of geometric variations due

to different attributes (e.g., identity, expression, and viseme). Sep-

arability means that each of these attributes can be independently

varied. A m ultilinear model can be estimated from a Cartesian

product of examples (identities × expressions × visemes) with

techniques from statistical analysis, but only after careful pre-

processing of the geometric data set to secure one-to-one corre-

spondence, to minimize cross-coupling artifacts, and to ﬁll in any

missing examples. Face Transfer offers new solutions to these prob-

lems and links the estimated model with a face-tracking algorithm

to extract pose, expression, and viseme parameters.

CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional

Graphics and Realism—Animation; I.4.9 [Image Processing and

Computer Vision]: Applications;

Keywords: Facial Animation, Computer Vision—Tracking

∗

MIT CSAIL, The Stata Center, 32 Vassar Street, Cambridge, MA

02139, USA

1 Introduction

Performance-driven animation has a growing role in ﬁlm produc-

tion because it allows actors to express content and mood naturally,

and because the resulting animations have a degree of realism that

is hard to obtain from synthesis methods [Robertson 2004]. The

search for the highest quality motions has led to complex, expen-

sive, and hard-to-use systems. This paper introduces new tech-

niques for producing compelling facial animations that are inexpen-

sive, practical, versatile, and well suited for editing performances

and retargeting to new characters.

Face Transfer extracts performances from ordinary video

footage, allowing the transfer of facial action of actors who are

unavailable for detailed measurement, instrumentation, or for re-

recording with specialized scanning equipment. Expressions,

visemes (speech-related mouth articulations), and head motions are

extracted automatically, along with a performance-driven texture

function. With this information in hand, our system can either

rewrite the original footage with adjusted expressions and visemes

or transfer the performance to a different face in a different footage.

Multilinear models are ideally suited for this application because

they can describe face variations with separable attributes that can

be estimated from video automatically. In this paper, we estimate

such a model from a data set of three-dimensional (3D) face scans

that vary according to expression, viseme, and identity. The multi-

linear model decouples the three attributes (i.e., identity or viseme

can be varied while expression remains constant) and encodes them

consistently. Thus the attribute vector that encodes a smile for one

person encodes a smile for every face spanned by the model, re-

gardless of identity or viseme. Yet the model captures the fact that

every person smiles in a slightly different way. Separability and

consistency are the key properties that enable the transfer of a per-

formance from one face to another without a change in content.

Contributions. This paper describes a general, controllable, and

practical system for facial animation. It estimates a multilinear

model of human faces by examining geometric variations between

3D face scans. In principle, given a large and varied data set, the

model can generate any face, any expression, any viseme. As proof

of concept, we estimate the model from a couple of geometric data

sets: one with 15 identities and 10 expressions, and another with

To appear in SIGGRAPH 2005.

16 identities, 5 expressions, and 5 visemes. Existing estimation

algorithms require perfect one-to-one correspondence between all

meshes, and a mesh for every possible combination of expression,

viseme, and identity. Because acquiring the full Cartesian product

of meshes and putting them into dense correspondence is extremely

difﬁcult, this paper introduces methods for populating the Cartesian

product from a sparse sampling of faces, and for placing unstruc-

tured face scans into correspondence with minimal cross-coupling

artifacts.

By linking the multilinear model to optical ﬂow, we obtain a

single-camera tracker that estimates performance parameters and

detailed 3D geometry from video recordings. The model deﬁnes a

mapping from performance parameters back to 3D shape, thu s we

can arbitrarily mix pose, identity, expressions, and visemes from

two or more videos and render the result back into a target video.

As a result, the system provides an intuitive interface for both an-

imators (via separably controllable attributes) and performers (via

acting). And because it does not require performers to wear visible

facial markers or to be recorded by special face-scanning equip-

ment, it is an inexpensive and easy-to-use facial animation system.

2 Related Work

Realistic facial animation remains a fundamental challenge in com-

puter graphics. Beginning with Parke’s pioneering work [1974],

desire for improved realism has driven researchers to extend geo-

metric models [Parke 1982] with physical models of facial anatomy

[Waters 1987; Lee et al. 1995] and to combine them with non-linear

ﬁnite element methods [Koch et al. 1996] in systems that could be

used for planning facial surgeries. In parallel, Williams presented a

compelling argument [1990] in favor of performance-driven facial

animation, which anticipated techniques for tracking head motions

and facial expressions in video [Li et al. 1993; Essa et al. 1996;

DeCarlo and Metaxas 1996; Pighin et al. 1999]. A more expensive

alternative could use a 3D scanning technique [Zhang et al. 2004],

if the performance can be re-recorded with such a system.

Much of the ensuing work on face estimation and tracking re-

lied on the observation that variation in faces is well approximated

by a linear subspace of low dimension [Sirovich and Kirby 1987].

These techniques estimate either linear coefﬁcients for known basis

shapes [Bascle and Blake 1998; Brand and Bhotika 2001] or both

the basis shapes and the coefﬁcients, simultaneously [Bregler et al.

2000; Torresani et al. 2001]. In computer graphics, the combina-

tion of accurate 3D geometry with linear texture models [Pighin

et al. 1998; Blanz and Vetter 1999] produced striking results. In

addition, Blanz and Vetter [1999] presented a process for estimat-

ing the shape of a face in a single photograph, and a set of controls

for intuitive manipulation of appearance attributes (thin/fat, femi-

nine/masculine).

These and other estimation techniques share a common chal-

lenge of decoupling the attributes responsible for observed varia-

tions. As an early example, Pentland and Sclaroff estimate geom-

etry of deformable objects by decoupling linear elastic equations

into orthogonal vibration modes [1991]. In this case, modal analy-

sis uses eigen decomposition to compute the independent vibration

modes. Similar factorizations are also relied upon to separate vari-

ations due to pose and lighting, pose and expression, identity and

lighting, or style and content in general [Freeman and Tenenbaum

1997; Bregler et al. 2000; DeCarlo and Metaxas 2000; Georghiades

et al. 2001; Cao et al. 2003].

A technical limitation of these formulations is that each pair

of factors must be considered in isolation; they cannot easily de-

couple variations due to a combination of more than two factors.

The extension of such two-mode analysis to more modes of vari-

ation was ﬁrst introduced by Tucker [1966] and later formalized

and improved on by Kroonenberg and de Leeuw [1980]. These

techniques were succ essfully applied to multilinear analysis of im-

ages [Vasilescu and Terzopoulos 2002; Vasilescu and Terzopoulos

2004].

This paper describes multilinear analysis of three-dimensional

(3D) data sets and generalizes face-tracking techniques to create a

unique performance-driven system for animation of any face, any

expression, and any viseme. In consideration of similar needs,

Bregler and colleagues introduced a two-dimensional method for

transferring mouth shapes from one performance to another [1997].

The method is ideal for ﬁlm dubbing—a problem that could also

be solved without performance by ﬁrst learning the mouth shapes

on a canonical data set and then generating new shapes for differ-

ent texts [Ezzat and Poggio 2000]. These methods are difﬁcult to

use for general performance-driven animation because they cannot

change emotions of a face. Although the problem can be resolved

by decoupling emotion and content via two-mode analysis [Chuang

et al. 2002], all three techniques are view speciﬁc, which presents

difﬁculties when view, illumination, or both have to change.

Our Face Transfer learns a model of 3D facial geometry vari-

ations in order to infer a particular face shape from 2D images.

Previous work combines identity and expression spaces by copy-

ing deformations from one subject onto the geometry of other faces

[DeCarlo and Metaxas 2000; Blanz et al. 2003; Chai et al. 2003].

Expression cloning [Noh and Neumann 2001; Sumner and Popovi

2004] improves on this process but does not account for actor-

speciﬁc idiosyncrasies that can be revealed by statistical analysis

of the entire data set (i.e., the mesh vertex displacements that pro-

duce a smile should depend on who is smiling and on what they

are saying at the same time). Other powerful models of human

faces have been explored [Wang et al. 2004] at the cost of mak-

ing the estimation and transfer of model parameters more difﬁcult.

This paper describes a method that incorporates all such informa-

tion through multilinear analysis, which naturally accommodates

variations along multiple attributes.

3 Multilinear Algebra

Multilinear algebra is a higher order generalization of linear al-

gebra. In this section we provide insight behind the basic con-

cepts needed for understanding of our Face Transfer system. De

Lathauwer’s dissertation [1997] provides a comprehensive treat-

ment of this topic. Concise overviews have also been published in

the graphics and vision literature [Vasilescu and Terzopoulos 2002;

Vasilescu and Terzopoulos 2004].

Tensors. The basic mathematical object of multilinear algebra is

the tensor, a natural generalization of vectors (1

order tensors)

and matrices (2

order tensors) to multiple indices. An N

-order

tensor can be thought of as a block of data indexed by N indices:

T = (t

...i

). Figure 2 shows a 3

-order (or 3-mode) tensor with

a total of d

× d

elements. Different modes usually corre-

spond to particular attributes of the data (e.g, expression, identity,

etc.).

Mode Spaces. A matrix has two characteristic spaces, row and

column space; a tensor has one for each mode, hence we call them

mode spaces. The d

× d

3-tensor in Figure 2 has three mode

spaces. Viewing the data as a set of d

-dimensional vectors stored

parallel to the ﬁrst axis (Figure 2b), we can deﬁne the mode-1 space

as the span of those vectors. Similarly, mode-2 space is deﬁned as

the span of the vectors stored parallel to the second axis, each of size

(Figure 2c). Finally, mode-3 space is spanned by vectors in the

third mode, of dimensionality d

(Figure 2d). Multilinear algebra

revolves around the analysis and manipulation of these spaces.

To appear in SIGGRAPH 2005.

mode 1

mode 2

mode 3

(a) (b) (c) (d)

Figure 2: In (a) we show a 3

-order (3-mode) tensor T whose

modes have d

, d

, and d

elements respectively. Depending on

how we look at the data within the tensor, we can identify three

mode spaces. By viewing the data as vectors parallel to the ﬁrst

mode (b), we deﬁne mode-1 space as the span of those vectors.

Similarly, mode-2 space is spanned by vectors parallel to the second

mode (c), and mode-3 space by vectors in the third mode (d).

Mode-n Product. The most obvious way of manipulating mode

spaces is via linear transformation, ofﬁcially referred to as the

mode-n product. It is deﬁned between a tensor T and a matrix

M for a speciﬁc mode n, and is written as a multiplication with a

subscript: T ×

M. This notation indicates a linear transforma-

tion of vectors in T ’s mode-n space by the matrix M. Concretely,

T ×

M would replace each mode-2 vector v (Figure 2c) with a

transformed vector Mv.

Tensor Decomposition. One particularly useful linear transfor-

mation of mode data is the N-mode singular value decomposition

(N-mode SVD). It rotates the mode spaces of a data tensor T pro-

ducing a core tensor C , whose variance monotonically decreases

from ﬁrst to last element in each mode (analogous to matrix SVD).

This enables us to truncate the insigniﬁcant components and get a

reduced model of our data.

Mathematically, N-mode SVD can be expressed with mode

products

T ×

··· ×

= C (1)

=⇒ T = C ×

·· · ×

, (2)

where T is the data tensor, C is the core tensor, and U

’s (or more

precisely their transposes) rotate the mode spaces. Each U

is an

orthonormal matrix whose columns contain left singular vectors of

the ith mode space, and can be computed via regular SVD of those

spaces [De Lathauwer 1997]. Since variance is concentrated in one

corner of the core tensor, data can be approximated by

T ' C

reduced

·· · ×

, (3)

where

’s are truncated versions of U

’s with last few columns

removed. This truncation generally yields high quality approxima-

tions but it is not optimal—one of several matrix-SVD properties

that do not generalize in multilinear algebra. One can obtain a bet-

ter approximation with further reﬁnement of

’s and C

reduced

via

alternating least squares [De Lathauwer 1997 ].

4 Multilinear Face Model

To construct the multilinear face model, we ﬁrst acquire a range

of 3D face scans, put them in full correspondence, appropriately

arrange them into a data tensor (Figure 3), and use the N-mode

SVD to compute a model that captures the face geometry and its

variation due to attributes such as identity and expression.

vertices

expression

identity

Figure 3: Data tensor for a bilinear model that varies with iden-

tity and expression; the ﬁrst mode contains vertices, while the sec-

ond and third modes correspond to expression and identity respec-

tively. The data is arranged so that each slice along the second mode

contains the same expression (in different identities) and each slice

along the third mode contains the same identity (in different expres-

sions). In our trilinear experiments we have added a fourth mode,

where scans in each slice share the same viseme.

4.1 Face Data

We demonstrate our proof-of-concept system on two separate face

models: a bilinear model, and a trilinear model. Both were es-

timated from detailed 3D scans (∼ 30K vertices) acquired with

3dMD/3Q’s structured light scanner (http://www.3dmd.com/) in

a process similar to regular ﬂash photography, although our meth-

ods would apply equally to other geometric data sets such as motion

capture. As a preprocess, the scans were smoothed using the bilat-

eral ﬁlter [Jones et al. 2003] to eliminate some of the capture noise.

The subject pool included men, women, Caucasians, and Asians,

from the mid-20s to mid-50s.

Bilinear model. 15 subjects were scanned performing the same

10 facial expressions. The expressions were picked for their famil-

iarity as well as distinctiveness, and include neutral, smile, frown,

surprise, anger, and others. The scans were assembled into a third

order (3-mode) data tensor (30K vertices × 10 expressions × 15

identities). After N-mode SVD reduction, the resulting bilinear

model offers 6 knobs for manipulating expression and 9 for identity.

Trilinear model. 16 subjects were asked to perform 5 visemes in

5 different expressions (neutral, smiling, scowling, surprised, and

sad). The visemes correspond to the boldfaced sounds in man, car,

eel, too, and she. Principal components analysis of detailed speech

motion capture indicated that these ﬁve expressions broadly span

the space of lip shapes, and should give a good approximate basis

for all other visemes—with the possible exception of exaggerated

fricatives. The resulting fourth order (4-mode) data tensor (30K

vertices × 5 visemes × 5 expressions × 16 identities) was decom-

posed to yield a trilinear model providing 4 knobs for viseme, 4 for

expression, and 16 for identity (we have kept the number of knobs

large since our data sets were small).

4.2 Correspondence

Training meshes that are not placed in perfect correspondence can

considerably muddle the question of how to displace vertices to

change one attribute versus another (e.g. identity versus expres-

sion), and thus the multilinear analysis may not give a model with

good separability. We show here how to put a set of unstructured

face scans into correspondence suitable for multilinear analysis.

Despite rapid advances in automatic parameterization of meshes

(e.g., [Praun and Hoppe 2003; Gotsman et al. 2003]), it took consid-

To appear in SIGGRAPH 2005.

erable experimentation to place many facial scans into detailed cor-

respondence. The principal complicating factors are that the scans

do not have congruent mesh boundaries, and the problem of match-

ing widely varied lip deformations does not appear to be well served

by conformal maps or local isometric constraints. This made it nec-

essary to mark a small number of feature points in order to bootstrap

correspondence-ﬁnding across large deformations.

We developed a protocol for a template-ﬁtting procedure [Allen

et al. 2003; Sumner and Popovi

c 2004], which seeks a minimal

deformation of a parameterized template mesh that ﬁts the surface

implied by the scan. The optimization objective, minimized with

gradient descent, balances overall surface similarity, proximity of

manually selected feature points on the two surfaces, and proxim-

ity of reference vertices to the nearest point on the scanned sur-

face. We manually speciﬁed 42 reference points on a reference fa-

cial mesh and on a neutral (m-viseme) scan. After rigidly aligning

the template and the scan with Procrustes’ alignment, we deformed

the template mesh into the scan: at ﬁrst, weighing the marked

correspondences heavily and afterwards emphasizing vertex prox-

imity. For the trilinear model, the remaining m-viseme (closed-

mouth) scans were marked with 21 features around eyebrows and

lips, rigidly aligned to upper-face geometry on the appropriate neu-

tral scans, and then non-rigidly put into correspondence as above.

Finally, all other viseme scans were similarly put into correspon-

dence with the appropriate closed-mouth scan, using the 18 features

marked around the lips.

4.3 Face Model

Equation (3) shows how to approximate the data tensor by mode-

multiplying a smaller core tensor with a number of truncated or-

thogonal matrices. Since our goal is to output vertices as a function

of attribute parameters, we can decompose the data tensor with-

out factoring along the mode that corresponds to vertices (mode-1),

changing Equation (3) to:

T ' M ×

·· · ×

, (4)

where M can now be called the multilinear model of face geome-

try. Mode-multiplying M with

’s approximates the original data.

In particular, mode-multiplying it with one row from each

re-

constructs exactly one original face (the one corresponding to the

attribute parameters contained in that row). Therefore, to generate

an arbitrary interpolation (or extrapolation) of original faces, we

can mode-multiply the model with a linear combination of rows for

each

. We can write

f = M ×

·· · ×

, (5)

where w

is a column vector of parameters (weights) for the at-

tribute corresponding to i

mode, and f is a column vector of ver-

tices describing the resulting face.

4.4 Missing Data

Building the multilinear model from a set of face scans requires

capturing the full Cartesian product of different face attributes, (i.e.,

all expressions and visemes need to be captured for each person).

Producing a full data tensor is not always practical for large data

sets. For example, a certain person might have trouble performing

some expressions on cue, or a researcher might add a new expres-

sion to the database but be unable reach all the previous subjects. In

our case, data corruption and subsequent unavailability of a subject

led to an incomplete tensor. The problem becomes more evident if

we add age as one of the attributes, where we cannot expect to scan

each individual throughout their entire lives. In all these cases, we

would still like to include a person’s successful scans in the model,

and ﬁll in the missing ones with the most likely candidates. This

process is known as imputation.

There are many possible schemes for estimating a model fro m

incomplete data. A naive imputation would ﬁnd a complete sub-

tensor, use it to estimate a smaller model, use that to predict a miss-

ing face, use that to augment the data set, and repeat. In a more

sophisticated Bayesian setting, we would treat the missing data as

hidden variables to be MAP estimated (imputed) or marginalized

out. Both approaches require many iterations over a huge data set;

Bayesian methods are particularly expensive and generally require

approximations for tractability. With MAP estimation and naive

imputation, the results can be highly dependent on the order of op-

erations. Because it fails to exploit all available constraints, the

naive imputative scheme generally produces inferior results.

Here we use an imputative scheme that exploits more available

constraints than the naive one, producing better results. The main

intuition, which we formalize below, is that any optimization crite-

ria can be linearized in a particular tensor mode, where it yields a

matrix factorization problem with missing values. Then we lever-

age existing factorization schemes for incomplete matrices, where

known values contribute a set of linear constraints on the missing

values. These constraints are then combined and solved in the least-

squares sense.

Description. Our algorithm consists of two steps. First, for each

mode we assemble an incomplete matrix whose columns are the

corresponding mode vectors. We then seek a subspace decompo-

sition that best reconstructs the known values of that matrix. The

decomposition and the known values provide a set of linear con-

straints for the missing values. This can be done with off-the-shelf

imputative matrix factorizations (e.g., PPCA [Tipping and Bishop

1999], SPCA [Roweis 1997], or ISVD [Brand 2002]). Typically

these algorithms estimate a low-rank subspace from the complete

vectors of the mode and use that to predict missing values in the

incomplete columns (and/or update the subspace). In our experi-

ments we used the standard PPCA formulation for ﬁlling in miss-

ing values, which reduces to a system of linear equations that relate

unknown values to known values through the estimated mean and

covariance of the vectors in the mode space. Second, the linear con-

straints are combined through the missing elements, because they

are shared across all groups of modal vectors and must be ﬁlled

in with consistent values. To that end, we collect the linear equa-

tions that determine a particular missing value in all the modes, and

solve them together. For example, if two missing values co-occur in

some mode vector, then they must be jointly estimated. We update

the mean and covariance for each decomposition and repeat the two

steps until convergence.

Evaluation. Figure 4 contrasts the results of this method with

faces predicted by our generalization of the simple method pro-

posed by Blanz and colleagues [2003]. In their formulation the

same displacement vectors that make one person smile are copied

over onto every other identity. Because our data set includes smiles

for more than one person, we extend that approach to copy their av-

erage. In this particular example, 15% of real faces where held out

of the trilinear data set and predicted by our imputation scheme and

the simple averaging scheme. Note how the multilinear prediction

is closer to the truth in most examples, even predicting some indi-

vidual idiosyncrasies in puckers and smiles. The simple averaging

scheme, however, seems to do a better job at keeping the lips sealed

for closed-mouth faces (bottom row of Figure 4). We could obtain

better results by preferentially weighting detail around the mouth.

In our earlier trilinear experiments, we found that ISVD-based

imputations predic ted how faces vary from the mean with less than

9% relative error (Frobenius norm of the total error divided by the

To appear in SIGGRAPH 2005.

Figure 4: From top to bottom: Prediction of held-out faces with our

imputation scheme (on the trilinear model), the actual face, and a

simple averaging scheme.

norm of the held-out face variations) for up to 50% missing data. In

general, the predictions are drawn towards the mean of the known

data. Closed-mouth expressions, which are under-represented in

our data and thus lie far from the mean, were not predicted as well

as other expressions. That can be ﬁxed by reweighting the data.

Tests performed on synthetic data indicate that the quality of im-

putation increases as the data set grows in size, even if signiﬁcant

portions of it are missing. The reason why is that if the data is truly

low-dimensional in each of the modes, the missing samples will fall

within the span and density of the known ones.

Probabilistic Interpretation. The above algorithm ﬁlls in miss-

ing data by approximating the true multilinear distribution. The

form of this approximation is made precise by a probabilistic inter-

pretation, which starts from a multilinear generative model

T = M ×

·· · ×

+ ν ,

where T and M are the data and model tensors,

is the i-th modal

subspace, and ν is a Gaussian noise source. Filling in missing data

according to this model is computationally expensive. Instead, we

approximate the true likelihood with a geometric average of Gaus-

sians

p(T |M ,{

}

i=2

) ≈

∏

j=2

(T ,M , {

}

i=2

)

1/N

Each Gaussian q

(T ,M , {

}

i=2

)

= N (T |

,σ

) is found by

ﬁxing {

}

i6= j

and turning the tensor Equation (4) into matrix form:

. Here, columns of T

are the mode- j vectors of T ,

and the columns of J

are the mode- j vectors of M ×

·· · ×

j−1

j+1

·· · ×

. The resulting likelihood becomes:

p(T |M ,{

}

i=2

) ≈

∏

j=2

N (T |

,σ

)

1/N

which can be maximized efﬁciently.

Taking logarithms and discarding constant factors such as N and

, we seek to minimize the sum-squared error

∑

j=2

−

Each term of the summation presents a matrix factorization prob-

lem with missing values, where

and J

are treated as unknown

factors of the incomplete matrix T

, and are solved for using PPCA

as described above.

5 Face Transfer

One produces animations from a multilinear model by varying the

attribute parameters (the elements of the w

’s) as if they were di-

als, and generating mesh coordinates from Equation 5. The N-

mode SVD conveniently gives groups of dials that separately con-

trol identity, expression and viseme. Within each group, the dials

do not correspond to semantically meaningful deformations (such

as smile or frown), but rather reﬂect the deformations that account

for most variance. However, the dials can be “tuned” to reﬂect de-

formations of interest through a linear transform of each w

. This

approach was successfully applied in [Allen et al. 2003] to make

their body shape dials correspond to height and weight. A similar

linear scheme was employed in [Blanz and Vetter 1999]. In general,

dial-based systems are currently used on most of the deformable

models in production, but only skilled animators can create believ-

able animations (or even stills) with them. To give similar power

to a casual user, we have devised a method that automatically sets

model parameters from given video data. With this tool, a user can

enact a performance in front of a camera, and have it automatically

transferred to the model.

5.1 Face Tracking

To link the parameters of a multilinear model to video data, we

use optical ﬂow in conjunction with the weak-perspective cam-

era model. Using the symmetric Kanade-Lucas-Tomasi formula-

tion [Birchﬁeld 1996], we express the frame-to-frame motion of a

tracked point with a linear system:

Zd = Z(p − p

) = e . (6)

Here, the 2-vector d describes the image-space motion of the point,

also expressed as the difference between the point’s true location p

and its current best guess p

(if we have no guess, then p

is the

location from the previous frame). Matrix Z and vector e contain

spatial and temporal intensity gradient information in the surround-

ing region [Birchﬁeld 1996].

Using a weak-perspective imaging model, the point position p

can be expanded in terms of rigid head-motion parameters and non-

rigid facial shape parameters, which are constrained by the multi-

linear model:

Z(sRf

+ t − p

) = e , (7)

where the rigid parameters consist of scale factor s, the ﬁrst two

rows of a 3D rotation matrix R, and the image-space translation t.

The 3D shape f comes from the multilinear model through Equa-

tion (5), with f

indicating the ith 3D vertex being tracked.

Solving for the pose and all the multilinear weights from a pair

of frames using Equation (7) is not a well-constrained problem. To

simplify the computation, we use a coordinate-descent method: we

let only one of the face attributes vary at a time by ﬁxing all the oth-

ers to their current guesses. This transforms the multilinear problem

into a linear one, as described below, which we solve with standard

techniques that simultaneously compute the rigid pose along with

the linear weights from a pair of frames [Bascle and Blake 1998;

Brand and Bhotika 2001].

When we ﬁx all but one attribute of the multilinear model,

thereby making f linear, Equation (7) turns into

Z(sRM

m,i

+ t − p

) = e , (8)

HTML Viewer

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Face transfer with multilinear models" ?

Face Transfer this paper is a method for mapping videorecorded performances of one individual to facial animations of another.

Face transfer with multilinear models

Summary (4 min read)

1 Introduction

3 Multilinear Algebra

4.1 Face Data

4.2 Correspondence

4.3 Face Model

4.4 Missing Data

5 Face Transfer

5.1 Face Tracking

5.2 Initialization

6 Results

7 Discussion

8 Conclusion

Figures (6)

Citations

Cites background or result from "Face transfer with multilinear mode..."

References

Related Papers (5)

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Face transfer with multilinear models" ?