What have the authors stated for future works in "Face recognition based on fitting a 3d morphable model" ?

It is straightforward to extend their morphable model to different ages, ethnic groups, and facial expressions by including face vectors from more 3D scans. Future work will also concentrate on automated initi- alization and a faster fitting procedure. In applications that require a fully automated system, their algorithm may be combined with an additional feature detector.

What is the probability of observing a face?

For Gaussian pixel noise with a standard deviation The author, the likelihood of observing Iinput, given ; ; , is a product of one-dimensional normal distributions, with one distribu-tion for each pixel and each color channel.

What is the primary goal in analyzing a face?

Given an input imageIinputðx; yÞ ¼ ðIrðx; yÞ; Igðx; yÞ; Ibðx; yÞÞT ;the primary goal in analyzing a face is to minimize the sum of square differences over all color channels and all pixels between this image and the synthetic reconstruction,EI ¼ X x;y Iinputðx; yÞ Imodelðx; yÞ 2: ð17Þ

What is the effect of shape coefficients on the image?

Shape coefficients i and rigid transformation, however, influence both the image coordinates ðpx;k; py;kÞ and color values Imodel;k due to the effect of geometry on surface normals and shading (14).

How many coefficients are used in the fitting algorithm?

The first iterations only optimize the first parameters i; i; i 2 f1; . . . ; 10g and all parameters i. Subsequent iterations consider more and more coefficients.

Why are the head angles not fully aligned in space?

Heads in the CMU-PIE database are not fully aligned in space, but, since front, side, and profile images are taken simultaneously, the relative angles between views should be constant.

What is the angular distribution of the specular reflections?

In each vertex k, the red channel isLr;k¼ Rk Lr;amb þRk Lr;dir nk; lh i þ ks Lr;dir rk; bvkh i ; ð14Þ where Rk is the red component of the diffuse reflection coefficient stored in the texture vector T, ks is the specular reflectance, defines the angular distribution of the specular reflections, bvk is the viewing direction, and rk ¼ 2 nk; lh ink l is the direction of maximum specular reflection [14].

What is the simplest way to compute the flow field?

Reference shape and texture vectors are then defined byS0 ¼ ðx1; y1; z1; x2; . . . ; xn; yn; znÞT ; ð7Þ T0 ¼ ðR1; G1; B1; R2; . . . ; Rn;Gn;BnÞT : ð8ÞTo encode a novel scan The author(Fig. 3, bottom), the authors compute the flow field from I0 to I, and convert Iðh0; 0Þ to Cartesian coordinates xðh0; 0Þ, yðh0; 0Þ, zðh0; 0Þ.

(Open Access) Face recognition based on fitting a 3D morphable model (2003) | Volker Blanz

Q: What is the core step of building a morphable face model?

The core step of building a morphable face model is toestablish dense point-to-point correspondence betweeneach face and a reference face.

Face Recognition Based on

Fitting a 3D Morphable Model

Volker Blanz and Thomas Vetter, Member, IEEE

Abstract—This paper presents a method for face recognition across variations in pose, ranging from frontal to profile views, and

across a wide range of illuminations, including cast shadows and specular reflections. To account for these variations, the algorithm

simulates the process of image formation in 3D space, using computer graphics, and it estimates 3D shape and texture of faces from

single images. The estimate is achieved by fitting a statistical, morphable model of 3D faces to images. The model is learned from a set

of textured 3D scans of heads. We describe the construction of the morphable model, an algorithm to fit the model to images, and a

framework for face identification. In this framework, faces are represented by model parameters for 3D shape and texture. We present

results obtained with 4,488 images from the publicly available CMU-PIE database and 1,940 images from the FERET database.

Index Terms—Face recognition, shape estimation, deformable model, 3D faces, pose invariance, illumination invariance.

1INTRODUCTION

N face recognition from images, the gray-level or color

values provided to the recognition system depend not

only on the identity of the person, but also on parameters

such as head pose and illumination. Variations in pose and

illumination, which may produce changes larger than the

differences between different people’s images, are the main

challenge for face recognition [39]. The goal of recognition

algorithms is to separate the characteristics of a face, which

are determined by the intrinsic shape and color (texture) of

the facial surface, from the random conditions of image

generation. Unlike pixel noise, these conditions may be

described consistently acrosstheentireimagebya

relatively small set of extrinsic parameters, such as camera

and scene geometry, illumination direction and intensity.

Methods in face recognition range within two fundamental

strategies: One approach is to treat these parameters as

separate variables and model their functional role explicitly.

The other approach does not formally distinguish between

intrinsic and extrinsic parameters, and the fact that extrinsic

parameters are not diagnostic for faces is only captured

statistically.

The latter strategy is taken in algorithms that analyze

intensity images directly using statistical methods or neural

networks (for an overview, see Section 3.2 in [39]).

To obtain a separate parameter for orientation, some

methods parameterize the manifold formed by different

views of an individual within the eigenspace of images [16],

or define separate view-based eigenspaces [28]. Another

way of capturing the viewpoint dependency is to represent

faces by eigen-lightfields [17].

Two-dimensional face models represent gray values

and their image locations independently [3], [4], [18], [23],

[13], [22]. These models, however, do not distinguish

between rotation angle and shape, and only some of them

separate illumination from texture [18]. Since large rota-

tions cannot be generated easily by the 2D warping used

in these algorithms due to occlusions, multiple view-based

2D models have to be combined [36], [11]. Another

approach that separates the image locations of facial

features from their appearance uses an approximation of

how features deform during rotations [26].

Complete separation of shape and orientation is

achieved by fitting a deformable 3D model to images. Some

algorithms match a small number of feature vertices to

image positions, and interpolate deformations of the surface

in between [21]. Others use restricted, but class-specific

deformations, which can be defined manually [24], or

learned from images [10], from nontextured [1] or textured

3D scans of heads [8].

In order to separate texture (albedo) from illumination

conditions, some algorithms, which are derived from shape-

from-shading, use models of illumination that explicitly

consider illumination direction and intensity for Lamber-

tian [15], [38] or non-Lambertian shading [34]. After

analyzing images with shape-from-shading, some algo-

rithms use a 3D head model to synthesize images at novel

orientations [15], [38].

The face recognition system presented in this paper

combines deformable 3D models with a computer graphics

simulation of projection and illumination. This makes

intrinsic shape and texture fully independent of extrinsic

parameters [8], [7]. Given a single image of a person, the

algorithm automatically estimates 3D shape, texture, and all

relevant 3D scene parameters. In our framework, rotations

in depth or c hanges of il luminati on are very simple

operations, and all poses and illuminations are covered by

a single model. Illumination is not restricted to Lambertian

reflection, but takes into account specular reflections and

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003 1063

. V. Blanz is with the Max-Planck-Institut fu

r Informatik, Stuhlsatzen-

hausweg 85, 66123 Saarbru

cken, Germany.

E-mail: blanz@mpi-sb.mpg.de.

. T. Vetter is with the University of Basel, Departement Informatik,

Bernoullistrasse 16, 4057 Basel, Switzerland.

E-mail: thomas.vetter@unibas.ch.

Manuscript received 9 Aug. 2002; accepted 10 Mar. 2003.

Recommended for acceptance by P. Belhumeur.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number 117108.

0162-8828/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society

cast shadows, which have considerable influence on the

appearance of human skin.

Our approach is based on a morphable model of 3D faces

that captures the class-specific properties of faces. These

properties are learned automatically from a data set of

3D scans. The morphable model represents shapes and

textures of faces as vectors in a high-dimensional face space,

and involves a probability density function of natural faces

within face space.

Unlike previous systems [8], [7], the algorithm presented

in this paper estimates all 3D scene parameters automati-

cally, including head position and orientation, focal length

of the camera, and illumination direction. This is achieved

by a new initialization procedure that also increases

robustness and reliability of the system considerably. The

new initialization uses image coordinates of between six

and eight feature points. Currently, most face recognition

algorithms require either some initialization, or they are,

unlike our system, restricted to front views or to faces that

are cut out from images.

In this paper, we give a comprehensive description of the

algorit hms inv olved in 1) constructing the morphable

model from 3D scans (Section 3), 2) fitting the model to

images for 3D shape reconstruction (Section 4), which

includes a novel algorithm for parameter optimization

(Appendix B), and 3) measuring similarity of faces for

recognition (Section 5). Recognition results for the image

databases of CMU-PIE [33] and FERET [29] are presented in

Section 5. We start in Section 2 by describing two general

strategies for face recognition with 3D morphable models.

2PARADIGMS FOR MODEL-BASED RECOGNITION

In face recognition, the set of images that shows all

individuals who are known to the system is often referred

to as gallery [39], [30]. In this paper, one gallery image per

person is provided to the system. Recognition is then

performed on novel probe images. We consider two

particular recognition tasks: For identification, the system

reports which person from the gallery is shown on the

probe image. For verification, a person claims to be a

particular member of the gallery. The system decides if the

probe and the gallery image show the same person (cf. [30]).

Fitting the 3D morphable model to images can be used in

two ways for recognition across different viewing conditions:

Paradigm 1. After fitting the model, recognition can be

based on model coefficients, which represent intrinsic shape

and texture of faces, and are independent of the imaging

conditions. For identification, all gallery images are ana-

lyzed by the fitting algorithm, and the shape and texture

coefficients are stored (Fig. 1). Given a probe image, the

fitting algorithm computes coefficients which are then

compared with all gallery data in order to find the nearest

neighbor. Paradigm 1 is the approach taken in this paper

(Section 5).

Paradigm 2. Three-dimension face reconstruction can

also be employed to generate synthetic views from gallery

or probe images [3], [35], [15], [38]. The synthetic views are

then transferred to a second, viewpoint-dependent recogni-

tion system. This paradigm has been evaluated with 10 face

recognition systems in the Face Recognition Vendor Test

2002 [30]: For 9 out of 10 systems, our morphable model and

fitting procedure (Sections 3 and 4) improved performance

on nonfrontal faces substantially.

In many applications, synthetic views have to meet

standard imaging conditions, which may be defined by the

properties of the recognition algorithm, by the way the

gallery images are taken (mug shots), or by a fixed camera

setup for probe images. Standard conditions can be

estimated from an example image by our system (Fig. 2).

If more than one image is required for the second system or

no standard conditions are defined, it may be useful to

synthesize a set of different views of each person.

3AMORPHABLE MODEL OF 3D FACES

The morphable face model is based on a vector space

representation of faces [36] that is constructed such that any

convex combination

of shape and texture vectors S

and T

of a set of examples describes a realistic human face:

S ¼

i¼1

; T ¼

i¼1

: ð1Þ

Continuous changes in the model parameters a

generate

a smooth transition such that each point of the initial

surface moves toward a point on the final surface. Just as in

morphing, artifacts in intermediate states of the morph are

avoided only if the initial and final points are correspond-

ing structures in the face, such as the tip of the nose.

Therefore, dense point-to-point correspondence is crucial

for defining shape and texture vectors. We describe an

automated method to establish this correspondence in

Section 3.2, and give a definition of S and T in Section 3.3.

3.1 Database of Three-Dimensional Laser Scans

The morphable model was derived from 3D scans of

100 males and 100 females, aged between 18 and 45 years.

One person is Asian, all others are Caucasian. Applied to

image databases that cover a much larger ethnic variety

1064 IEEE TRANSA CTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003

Fig. 1. Derived from a database of laser scans, the 3D morphable face

model is used to encode gallery and probe images. For identification, the

model coefficients 

, 

of the probe image are compared with the

stored coefficients of all gallery images.

1. To avoid changes in overall size and brightness, a

and b

should sum

to 1. The additional constraints a

2½0; 1 imposed on convex combina-

tions will be replaced by a probabilistic criterion in Section 3.4.

(Section 5), the model seemed to generalize well beyond

ethnic boundaries. Still, a more diverse set of examples

would certainly improve performance.

Recorded with a Cyberware

3030PS laser scanner, the

scans represen t face shape in cylind rical coord inates

relative to a vertical axis centered with respect to the head.

In 512 angular steps  covering 360



and 512 vertical steps h

at a spacing of 0.615mm, the device measures radius r,

along with red, green, and blue components of surface

texture R; G; B. We combine radius and texture data:

Iðh; Þ¼ rðh; Þ;Rðh; Þ;Gðh; Þ;Bðh; ÞðÞ

;

h;  2f0; ...; 511g:

ð2Þ

Preprocessing of raw scans involves:

1. filling holes and removing spikes in the surface with

an interactive tool,

2. automated 3D alignment of the faces with the

method of 3D-3D Absolute Orientation [19],

3. semiautomatic trimming along the edge of a bathing

cap, and

4. a vertical, planar cut behind the ears and a

horizontal cut at the neck, to remove the back of

the head, and the shoulders.

3.2 Correspondence Based on Optic Flow

The core step of building a morphable face model is to

establish dense point-to -point co rrespondence bet ween

each face an d a reference face. The representation in

cylindrical coordinates provides a parameterization of the

two-dimensional manifold of the facial surface by para-

meters h and . Correspondence is given by a dense vector

field vðh; Þ¼ðhðh; Þ; ðh; ÞÞ

such that each point

ðh; Þ on the first scan corresponds to the point I

ðh þ

h;  þ Þ on the second scan. We employ a modified

optic flow algorithm to determine this vector field. The

following two sections describe the original algorithm and

our modifications.

Optic Flow on Gray-Level Images. Many optic flow

algorithms (e.g., [20], [25], [2]) are based on the assumption

that objects in image sequences Iðx; y; tÞ retain their bright-

nesses as they move across the image at a velocity ðv

This implies

¼ v

þ v

¼ 0: ð3Þ

For pairs of images I

taken at two discrete moments,

temporal derivatives v

, v

in (3) are approximated by

finite differences x, y, and I ¼ I

 I

. If the images are

not from a temporal sequence, but show two different

objects, corresponding points can no longer be assumed to

have equal brightnesses. Still, optic flow algorithms may be

applied successfully.

A unique solution for both components of v ¼ðv

from (3) can be obtained if v is assumed to be constant on

each neighborhood Rðx

Þ, and the following expression

[25], [2] is minimized in each point ðx

Þ:

Eðx

Þ¼

x;y2Rðx

@Iðx; yÞ

þ v

@Iðx; yÞ

þ Iðx; yÞ



ð4Þ

We use a 5  5 pixel neighborhood Rðx

Þ. In each

point ðx

Þ, vðx

Þ can be found by solving a 2  2 linear

system (Appendix A).

In order to deal with large displacements v,the

algorithm of Bergen and Hingorani [2] employs a coarse-

to-fine strategy using Gaussian pyramids of downsampled

images: With the gradient-based method described above,

the algorithm computes the flow field on the lowest level of

resolution and refines it on each subsequent level.

Generalization to three-dimensional surfaces. For pro-

cessing 3D laser scans Iðh; Þ, (4) is replaced by

BLANZ AND VETTER: FACE RECOGNITION BASED ON FITTING A 3D MORPHABLE MODEL 1065

Fig. 2. In 3D model fitting, light direction and intensity are estimated automatically, and cast shadows are taken into account. The figure shows

original PIE images (top), reconstructions rendered into the originals (second row), and the same reconstructions rendered with standard illumination

(third row) taken from the top right image.

E ¼

h;2R

@Iðh;Þ

þ v



@Iðh;Þ

@

þ I



; ð5Þ

with a norm I

¼ w

þ w

: ð6Þ

Weights w

, w

, and w

compensate for different

variations within the radius data and the red, green, and

blue texture components, and control the overall weighting

of shape versus texture information. The weights are chosen

heuristically. The minimum of (5) is again given by a 2  2

linear system (Appendix A).

Correspondence between scans of different individuals,

who may differ in overall brightness and size, is improved

by using Laplacian pyramids (band-pass filtering) rather

than Gaussian pyramids (low-pass filtering). Additional

quantities, such as Gaussian curvature, mean curvature, or

surface normals, may be incorporated in Iðh; Þ. To obtain

reliable results even in regions of the face with no salient

structures, a specifically designed smoothing and interpola-

tion algorithm (Appendix A.1) is added to the matching

procedure on each level of resolution.

3.3 Definition of Face Vectors

The definition of shape and texture vectors is based on a

reference face I

, which can be any three-dimensional face

model. Our reference face is a triangular mesh with

75,972 vertices derived from a laser scan. Let the vertices

k 2f1; ...;ng of this mesh be located at ðh

;

;rðh

;

ÞÞ

in cylindrical and at ðx

Þ in Cartesian coordinates

and have colors ðR

Þ. Reference shape and texture

vectors are then defined by

¼ðx

; ...;x

; ð7Þ

¼ðR

; ...;R

: ð8Þ

To encode a novel scan I (Fig. 3, bottom), we compute

the flow field from I

to I, and convert Iðh

;

Þ to

Cartesian coordinates xðh

;

Þ, yðh

;

Þ, zðh

;

Þ. Coordi-

nates ðx

Þ and color values ðR

Þ for the

shape and texture vectors S and T are then sampled at

¼ h

þ hðh

;

Þ, 

¼ 

þ v



ðh

;

Þ.

3.4 Principal Component Analysis

We perform a Principal Component Analysis (PCA, see

[12]) on the set of shape and texture vectors S

and T

example faces i ¼ 1...m. Ignoring the correlation between

shape and texture data, we analyze shape and texture

separately.

For shape, we subtract the average

s ¼

i¼1

from

each shape vector, a

¼ S

 s, and define a data matrix

A ¼ða

; a

; ...; a

Þ.

The essential step of PCA is to compute the eigenvec-

tors s

; s

; ... of the covariance matrix C ¼

i¼1

, which can be achieved by a Singular Value

Decomposition [31] of A.TheeigenvaluesofC,



S;1

 

S;2

 ..., are the variances of the data along each

eigenvector. By the same procedure, we obtain texture

eigenvectors t

and variances 

T;i

. Results are visualized

in Fig. 4. The eigenvectors form an orthogonal basis,

S ¼

s þ

m1

i¼1



 s

; T ¼ t þ

m1

i¼1



 t

ð9Þ

and PCA provides an estimate of the probability density

within face space:

ðSÞe







S;i

ðTÞe







T;i

: ð10Þ

3.5 Segments

From a given set of examples, a larger variety of different

faces can be generated if linear combinations of shape and

texture are formed separately for different regions of the

face. In our system, these regions are the eyes, nose, mouth,

and the surrounding area [8]. Once manually defined on the

referen ce face, the segmentation applies to the entire

morphable model.

For continuous transitions between the segments, we

apply a modification of the image blending technique of [9]:

x; y; z coordinates and colors R; G; B are stored in arrays

xðh; Þ, ... based on the mapping i 7!ðh

;

Þ of the reference

face. The blending technique interpolates x; y; z and R; G; B

across an overlap in the ðh; Þ-domain, which is large for

low spatial frequencies and small for high frequencies.

1066 IEEE TRANSA CTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003

Fig. 3. For 3D laser scans parameterized by cylindrical coordinates

ðh; Þ, the flow field that maps each point of the reference face (top) to

the corresponding point of the example (bottom) is used to form shape

and texture vectors S and T.

Fig. 4. The average and the first two principal components of a data set

of 200 3D face scans, visualized by adding 3

S;i

and 3

T;i

to the

average face.

4MODEL-BASED IMAGE ANALYSIS

The goal of model-based image analysis is to represent a

novel face in an image by model coefficients 

and 

(9)

and provide a reconstruction of 3D shape. Moreover, it

automatically estimates all relevant parameters of the three-

dimensional scene, such as pose, focal length of the camera,

light intensity, color, and direction.

In an analysis-by-synthesis loop, the algorithm finds

model parameters and scene parameters such that the

model, rendered by computer graphics algorithms, pro-

duces an image as similar as possible to the input image

input

(Fig. 5).

The iterative optimization starts from the

average face and standard rendering conditions (front view,

frontal illumination, cf. Fig. 6).

For initialization, the system currently requires image

coordinates of about seven facial feature points, such as the

corners of the eyes or the tip of the nose (Fig. 6). With an

interactive tool, the user defines these points j ¼ 1...7 by

alternately clicking on a point of the reference head to select

a vertex k

of the morphable model and on the correspond-

ing point q

x;j

y;j

in the image. Depending on what part of

the face is visible in the image, different vertices k

may be

selected for each image. Some salient features in images,

such as the contour line of the cheek, cannot be attributed to

a single vertex of the model, but depend on the particular

viewpoint and shape of the face. The user can define such

points in the image and label them as contours. During the

fitting procedure, the algorithm determines potential con-

tour points of the 3D model based on the angle between

surface normal and viewing direction and selects the closest

contour point of the model as k

in each iteration.

The following section summarizes the image synthesis

from the model, and Section 4.2 describes the analysis-by-

synthesis loop for parameter estimation.

4.1 Image Synthesis

The three-dimensional positions and the color values of the

model’s vertices are given by the coefficients 

and 

and

(9). Rendering an image includes the following steps.

4.1.1 Image Positions of Vertices

A rigid transformation maps the object-centered coordi-

nates x

¼ðx

of each vertex k to a position relative

to the camera:

ðw

x;k

y;k

z;k

¼ R







þ t

: ð11Þ

The angles  and  control in-depth rotations around the

vertical and horizontal axis, and  defines a rotation around

the camera axis. t

is a spatial shift.

A perspective projection then maps vertex k to image

plane coordinates p

x;k

y;k

x;k

¼ P

þ f

x;k

z;k

y;k

¼ P

 f

y;k

z;k

: ð12Þ

f is the focal length of the camera which is located in the

origin, and ðP

Þ defines the image-plane position of the

optical axis (principal point).

4.1.2 Illumination and Color

Shading of surfaces depends on the direction of the surface

normals n. The normal vector to a triangle k

of the

face mesh is given by a vector product of the edges,

ðx

 x

Þðx

 x

Þ, which is normalized to unit length

and rotated along with the head (11). For fitting the model

to an image, it is sufficient to consider the centers of

triangles only, most of which are about 0:2mm

in size. The

BLANZ AND VETTER: FACE RECOGNITION BASED ON FITTING A 3D MORPHABLE MODEL 1067

2. Fig. 5 is illustrated with linear combinations of example faces

according to (1) rather than principal components (9) for visualization.

Fig. 5. The goal of the fitting process is to find shape and texture

coefficients 

and 

describing a three-dimensional face model such

that rendering R



produces an image I

model

that is as similar as possible

to I

input

Fig. 6. Face reconstruction from a single image (top, left) and a set of

feature points (top, center): Starting from standard pose and illumination

(top, right), the algorithm computes a rigid transformation and a slight

deformation to fit the features. Subsequently, illumination is estimated.

Shape, texture, transformation, and illumination are then optimized for

the entire face and refined for each segment (second row). From the

reconstructed face, novel views can be generated (bottom row).

Face recognition based on fitting a 3D morphable model

Figures

Citations

Face recognition: A literature survey

Computer Vision: Algorithms and Applications

Face detection, pose estimation, and landmark localization in the wild

A morphable model for the synthesis of 3D faces

Computer and Robot Vision

References

Numerical recipes in C

Neural networks for pattern recognition

Neural Networks for Pattern Recognition

An iterative image registration technique with an application to stereo vision

Determining optical flow

Related Papers (5)

Eigenfaces for recognition

Face recognition: A literature survey

Eigenfaces vs. Fisherfaces: recognition using class specific linear projection

Active appearance models

From few to many: illumination cone models for face recognition under variable lighting and pose

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Face recognition based on fitting a 3d morphable model" ?

Q2. What have the authors stated for future works in "Face recognition based on fitting a 3d morphable model" ?

Q3. What is the core step of building a morphable face model?

Q4. What is the probability of observing a face?

Q5. What is the procedure for fitting a face to a probe image?

Q6. What is the primary goal in analyzing a face?

Q7. What is the name of the set of images that shows all individuals who are known to the system?

Q8. What is the effect of shape coefficients on the image?

Q9. How many coefficients are used in the fitting algorithm?

Q10. Why are the head angles not fully aligned in space?

Q11. What is the angular distribution of the specular reflections?

Q12. What is the simplest way to compute the flow field?