Regressing Robust and Discriminative 3D Morphable Models with a Very Deep Neural Network

doi:10.1109/CVPR.2017.163

Regressing Robust and Discriminative 3D Morphable Models

with a very Deep Neural Network

Anh Tu

´

ˆ

an Tr

`

ˆ

an

1

, Tal Hassner

2,3

, Iacopo Masi

1

, and G

´

erard Medioni

1

Institute for Robotics and Intelligent Systems, USC, CA, USA

2

Information Sciences Institute, USC, CA, USA

3

The Open University of Israel, Israel

Abstract

The 3D shapes of faces are well known to be discrimi-

native. Yet despite this, they are rarely used for face recog-

nition and always under controlled viewing conditions. We

claim that this is a symptom of a serious but often over-

looked problem with existing methods for single view 3D

face reconstruction: when applied “in the wild”, their

3D estimates are either unstable and change for different

photos of the same subject or they are over-regularized

and generic. In response, we describe a robust method

for regressing discriminative 3D morphable face models

(3DMM). We use a convolutional neural network (CNN) to

regress 3DMM shape and texture parameters directly from

an input photo. We overcome the shortage of training data

required for this purpose by offering a method for generat-

ing huge numbers of labeled examples. The 3D estimates

produced by our CNN surpass state of the art accuracy on

the MICC data set. Coupled with a 3D-3D face matching

pipeline, we show the ﬁrst competitive face recognition re-

sults on the LFW, YTF and IJB-A benchmarks using 3D face

shapes as representations, rather than the opaque deep fea-

ture vectors used by other modern systems.

1. Introduction

Single view 3D face shape estimation methods originally

proposed using their 3D shapes for recognition [4, 7, 28].

This makes sense because 3D shapes are discriminative –

different people have different face shapes – yet invariant to

lighting, texture changes and more. Indeed, previous work

showed that when available, high resolution 3D face scans

are excellent face representations which can even be used to

distinguish between the faces of identical twins [9].

Curiously, however, despite their widespread use, single

view face reconstruction methods are rarely employed by

modern face recognition systems. The highly successful 3D

Morphable Models (3DMM), for example, were only ever

used for recognition in limited, controlled viewing condi-

tions [4, 7, 11, 17, 28]. To our knowledge, there are no

Figure 1: Unconstrained, single view, 3D face shape recon-

struction. (a) Input images of the same subject with disrup-

tive poses and occlusions. (b-e) 3D reconstructions using

(b) single-view 3DMM [33], (c) ﬂow based method [13]

(d) 3DDFA [47], (e) Our proposed approach. (b-c) Present

different 3D shapes for the same subject and (d) appears

generic, whereas our method (e) is robust, producing simi-

lar discriminative 3D shapes for different views.

reports of successfully using single view face shape estima-

tion – 3DMM or any other method – to recognize faces in

challenging unconstrained, in the wild settings.

An important reason why this may be so, is that these

methods can be unstable in unconstrained viewing condi-

tions. We later verify this quantitatively but it can also

be seen in Fig. 1 which presents 3D shapes estimated

from three unconstrained photos by three different meth-

ods (Fig. 1 (b-d)). Clearly, though the same subject appears

in all photos, shapes produced by the same method are ei-

ther very different (b,c) or highly regularized and generic

(d). It is therefore unsurprising that these shapes are poor

representations for recognition. It also explains why some

recently proposed using coarse, simple 3D shape approxi-

mations only as proxies when rendering faces to new views

rather than as face representations [13, 15, 25, 26, 39].

Contrary to previous work, we show that robust and dis-

1

5163

criminative 3D face shapes can, in fact, be estimated from

single, unconstrained images (Fig. 1 (e)). We propose esti-

mating 3D facial shapes using a very deep convolutional

neural network (CNN) to regress 3DMM shape and tex-

ture parameters directly from single face photos. We iden-

tify shortage of labeled training data as an obstacle to us-

ing data-hungry CNNs for this purpose. We address this

problem with a novel means for generating a huge labeled

training set of unconstrained faces and their 3DMM repre-

sentations. Coupled with additional technical novelties, we

obtain a method which is fast, robust and accurate.

The accuracy of our estimated shapes is veriﬁed on the

MICC data set [1] and quantitatively shown to surpass the

accuracy of other 3D reconstruction methods. We further

show that our estimated shapes are robust and discrimina-

tive by presenting face recognition results on the Labeled

Faces in the Wild (LFW) [18], YouTube Faces (YTF) [42]

and IJB-A [23] benchmarks. To our knowledge, this is the

ﬁrst time single image 3D face shapes are successfully used

to represent faces from modern, unconstrained face recog-

nition benchmarks. Finally, to promote reproduction of our

results, we publicly release our code and models.

1

.

2. Related work

Over the years, many attempts were made to estimate

the 3D surface of a face appearing in a single view. Be-

fore listing them, it is important to mention recent multi

image methods which use image sets for reconstruction

(e.g., [24, 30, 34, 35, 38]). Although these methods pro-

duce accurate 3D reconstructions, they require many im-

ages from multiple sources to produce a single 3D face

shape whereas we reconstruct faces from single images.

Methods for single view 3D face reconstructions can

broadly be categorized into the following types.

Statistical shape representations, such as the widely pop-

ular 3DMM [5, 6, 11, 28, 32, 40, 45], use many aligned

3D face shapes to learn a distribution of 3D faces, repre-

sented as a high dimensional subspace. Each point on this

subspace is a parameter vector representing facial geome-

try and sometimes expression and texture. Reconstruction

is performed by searching for a point on this subspace that

represents a face similar to the one in the input image. These

methods do not attempt to produce discriminative facial ge-

ometries and indeed, as mentioned earlier, were only used

for face recognition under controlled settings.

The very recent method of [31] also uses a CNN to

regress 3DMM parameters for face photos. They too rec-

ognize absence of training data as a major concern. Con-

trary to us, they propose synthesizing training faces with

known geometry by sampling from the 3DMM distribution.

1

Please see www.openu.ac.il/home/hassner/projects/CNN3DMM for

updates.

This approach produces synthetic looking photos which can

easily cause overﬁtting problems when training large net-

works [26]. They were therefore able to train only a shal-

low residual network (seven layers compared to our 101)

and their estimated shapes were not shown to be more ro-

bust or discriminative than other methods.

Scene assumption methods. In order to obtain correct re-

constructions, some make strong assumptions on the scene

and the viewing conditions in the input image. Shape from

shading methods [21], for example, make assumptions on

the light sources, facial reﬂectance and more. Others in-

stead use facial symmetry [12]. The assumptions they and

others make often do not hold in practice, limiting the ap-

plication of these methods to controlled settings.

Example based methods, beginning from the work of [14]

and more recently [13, 39], modify the 3D surface of ex-

ample face shapes, ﬁtting them to the face appearing in

input photo. These methods favor robustness to challeng-

ing viewing conditions over detailed reconstructions. They

were thus only used for face recognition to synthesize new

views from unseen poses.

Landmark ﬁtting methods. Finally, some reconstruction

techniques ﬁt a 3D surface to detected facial landmarks

rather than to face intensities directly. These include meth-

ods designed for videos (e.g., [19, 36]) and the CNN based

approaches of [20, 47]. These focus more on landmark de-

tection than 3D shape estimation and so do not attempt to

produce detailed and discriminative facial geometries.

3. Regressing 3DMM parameters with a CNN

We propose to regress 3DMM face shape parameters di-

rectly from an input photo using a very deep CNN. Osten-

sibly, CNNs are ideal for this task: After all, they are being

successfully applied to many related computer vision tasks.

But despite their success, apart from [31], we are unaware

of published reports of using CNNs for 3DMM parameter

regression.

We believe CNNs were not used here because this is a

regression problem where both the input photo and the out-

put 3DMM shape parameters are high dimensional. Solv-

ing such problems requires deep networks and these need

massive amounts of training data. Unfortunately, existing

unconstrained face sets with ground truth 3D shapes are far

too small for this purpose and obtaining large quantities of

3D face scans is labor intensive and impractical.

We therefore instead leverage three key observations.

1. As discussed in Sec. 2, accurate 3D estimates can be

obtained by using multiple images of the same face.

2. Unlike the limited availability of ground truth 3D face

shapes, there is certainly no shortage of challenging

face sets containing multiple photos per subject.

5164

3. Highly effective deep networks are available for the re-

lated task of extracting robust and discriminative face

representations for face recognition.

From (1), we have a reasonable way of producing 3D face

shape estimates for training, as surrogates for ground truth

shapes: by using a robust method for multi-view 3DMM

estimation. Getting multiple photos for enough subjects is

very easy (2). This abundance of examples further allows

balancing any reconstruction errors with potentially limit-

less subjects to train on. Finally, (3), a state of the art CNN

for face recognition may be ﬁne-tuned to this problem. It

should already be tuned for unconstrained facial appear-

ance variations and trained to produce similar, discrimina-

tive outputs for different images of the same face.

3.1. Generating training data

To generate training data, we use a simple yet effective

multi image 3DMM estimation method, loosely based on

the one recently proposed by [30]. We run it on the uncon-

strained faces in the CASIA WebFace dataset [46]. These

multi image 3DMM estimates are then used as ground truth

3D face shapes when training our CNN 3DMM regressor.

Multi image 3DMM reconstruction is performed by ﬁrst

estimating 3DMM parameters from the 500k single images

in CASIA. 3DMM estimates for images of the same subject

are then aggregated into a single 3DMM per subject (∼10k

subjects). This process is described next (see also, Fig. 2).

The 3DMM representation. Our system uses the popular

Basel Face Model (BFM) [28]. It is a publicly available

3DMM representation and one of the state of the art meth-

ods for single view 3D face modeling.

A face is modeled by decoupling its shape and texture

giving the following two independent generative models.

S

′

=

b

s + W

S

α , T

′

=

b

t + W

T

β. (1)

Here, the vectors

b

s and

b

t are the mean face shape and tex-

ture, computed over the aligned facial 3D scans in the Basel

Faces collection and represented by the concatenated 3D

coordinates of the 3D point clouds and the concatenated

RGB values of their textures. Matrices W

S

and W

T

are

the principle components, computed from the same aligned

facial scans. Finally, α and β are each 99D parameter vec-

tors, representing shape and texture respectively.

Single image 3DMM ﬁtting. Fitting a 3DMM to each

training image is performed with a slightly modiﬁed version

of the two standard methods of [8] and [33]. Given an image

I, we estimate parameter vectors α

⋆

and β

⋆

which repre-

sent a face similar to the one in I (Eq. (1)). Unlike previous

work, we begin processing by applying the CLNF [22] state

of the art facial landmark detector. It provides K = 68 fa-

cial landmarks p

k

∈ R

2

, k ∈ 1..K, and a conﬁdence score

value w (which we use later on).

Landmarks are used to obtain an initial estimate for the

pose of the input face, in the reference 3DMM coordinate

system. Pose is represented by six degrees of freedom for

rotation, r = [r

α

, r

β

, r

γ

], and translation, t = [t

X

, t

Y

, t

Z

],

and estimated similar to [13]. 3DMM ﬁtting then proceeds

by optimizing over the shape, texture, pose, illumination,

and color model following [8]. We found that CLNF makes

occasional localization errors. To introduce more stability,

our optimization also uses the edge-based cost of [33]. For

more details on this optimization, we refer to [8] and [33].

Once the optimization converges, we take the shape and

texture parameters, α

⋆

and β

⋆

, from the last iteration as our

single image 3DMM estimate for the input image I. Impor-

tantly, though this process is known to be computationally

expensive, it is applied in our pipeline only in preprocessing

and once for every training image. We later show our CNN

regressor to be much faster.

Multi image 3DMM ﬁtting. Although a number of multi

image 3D face shape estimation methods were proposed in

the past, we found the following simple approach, inspired

by the very recent work of [30], to be particularly effective.

Speciﬁcally, we pool the shape and texture 3DMM pa-

rameters γ

i

= [α

i

, β

i

], i ∈ 1..N across all the N single

view estimates belonging to the same subject. Pooling is

performed by element wise weighted averaging of the N

3DMM vectors, resulting in a single 3DMM estimate for

that subject,

b

γ. That is,

b

γ =

N

X

i=1

w

i

· γ

i

and

N

X

i=1

w

i

= 1, (2)

where w

i

are normalized per-image conﬁdences provided

by the CLNF facial landmark detector.

Note that unlike [30], we do not use a rank-list based on

distances of normals as a quality measure to pool 3DMM

parameters, instead taking the landmark detection conﬁ-

dence measure for these weights. Following this process,

each CASIA subject is associated with a single, pooled

3DMM parameter vector

b

γ. For ease of notation, hence-

forth we will drop the hat when denoting pooled features,

assuming all training set 3DMM parameters were pooled.

3.2. Learning to regress pooled 3DMM

Following the process described in Sec. 3.1, each subject

in our data set is associated with a number of images and

a single, pooled 3DMM. We now use this data to learn a

function which, ideally, regresses the same pooled 3DMM

feature vector for different photos of the same subject.

To this end, we use a state of the art CNN, trained for

face recognition. We use the very deep ResNet architec-

ture [16] with 101 layers, recently trained for face recogni-

tion by [26]. We modify its last fully-connected layer to out-

put the 198D 3DMM feature vector γ. The network is then

5165

Figure 2: Overview of our process. (a) Large quantities of unconstrained photos are used to ﬁt a single 3DMM for each

subject. (b) This is done by ﬁrst ﬁtting single image 3DMM shape and texture parameters to each image separately. Then, all

3DMM estimates for the same subject are pooled together for a single estimate per subject. (c) These pooled estimates are

used in place of expensive ground truth face scans to train a very deep CNN to regress 3DMM parameters directly.

ﬁne-tuned on CASIA images using the pooled 3DMM esti-

mates as target values; different images of the same subject

presented to the CNN using the same target 3DMM shape.

We note that we also tried using the VGG-Face CNN of [27]

with 16 layers. Its results were similar to those obtained by

the ResNet architecture, though somewhat lower.

The asymmetric Euclidean loss. Training our network re-

quires some care when deﬁning its loss function. 3DMM

vectors, by construction, belong to a multivariate Gaussian

distribution with its mean on the origin, representing the

mean face (Sec. 3.1). Consequently, during training, us-

ing the standard Euclidean loss to minimize distances be-

tween estimated and target 3DMM vectors will favor esti-

mates closer to the origin: these will have a higher proba-

bility of being closer to their target values than those further

away. In practice, we found that a network trained with the

Euclidean loss tends to output less detailed faces (Fig. 3).

To counter this bias towards a mean face shape, we in-

troduce an asymmetric Euclidean loss. It is designed to en-

courage the network to favor estimates further away from

the origin by decoupling under-estimation errors (errors on

the side of the 3DMM target closer to the origin) from over-

estimation errors (where the estimate is further out from the

origin than the target). It is deﬁned by:

L(γ

p

, γ) = λ

1

· ||γ

+

− γ

max

||

2

|

{z }

over-estimate

+λ

2

· ||γ

+

p

− γ

max

||

2

|

{z }

under-estimate

, (3)

using the element-wise operators:

γ

+

.

= abs(γ)

.

= sign(γ) · γ; γ

+

p

.

= sign(γ) · γ

p

, (4)

γ

max

.

= max(γ

+

, γ

+

p

). (5)

Here, γ is the target pooled 3DMM value, γ

p

is the output,

regressed 3DMM and λ

1,2

control the trade-off between the

over and under estimation errors. When both equal 1, this

reduces to the traditional Euclidean loss. In practice, we set

λ

1

= 1, λ

2

= 3, thus changing the behavior of the train-

ing process, allowing it to escape under-ﬁtting faster and

Figure 3: Effect of our loss function: (left) Input image,

(a) generic model, (b) regressed shape and texture with a

regular ℓ

2

loss and (c) our proposed asymmetric ℓ

2

loss.

encouraging the network to produce more detailed, realistic

3D face models (Fig. 3).

Network hyperparameters. Eq. (3) is solved using

Stochastic Gradient Descent (SGD) with a mini-batch of

size 144, momentum set to 0.9 and with regularization

over the weights provided by ℓ

2

with a weight decay of

0.0005. When performing back-propagation, we learn the

inner product layer (fc) after pool5 faster, setting the learn-

ing rate to 0.01, since it is trained from scratch for the re-

gression problem. Other network weights are updated with

a learning rate an order of magnitude lower. When the vali-

dation loss saturates, we decrease learning rates by an order

of magnitude, until the validation loss stops decreasing.

Discussion: Render-free 3DMM estimator. It is impor-

tant to note that by choosing to use a CNN to regress 3DMM

parameters, we obtain a function that is render-free. That

is, 3DMM parameters are regressed directly from the input

image, without an optimization process which renders the

face and compares it to the photo, as do existing methods

for 3DMM estimation (including our method for generating

training data in Sec. 3.1). By using a CNN, we therefore

hope to gain not only improved accuracy, but also much

faster 3DMM estimation speeds.

3.3. Parameter based 3D-3D recognition

The CNN we train in Sec. 3.2 represents a function

f : I 7→ γ

p

, giving us 3DMM parameters γ

p

for an input

image I. We later use our 3DMM estimates in face recogni-

5166

tion benchmarks, to test how robust and discriminative they

are. We next describe the method used for that purpose to

evaluate the similarity of two face shapes and textures to

determine if they represent the same subject.

3D-3D recognition with a single image. We perform

face recognition using the 3DMM parameters regressed by

our network: By using the 3DMM parameters γ

p

as face

descriptors. Because different benchmarks often exhibit

speciﬁc appearance biases, we apply Principal Component

Analysis (PCA), learned from the training splits of the test

benchmark, to adapt our estimated parameter vectors to

the benchmark. Signed, element wise square rooting of

these vectors is then used to further improve representation

power [29]. Finally, the similarity of two faces, s(γ

p1

, γ

p2

),

is evaluated by computing their cosine score:

s(γ

1

, γ

2

) =

γ

p1

· γ

T

p2

||γ

p1

|| · ||γ

p2

||

. (6)

3D-3D recognition with multiple-images. In some scenar-

ios, a subject is represented by a set of images, rather than

just one. This is the case in the YTF benchmark [42] where

videos are used, each containing multiple frames, and in the

recent IJB-A [23], which uses templates containing hetero-

geneous visual data (images, videos and possibly more).

We use the same pipeline for single images also for im-

age sets. Here, however, 3DMM parameters for differ-

ent images or frames are ﬁrst pooled using Eq. (2). Un-

like the process applied in Sec. 3.1, all images here have

equal weights, as we do not run landmark detection prior to

3DMM ﬁtting with our CNN (see below). When using tem-

plates with both videos and images, following [26], we ﬁrst

pool the 3DMM estimates for frames in each video sepa-

rately, obtaining one 3DMM per video. We then pool these

3DMMs with those of other images in the same template.

Face alignment. Facial landmark detection and face align-

ment are known to improve recognition accuracy (e.g., [43,

15]). In fact, the recent, related work of [17] manually as-

signed landmarks before using their 3DMM ﬁtting method

for recognition on controlled images. We, however, did not

align faces beyond using the bounding boxes provided in

their data sets. We found our method robust to misalign-

ments and so spared the runtime this required.

4. Experimental results

We test our proposed method, comparing the accuracy of

its estimated 3D shapes, its speed and its ability to represent

faces for recognition with existing methods. Importantly,

we are unaware of any previous work on single view 3D face

shape estimation which reported as many quantitative tests

as we do, in terms of the number of benchmarks used, the

number of baseline methods compared with and the level of

difﬁculty of the photos used in these tests.

Method 3DRMSE RMSE log

10

×10

4

Rel×10

4

Sec.

Generic 1.88±.52 3.48±.76 28±7 65±16 –

3DMM [33] 1.75±.42 3.64±.94 29±8 68±18 120

Flow-based [13] 1.83±.39 3.29±.70 27±6 62±14 13.3

Us 1.57±.33 3.18±.77 26±6 59±14 .088

Generic+pool 1.88±.52 3.48±.76 28±7 65±16 –

3DMM [33]+pool

∗

1.60±.46 3.31±.98 27±9 62±20 120

3DDFA [47]+pool 1.83±.58 3.45±.85 28±7 65±17 .146

[19] 1.84±.32 3.73±.62 30±5 68±11 .372

[2]+pool 1.84±.58 3.45±.85 28±6 65±13 52.3

Us +pool 1.53±.29 3.14±.70 25±6 58±13 .088

Table 1: 3D estimation accuracy and per-image speed on

the MICC dataset. Top are single view methods, bottom are

multi frame. See text for details on measures. 3DRMSE in

real-world mm.

∗

Denotes the method used to produce the

training data in Sec. 3.1. Lower values are better.

Figure 4: Qualitative comparison of surface errors, visual-

ized as heat maps with real world mm errors on MICC face

videos and their ground truth 3D shapes. Left to right, top

to bottom: frame from input; 3D ground-truth; generic face;

estimates for ﬂow-based method [13], Huber et al. [19],

3DDFA [47], Bas et al. [2], 3DMM +pool [33], us +pool.

Speciﬁcally, we evaluate the accuracy of our estimated

3D shapes using videos and photos and their correspond-

ing scanned, ground truth 3D shapes from the MICC Flo-

rence Faces dataset [1] (Sec. 4.1). To test how discrimina-

tive and robust our shapes are when estimated from uncon-

strained images, we perform single image and multi image

face recognition using the LFW [18], YTF [42] and the new

IARPA JANUS Benchmark-A (IJB-A) [23] (Sec. 4.3). Fi-

nally we also provide qualitative results in Sec. 4.4.

As baseline 3D reconstruction methods we used stan-

dard 3DMM ﬁtting [33], implemented by us, the ﬂow-based

method of [13], the edge based method of [2], the multi res-

olution, multi-view approach of [19] and 3DDFA of [47],

were all tested with their authors’ implementations.

4.1. 3D shape reconstruction accuracy

The MICC dataset [1] contains challenging face videos

of 53 subjects. The videos span the range of controlled to

challenging unconstrained outdoor settings. For each of the

subjects in these videos, the data set contains also a ground-

truth 3D model acquired using a structured-light scanning

system with high precision. This allows comparing our 3D

5167

Regressing Robust and Discriminative 3D Morphable Models with a Very Deep Neural Network

Citations

A morphable model for the synthesis of 3D faces

RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild

Deep video portraits

Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network

Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression

References

Deep Residual Learning for Image Recognition

A method for registration of 3-D shapes

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Face recognition using eigenfaces

Related Papers (5)

A morphable model for the synthesis of 3D faces

Face Alignment Across Large Poses: A 3D Solution

A 3D Face Model for Pose and Illumination Invariant Face Recognition

FaceWarehouse: A 3D Facial Expression Database for Visual Computing

Deep Residual Learning for Image Recognition