Analyzing and Reducing the Damage of Dataset Bias to Face Recognition With Synthetic Data

doi:10.1109/CVPRW.2019.00279

Analyzing and Reducing the Damage of Dataset Bias

to Face Recognition with Synthetic Data

Adam Kortylewski Bernhard Egger Andreas Schneider

Thomas Gerig Andreas Morel-Forster Thomas Vetter

Department of Mathematics and Computer Science

University of Basel

Abstract

It is well known that deep learning approaches to face

recognition suffer from various biases in the available train-

ing data. In this work, we demonstrate the large potential

of synthetic data for analyzing and reducing the negative

effects of dataset bias on deep face recognition systems. In

particular we explore two complementary application areas

for synthetic face images: 1) Using fully annotated synthetic

face images we can study the face recognition rate as a

function of interpretable parameters such as face pose. This

enables us to systematically analyze the effect of different

types of dataset biases on the generalization ability of neu-

ral network architectures. Our analysis reveals that deeper

neural network architectures can generalize better to un-

seen face poses. Furthermore, our study shows that current

neural network architectures cannot disentangle face pose

and facial identity, which limits their generalization ability.

2) We pre-train neural networks with large-scale synthetic

data that is highly variable in face pose and the number of

facial identities. After a subsequent ﬁne-tuning with real-

world data, we observe that the damage of dataset bias in

the real-world data is largely reduced. Furthermore, we

demonstrate that the size of real-world datasets can be re-

duced by 75% while maintaining competitive face recogni-

tion performance. The data and software used in this work

are publicly available

1

.

1. Introduction

Deep face recognition systems [22, 21, 19] have

achieved remarkable performances on challenging datasets,

due to advances in deep learning [18] and the availability

of large-scale training data [10, 13, 25]. However, training

datasets for face recognition are biased regarding nuisance

variables, such as the face pose or the illumination condi-

tions, because they were mostly collected from the web. It

1

https://github.com/unibas-gravis/parametric-face-image-generator

is well known that such biases have severe negative effects

on the generalization performance of machine learning sys-

tems [24, 14, 23, 17]. Therefore, the face recognition com-

munity faces two fundamental problems: 1) It is difﬁcult to

systematically analyze the effects of dataset bias on the gen-

eralization performance, since a ﬁne-grained annotation of

nuisance variables is practically unfeasible on large-scale

datasets. 2) Deep face recognition systems do not gener-

alize well across benchmarks, due to the severe sampling

biases in public datasets (as illustrated in Section 4). This

causes well-known problems such as a lack of diversity and

fairness in face recognition [15]. It is unclear how such

damages from dataset bias can be undone.

We propose to overcome both problems by leveraging

synthetic face images which are generated with a paramet-

ric 3D Morphable Face Model [3, 7]. In particular, we in-

troduce a data generator which creates synthetic face im-

ages with precise annotation of parameters that deﬁne the

facial identity, such as shape and texture, but also of nui-

sance parameters, such as light, camera and head pose. In

our experiments, we explore two application areas for syn-

thetic images in the context of face recognition:

• Systematic analysis of the damage from dataset

bias. We use fully annotated synthetic face images to

study the face recognition rate as a function of nui-

sance variables such as face pose. This enables us

to systematically study the effect of different types of

dataset biases on the generalization ability of neural

network architectures.

• Pre-training with synthetic data. We generate large-

scale synthetic data for pre-training DCNNs and sub-

sequently ﬁne-tune them with real-world data. The

parametric nature of the generator enables us to design

the distribution of nuisances in the synthetic data such

that is it highly variable in nuisance parameters that are

well known to be biased in real-world datasets (such as

pose and facial identity).

1

Based on our extensive experimental evaluation we gain

several novel insights about the effects of dataset bias on the

generalization ability of DCNNs at the task of face recog-

nition: i) It is well known that DCNNs with the VGG-16

architecture can generalize better than with the AlexNet ar-

chitecture at face recognition tasks. Using the presented

methodology we reveal that VGG-16 outperforms AlexNet,

because it can much better generalize to unseen face poses,

although it has signiﬁcantly more parameters (Section 3.2).

ii) In a real world scenario, not all identities in the training

data share the same distribution of face poses. We simulate

this setting and observe that DCNNs cannot disentangle the

facial identity from the face pose, which limits their abil-

ity to generalize from biased data (Section 3.3). iii) Using

synthetic face images for pre-training, we can enhance the

generalization performance of deep neural networks consis-

tently across several benchmark datasets (Section 4.3). iv)

The amount of real-world data needed to achieve competi-

tive performance is reduced considerably (Section 4.3) after

pre-training with synthetic data. Thus, offering a means to

concentrate data collection efforts to less but higher quality

data in terms variability.

Curiously, despite the success of 3D Morphable Face

Models at facial image generation, we are not aware of any

previous work that uses this effective and easily accessible

approach to analyze and enhance face recognition systems.

2. Face Image Generator

We use a fully parametric generator for the synthesis of

face images with detailed annotation of the most relevant

nuisance transformations. Our generator is based on a 3D

Morphable Model [3] of face shape, color and expression.

In particular, we use the Basel Face Model 2017 (BFM-

2017) [7] which is learned from 200 neutral face scans and

160 expression deformations. Natural looking, three dimen-

sional faces with expressions can be generated by sampling

from the statistical distribution of the model. In order to

achieve a natural illumination in the synthetic face images,

we sample the spherical harmonics illumination parameters

from the Basel Illumination Prior (BIP) [5]. Using com-

puter graphics we generate a 2D image from a 3D face, sam-

pled from the model. We use a non-parametric background

model that chooses random background textures from the

data provided in the describable texture database [4]. The

face image generator is built on the scalismo-faces software

framework [20]. The advantage of using 3DMMs for data

synthesis over related generative face models such as e.g.

GANs [2, 8] is that the 3DMM provides full control over

disentangled parameters that change the facial identity in

the terms of shape and albedo texture as well as pose, il-

lumination and facial expression. The proposed generator

enables us to generate inﬁnite amount of face images with

detailed labeling of the most relevant sources of image vari-

Figure 1: Synthetic face images sampled from our data gen-

erator. The facial identity in each row is the same. The top

row illustrates the precise control over image parameters,

where only the yaw pose is changed while all other nuisance

parameters are ﬁxed (as used in Section 3). The bottom row

illustrates synthetic faces generated by randomly sampling

all nuisance variables (as used in Section 4).

ation. Example images synthesized from the generator are

illustrated in Figure 1. Using the ﬁne-grained annotation of

the synthetic data enables us to systematically analyze dif-

ferent DCNN architectures on a common ground at the task

of face recognition in the next section. Subsequently, we

study how the generalization performance is affected when

large-scale synthetic data is used for pre-training in Section

4.

3. Analyzing the Damage of Dataset Bias

The ﬁne-grained control over the image variation in the

training and test data enables us to decompose the total

recognition rate (TRR) as a function along the axis of nui-

sance transformations. With this tool at hand, we study how

biases in the training data, in particular missing viewpoints

of a face, affect the generalization of DCNNs to unseen data

at test time.

3.1. Experimental Setup

Figure 2 schematically illustrates our experimental

setup. We generate synthetic images of different facial iden-

tities and transform them along the axes of the nuisance

transformations that we want to study (Figure 2 (I)). In

this work we focus on studying the effects of biases in the

face pose only. We simulate strong background variations,

which are common in real world data, by sampling random

textures from our empirical background model. All other

nuisance parameters are ﬁxed. We illustrate samples of the

2

Figure 2: Experimental setup for our empirical analysis of the effect of biased training data on the generalization ability of

different DCNN architectures. (I) We generate synthetic identities with a 3D Morphable Face Model and render them in

different face poses. We simulate background variation by overlaying the faces on different textures. (II) We bias the training

data by removing certain viewpoints from the training set. (III) We train common DCNN architectures on the biased training

data. (IV) The annotation of the test data makes possible to analyze the recognition rate as a function of the face pose. It

provides ﬁne-grained information about the generalization ability of the different DCNN architectures.

face image generator with the nuisance transformations that

we consider in our experiments in Figure 2. After splitting

the synthetic data into a training and test set we bias the

training data e.g. by removing certain face poses (Figure 2

(II)). Subsequently, we train different DCNN architectures

on the biased training data (Figure 2 (III)) and evaluate

how well the DCNNs generalize to the unbiased test data.

The fully parametric nature of the synthetic data, allows us

to evaluate the recognition rate as a function of the biased

nuisance transformation (Figure 2 (IV )).

In our experiments, we focus on comparing DCNNs with

a signiﬁcantly diverging performance at face recognition

(AlexNet and VGG-16), as our methodology makes pos-

sible to study why exactly one model performs better than

the other. We test these networks at the task of face classiﬁ-

cation. Thus, the task is to recognize a face from an image,

for which the identity is known at training time. Another

common way of performing face recognition is to use the

neural representation of the penultimate layer and to per-

form recognition via nearest neighbor in this feature space

[19]. However, we focus on diagnosing the performance of

DCNNs on the task that they were explicitly optimized on.

Parameter Settings. The size of the images is set to

227 × 227 pixels. We train the DCNNs with stochastic gra-

dient descent (SGD) and backpropagation with the Caffe

deep learning framework [12] via the Nvidia DIGITS train-

ing system. Every DCNN is trained from scratch for 30

epochs with a base learning rate of l = 0.001 which is mul-

tiplied every 10 epochs by γ = 0.1. We use L

2

regulariza-

tion with a weight regularization parameter of λ =

l

100

. If

not stated otherwise, the data is uniformly sampled across

the pose and illumination axes in the speciﬁed ranges. The

training data consists of 30 different identities, which we

obtain by randomly sampling the shape and appearance pa-

rameter of the 3DMM. The images in the test set always

reﬂect an unbiased sampling of the nuisance transformation

that we want to study. For the yaw pose, we sample the pa-

rameter space at intervals of

π

32

radian and for the direction

of light at

π

16

radian. Each face image is overlayed on 50

different background textures in the training as well as in

the test set.

3.2. Common bias over all facial identities

In this section, we limit the range of nuisance transfor-

mations in the training data and analyze if DCNNs can gen-

eralize to the unobserved nuisance transformations. We ap-

ply the same bias to all identities in the training set (see

example in Figure 5a).

EXP-1: Bias in the range of the yaw pose. In the fol-

lowing experiments, we limit the range of the yaw pose in

the training data. The light direction is ﬁxed to be frontal.

Figure 3a illustrates the recognition performance as a func-

tion of the yaw pose, when faces in the training set are

restricted to a yaw pose range of [−45

◦

, 45

◦

]. Both DC-

NNs achieve high recognition rates for the observed yaw

poses. However, the recognition performance drops signif-

icantly when faces are outside of the observed pose range.

The same generalization pattern can be observed when re-

stricting the faces at training time to a yaw pose range of

[−90

◦

, 0

◦

] (Figure 3b). In both experiments, the VGG-16

network achieves higher overall recognition rates, because

it generalizes better to larger unseen yaw poses.

EXP-2: Sparse sampling of the yaw pose. In Fig-

ure 4 we illustrate the effect of sampling the training data

more sparsely along the axis of the yaw pose. We ﬁrst bias

the training set to yaw poses of −45

◦

and 45

◦

. VGG-16

3

(a)

(b)

Figure 3: Effect of restricting the range of yaw poses

at training time. (a) Yaw pose restricted to the range

[−45

◦

, 45

◦

]. AlexNet TRR: 77.6%; VGG-16 TRR:85.9%.

(b) Yaw pose restricted to the range [−90

◦

, 0

◦

]. AlexNet

TRR: 81.8%; VGG-16 TRR:86.9%. In both setups the

DCNNs cannot recognize faces well from previously un-

observed views. VGG-16 achieves a higher TRR due to the

better generalization to large unseen yaw poses.

achieves a TRR of 70.5% at test time, whereas AlexNet

only achieves 51. 8%. Figure 4a illustrates how these TRRs

decompose as a function of the yaw pose. VGG-16 achieves

constantly higher recognition rates across all poses. Most

signiﬁcantly, it is more than twice as good as AlexNet at

recognizing frontal faces. If we add frontal faces at train-

ing time (Figure 4b) VGG-16 achieves a TRR of 81.9%,

whereas AlexNet achieves 69.3%. Remarkably, VGG-

16 is now able to recognize all faces correctly across the

full range of [−45

◦

, 45

◦

], whereas the recognition rates

of AlexNet still drop signiﬁcantly for poses in between

[−45

◦

, 0

◦

] and [0

◦

, 45

◦

]. Thus, the architecture of VGG-16

enables the DCNN to generalize well from only a few well

distributed example views to other unseen views, although

it has more parameters than AlexNet.

3.3. Disentanglement bias across facial identities

In the previous section, we have observed that DCNNs

generalize well as soon as a nuisance transformation is suf-

(a)

(b)

Figure 4: Effect of sparsely sampling the yaw pose of faces

at training time. (a)Yaw pose sampled at −45

◦

and 45

◦

(AlexNet TRR: 51.8%; VGG-16 TRR: 70.5%); VGG-16

generalizes much better to frontal poses than AlexNet. (b)

Yaw pose sampled at −45

◦

, 0

◦

and 45

◦

(AlexNet TRR:

69.3%; VGG-16 TRR: 81.9%); VGG-16 generalizes per-

fectly across the full range [−45

◦

, 45

◦

], whereas AlexNet

still cannot generalize in between the sampled poses.

ﬁciently represented for each identity in the training. When

this was not the case, the generalization performance de-

creased signiﬁcantly. In this section, we study if DCNNs

are capable of generalizing if the nuisance transformation is

densely reﬂected in the training data across multiple identi-

ties. In particular, each face identity in the training is varied

in a certain interval of the yaw pose. However, across all

identities the full yaw pose variation is reﬂected. In Fig-

ure 5b we schematically illustrate how this setup compares

to the one from the previous Section 3.2 (Figure 5a). We

call this type of bias disentanglement bias, since if DCNNs

are capable of disentangling the image variation induced by

the yaw pose from the face identity, then they would be able

to generalize well.

EXP-3: Disentanglement of pose variation. In this ex-

periment, half of the identities in the training set vary in the

yaw pose range of [−90

◦

, 0

◦

]. We refer to those identities

as the set Left-identities. The other half of the faces varies

in the range [0

◦

, 90

◦

] (Right-identities, Figure 5b). Figure 6

4

(a)

(b)

Figure 5: Different types of biases illustrated on the exam-

ple of yaw pose. Faces with red background are part of the

training set. (a) The same bias is applied to all the identities

in the training set. Thus, the pose variation space is only

partially observed. We use this setup in Section 3.2. (b)

For each half of the identities an alternating half of the pose

transformation is applied. Thus, the full pose transforma-

tion space is reﬂected in the data (Section 3.3).

illustrates the recognition performance of DCNNs trained

on the full training set. We evaluate the Left-identities and

Right-identities separately (Figure 6a & Figure 6b). We ob-

serve, that the DCNNs only slightly improve compared to

setup where the yaw pose range is restricted to [−90

◦

, 0

◦

]

for all identities (dotted curves). Thus, both DCNNs cannot

beneﬁt from the additional information in the training set.

We conclude that this phenomenon occurs because they are

not able to disentangle the image variation induced by the

pose variation and the identity change.

3.4. Discussion - Analysis with Synthetic Data

Our experiments in this section demonstrate that the full

control over the image variation makes possible to decom-

pose the recognition score as a function of nuisance trans-

formations. This enabled us to systematically analyze and

compare DCNNs at the task of face recognition. In our ex-

periements we observed the following phenomena:

Deeper networks generalize better to unseen head

poses. A major reason why VGG-16 outperforms AlexNet

at face recognition is that it can generalize better to faces in

previously unseen face poses (Section 3.2).

Deep networks cannot disentangle face pose from fa-

cial identity. A major limitation of the analyzed DCNN ar-

chitectures is that they have severe difﬁculties to generalize

when facial identities do not share the same pose variation

(Section 3.3). Thus, deep networks cannot disentangle well

the image variation caused by changes in the face pose from

the one induced by changes in the facial identity.

(a)

(b)

Figure 6: Testing disentanglement ability of DCNNs. Dot-

ted lines: DCNNs trained on a biased yaw pose (illustrated

in Figure 5a). Solid lines: Disentanglement setup (illus-

trated in Figure 5b). (a) Left-Identities with biased yaw pose

of [−90

◦

, 0

◦

]. (b) Right-Identities with biased yaw pose of

[0

◦

, 90

◦

]. DCNNs cannot make use of the additional infor-

mation about the pose transformation which is present in

the data in the disentanglement setup.

4. Reducing the Damage of Dataset Bias

In this section, we study the impact on the generalization

performance when using large-scale synthetic data for pre-

training of deep face recognition systems.

4.1. Experimental Setup

Our face recognition experiments are based on the pub-

licly available OpenFace framework [1]. For face detection

and alignment we use a publicly available multi-task CNN

2

[26]. In case the face detection fails, we use the face boxes

as deﬁned in the individual datasets

3

. We train the FaceNet-

NN4 architecture that was originally proposed by Schroff et

al. [21] with the vanilla setting, as provided in the OpenFace

framework. The aligned images are scaled to 96×96 pixels.

2

https://github.com/kpzhang93/mtcnn

face detection

alignment

3

For LFW and IJB-A these face boxes are provided in the dataset, for

Multi-PIE we use the annotations provided in [6].

5

Analyzing and Reducing the Damage of Dataset Bias to Face Recognition With Synthetic Data

Figures

Citations

A morphable model for the synthesis of 3D faces

3D Morphable Face Models—Past, Present, and Future

Synthetic Data for Deep Learning

Demographic Bias in Biometrics: A Survey on an Emerging Challenge

3D Morphable Face Models -- Past, Present and Future

References

ImageNet Classification with Deep Convolutional Neural Networks

Generative Adversarial Nets

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe: Convolutional Architecture for Fast Feature Embedding

FaceNet: A unified embedding for face recognition and clustering

Related Papers (5)

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

FaceNet: A unified embedding for face recognition and clustering

VGGFace2: A Dataset for Recognising Faces across Pose and Age

Unbiased look at dataset bias

Deep Residual Learning for Image Recognition