Apparent Age Estimation from Face Images Combining General and Children-Specialized Deep Learning Models

doi:10.1109/CVPRW.2016.105

Apparent Age Estimation from Face Images Combining General and

Children-Specialized Deep Learning Models

Grigory Antipov

1,2

, Moez Baccouche

1

, Sid-Ahmed Berrani

1

, Jean-Luc Dugelay

2

1

Orange Labs – France Telecom, 4 rue Clos Courtel, 35512 Cesson-S

´

evign

´

e, France

2

Eurecom, 450 route des Chappes, 06410 Biot, France

{grigory.antipov,moez.baccouche,sidahmed.berrani}@orange.com, jean-luc.dugelay@eurecom.fr

Abstract

This work describes our solution in the second edition

of the ChaLearn LAP competition on Apparent Age Esti-

mation. Starting from a pretrained version of the VGG-16

convolutional neural network for face recognition, we train

it on the huge IMDB-Wiki dataset for biological age estima-

tion and then ﬁne-tune it for apparent age estimation using

the relatively small competition dataset. We show that the

precise age estimation of children is the cornerstone of the

competition. Therefore, we integrate a separate “children”

VGG-16 network for apparent age estimation of children

between 0 and 12 years old in our ﬁnal solution. The “chil-

dren” network is ﬁne-tuned from the “general” one. We

employ different age encoding strategies for training “gen-

eral” and “children” networks: the soft one (label distribu-

tion encoding) for the “general” network and the strict one

(0/1 classiﬁcation encoding) for the “children” network.

Finally, we highlight the importance of the state-of-the-art

face detection and face alignment for the ﬁnal apparent age

estimation. Our resulting solution wins the 1st place in the

competition signiﬁcantly outperforming the runner-up.

1. Introduction

Historically being one of the most challenging topics

in facial analysis [13], automatic age estimation from face

images has numerous practical applications such as demo-

graphic statistics collection, customer proﬁling, search op-

timization in large databases and assistance of biometrics

systems. There are multiple reasons why automatic age

estimation is a very challenging task. The most principal

among them are an uncontrolled nature of the ageing pro-

cess, a signiﬁcant variance among faces in the same age

range and a high dependency of ageing traits on a person.

Recently, deep neural networks have signiﬁcantly

boosted many computer vision domains including uncon-

strained face recognition [26, 19, 24] and facial gender

recognition [2]. However, the progress in unconstrained fa-

cial age estimation is much slower, due to the difﬁculty of

collecting and labelling large datasets which is essential for

training deep networks.

The vast majority of existing age estimation studies deals

with the problem of estimation of a person’s biological age

(i.e. objective age deﬁned as the elapsed time since the

person’s birth date). However, in 2015, the ﬁrst ChaLearn

Looking at People (LAP) competition on apparent age es-

timation (i.e. subjective age estimated from a visual ap-

pearance of a person) was conducted [6]. The organizers

collected a dataset of face images and developed a web ser-

vice where people could annotate these images with an ap-

parent age. More than 100 teams have participated in the

competition and the 5 best approaches were based on deep

Convolutional Neural Networks (CNNs).

In 2016, the second edition of the ChaLearn LAP Ap-

parent Age Estimation (AAE) competition has been orga-

nized [7]. We have participated in this competition and have

won the 1st place outperforming all other participants by a

signiﬁcant margin. Our ﬁnal solution is mainly inspired by

the solution of the previous year’s winners [21]. We im-

prove the approach of [21] by using: (1) a combination of

“general” apparent age estimation model with soft age en-

coding and “children” model with 0/1 age encoding, and (2)

precise face alignment prior to age estimation. In this paper,

we detail our winning solution in the ChaLearn LAP AAE

competition motivating the selected design choices.

The rest of the paper is organized as follows: in Sec-

tion 2, we present related works on biological and apparent

age estimation and existing age encoding strategies, in Sec-

tion 3, we describe external image datasets which we have

used for training in addition to the competition dataset, in

Section 4, we detail our data preprocessing and age estima-

tion approaches, in Section 5, we highlight the importance

of certain design choices in our solution by experimenting

on the validation dataset of the competition, in Section 6,

we present the ﬁnal results of the competition, and we sum-

marize our contributions and conclusions in Section 7.

96

Figure 1. Apparent age distribution in the ChaLearn LAP AAE competition datasets (training+validation): (a) 2015, (b) 2016.

2. Related work

2.1. Biological age estimation

As already mentioned in Section 1, the existing age esti-

mation studies mainly focus on biological age estimation.

There are 2 publicly available datasets which are mostly

used in the context of biological age estimation: FG-NET

dataset [1] and MORPH-II dataset [20]. FG-NET dataset

contains about 1000 images obtained mainly from scanning

old photos. MORPH-II dataset is bigger than FG-NET con-

taining about 55, 000 images. This dataset was collected by

American law enforcement services.

The most used metric for evaluating systems of auto-

matic estimation of a biological age is Mean Absolute Error

(MAE). MAE is simply deﬁned as a mean value of absolute

differences between predicted ages ˆx and real (biological)

ages x: M AE =

1

N

P

i=1

| ˆx

i

− x|.

The problem of biological age estimation has been stud-

ied for a long time. The very ﬁrst works (notably [15]

(1999)) focused mainly on cranio-facial development the-

ory using geometrical ratios between different face regions

to identify a person’s biological age. Age estimation was

treated as a classiﬁcation problem with coarse classes (ba-

bies, young adults, adults and seniors). Later studies ap-

proached biological age estimation from face images in a

conventional computer vision manner: designing of fea-

ture representations for input images and training regres-

sion functions or classiﬁers on the obtained representations.

In that context, feature designing to describe the ageing

pattern proved to be of particular importance. For exam-

ple, in 2007, [9] proposed to model the ageing pattern de-

ﬁned as the sequence of a particular individual’s face im-

ages sorted in time order by constructing a representative

subspace: AGES (AGeing pattErn Subspace). Authors ob-

tained MAEs of 6.8 and 8.8 on FG-NET and MORPH-II,

respectively. While in 2009, [12] investigated a possibility

of applying Biologically Inspired Features (BIF) for age es-

timation. Authors proposed the “STD” operator for encod-

ing the ageing subtlety on faces. They obtained the MAE of

4.8 on FG-NET dataset. This result was further improved

by [10] in 2011 who proposed to combine BIF with Ker-

nel Partial Least Square regression (KPLS) and reached the

MAE of 4.2 on FG-NET dataset.

Finally, the recent development of deep learning meth-

ods (where feature designing and age estimation stages are

combined into one neural model) has allowed to further im-

prove automatic age estimation quality. Thus, [29] (2014)

is one of the ﬁrst works to apply CNNs for age estima-

tion. Authors employed several shallow multiscale CNNs

on different face regions and obtained the MAE of 3.6 on

MORPH-II dataset. The most recent work of [28] (2015) is

also based on CNNs. Authors proposed using a ranking en-

coding for age and gender and reported the state-of-the-art

MAE of 3.5 on MORPH-II dataset.

2.2. Apparent age estimation

Despite being strongly correlated with each other, an ap-

parent age of a person can be very different from her (his)

biological age [6]. The ﬁrst edition of the ChaLearn LAP

AAE competition [6] boosted the research in apparent age

estimation by making public the ﬁrst dataset with appar-

ent age annotations of 4691 images. In the second edition

of the competition [7], this dataset has been extended to

7591 images (4113 images for training, 1500 for validation

and 1978 for test). Not only the number of images has in-

creased, but also the age distribution has changed with re-

spect to the ﬁrst edition of the competition (see Figure 1).

In particular, the percentage of children images has signiﬁ-

cantly increased in the second edition of the competition.

97

Each image of the competition dataset is annotated with

a mean age µ and a corresponding standard deviation σ

(these statistics are calculated based on at least 10 human

votes per image). The metric which has been selected by

the competition organizers to evaluate apparent age estima-

tion systems is quite different from MAE which is used for

biological age estimation. The competition metric ǫ is de-

ﬁned as the size of the tail of the normal distribution with

the mean µ and the standard deviation σ with respect to the

predicted value ˆx: ǫ = 1 − e

−

(ˆx−µ)

2

2σ

2

. Therefore, the appar-

ent age estimation errors on examples with a small standard

deviation (i.e. on examples on which human votes are close

to each other) are penalized stronger than the same errors on

examples with a high standard deviation (i.e. on examples

on which human votes disagree between each other).

Below, we present 3 winning entries of the ﬁrst edition

of the competition. All of them are based on CNNs.

[21] are the winners of the ﬁrst edition of the competi-

tion. Their approach is based on pretraining of the VGG-

16 CNN [23] on the ImageNet dataset [22], training this

network on the IMDB-Wiki dataset for the biological age

estimation task (this dataset has been collected and made

public by the authors) and, ﬁnally, ﬁne-tuning for the ap-

parent age estimation task on the competition data. Authors

trained their CNN for a classiﬁcation with 101 classes (ages

between 0 and 100 years old) and used the expected value

of 101 neurons as an age estimation at the test phase. The

resulting ǫ is 0.2650. [16] are the runners-up of the competi-

tion. Authors used the GoogLeNet CNN [25] as their basic

model. Authors pretrained the GoogLeNet CNN for face

recognition task on the CASIA WebFace dataset [30], then

the CNN was trained on CACD [4], WebFaceAge [18] and

Morph-II datasets for the biological age estimation task, and

ﬁnally, the CNN was ﬁne-tuned on the competition data for

the apparent age estimation task. Authors combined CNNs

trained for age regression and for age classiﬁcation with dis-

tributed labelling. As a result, they obtained ǫ of 0.2707.

The third result in the competition was achieved by [32].

Their approach is very similar to the one by [30]: also pre-

training of the GoogLeNet CNN on the CASIA WebFace

dataset, training for biological age estimation on publicly

available age datasets and the ﬁnal ﬁne-tuning for apparent

age estimation on the competition data. However, the par-

ticularity of the solution by [32] is the usage of the cascade

approach for age classiﬁcation: ﬁrstly, a coarse classiﬁca-

tion in one of 10 age groups and then a ﬁne-grained intra-

group regression. The ﬁnal result of [32] is ǫ of 0.2948.

Summarizing the approaches of the 3 winners of the ﬁrst

edition of the ChaLearn LAP AAE competition, the follow-

ing common strategies can be highlighted:

1. All 3 winners use deep CNN architectures (either

VGG-16 or GoogLeNet) pretrained on large image

datasets (either ImageNet or CASIA-WebFace).

2. All 3 winners employ the same pipeline for training

their CNN: ﬁrstly, training on large datasets for bio-

logical age estimation and secondly, ﬁne-tuning on the

competition dataset for apparent age estimation.

Relying on the success of these 2 strategies in the ﬁrst edi-

tion of the competition, we also follow them in our solution

in the second edition of the competition.

2.3. Age labels encoding

In literature, there are 3 commonly used age labels en-

codings for automatic age estimation systems. These en-

codings are presented below:

1. Real number encoding. This is a pure regression ap-

proach. In real number encoding, the age labels are

encoded just as real numbers.

2. 0/1 classiﬁcation encoding. This is a pure classiﬁca-

tion approach. In 0/1 classiﬁcation encoding, we pre-

deﬁne a certain number of classes (for example, 100

classes for ages between 0 and 99 years old) and the

age labels are encoded as binary vectors containing

a single non-zero value corresponding to the class to

which a certain example belongs to.

3. Label distribution encoding. Label distribution en-

coding [8] can be seen as the soft version of 0/1 clas-

siﬁcation encoding. In label distribution encoding, on

the one hand, we predeﬁne a certain number of classes

(as in case of 0/1 classiﬁcation encoding) but on the

other hand, the age labels are encoded not with binary

vectors but with real-valued vectors representing the

probability distributions of belonging to correspond-

ing classes. More precisely, assuming that we encode

an age x ∈ R with a label vector L of length N (N

classes), the label vector L will be deﬁned as follows:

L

i

=

1

σ

√

2π

e

−

(i−x)

2

2σ

2

; i = 1, . . . , N, where σ is a pre-

deﬁned parameter. In other words, in order to encode

an age x, we ﬁt a normal distribution with an expected

value of x and a standard deviation of σ. The ad-

vantage of label distribution encoding with respect to

0/1 classiﬁcation encoding is the fact that apart from

storing the information to which class a certain exam-

ple belongs to, a label vector also stores the informa-

tion about the neighbouring classes (i.e. neighbouring

ages). This additional information can be useful dur-

ing training. In particular, label distribution encoding

provides a machine learning model with the informa-

tion that, for example, it is better to predict 20 years

old instead of 21 years old, than 100 years old instead

of 21 years old. This information is missing in 0/1

classiﬁcation encoding. Finally, it is worth noting that

98

0/1 classiﬁcation encoding is an extreme case of label

distribution encoding when σ → 0.

3. External data

In this section, we present the datasets which we have

used for biological age estimation training in our work.

IMDB-Wiki Inspired by the success of the 1st place win-

ners of the ﬁrst edition of the ChaLearn LAP AAE compe-

tition [21], we have decided to use the IMDB-Wiki dataset

collected and used by them for the biological age estimation

training. Authors made this dataset public in 2016.

The dataset consists of 523, 051 images collected from 2

sources: IMDb

1

(460, 723 images) and Wikipedia

2

(62, 328

images). The distribution of ages in the IMDB-Wiki dataset

is presented in Figure 2.

Figure 2. Biological age distribution in the IMDB-Wiki dataset.

Due to the fact that each image contains a celebrity

(whose identity, gender and birth date are known) and a

timestamp, authors managed to automatically annotate all

images in the IMDB-Wiki dataset with biological ages.

However, for the majority of images from the IMDB-

Wiki dataset, the provided annotations are not directly us-

able. The problem comes from the fact that a lot of images

contain more than one person. Assuming that all faces in

the image are detected automatically, it is not obvious how

to automatically select a face to which the given annotation

corresponds to. To circumvent this problem, we have pur-

sued the 2 following approaches:

1. We have used those images for which the “Head

Hunter” face detector [17] has detected only one face

(a similar approach was employed by [21]). In this

1

The Internet Movie Database: www.imdb.com

2

The free Internet encyclopaedia: www.wikipedia.org

case, we can be sure that the detected face corresponds

to the provided age annotation. This approach has re-

sulted in 182, 019 images.

2. We have developed a simple web interface for the man-

ual annotation of the remaining images. Given an in-

put image and a corresponding annotation (the person

identity, gender and age), a user has to simply select

a face in the image to which the given annotation cor-

responds to. By crowdsourcing the annotation process

via the described interface, we have managed to an-

notate 68, 548 images (26 persons participated in the

annotation campaign which lasted for 4 days).

Thus, in total, 250, 367 images from the IMDB-Wiki

dataset have been used in our experiments. In order to avoid

ambiguity with the whole IMDB-Wiki dataset, below, we

refer to this subset of 250, 367 images of the IMDB-Wiki

dataset as the “cleaned” IMDB-Wiki dataset.

Collected dataset with images of children As it is seen

in Figure 2, there are very few images of children younger

than teenage (i.e. 12 years old and younger) in the IMDB-

Wiki dataset. Therefore, an age recognition model which

is trained on this dataset is likely to perform poorly for age

estimation of children. This was not a major problem in the

ﬁrst edition of the ChaLearn LAP AAE competition given

that there were very few children in the competition dataset

(see Figure 1(a)). However, this problem becomes very im-

portant in the second edition of the competition where chil-

dren occupy almost 10% of all images (see Figure 1(b)).

It should also be noticed that according to the compe-

tition dataset annotations, the average standard deviation of

human votes for images of children (between 0 and 12 years

old) is about 1, while the average standard deviation for all

other images is about 5. Thus, according to the competition

data, humans estimate an age of a child almost 5 times more

precisely than an age of an adult. As it is mentioned in Sec-

tion 2.2, the competition metric ǫ is deﬁned in the way that

the same absolute error in age estimation is penalized more

for images with small standard deviation of human votes.

The above observation shows the importance of predict-

ing ages for children images with a very high precision and

the need of training children images with precise biologi-

cal age annotations. Therefore, we have manually collected

a private dataset of 5723 children images in the 0-12 age

category using the Internet search engines.

4. Proposed solution

ChaLearn LAP AAE competition is an “end-to-end”

competition meaning that given at input raw real-life im-

ages (from Wikipedia, social networks etc.), participants

have to output corresponding apparent age estimations.

99

Required image preprocessing (e.g. face detection and

face alignment) is considered as a part of the challenge.

Therefore, our solution is split into 2 logical steps: image

preprocessing and apparent age estimation itself. In this

section, we present the mentioned steps one by one.

4.1. Image preprocessing

Face detection We have used the open source “Head

Hunter” face detector [17]. In particularly, we have em-

ployed the fast implementation by [19]. In order to detect

faces regardless of an image orientation, we rotate each in-

put image at all angles in the range [-90

◦

, 90

◦

] with the step

of 5

◦

. We then select the rotated version of the input im-

age which gives the strongest output of the face detector for

the face alignment step. If no face is detected in all rotated

versions of the input image, the initial image is upscaled

and the presented algorithm is repeated until a face is de-

tected. 2 upscaling operations has been enough to detect at

least one face in all images of the competition dataset. As

recommended in [21] (and also conﬁrmed by our own ex-

periments), we extend the face area detected by the “Head

Hunter” face detector and take 40% of its width to the left

and to the right and 40% of its height above and below.

Face alignment We have integrated the state-of-the-art

face alignment solution by [27] in our image preprocessing

pipeline. The solution of [27] is based on the multi-view

facial landmark detection. There are 5 landmark detection

models: a frontal model, 2 proﬁle models and 2 half-proﬁle

models. Each of these models is tuned to work on one of the

corresponding facial poses. The face alignment follows the

face detection and requires running of all 5 landmark mod-

els on the detected face. Each model reports a conﬁdence

score which shows how well the corresponding landmarks

are detected in the given face. We then select the model with

the highest conﬁdence score and perform an afﬁne transfor-

mation from the detected landmarks to the predeﬁned op-

timal positions of these landmarks with respect to the de-

tected facial pose.

We have also tried to use an older commercial solution

for face detection and face alignment which is based on [31]

and [3] respectively. Our experiments presented in Sec-

tion 5 compare the 2 approaches and clearly demonstrate

the merits of the open-source state-of-the-art solutions.

4.2. Apparent age estimation

Following the winning solution from the previous edition

of the ChaLearn LAP AAE competition [21], we also em-

ploy the 2-steps strategy of CNN-training for apparent age

estimation: ﬁrstly, we train our CNNs for biological age

estimation on external datasets, and secondly, we ﬁne-tune

them for apparent age estimation on the competition data.

However, there are several key novelties in our approach

with respect to the approach of [21]. We highlight these

novelties below:

1. As it is mentioned in Section 3, the precision of the

apparent age estimation on children images has a very

high inﬂuence on the ﬁnal score in the second edition

of the ChaLearn LAP AAE competition. Therefore,

we have trained a separate model for estimating appar-

ent ages of children (0-12 years old) using the external

data described in Section 3. The gain of integrating

this separate CNN in the ﬁnal solution is quantitatively

evaluated in Section 5.

2. We combine 2 age labels encoding strategies which are

presented in Section 2.3. On the one hand, we employ

a label distribution age encoding for training the “gen-

eral” CNNs which allows our neural networks to better

capture the concept of an apparent age (which is rather

a range of values than a precise real value). On the

other hand, we employ a 0/1 classiﬁcation encoding for

the “children” CNNs because for children, a possible

range of apparent age values is very narrow and, there-

fore, it is meaningful to encode each year as a com-

pletely separate class.

3

Our experiments have shown

that using this combined age labels encoding strategy

is advantageous with respect to using only distributed

age encoding or only 0/1 classiﬁcation encoding for

both “general” and “children” CNNs.

3. Our experiments in Section 5 demonstrate that the

quality of image preprocessing has a very strong im-

pact on the ﬁnal ǫ-score. Therefore, we employ the

state-of-the-art open source solution from [27] for face

alignment in our ﬁnal approach.

4.2.1 Training pipeline

The integral training pipeline of all apparent age estima-

tion CNNs is presented in Figure 3. Starting with the pre-

trained VGG-16 CNN from [19], we train a “general” CNN

for biological age estimation of all ages between 0 and 99

years old on the “cleaned” IMDB-Wiki dataset using the la-

bel distribution age encoding. From the obtained network,

we ﬁne-tune a “children” CNN for biological age estima-

tion of children between 0 and 12 years old. This time,

the 0/1 classiﬁcation age encoding is used. The next step

is ﬁne-tuning of 2 resulting CNNs (the “general” one and

the “children” one) for apparent age estimation. In case of

the “general” CNN, we combine all training and validation

images from the competition dataset (5613 images in total)

3

Here and below, we refer to the CNNs which estimate all ages between

0 and 99 years old as the “general” ones, while to the CNNs which estimate

only ages of children between 0 and 12 years old as the “children” ones.

100

Apparent Age Estimation from Face Images Combining General and Children-Specialized Deep Learning Models

Citations

Face aging with conditional generative adversarial networks

Face Aging With Conditional Generative Adversarial Networks

Deep Learning for Biometrics: A Survey

Mean-Variance Loss for Deep Age Estimation from a Face

MORPH: A Longitudinal Image Database of Normal Adult Age-Progression.

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Going deeper with convolutions

ImageNet Large Scale Visual Recognition Challenge

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe: Convolutional Architecture for Fast Feature Embedding

Related Papers (5)

MORPH: a longitudinal image database of normal adult age-progression

Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks

DEX: Deep EXpectation of Apparent Age from a Single Image

Very Deep Convolutional Networks for Large-Scale Image Recognition

Age and Gender Estimation of Unfiltered Faces