scispace - formally typeset
Proceedings ArticleDOI

Apparent Age Estimation from Face Images Combining General and Children-Specialized Deep Learning Models

TLDR
This work integrates a separate "children" VGG-16 network for apparent age estimation of children between 0 and 12 years old in their final solution and wins the 1st place in the ChaLearn LAP competition significantly outperforming the runner-up.
Abstract
This work describes our solution in the second edition of the ChaLearn LAP competition on Apparent Age Estimation. Starting from a pretrained version of the VGG-16 convolutional neural network for face recognition, we train it on the huge IMDB-Wiki dataset for biological age estimation and then fine-tune it for apparent age estimation using the relatively small competition dataset. We show that the precise age estimation of children is the cornerstone of the competition. Therefore, we integrate a separate "children" VGG-16 network for apparent age estimation of children between 0 and 12 years old in our final solution. The "children" network is fine-tuned from the "general" one. We employ different age encoding strategies for training "general" and "children" networks: the soft one (label distribution encoding) for the "general" network and the strict one (0/1 classification encoding) for the "children" network. Finally, we highlight the importance of the state-of-the-art face detection and face alignment for the final apparent age estimation. Our resulting solution wins the 1st place in the competition significantly outperforming the runner-up.

read more

Content maybe subject to copyright    Report

Apparent Age Estimation from Face Images Combining General and
Children-Specialized Deep Learning Models
Grigory Antipov
1,2
, Moez Baccouche
1
, Sid-Ahmed Berrani
1
, Jean-Luc Dugelay
2
1
Orange Labs France Telecom, 4 rue Clos Courtel, 35512 Cesson-S
´
evign
´
e, France
2
Eurecom, 450 route des Chappes, 06410 Biot, France
{grigory.antipov,moez.baccouche,sidahmed.berrani}@orange.com, jean-luc.dugelay@eurecom.fr
Abstract
This work describes our solution in the second edition
of the ChaLearn LAP competition on Apparent Age Esti-
mation. Starting from a pretrained version of the VGG-16
convolutional neural network for face recognition, we train
it on the huge IMDB-Wiki dataset for biological age estima-
tion and then fine-tune it for apparent age estimation using
the relatively small competition dataset. We show that the
precise age estimation of children is the cornerstone of the
competition. Therefore, we integrate a separate “children”
VGG-16 network for apparent age estimation of children
between 0 and 12 years old in our final solution. The “chil-
dren” network is fine-tuned from the “general” one. We
employ different age encoding strategies for training “gen-
eral” and “children” networks: the soft one (label distribu-
tion encoding) for the “general” network and the strict one
(0/1 classification encoding) for the “children” network.
Finally, we highlight the importance of the state-of-the-art
face detection and face alignment for the final apparent age
estimation. Our resulting solution wins the 1st place in the
competition significantly outperforming the runner-up.
1. Introduction
Historically being one of the most challenging topics
in facial analysis [13], automatic age estimation from face
images has numerous practical applications such as demo-
graphic statistics collection, customer profiling, search op-
timization in large databases and assistance of biometrics
systems. There are multiple reasons why automatic age
estimation is a very challenging task. The most principal
among them are an uncontrolled nature of the ageing pro-
cess, a significant variance among faces in the same age
range and a high dependency of ageing traits on a person.
Recently, deep neural networks have significantly
boosted many computer vision domains including uncon-
strained face recognition [26, 19, 24] and facial gender
recognition [2]. However, the progress in unconstrained fa-
cial age estimation is much slower, due to the difficulty of
collecting and labelling large datasets which is essential for
training deep networks.
The vast majority of existing age estimation studies deals
with the problem of estimation of a person’s biological age
(i.e. objective age defined as the elapsed time since the
person’s birth date). However, in 2015, the first ChaLearn
Looking at People (LAP) competition on apparent age es-
timation (i.e. subjective age estimated from a visual ap-
pearance of a person) was conducted [6]. The organizers
collected a dataset of face images and developed a web ser-
vice where people could annotate these images with an ap-
parent age. More than 100 teams have participated in the
competition and the 5 best approaches were based on deep
Convolutional Neural Networks (CNNs).
In 2016, the second edition of the ChaLearn LAP Ap-
parent Age Estimation (AAE) competition has been orga-
nized [7]. We have participated in this competition and have
won the 1st place outperforming all other participants by a
significant margin. Our final solution is mainly inspired by
the solution of the previous year’s winners [21]. We im-
prove the approach of [21] by using: (1) a combination of
“general” apparent age estimation model with soft age en-
coding and “children” model with 0/1 age encoding, and (2)
precise face alignment prior to age estimation. In this paper,
we detail our winning solution in the ChaLearn LAP AAE
competition motivating the selected design choices.
The rest of the paper is organized as follows: in Sec-
tion 2, we present related works on biological and apparent
age estimation and existing age encoding strategies, in Sec-
tion 3, we describe external image datasets which we have
used for training in addition to the competition dataset, in
Section 4, we detail our data preprocessing and age estima-
tion approaches, in Section 5, we highlight the importance
of certain design choices in our solution by experimenting
on the validation dataset of the competition, in Section 6,
we present the final results of the competition, and we sum-
marize our contributions and conclusions in Section 7.
96

Figure 1. Apparent age distribution in the ChaLearn LAP AAE competition datasets (training+validation): (a) 2015, (b) 2016.
2. Related work
2.1. Biological age estimation
As already mentioned in Section 1, the existing age esti-
mation studies mainly focus on biological age estimation.
There are 2 publicly available datasets which are mostly
used in the context of biological age estimation: FG-NET
dataset [1] and MORPH-II dataset [20]. FG-NET dataset
contains about 1000 images obtained mainly from scanning
old photos. MORPH-II dataset is bigger than FG-NET con-
taining about 55, 000 images. This dataset was collected by
American law enforcement services.
The most used metric for evaluating systems of auto-
matic estimation of a biological age is Mean Absolute Error
(MAE). MAE is simply defined as a mean value of absolute
differences between predicted ages ˆx and real (biological)
ages x: M AE =
1
N
N
P
i=1
| ˆx
i
x|.
The problem of biological age estimation has been stud-
ied for a long time. The very first works (notably [15]
(1999)) focused mainly on cranio-facial development the-
ory using geometrical ratios between different face regions
to identify a person’s biological age. Age estimation was
treated as a classification problem with coarse classes (ba-
bies, young adults, adults and seniors). Later studies ap-
proached biological age estimation from face images in a
conventional computer vision manner: designing of fea-
ture representations for input images and training regres-
sion functions or classifiers on the obtained representations.
In that context, feature designing to describe the ageing
pattern proved to be of particular importance. For exam-
ple, in 2007, [9] proposed to model the ageing pattern de-
fined as the sequence of a particular individual’s face im-
ages sorted in time order by constructing a representative
subspace: AGES (AGeing pattErn Subspace). Authors ob-
tained MAEs of 6.8 and 8.8 on FG-NET and MORPH-II,
respectively. While in 2009, [12] investigated a possibility
of applying Biologically Inspired Features (BIF) for age es-
timation. Authors proposed the “STD” operator for encod-
ing the ageing subtlety on faces. They obtained the MAE of
4.8 on FG-NET dataset. This result was further improved
by [10] in 2011 who proposed to combine BIF with Ker-
nel Partial Least Square regression (KPLS) and reached the
MAE of 4.2 on FG-NET dataset.
Finally, the recent development of deep learning meth-
ods (where feature designing and age estimation stages are
combined into one neural model) has allowed to further im-
prove automatic age estimation quality. Thus, [29] (2014)
is one of the first works to apply CNNs for age estima-
tion. Authors employed several shallow multiscale CNNs
on different face regions and obtained the MAE of 3.6 on
MORPH-II dataset. The most recent work of [28] (2015) is
also based on CNNs. Authors proposed using a ranking en-
coding for age and gender and reported the state-of-the-art
MAE of 3.5 on MORPH-II dataset.
2.2. Apparent age estimation
Despite being strongly correlated with each other, an ap-
parent age of a person can be very different from her (his)
biological age [6]. The first edition of the ChaLearn LAP
AAE competition [6] boosted the research in apparent age
estimation by making public the first dataset with appar-
ent age annotations of 4691 images. In the second edition
of the competition [7], this dataset has been extended to
7591 images (4113 images for training, 1500 for validation
and 1978 for test). Not only the number of images has in-
creased, but also the age distribution has changed with re-
spect to the first edition of the competition (see Figure 1).
In particular, the percentage of children images has signifi-
cantly increased in the second edition of the competition.
97

Each image of the competition dataset is annotated with
a mean age µ and a corresponding standard deviation σ
(these statistics are calculated based on at least 10 human
votes per image). The metric which has been selected by
the competition organizers to evaluate apparent age estima-
tion systems is quite different from MAE which is used for
biological age estimation. The competition metric ǫ is de-
fined as the size of the tail of the normal distribution with
the mean µ and the standard deviation σ with respect to the
predicted value ˆx: ǫ = 1 e
(ˆxµ)
2
2σ
2
. Therefore, the appar-
ent age estimation errors on examples with a small standard
deviation (i.e. on examples on which human votes are close
to each other) are penalized stronger than the same errors on
examples with a high standard deviation (i.e. on examples
on which human votes disagree between each other).
Below, we present 3 winning entries of the first edition
of the competition. All of them are based on CNNs.
[21] are the winners of the first edition of the competi-
tion. Their approach is based on pretraining of the VGG-
16 CNN [23] on the ImageNet dataset [22], training this
network on the IMDB-Wiki dataset for the biological age
estimation task (this dataset has been collected and made
public by the authors) and, finally, fine-tuning for the ap-
parent age estimation task on the competition data. Authors
trained their CNN for a classification with 101 classes (ages
between 0 and 100 years old) and used the expected value
of 101 neurons as an age estimation at the test phase. The
resulting ǫ is 0.2650. [16] are the runners-up of the competi-
tion. Authors used the GoogLeNet CNN [25] as their basic
model. Authors pretrained the GoogLeNet CNN for face
recognition task on the CASIA WebFace dataset [30], then
the CNN was trained on CACD [4], WebFaceAge [18] and
Morph-II datasets for the biological age estimation task, and
finally, the CNN was fine-tuned on the competition data for
the apparent age estimation task. Authors combined CNNs
trained for age regression and for age classification with dis-
tributed labelling. As a result, they obtained ǫ of 0.2707.
The third result in the competition was achieved by [32].
Their approach is very similar to the one by [30]: also pre-
training of the GoogLeNet CNN on the CASIA WebFace
dataset, training for biological age estimation on publicly
available age datasets and the final fine-tuning for apparent
age estimation on the competition data. However, the par-
ticularity of the solution by [32] is the usage of the cascade
approach for age classification: firstly, a coarse classifica-
tion in one of 10 age groups and then a fine-grained intra-
group regression. The final result of [32] is ǫ of 0.2948.
Summarizing the approaches of the 3 winners of the first
edition of the ChaLearn LAP AAE competition, the follow-
ing common strategies can be highlighted:
1. All 3 winners use deep CNN architectures (either
VGG-16 or GoogLeNet) pretrained on large image
datasets (either ImageNet or CASIA-WebFace).
2. All 3 winners employ the same pipeline for training
their CNN: firstly, training on large datasets for bio-
logical age estimation and secondly, fine-tuning on the
competition dataset for apparent age estimation.
Relying on the success of these 2 strategies in the first edi-
tion of the competition, we also follow them in our solution
in the second edition of the competition.
2.3. Age labels encoding
In literature, there are 3 commonly used age labels en-
codings for automatic age estimation systems. These en-
codings are presented below:
1. Real number encoding. This is a pure regression ap-
proach. In real number encoding, the age labels are
encoded just as real numbers.
2. 0/1 classification encoding. This is a pure classifica-
tion approach. In 0/1 classification encoding, we pre-
define a certain number of classes (for example, 100
classes for ages between 0 and 99 years old) and the
age labels are encoded as binary vectors containing
a single non-zero value corresponding to the class to
which a certain example belongs to.
3. Label distribution encoding. Label distribution en-
coding [8] can be seen as the soft version of 0/1 clas-
sification encoding. In label distribution encoding, on
the one hand, we predefine a certain number of classes
(as in case of 0/1 classification encoding) but on the
other hand, the age labels are encoded not with binary
vectors but with real-valued vectors representing the
probability distributions of belonging to correspond-
ing classes. More precisely, assuming that we encode
an age x R with a label vector L of length N (N
classes), the label vector L will be defined as follows:
L
i
=
1
σ
2π
e
(ix)
2
2σ
2
; i = 1, . . . , N, where σ is a pre-
defined parameter. In other words, in order to encode
an age x, we fit a normal distribution with an expected
value of x and a standard deviation of σ. The ad-
vantage of label distribution encoding with respect to
0/1 classification encoding is the fact that apart from
storing the information to which class a certain exam-
ple belongs to, a label vector also stores the informa-
tion about the neighbouring classes (i.e. neighbouring
ages). This additional information can be useful dur-
ing training. In particular, label distribution encoding
provides a machine learning model with the informa-
tion that, for example, it is better to predict 20 years
old instead of 21 years old, than 100 years old instead
of 21 years old. This information is missing in 0/1
classification encoding. Finally, it is worth noting that
98

0/1 classification encoding is an extreme case of label
distribution encoding when σ 0.
3. External data
In this section, we present the datasets which we have
used for biological age estimation training in our work.
IMDB-Wiki Inspired by the success of the 1st place win-
ners of the first edition of the ChaLearn LAP AAE compe-
tition [21], we have decided to use the IMDB-Wiki dataset
collected and used by them for the biological age estimation
training. Authors made this dataset public in 2016.
The dataset consists of 523, 051 images collected from 2
sources: IMDb
1
(460, 723 images) and Wikipedia
2
(62, 328
images). The distribution of ages in the IMDB-Wiki dataset
is presented in Figure 2.
Figure 2. Biological age distribution in the IMDB-Wiki dataset.
Due to the fact that each image contains a celebrity
(whose identity, gender and birth date are known) and a
timestamp, authors managed to automatically annotate all
images in the IMDB-Wiki dataset with biological ages.
However, for the majority of images from the IMDB-
Wiki dataset, the provided annotations are not directly us-
able. The problem comes from the fact that a lot of images
contain more than one person. Assuming that all faces in
the image are detected automatically, it is not obvious how
to automatically select a face to which the given annotation
corresponds to. To circumvent this problem, we have pur-
sued the 2 following approaches:
1. We have used those images for which the “Head
Hunter” face detector [17] has detected only one face
(a similar approach was employed by [21]). In this
1
The Internet Movie Database: www.imdb.com
2
The free Internet encyclopaedia: www.wikipedia.org
case, we can be sure that the detected face corresponds
to the provided age annotation. This approach has re-
sulted in 182, 019 images.
2. We have developed a simple web interface for the man-
ual annotation of the remaining images. Given an in-
put image and a corresponding annotation (the person
identity, gender and age), a user has to simply select
a face in the image to which the given annotation cor-
responds to. By crowdsourcing the annotation process
via the described interface, we have managed to an-
notate 68, 548 images (26 persons participated in the
annotation campaign which lasted for 4 days).
Thus, in total, 250, 367 images from the IMDB-Wiki
dataset have been used in our experiments. In order to avoid
ambiguity with the whole IMDB-Wiki dataset, below, we
refer to this subset of 250, 367 images of the IMDB-Wiki
dataset as the “cleaned” IMDB-Wiki dataset.
Collected dataset with images of children As it is seen
in Figure 2, there are very few images of children younger
than teenage (i.e. 12 years old and younger) in the IMDB-
Wiki dataset. Therefore, an age recognition model which
is trained on this dataset is likely to perform poorly for age
estimation of children. This was not a major problem in the
first edition of the ChaLearn LAP AAE competition given
that there were very few children in the competition dataset
(see Figure 1(a)). However, this problem becomes very im-
portant in the second edition of the competition where chil-
dren occupy almost 10% of all images (see Figure 1(b)).
It should also be noticed that according to the compe-
tition dataset annotations, the average standard deviation of
human votes for images of children (between 0 and 12 years
old) is about 1, while the average standard deviation for all
other images is about 5. Thus, according to the competition
data, humans estimate an age of a child almost 5 times more
precisely than an age of an adult. As it is mentioned in Sec-
tion 2.2, the competition metric ǫ is defined in the way that
the same absolute error in age estimation is penalized more
for images with small standard deviation of human votes.
The above observation shows the importance of predict-
ing ages for children images with a very high precision and
the need of training children images with precise biologi-
cal age annotations. Therefore, we have manually collected
a private dataset of 5723 children images in the 0-12 age
category using the Internet search engines.
4. Proposed solution
ChaLearn LAP AAE competition is an “end-to-end”
competition meaning that given at input raw real-life im-
ages (from Wikipedia, social networks etc.), participants
have to output corresponding apparent age estimations.
99

Required image preprocessing (e.g. face detection and
face alignment) is considered as a part of the challenge.
Therefore, our solution is split into 2 logical steps: image
preprocessing and apparent age estimation itself. In this
section, we present the mentioned steps one by one.
4.1. Image preprocessing
Face detection We have used the open source “Head
Hunter” face detector [17]. In particularly, we have em-
ployed the fast implementation by [19]. In order to detect
faces regardless of an image orientation, we rotate each in-
put image at all angles in the range [-90
, 90
] with the step
of 5
. We then select the rotated version of the input im-
age which gives the strongest output of the face detector for
the face alignment step. If no face is detected in all rotated
versions of the input image, the initial image is upscaled
and the presented algorithm is repeated until a face is de-
tected. 2 upscaling operations has been enough to detect at
least one face in all images of the competition dataset. As
recommended in [21] (and also confirmed by our own ex-
periments), we extend the face area detected by the “Head
Hunter” face detector and take 40% of its width to the left
and to the right and 40% of its height above and below.
Face alignment We have integrated the state-of-the-art
face alignment solution by [27] in our image preprocessing
pipeline. The solution of [27] is based on the multi-view
facial landmark detection. There are 5 landmark detection
models: a frontal model, 2 profile models and 2 half-profile
models. Each of these models is tuned to work on one of the
corresponding facial poses. The face alignment follows the
face detection and requires running of all 5 landmark mod-
els on the detected face. Each model reports a confidence
score which shows how well the corresponding landmarks
are detected in the given face. We then select the model with
the highest confidence score and perform an affine transfor-
mation from the detected landmarks to the predefined op-
timal positions of these landmarks with respect to the de-
tected facial pose.
We have also tried to use an older commercial solution
for face detection and face alignment which is based on [31]
and [3] respectively. Our experiments presented in Sec-
tion 5 compare the 2 approaches and clearly demonstrate
the merits of the open-source state-of-the-art solutions.
4.2. Apparent age estimation
Following the winning solution from the previous edition
of the ChaLearn LAP AAE competition [21], we also em-
ploy the 2-steps strategy of CNN-training for apparent age
estimation: firstly, we train our CNNs for biological age
estimation on external datasets, and secondly, we fine-tune
them for apparent age estimation on the competition data.
However, there are several key novelties in our approach
with respect to the approach of [21]. We highlight these
novelties below:
1. As it is mentioned in Section 3, the precision of the
apparent age estimation on children images has a very
high influence on the final score in the second edition
of the ChaLearn LAP AAE competition. Therefore,
we have trained a separate model for estimating appar-
ent ages of children (0-12 years old) using the external
data described in Section 3. The gain of integrating
this separate CNN in the final solution is quantitatively
evaluated in Section 5.
2. We combine 2 age labels encoding strategies which are
presented in Section 2.3. On the one hand, we employ
a label distribution age encoding for training the “gen-
eral” CNNs which allows our neural networks to better
capture the concept of an apparent age (which is rather
a range of values than a precise real value). On the
other hand, we employ a 0/1 classification encoding for
the “children” CNNs because for children, a possible
range of apparent age values is very narrow and, there-
fore, it is meaningful to encode each year as a com-
pletely separate class.
3
Our experiments have shown
that using this combined age labels encoding strategy
is advantageous with respect to using only distributed
age encoding or only 0/1 classification encoding for
both “general” and “children” CNNs.
3. Our experiments in Section 5 demonstrate that the
quality of image preprocessing has a very strong im-
pact on the final ǫ-score. Therefore, we employ the
state-of-the-art open source solution from [27] for face
alignment in our final approach.
4.2.1 Training pipeline
The integral training pipeline of all apparent age estima-
tion CNNs is presented in Figure 3. Starting with the pre-
trained VGG-16 CNN from [19], we train a “general” CNN
for biological age estimation of all ages between 0 and 99
years old on the “cleaned” IMDB-Wiki dataset using the la-
bel distribution age encoding. From the obtained network,
we fine-tune a “children” CNN for biological age estima-
tion of children between 0 and 12 years old. This time,
the 0/1 classification age encoding is used. The next step
is fine-tuning of 2 resulting CNNs (the “general” one and
the “children” one) for apparent age estimation. In case of
the “general” CNN, we combine all training and validation
images from the competition dataset (5613 images in total)
3
Here and below, we refer to the CNNs which estimate all ages between
0 and 99 years old as the “general” ones, while to the CNNs which estimate
only ages of children between 0 and 12 years old as the “children” ones.
100

Citations
More filters
Proceedings ArticleDOI

Face aging with conditional generative adversarial networks

TL;DR: This work proposes the first GAN-based method for automatic face aging and introduces a novel approach for “Identity-Preserving” optimization of GAN's latent vectors.
Posted Content

Face Aging With Conditional Generative Adversarial Networks

TL;DR: In this paper, the identity-preserving optimization of GAN's latent vectors is proposed to preserve the original person's identity in the aged version of his/her face, and the objective evaluation of the resulting aged and rejuvenated face images by state-of-the-art face recognition and age estimation solutions demonstrate the high potential of the proposed method.
Journal ArticleDOI

Deep Learning for Biometrics: A Survey

TL;DR: This article surveys 100 different approaches that explore deep learning for recognizing individuals using various biometric modalities and discusses how deep learning methods can benefit the field of biometrics and the potential gaps that deep learning approaches need to address for real-world biometric applications.
Proceedings ArticleDOI

Mean-Variance Loss for Deep Age Estimation from a Face

TL;DR: A new loss function, called mean-variance loss, is proposed for robust age estimation via distribution learning, which penalizes difference between the mean and variance of the estimated age distribution and the ground-truth age.

MORPH: A Longitudinal Image Database of Normal Adult Age-Progression.

TL;DR: It is concluded that the problem of age-progression on face recognition (FR) is not unique to the algorithm used in this work, and the efficacy of this algorithm is evaluated against the variables of gender and racial origin.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Posted Content

Caffe: Convolutional Architecture for Fast Feature Embedding

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Proceedings ArticleDOI

Caffe: Convolutional Architecture for Fast Feature Embedding

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Related Papers (5)