What are the contributions in "Deep learning face representation from predicting 10,000 classes" ?

This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features ( DeepID ), for face verification. The authors argue that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks ( such as verification ) and new identities unseen in the training set.

What have the authors stated for future works in "Deep learning face representation from predicting 10,000 classes" ?

This could be another interesting direction to be explored in the future.

What is the functionyj in the last hidden layer?

The last hidden layer takes the functionyj = max ( 0, ∑ i x1i · w1i,j + ∑ i x2i · w2i,j + bj ) , (3)where x1, w1, x2, w2 denote neurons and weights in the third and fourth convolutional layers, respectively.

How much accuracy can the authors achieve with Combing 60 patches?

Combing 60 patches increases the accuracy by 4.53% and 5.27% over best single patch for Joint Bayesian and neural networks, respectively.

What is the simplest formula for the i-th input map?

Max-pooling is formulated asyij,k = max 0≤m,n<s{ xij·s+m, k·s+n } , (2)where each neuron in the i-th output map yi pools over an s× s non-overlapping local region in the i-th input map xi.

How does the transfer learning algorithm compare with the human-level performance of DeepID?

The transfer learning Joint Bayesian based on their DeepID features achieves 97.45% test accuracy on LFW, which is on par with the human-level performance of 97.53%.

What is the main reason why the classifier outputs are diverse and unreliable?

With so many classes and few samples for each class, the classifier outputs are diverse and unreliable, therefore cannot be used as features.

(Open Access) Deep Learning Face Representation from Predicting 10,000 Classes (2014) | Yi Sun

Q: How many times do the identity classes increase?

When identity classes increase 32 times from 136 to 4349, the accuracy increases by 10.13% and 8.42% for Joint Bayesian and neural networks, respectively, or 2.03% and 1.68% on average, respectively, whenever the identity classes double.

Q: How does the neural network achieve the accuracy of the face verification model?

Joint Bayesian only achieves approximately 66% accuracy on these features, while the neural network fails, where it accounts all the face pairs aspositive or negative pairs.

Q: What is the effect of the bypassing connections between the third and fourth convolutional layers?

Adding the bypassing connections between the third convolutional layer (referred to as the skipping layer) and the last hidden layer reduces the possible information loss in the fourth convolutional layer.

Q: How many neurons are in the first hidden layer?

It has 38, 400 input neurons with 19, 200 DeepID features from each patch, and 4, 800 neurons in the following two hidden layers, with every 80 neurons in the first hidden layer locally connected to one of the 60 groups of input neurons.

Q: How many identities are derived from the deep ID?

Highly compact 160-dimensional DeepID is acquired at the end of the cascade that contain rich identity information and directly predict a much larger number (e.g., 10, 000) of identity classes.

Deep Learning Face Representation from Predicting 10,000 Classes

Yi Sun

Xiaogang Wang

Xiaoou Tang

1,3

Department of Information Engineering, The Chinese University of Hong Kong

Department of Electronic Engineering, The Chinese University of Hong Kong

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

sy011@ie.cuhk.edu.hk xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk

Abstract

This paper proposes to learn a set of high-level feature

representations through deep learning, referred to as Deep

hidden IDentity features (DeepID), for face veriﬁcation.

We argue that DeepID can be effectively learned through

challenging multi-class face identiﬁcation tasks, whilst they

can be generalized to other tasks (such as veriﬁcation) and

new identities unseen in the training set. Moreover, the

generalization capability of DeepID increases as more face

classes are to be predicted at training. DeepID features

are taken from the last hidden layer neuron activations of

deep convolutional networks (ConvNets). When learned

as classiﬁers to recognize about 10, 000 face identities in

the training set and conﬁgured to keep reducing the neuron

numbers along the feature extraction hierarchy, these deep

ConvNets gradually form compact identity-related features

in the top layers with only a small number of hidden

neurons. The proposed features are extracted from various

face regions to form complementary and over-complete

representations. Any state-of-the-art classiﬁers can be

learned based on these high-level representations for face

veriﬁcation. 97.45% veriﬁcation accuracy on LFW is

achieved with only weakly aligned faces.

1. Introduction

Face veriﬁcation in unconstrained conditions has been

studied extensively in recent years [21, 15, 7, 34, 17, 26,

18, 8, 2, 9, 3, 29, 6] due to its practical applications

and the publishing of LFW [19], an extensively reported

dataset for face veriﬁcation algorithms. The current best-

performing face veriﬁcation algorithms typically represent

faces with over-complete low-level features, followed by

shallow models [9, 29, 6]. Recently, deep models such as

ConvNets [24] have been proved effective for extracting

high-level visual features [11, 20, 14] and are used for

face veriﬁcation [18, 5, 31, 32, 36]. Huang et al. [18]

learned a generative deep model without supervision. Cai

Figure 1. An illustration of the feature extraction process. Arrows

indicate forward propagation directions. The number of neurons in

each layer of the multiple deep ConvNets are labeled beside each

layer. The DeepID features are taken from the last hidden layer

of each ConvNet, and predict a large number of identity classes.

Feature numbers continue to reduce along the feature extraction

cascade till the DeepID layer.

et al. [5] learned deep nonlinear metrics. In [31], the

deep models are supervised by the binary face veriﬁcation

target. Differently, in this paper we propose to learn high-

level face identity features with deep models through face

identiﬁcation, i.e. classifying a training image into one

of n identities (n ≈ 10, 000 in this work). This high-

dimensional prediction task is much more challenging than

face veriﬁcation, however, it leads to good generalization

of the learned feature representations. Although learned

through identiﬁcation, these features are shown to be

effective for face veriﬁcation and new faces unseen in the

training set.

We propose an effective way to learn high-level over-

complete features with deep ConvNets. A high-level

illustration of our feature extraction process is shown in

Figure 1. The ConvNets are learned to classify all the

faces available for training by their identities, with the last

hidden layer neuron activations as features (referred to as

Deep hidden IDentity features or DeepID). Each ConvNet

takes a face patch as input and extracts local low-level

features in the bottom layers. Feature numbers continue to

reduce along the feature extraction cascade while gradually

more global and high-level features are formed in the

top layers. Highly compact 160-dimensional DeepID is

acquired at the end of the cascade that contain rich identity

information and directly predict a much larger number (e.g.,

10, 000) of identity classes. Classifying all the identities

simultaneously instead of training binary classiﬁers as in

[21, 2, 3] is based on two considerations. First, it is

much more difﬁcult to predict a training sample into one

of many classes than to perform binary classiﬁcation. This

challenging task can make full use of the super learning

capacity of neural networks to extract effective features

for face recognition. Second, it implicitly adds a strong

regularization to ConvNets, which helps to form shared

hidden representations that can classify all the identities

well. Therefore, the learned high-level features have good

generalization ability and do not over-ﬁt to a small subset

of training faces. We constrain DeepID to be signiﬁcantly

fewer than the classes of identities they predict, which is

key to learning highly compact and discriminative features.

We further concatenate the DeepID extracted from various

face regions to form complementary and over-complete rep-

resentations. The learned features can be well generalized

to new identities in test, which are not seen in training,

and can be readily integrated with any state-of-the-art face

classiﬁers (e.g., Joint Bayesian [8]) for face veriﬁcation.

Our method achieves 97.45% face veriﬁcation accuracy

on LFW using only weakly aligned faces, which is almost

as good as human performance of 97.53%. We also observe

that as the number of training identities increases, the

veriﬁcation performance steadily gets improved. Although

the prediction task at the training stage becomes more

challenging, the discrimination and generalization ability of

the learned features increases. It leaves the door wide open

for future improvement of accuracy with more training data.

2. Related work

Many face veriﬁcation methods represent faces by high-

dimensional over-complete face descriptors, followed by

shallow models. Cao et al. [7] encoded each face image into

26K learning-based (LE) descriptors, and then calculated

the L

distance between the LE descriptors after PCA. Chen

et al. [9] extracted 100K LBP descriptors at dense facial

landmarks with multiple scales and used Joint Bayesian [8]

for veriﬁcation after PCA. Simonyan et al. [29] computed

1.7M SIFT descriptors densely in scale and space, encoded

the dense SIFT features into Fisher vectors, and learned lin-

ear projection for discriminative dimensionality reduction.

Huang et al. [17] combined 1.2M CMD [33] and SLBP

[1] descriptors, and learned sparse Mahalanobis metrics for

face veriﬁcation.

Some previous studies have further learned identity-

related features based on low-level features. Kumar et al.

[21] trained attribute and simile classiﬁers to detect facial

attributes and measure face similarities to a set of reference

people. Berg and Belhumeur [2, 3] trained classiﬁers to

distinguish the faces from two different people. Features

are outputs of the learned classiﬁers. They used SVM

classiﬁers, which are shallow structures, and their learned

features are still relatively low-level. In contrast, we classify

all the identities from the training set simultaneously. More-

over, we use the last hidden layer activations as features

instead of the classiﬁer outputs. In our ConvNets, the

neuron number of the last hidden layer is much smaller

than that of the output, which forces the last hidden layer

to learn shared hidden representations for faces of different

people in order to well classify all of them, resulting

in highly discriminative and compact features with good

generalization ability.

A few deep models have been used for face veriﬁcation

or identiﬁcation. Chopra et al. [10] used a Siamese network

[4] for deep metric learning. The Siamese network extracts

features separately from two compared inputs with two

identical sub-networks, taking the distance between the

outputs of the two sub-networks as dissimilarity. [10]

used deep ConvNets as the sub-networks. In contrast

to the Siamese network in which feature extraction and

recognition are jointly learned with the face veriﬁcation

target, we conduct feature extraction and recognition in

two steps, with the ﬁrst feature extraction step learned with

the target of face identiﬁcation, which is a much stronger

supervision signal than veriﬁcation. Huang et al. [18]

generatively learned features with CDBNs [25], then used

ITML [13] and linear SVM for face veriﬁcation. Cai et al.

[5] also learned deep metrics under the Siamese network

framework as [10], but used a two-level ISA network [23]

as the sub-networks instead. Zhu et al. [35, 36] learned deep

neural networks to transform faces in arbitrary poses and

illumination to frontal faces with normal illumination, and

then used the last hidden layer features or the transformed

faces for face recognition. Sun et al. [31] used multiple deep

ConvNets to learn high-level face similarity features and

trained classiﬁcation RBM [22] for face veriﬁcation. Their

features are jointly extracted from a pair of faces instead of

from a single face.

3. Learning DeepID for face veriﬁcation

3.1. Deep ConvNets

Our deep ConvNets contain four convolutional layers

(with max-pooling) to extract features hierarchically, fol-

lowed by the fully-connected DeepID layer and the softmax

output layer indicating identity classes. The input is 39 ×

Figure 2. ConvNet structure. The length, width, and height of

each cuboid denotes the map number and the dimension of each

map for all input, convolutional, and max-pooling layers. The

inside small cuboids and squares denote the 3D convolution kernel

sizes and the 2D pooling region sizes of convolutional and max-

pooling layers, respectively. Neuron numbers of the last two fully-

connected layers are marked beside each layer.

31 × k for rectangle patches, and 31 × 31 × k for square

patches, where k = 3 for color patches and k = 1 for

gray patches. Figure 2 shows the detailed structure of the

ConvNet which takes 39 × 31 × 1 input and predicts n (e.g.,

n = 10, 000) identity classes. When the input sizes change,

the height and width of maps in the following layers will

change accordingly. The dimension of the DeepID layer is

ﬁxed to 160, while the dimension of the output layer varies

according to the number of classes it predicts. Feature

numbers continue to reduce along the feature extraction

hierarchy until the last hidden layer (the DeepID layer),

where highly compact and predictive features are formed,

which predict a much larger number of identity classes with

only a few features.

The convolution operation is expressed as

j(r)

= max

0, b

j(r)

ij(r)

∗ x

i(r)

, (1)

where x

and y

are the i-th input map and the j-th output

map, respectively. k

is the convolution kernel between

the i-th input map and the j-th output map. ∗ denotes

convolution. b

is the bias of the j-th output map. We use

ReLU nonlinearity (y = max (0, x)) for hidden neurons,

which is shown to have better ﬁtting abilities than the

sigmoid function [20]. Weights in higher convolutional

layers of our ConvNets are locally shared to learn different

mid- or high-level features in different regions [18]. r

in Equation 1 indicates a local region where weights are

shared. In the third convolutional layer, weights are locally

shared in every 2 × 2 regions, while weights in the fourth

convolutional layer are totally unshared. Max-pooling is

formulated as

j,k

= max

0≤m,n<s



j·s+m, k·s+n



, (2)

where each neuron in the i-th output map y

pools over an

s × s non-overlapping local region in the i-th input map x

Figure 3. Top: ten face regions of medium scales. The ﬁve regions

in the top left are global regions taken from the weakly aligned

faces, the other ﬁve in the top right are local regions centered

around the ﬁve facial landmarks (two eye centers, nose tip, and two

mouse corners). Bottom: three scales of two particular patches.

The last hidden layer of DeepID is fully connected to

both the third and fourth convolutional layers (after max-

pooling) such that it sees multi-scale features [28] (features

in the fourth convolutional layer are more global than

those in the third one). This is critical to feature learning

because after successive down-sampling along the cascade,

the fourth convolutional layer contains too few neurons

and becomes the bottleneck for information propagation.

Adding the bypassing connections between the third con-

volutional layer (referred to as the skipping layer) and the

last hidden layer reduces the possible information loss in

the fourth convolutional layer. The last hidden layer takes

the function

= max

· w

i,j

· w

i,j

+ b

, (3)

where x

, w

, x

, w

denote neurons and weights in the

third and fourth convolutional layers, respectively. It lin-

early combines features in the previous two convolutional

layers, followed by ReLU non-linearity.

The ConvNet output is an n-way softmax predicting the

probability distribution over n different identities.

exp(y

)

j=1

exp(y

)

, (4)

where y

160

i=1

· w

i,j

+ b

linearly combines the 160

DeepID features x

as the input of neuron j, and y

is its

output. The ConvNet is learned by minimizing − log y

with the t-th target class. Stochastic gradient descent is used

with gradients calculated by back-propagation.

3.2. Feature extraction

We detect ﬁve facial landmarks, including the two eye

centers, the nose tip, and the two mouth corners, with the

facial point detection method proposed by Sun et al. [30].

Faces are globally aligned by similarity transformation

according to the two eye centers and the mid-point of the

two mouth corners. Features are extracted from 60 face

patches with ten regions, three scales, and RGB or gray

channels. Figure 3 shows the ten face regions and the

three scales of two particular face regions. We trained

60 ConvNets, each of which extracts two 160-dimensional

DeepID vectors from a particular patch and its horizontally

ﬂipped counterpart. A special case is patches around the

two eye centers and the two mouth corners, which are not

ﬂipped themselves, but the patches symmetric with them

(for example, the ﬂipped counterpart of the patch centered

on the left eye is derived by ﬂipping the patch centered

on the right eye). The total length of DeepID is 19, 200

(160 ×2×60), which is ready for the ﬁnal face veriﬁcation.

3.3. Face veriﬁcation

We use the Joint Bayesian [8] technique for face ver-

iﬁcation based on the DeepID. Joint Bayesian has been

highly successful for face veriﬁcation [9, 6]. It represents

the extracted facial features x (after subtracting the mean)

by the sum of two independent Gaussian variables

x = µ +  , (5)

where µ ∼ N (0, S

) represents the face identity and

 ∼ N (0, S



) the intra-personal variations. Joint Bayesian

models the joint probability of two faces given the intra-

or extra-personal variation hypothesis, P (x

, x

| H

) and

P (x

, x

| H

). It is readily shown from Equation 5 that

these two probabilities are also Gaussian with variations



+ S



+ S





(6)

and



+ S



0 S

+ S





, (7)

respectively. S

and S



can be learned from data with EM

algorithm. In test, it calculates the likelihood ratio

r (x

, x

) = log

P (x

, x

| H

)

P (x

, x

| H

)

, (8)

which has closed-form solutions and is efﬁcient.

We also train a neural network for veriﬁcation and com-

pare it to Joint Bayesian to see if other models can also learn

from the extracted features and how much the features and a

good face veriﬁcation model contribute to the performance,

respectively. The neural network contains one input layer

Figure 4. The structure of the neural network used for face

veriﬁcation. The layer type and dimension are labeled beside each

layer. The solid neurons form a subnetwork.

taking the DeepID, one locally-connected layer, one fully-

connected layer, and a single output neuron indicating

face similarities. The input features are divided into 60

groups, each of which contains 640 features extracted from

a particular patch pair with a particular ConvNet. Features

in the same group are highly correlated. Neurons in the

locally-connected layer only connect to a single group of

features to learn their local relations and reduce the feature

dimension at the same time. The second hidden layer is

fully-connected to the ﬁrst hidden layer to learn global

relations. The single output neuron is fully connected to the

second hidden layer. The hidden neurons are ReLUs and

the output neuron is sigmoid. An illustration of the neural

network structure is shown in Figure 4. It has 38, 400 input

neurons with 19, 200 DeepID features from each patch, and

4, 800 neurons in the following two hidden layers, with

every 80 neurons in the ﬁrst hidden layer locally connected

to one of the 60 groups of input neurons.

Dropout learning [16] is used for all the hidden neu-

rons. The input neurons cannot be dropped because the

learned features are compact and distributed representa-

tions (representing a large number of identities with very

few neurons) and have to collaborate with each other to

represent the identities well. On the other hand, learning

high-dimensional features without dropout is difﬁcult due

to gradient diffusion. To solve this problem, we ﬁrst train 60

subnetworks, each with features of a single group as input.

A particular subnetwork is illustrated in Figure 4. We then

use the ﬁrst-layer weights of the subnetworks to initialize

those of the original network, and tune the second and third

layers of the original network with the ﬁrst layer weights

clipped.

4. Experiments

We evaluate our algorithm on LFW, which reveals the

state-of-the-art of face veriﬁcation in the wild. Though

LFW contains 5749 people, only 85 have more than 15

images, and 4069 people have only one image. It is

inadequate to train identity classiﬁers with so few images

per person. Instead, we trained our model on CelebFaces

[31] and tested on LFW (Section 4.1 - 4.3). CelebFaces

contains 87, 628 face images of 5436 celebrities from the

Internet, with approximately 16 images per person on

average. People in LFW and CelebFaces are mutually

exclusive.

We randomly choose 80% (4349) people from Celeb-

Faces to learn the DeepID, and use the remaining 20%

people to learn the face veriﬁcation model (Joint Bayesian

or neural networks). For feature learning, ConvNets

are supervised to classify the 4349 people simultaneously

from a particular kind of face patches and their ﬂipped

counterparts. We randomly select 10% images of each

training person to generate the validation data. After each

training epoch, we observe the top-1 validation set error

rates and select the model that provides the lowest one.

In face veriﬁcation, our feature dimension is reduced

to 150 by PCA before learning the Joint Bayesian model.

Performance almost retains in a wide range of dimensions.

In test, each face pair is classiﬁed by comparing the Joint

Bayesian likelihood ratio to a threshold optimized in the

training data.

To evaluate the performance of our approach at an even

larger training scale in Section 4.4, we extend CelebFaces

to the CelebFaces+ dataset, which contains 202, 599 face

images of 10, 177 celebrities. Again, people in LFW

and CelebFaces+ are mutually exclusive. The ConvNet

structure and feature extraction process described in the

previous section remains unchanged.

4.1. Multi-scale ConvNets

We verify the effectiveness of directly connecting neu-

rons in the third convolutional layer (after max-pooling)

to the last hidden layer (the DeepID layer), such that it

sees both the third and fourth convolutional layer features,

forming the so-called multi-scale ConvNets. It also results

in reducing feature numbers from the convolutional layers

to the DeepID layer (shown in Figure 1), which helps the

latter to learn higher-level features in order to well represent

the face identities with fewer neurons. Figure 5 compares

the top-1 validation set error rates of the 60 ConvNets

learned to classify the 4349 classes of identities, either with

or without the skipping layer. The lower error rates indicate

the better hidden features learned. Allowing the DeepID to

pool over multi-scale features reduces validation errors by

an average of 4.72%. It actually also improves the ﬁnal

face veriﬁcation accuracy from 95.35% to 96.05% when

concatenating the DeepID from the 60 ConvNets and using

Joint Bayesian for face veriﬁcation.

4.2. Learning effective features

Classifying a large number of identities simultaneously

is key to learning discriminative and compact hidden

features. To verify this, we increase the identity classes

Figure 5. Top-1 validation set error rates of the 60 ConvNets

trained on the 60 different patches. The blue and red markers show

error rates of the conventional ConvNets (without the skipping

layer) and the multi-scale ConvNets, respectively.

for training exponentially (and output neuron numbers

correspondingly) from 136 to 4349 while ﬁxing the neuron

numbers in all previous layers (the DeepID is kept to be

160 dimensional). We observe the classiﬁcation ability of

ConvNets (measured by the top-1 validation set error rates)

and the effectiveness of the learned hidden representations

for face veriﬁcation (measured by the test set veriﬁcation

accuracy) with the increasing identity classes. The input is a

single patch covering the whole face in this experiment. As

shown in Figure 6, both Joint Bayesian and neural network

improve linearly in veriﬁcation accuracy when the identity

classes double. The improvement is signiﬁcant. When

identity classes increase 32 times from 136 to 4349, the

accuracy increases by 10.13% and 8.42% for Joint Bayesian

and neural networks, respectively, or 2.03% and 1.68% on

average, respectively, whenever the identity classes double.

At the same time, the validation set error rates drop, even

when the predicted classes are tens of times more than

the last hidden layer neurons, as shown in Figure 7. This

phenomenon indicates that ConvNets can learn from classi-

fying each identity and form shared hidden representations

that can classify all the identities well. More identity

classes help to learn better hidden representations that can

distinguish more people (discriminative) without increasing

the feature length (compact). The linear increasing of

test accuracy with respect to the exponentially increasing

training data indicates that our features would be further

improved if even more identities are available. Examples of

the 160-dimensional DeepID learned from the 4349 training

identities and extracted from LFW test pairs are shown in

Figure 8. We ﬁnd that faces of the same identity tend to

have more commonly activated neurons (positive features

being in the same position) than those of different identities.

So the learned features extract identity information.

We also test the 4349-dimensional classiﬁer outputs as

features for face veriﬁcation. Joint Bayesian only achieves

approximately 66% accuracy on these features, while the

neural network fails, where it accounts all the face pairs as

Deep Learning Face Representation from Predicting 10,000 Classes

Figures

Citations

Deep face recognition

A Discriminative Feature Learning Approach for Deep Face Recognition

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

Deep convolutional neural networks for image classification: A comprehensive review

VGGFace2: A Dataset for Recognising Faces across Pose and Age

References

ImageNet Classification with Deep Convolutional Neural Networks

Gradient-based learning applied to document recognition

Improving neural networks by preventing co-adaptation of feature detectors

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Related Papers (5)

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

FaceNet: A unified embedding for face recognition and clustering

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

ImageNet Classification with Deep Convolutional Neural Networks

Deep Residual Learning for Image Recognition

Frequently Asked Questions (14)

Q1. What are the contributions in "Deep learning face representation from predicting 10,000 classes" ?

Q2. What have the authors stated for future works in "Deep learning face representation from predicting 10,000 classes" ?

Q3. How many times do the identity classes increase?

Q4. How does the neural network achieve the accuracy of the face verification model?

Q5. What is the functionyj in the last hidden layer?

Q6. What is the effect of the bypassing connections between the third and fourth convolutional layers?

Q7. How many neurons are in the first hidden layer?

Q8. How many identities are derived from the deep ID?

Q9. How much accuracy can the authors achieve with Combing 60 patches?

Q10. What is the simplest formula for the i-th input map?

Q11. What is the method for detecting facial landmarks?

Q12. What did they learn from the deep ConvNets?

Q13. How does the transfer learning algorithm compare with the human-level performance of DeepID?

Q14. What is the main reason why the classifier outputs are diverse and unreliable?