scispace - formally typeset
Open AccessProceedings ArticleDOI

Deep Learning Face Representation from Predicting 10,000 Classes

Reads0
Chats0
TLDR
It is argued that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set.
Abstract
This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features (DeepID), for face verification. We argue that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set. Moreover, the generalization capability of DeepID increases as more face classes are to be predicted at training. DeepID features are taken from the last hidden layer neuron activations of deep convolutional networks (ConvNets). When learned as classifiers to recognize about 10, 000 face identities in the training set and configured to keep reducing the neuron numbers along the feature extraction hierarchy, these deep ConvNets gradually form compact identity-related features in the top layers with only a small number of hidden neurons. The proposed features are extracted from various face regions to form complementary and over-complete representations. Any state-of-the-art classifiers can be learned based on these high-level representations for face verification. 97:45% verification accuracy on LFW is achieved with only weakly aligned faces.

read more

Content maybe subject to copyright    Report

Deep Learning Face Representation from Predicting 10,000 Classes
Yi Sun
1
Xiaogang Wang
2
Xiaoou Tang
1,3
1
Department of Information Engineering, The Chinese University of Hong Kong
2
Department of Electronic Engineering, The Chinese University of Hong Kong
3
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
sy011@ie.cuhk.edu.hk xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk
Abstract
This paper proposes to learn a set of high-level feature
representations through deep learning, referred to as Deep
hidden IDentity features (DeepID), for face verification.
We argue that DeepID can be effectively learned through
challenging multi-class face identification tasks, whilst they
can be generalized to other tasks (such as verification) and
new identities unseen in the training set. Moreover, the
generalization capability of DeepID increases as more face
classes are to be predicted at training. DeepID features
are taken from the last hidden layer neuron activations of
deep convolutional networks (ConvNets). When learned
as classifiers to recognize about 10, 000 face identities in
the training set and configured to keep reducing the neuron
numbers along the feature extraction hierarchy, these deep
ConvNets gradually form compact identity-related features
in the top layers with only a small number of hidden
neurons. The proposed features are extracted from various
face regions to form complementary and over-complete
representations. Any state-of-the-art classifiers can be
learned based on these high-level representations for face
verification. 97.45% verification accuracy on LFW is
achieved with only weakly aligned faces.
1. Introduction
Face verification in unconstrained conditions has been
studied extensively in recent years [21, 15, 7, 34, 17, 26,
18, 8, 2, 9, 3, 29, 6] due to its practical applications
and the publishing of LFW [19], an extensively reported
dataset for face verification algorithms. The current best-
performing face verification algorithms typically represent
faces with over-complete low-level features, followed by
shallow models [9, 29, 6]. Recently, deep models such as
ConvNets [24] have been proved effective for extracting
high-level visual features [11, 20, 14] and are used for
face verification [18, 5, 31, 32, 36]. Huang et al. [18]
learned a generative deep model without supervision. Cai
Figure 1. An illustration of the feature extraction process. Arrows
indicate forward propagation directions. The number of neurons in
each layer of the multiple deep ConvNets are labeled beside each
layer. The DeepID features are taken from the last hidden layer
of each ConvNet, and predict a large number of identity classes.
Feature numbers continue to reduce along the feature extraction
cascade till the DeepID layer.
et al. [5] learned deep nonlinear metrics. In [31], the
deep models are supervised by the binary face verification
target. Differently, in this paper we propose to learn high-
level face identity features with deep models through face
identification, i.e. classifying a training image into one
of n identities (n 10, 000 in this work). This high-
dimensional prediction task is much more challenging than
face verification, however, it leads to good generalization
of the learned feature representations. Although learned
through identification, these features are shown to be
effective for face verification and new faces unseen in the
training set.
We propose an effective way to learn high-level over-
complete features with deep ConvNets. A high-level
illustration of our feature extraction process is shown in
Figure 1. The ConvNets are learned to classify all the
faces available for training by their identities, with the last
hidden layer neuron activations as features (referred to as
1

Deep hidden IDentity features or DeepID). Each ConvNet
takes a face patch as input and extracts local low-level
features in the bottom layers. Feature numbers continue to
reduce along the feature extraction cascade while gradually
more global and high-level features are formed in the
top layers. Highly compact 160-dimensional DeepID is
acquired at the end of the cascade that contain rich identity
information and directly predict a much larger number (e.g.,
10, 000) of identity classes. Classifying all the identities
simultaneously instead of training binary classifiers as in
[21, 2, 3] is based on two considerations. First, it is
much more difficult to predict a training sample into one
of many classes than to perform binary classification. This
challenging task can make full use of the super learning
capacity of neural networks to extract effective features
for face recognition. Second, it implicitly adds a strong
regularization to ConvNets, which helps to form shared
hidden representations that can classify all the identities
well. Therefore, the learned high-level features have good
generalization ability and do not over-fit to a small subset
of training faces. We constrain DeepID to be significantly
fewer than the classes of identities they predict, which is
key to learning highly compact and discriminative features.
We further concatenate the DeepID extracted from various
face regions to form complementary and over-complete rep-
resentations. The learned features can be well generalized
to new identities in test, which are not seen in training,
and can be readily integrated with any state-of-the-art face
classifiers (e.g., Joint Bayesian [8]) for face verification.
Our method achieves 97.45% face verification accuracy
on LFW using only weakly aligned faces, which is almost
as good as human performance of 97.53%. We also observe
that as the number of training identities increases, the
verification performance steadily gets improved. Although
the prediction task at the training stage becomes more
challenging, the discrimination and generalization ability of
the learned features increases. It leaves the door wide open
for future improvement of accuracy with more training data.
2. Related work
Many face verification methods represent faces by high-
dimensional over-complete face descriptors, followed by
shallow models. Cao et al. [7] encoded each face image into
26K learning-based (LE) descriptors, and then calculated
the L
2
distance between the LE descriptors after PCA. Chen
et al. [9] extracted 100K LBP descriptors at dense facial
landmarks with multiple scales and used Joint Bayesian [8]
for verification after PCA. Simonyan et al. [29] computed
1.7M SIFT descriptors densely in scale and space, encoded
the dense SIFT features into Fisher vectors, and learned lin-
ear projection for discriminative dimensionality reduction.
Huang et al. [17] combined 1.2M CMD [33] and SLBP
[1] descriptors, and learned sparse Mahalanobis metrics for
face verification.
Some previous studies have further learned identity-
related features based on low-level features. Kumar et al.
[21] trained attribute and simile classifiers to detect facial
attributes and measure face similarities to a set of reference
people. Berg and Belhumeur [2, 3] trained classifiers to
distinguish the faces from two different people. Features
are outputs of the learned classifiers. They used SVM
classifiers, which are shallow structures, and their learned
features are still relatively low-level. In contrast, we classify
all the identities from the training set simultaneously. More-
over, we use the last hidden layer activations as features
instead of the classifier outputs. In our ConvNets, the
neuron number of the last hidden layer is much smaller
than that of the output, which forces the last hidden layer
to learn shared hidden representations for faces of different
people in order to well classify all of them, resulting
in highly discriminative and compact features with good
generalization ability.
A few deep models have been used for face verification
or identification. Chopra et al. [10] used a Siamese network
[4] for deep metric learning. The Siamese network extracts
features separately from two compared inputs with two
identical sub-networks, taking the distance between the
outputs of the two sub-networks as dissimilarity. [10]
used deep ConvNets as the sub-networks. In contrast
to the Siamese network in which feature extraction and
recognition are jointly learned with the face verification
target, we conduct feature extraction and recognition in
two steps, with the first feature extraction step learned with
the target of face identification, which is a much stronger
supervision signal than verification. Huang et al. [18]
generatively learned features with CDBNs [25], then used
ITML [13] and linear SVM for face verification. Cai et al.
[5] also learned deep metrics under the Siamese network
framework as [10], but used a two-level ISA network [23]
as the sub-networks instead. Zhu et al. [35, 36] learned deep
neural networks to transform faces in arbitrary poses and
illumination to frontal faces with normal illumination, and
then used the last hidden layer features or the transformed
faces for face recognition. Sun et al. [31] used multiple deep
ConvNets to learn high-level face similarity features and
trained classification RBM [22] for face verification. Their
features are jointly extracted from a pair of faces instead of
from a single face.
3. Learning DeepID for face verification
3.1. Deep ConvNets
Our deep ConvNets contain four convolutional layers
(with max-pooling) to extract features hierarchically, fol-
lowed by the fully-connected DeepID layer and the softmax
output layer indicating identity classes. The input is 39 ×

Figure 2. ConvNet structure. The length, width, and height of
each cuboid denotes the map number and the dimension of each
map for all input, convolutional, and max-pooling layers. The
inside small cuboids and squares denote the 3D convolution kernel
sizes and the 2D pooling region sizes of convolutional and max-
pooling layers, respectively. Neuron numbers of the last two fully-
connected layers are marked beside each layer.
31 × k for rectangle patches, and 31 × 31 × k for square
patches, where k = 3 for color patches and k = 1 for
gray patches. Figure 2 shows the detailed structure of the
ConvNet which takes 39 × 31 × 1 input and predicts n (e.g.,
n = 10, 000) identity classes. When the input sizes change,
the height and width of maps in the following layers will
change accordingly. The dimension of the DeepID layer is
fixed to 160, while the dimension of the output layer varies
according to the number of classes it predicts. Feature
numbers continue to reduce along the feature extraction
hierarchy until the last hidden layer (the DeepID layer),
where highly compact and predictive features are formed,
which predict a much larger number of identity classes with
only a few features.
The convolution operation is expressed as
y
j(r)
= max
0, b
j(r)
+
X
i
k
ij(r)
x
i(r)
!
, (1)
where x
i
and y
j
are the i-th input map and the j-th output
map, respectively. k
ij
is the convolution kernel between
the i-th input map and the j-th output map. denotes
convolution. b
j
is the bias of the j-th output map. We use
ReLU nonlinearity (y = max (0, x)) for hidden neurons,
which is shown to have better fitting abilities than the
sigmoid function [20]. Weights in higher convolutional
layers of our ConvNets are locally shared to learn different
mid- or high-level features in different regions [18]. r
in Equation 1 indicates a local region where weights are
shared. In the third convolutional layer, weights are locally
shared in every 2 × 2 regions, while weights in the fourth
convolutional layer are totally unshared. Max-pooling is
formulated as
y
i
j,k
= max
0m,n<s
x
i
j·s+m, k·s+n
, (2)
where each neuron in the i-th output map y
i
pools over an
s × s non-overlapping local region in the i-th input map x
i
.
Figure 3. Top: ten face regions of medium scales. The five regions
in the top left are global regions taken from the weakly aligned
faces, the other five in the top right are local regions centered
around the five facial landmarks (two eye centers, nose tip, and two
mouse corners). Bottom: three scales of two particular patches.
The last hidden layer of DeepID is fully connected to
both the third and fourth convolutional layers (after max-
pooling) such that it sees multi-scale features [28] (features
in the fourth convolutional layer are more global than
those in the third one). This is critical to feature learning
because after successive down-sampling along the cascade,
the fourth convolutional layer contains too few neurons
and becomes the bottleneck for information propagation.
Adding the bypassing connections between the third con-
volutional layer (referred to as the skipping layer) and the
last hidden layer reduces the possible information loss in
the fourth convolutional layer. The last hidden layer takes
the function
y
j
= max
0,
X
i
x
1
i
· w
1
i,j
+
X
i
x
2
i
· w
2
i,j
+ b
j
!
, (3)
where x
1
, w
1
, x
2
, w
2
denote neurons and weights in the
third and fourth convolutional layers, respectively. It lin-
early combines features in the previous two convolutional
layers, followed by ReLU non-linearity.
The ConvNet output is an n-way softmax predicting the
probability distribution over n different identities.
y
i
=
exp(y
0
i
)
P
n
j=1
exp(y
0
j
)
, (4)
where y
0
j
=
P
160
i=1
x
i
· w
i,j
+ b
j
linearly combines the 160
DeepID features x
i
as the input of neuron j, and y
j
is its
output. The ConvNet is learned by minimizing log y
t
,
with the t-th target class. Stochastic gradient descent is used
with gradients calculated by back-propagation.

3.2. Feature extraction
We detect five facial landmarks, including the two eye
centers, the nose tip, and the two mouth corners, with the
facial point detection method proposed by Sun et al. [30].
Faces are globally aligned by similarity transformation
according to the two eye centers and the mid-point of the
two mouth corners. Features are extracted from 60 face
patches with ten regions, three scales, and RGB or gray
channels. Figure 3 shows the ten face regions and the
three scales of two particular face regions. We trained
60 ConvNets, each of which extracts two 160-dimensional
DeepID vectors from a particular patch and its horizontally
flipped counterpart. A special case is patches around the
two eye centers and the two mouth corners, which are not
flipped themselves, but the patches symmetric with them
(for example, the flipped counterpart of the patch centered
on the left eye is derived by flipping the patch centered
on the right eye). The total length of DeepID is 19, 200
(160 ×2×60), which is ready for the final face verification.
3.3. Face verification
We use the Joint Bayesian [8] technique for face ver-
ification based on the DeepID. Joint Bayesian has been
highly successful for face verification [9, 6]. It represents
the extracted facial features x (after subtracting the mean)
by the sum of two independent Gaussian variables
x = µ + , (5)
where µ N (0, S
µ
) represents the face identity and
N (0, S
) the intra-personal variations. Joint Bayesian
models the joint probability of two faces given the intra-
or extra-personal variation hypothesis, P (x
1
, x
2
| H
I
) and
P (x
1
, x
2
| H
E
). It is readily shown from Equation 5 that
these two probabilities are also Gaussian with variations
Σ
I
=
S
µ
+ S
S
µ
S
µ
S
µ
+ S
(6)
and
Σ
E
=
S
µ
+ S
0
0 S
µ
+ S
, (7)
respectively. S
µ
and S
can be learned from data with EM
algorithm. In test, it calculates the likelihood ratio
r (x
1
, x
2
) = log
P (x
1
, x
2
| H
I
)
P (x
1
, x
2
| H
E
)
, (8)
which has closed-form solutions and is efficient.
We also train a neural network for verification and com-
pare it to Joint Bayesian to see if other models can also learn
from the extracted features and how much the features and a
good face verification model contribute to the performance,
respectively. The neural network contains one input layer
Figure 4. The structure of the neural network used for face
verification. The layer type and dimension are labeled beside each
layer. The solid neurons form a subnetwork.
taking the DeepID, one locally-connected layer, one fully-
connected layer, and a single output neuron indicating
face similarities. The input features are divided into 60
groups, each of which contains 640 features extracted from
a particular patch pair with a particular ConvNet. Features
in the same group are highly correlated. Neurons in the
locally-connected layer only connect to a single group of
features to learn their local relations and reduce the feature
dimension at the same time. The second hidden layer is
fully-connected to the first hidden layer to learn global
relations. The single output neuron is fully connected to the
second hidden layer. The hidden neurons are ReLUs and
the output neuron is sigmoid. An illustration of the neural
network structure is shown in Figure 4. It has 38, 400 input
neurons with 19, 200 DeepID features from each patch, and
4, 800 neurons in the following two hidden layers, with
every 80 neurons in the first hidden layer locally connected
to one of the 60 groups of input neurons.
Dropout learning [16] is used for all the hidden neu-
rons. The input neurons cannot be dropped because the
learned features are compact and distributed representa-
tions (representing a large number of identities with very
few neurons) and have to collaborate with each other to
represent the identities well. On the other hand, learning
high-dimensional features without dropout is difficult due
to gradient diffusion. To solve this problem, we first train 60
subnetworks, each with features of a single group as input.
A particular subnetwork is illustrated in Figure 4. We then
use the first-layer weights of the subnetworks to initialize
those of the original network, and tune the second and third
layers of the original network with the first layer weights
clipped.
4. Experiments
We evaluate our algorithm on LFW, which reveals the
state-of-the-art of face verification in the wild. Though
LFW contains 5749 people, only 85 have more than 15
images, and 4069 people have only one image. It is
inadequate to train identity classifiers with so few images
per person. Instead, we trained our model on CelebFaces

[31] and tested on LFW (Section 4.1 - 4.3). CelebFaces
contains 87, 628 face images of 5436 celebrities from the
Internet, with approximately 16 images per person on
average. People in LFW and CelebFaces are mutually
exclusive.
We randomly choose 80% (4349) people from Celeb-
Faces to learn the DeepID, and use the remaining 20%
people to learn the face verification model (Joint Bayesian
or neural networks). For feature learning, ConvNets
are supervised to classify the 4349 people simultaneously
from a particular kind of face patches and their flipped
counterparts. We randomly select 10% images of each
training person to generate the validation data. After each
training epoch, we observe the top-1 validation set error
rates and select the model that provides the lowest one.
In face verification, our feature dimension is reduced
to 150 by PCA before learning the Joint Bayesian model.
Performance almost retains in a wide range of dimensions.
In test, each face pair is classified by comparing the Joint
Bayesian likelihood ratio to a threshold optimized in the
training data.
To evaluate the performance of our approach at an even
larger training scale in Section 4.4, we extend CelebFaces
to the CelebFaces+ dataset, which contains 202, 599 face
images of 10, 177 celebrities. Again, people in LFW
and CelebFaces+ are mutually exclusive. The ConvNet
structure and feature extraction process described in the
previous section remains unchanged.
4.1. Multi-scale ConvNets
We verify the effectiveness of directly connecting neu-
rons in the third convolutional layer (after max-pooling)
to the last hidden layer (the DeepID layer), such that it
sees both the third and fourth convolutional layer features,
forming the so-called multi-scale ConvNets. It also results
in reducing feature numbers from the convolutional layers
to the DeepID layer (shown in Figure 1), which helps the
latter to learn higher-level features in order to well represent
the face identities with fewer neurons. Figure 5 compares
the top-1 validation set error rates of the 60 ConvNets
learned to classify the 4349 classes of identities, either with
or without the skipping layer. The lower error rates indicate
the better hidden features learned. Allowing the DeepID to
pool over multi-scale features reduces validation errors by
an average of 4.72%. It actually also improves the final
face verification accuracy from 95.35% to 96.05% when
concatenating the DeepID from the 60 ConvNets and using
Joint Bayesian for face verification.
4.2. Learning effective features
Classifying a large number of identities simultaneously
is key to learning discriminative and compact hidden
features. To verify this, we increase the identity classes
Figure 5. Top-1 validation set error rates of the 60 ConvNets
trained on the 60 different patches. The blue and red markers show
error rates of the conventional ConvNets (without the skipping
layer) and the multi-scale ConvNets, respectively.
for training exponentially (and output neuron numbers
correspondingly) from 136 to 4349 while fixing the neuron
numbers in all previous layers (the DeepID is kept to be
160 dimensional). We observe the classification ability of
ConvNets (measured by the top-1 validation set error rates)
and the effectiveness of the learned hidden representations
for face verification (measured by the test set verification
accuracy) with the increasing identity classes. The input is a
single patch covering the whole face in this experiment. As
shown in Figure 6, both Joint Bayesian and neural network
improve linearly in verification accuracy when the identity
classes double. The improvement is significant. When
identity classes increase 32 times from 136 to 4349, the
accuracy increases by 10.13% and 8.42% for Joint Bayesian
and neural networks, respectively, or 2.03% and 1.68% on
average, respectively, whenever the identity classes double.
At the same time, the validation set error rates drop, even
when the predicted classes are tens of times more than
the last hidden layer neurons, as shown in Figure 7. This
phenomenon indicates that ConvNets can learn from classi-
fying each identity and form shared hidden representations
that can classify all the identities well. More identity
classes help to learn better hidden representations that can
distinguish more people (discriminative) without increasing
the feature length (compact). The linear increasing of
test accuracy with respect to the exponentially increasing
training data indicates that our features would be further
improved if even more identities are available. Examples of
the 160-dimensional DeepID learned from the 4349 training
identities and extracted from LFW test pairs are shown in
Figure 8. We find that faces of the same identity tend to
have more commonly activated neurons (positive features
being in the same position) than those of different identities.
So the learned features extract identity information.
We also test the 4349-dimensional classifier outputs as
features for face verification. Joint Bayesian only achieves
approximately 66% accuracy on these features, while the
neural network fails, where it accounts all the face pairs as

Figures
Citations
More filters
Proceedings ArticleDOI

Deep face recognition

TL;DR: It is shown how a very large scale dataset can be assembled by a combination of automation and human in the loop, and the trade off between data purity and time is discussed.
Book ChapterDOI

A Discriminative Feature Learning Approach for Deep Face Recognition

TL;DR: This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.
Proceedings ArticleDOI

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

TL;DR: A novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter is proposed and significantly outperforms state-of-the-art methods on this dataset.
Journal ArticleDOI

Deep convolutional neural networks for image classification: A comprehensive review

TL;DR: This review, which focuses on the application of CNNs to image classification tasks, covers their development, from their predecessors up to recent state-of-the-art deep learning systems.
Proceedings ArticleDOI

VGGFace2: A Dataset for Recognising Faces across Pose and Age

TL;DR: VGGFace2 as discussed by the authors is a large-scale face dataset with 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Posted Content

Improving neural networks by preventing co-adaptation of feature detectors

TL;DR: The authors randomly omits half of the feature detectors on each training case to prevent complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.
Proceedings ArticleDOI

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

TL;DR: This work revisits both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network.

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

TL;DR: The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life, and exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in "Deep learning face representation from predicting 10,000 classes" ?

This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features ( DeepID ), for face verification. The authors argue that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks ( such as verification ) and new identities unseen in the training set. 

This could be another interesting direction to be explored in the future. 

When identity classes increase 32 times from 136 to 4349, the accuracy increases by 10.13% and 8.42% for Joint Bayesian and neural networks, respectively, or 2.03% and 1.68% on average, respectively, whenever the identity classes double. 

Joint Bayesian only achieves approximately 66% accuracy on these features, while the neural network fails, where it accounts all the face pairs aspositive or negative pairs. 

The last hidden layer takes the functionyj = max ( 0, ∑ i x1i · w1i,j + ∑ i x2i · w2i,j + bj ) , (3)where x1, w1, x2, w2 denote neurons and weights in the third and fourth convolutional layers, respectively. 

Adding the bypassing connections between the third convolutional layer (referred to as the skipping layer) and the last hidden layer reduces the possible information loss in the fourth convolutional layer. 

It has 38, 400 input neurons with 19, 200 DeepID features from each patch, and 4, 800 neurons in the following two hidden layers, with every 80 neurons in the first hidden layer locally connected to one of the 60 groups of input neurons. 

Highly compact 160-dimensional DeepID is acquired at the end of the cascade that contain rich identity information and directly predict a much larger number (e.g., 10, 000) of identity classes. 

Combing 60 patches increases the accuracy by 4.53% and 5.27% over best single patch for Joint Bayesian and neural networks, respectively. 

Max-pooling is formulated asyij,k = max 0≤m,n<s{ xij·s+m, k·s+n } , (2)where each neuron in the i-th output map yi pools over an s× s non-overlapping local region in the i-th input map xi. 

The authors detect five facial landmarks, including the two eye centers, the nose tip, and the two mouth corners, with the facial point detection method proposed by Sun et al. [30]. 

Sun et al. [31] used multiple deep ConvNets to learn high-level face similarity features and trained classification RBM [22] for face verification. 

The transfer learning Joint Bayesian based on their DeepID features achieves 97.45% test accuracy on LFW, which is on par with the human-level performance of 97.53%. 

With so many classes and few samples for each class, the classifier outputs are diverse and unreliable, therefore cannot be used as features.