scispace - formally typeset
Book ChapterDOI

A Discriminative Feature Learning Approach for Deep Face Recognition

TLDR
This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.
Abstract
Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

read more

Content maybe subject to copyright    Report

A Discriminative Feature Learning Approach
for Deep Face Recognition
Yandong Wen
1
, Kaipeng Zhang
1
, Zhifeng Li
1(
B
)
, and Yu Qiao
1,2
1
Shenzhen Key Lab of Computer Vision and Pattern Recognition,
Shenzhen Institutes of Advanced Technology, CAS, Shenzhen, China
yandongw@andrew.cmu.edu, {kp.zhang,zhifeng.li,yu.qiao}@siat.ac.cn
2
The Chinese University of Hong Kong, Sha Tin, Hong Kong
Abstract. Convolutional neural networks (CNNs) have been widely
used in computer vision community, significantly improving the state-of-
the-art. In most of the available CNNs, the softmax loss function is used
as the supervision signal to train the deep model. In order to enhance
the discriminative power of the deeply learned features, this paper pro-
p o ses a new supervision signal, called center loss, for face recognition
task. Specifically, the center loss simultaneously learns a center for deep
features of each class and penalizes the distances between the deep fea-
tures and their corresponding class centers. More importantly, we prove
that the proposed center loss function is trainable and easy to optimize
in the CNNs. With the joint supervision of softmax loss and center loss,
we can train a robust CNNs to obtain the deep features with the two
key learning obj ectives, inter-class dispension and intra-class compact-
ness as much as possible, which are very essential to face recognition.
It is encouraging to see that our CNNs (with such joint supervision)
achieve the state-of-the-art accuracy on several important face recog-
nition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces
(YTF), and MegaFace Challenge. Especially, our new approach achieves
the b est results on MegaFace (the largest public domain face benchmark)
under the protocol of small training set (contains under 500000 images
and under 20000 persons), significantly improving the previous results
and setting new state-of-the-art for both face recognition and face veri-
fication tasks.
Keywords: Convolutional neural networks
· Face recognition · Discrim-
inative feature learning · Center loss
1 Introduction
Convolutional neural networks (CNNs) have achieved great success on vision
community, significantly improving the state of the art in classification problems,
such as object [
11,12,18,28,33], scene [41,42], action [3,16,36] and so on. It
mainly benefits from the large scale training data [
8,26] and the end-to-end
learning framework. The most commonly used CNNs perform feature learning
c
Springer International Publishing AG 2016
B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 499–515, 2016.
DOI: 10.1007/978-3-319-46478-7
31

500 Y. Wen et al.
Convolutional
Feature Learning
Label
Prediction
Loss
Function
Face Images
Separable
Features
Discriminative
Features
Input
Deeply learned
Features
Predicted
Labels
Predicted Labels
Classify
Fig. 1. The typical framework of convolutional neural networks.
and label prediction, mapping the input data to deep features (the output of the
last hidden layer), then to the predicted labels, as shown in Fig. 1.
In generic object, scene or action recognition, the classes of the possible
testing samples are within the training set, which is also referred to close-set
identification. Therefore, the predicted labels dominate the performance and
softmax loss is able to directly address the classification problems. In this way,
the label prediction (the last fully connected layer) acts like a linear classifier
and the deeply learned features are prone to be separable.
For face recognition task, the deeply learned features need to be not only sep-
arable but also discriminative. S ince it is impractical to pre-collect all the possible
testing identities for training, the lab el prediction in CNNs is not always applica-
ble. The deeply learned features are required to be discriminative and generalized
enough for identifying new unseen classes without label prediction. Discrimina-
tive power characterizes features in both the compact intra-class variations and
separable inter-class differences, as shown in Fig.
1. Discriminative features can
be well-classified by nearest neighbor (NN) [
7] or k-nearest neighbor (k-NN)
[9] algorithms, which do not necessarily depend on the label prediction. How-
ever, the softmax loss only encourage the separability of features. The resulting
features are not sufficiently effective for face recognition.
Constructing highly efficient loss function for discriminative feature learn-
ing in CNNs is non-trivial. Because the stochastic gradient descent (SGD) [
19]
optimizes the CNNs based on mini-batch, which can not reflect the global dis-
tribution of deep features very well. Due to the huge scale of training set, it is
impractical to input all the tr aini ng samples in every iteration. As alternative
approaches, contrastive loss [
10,29] and triplet loss [27] respectively construct
loss functions for image pairs and triplet. However, compared to the image sam-
ples, the number of training pairs or triplets dramatically grows. It inevitably
results in slow convergence and instability. By carefully selecting the image pairs
or triplets, the problem may be partially alleviated. But it significantly in creases
the computational complexity and the training procedure becomes inconvenient.

A Discriminative Feature Learning Approach for Deep Face Recognition 501
In this paper, we propose a new loss function, namely center loss, to efficiently
enhance the discriminative p ower of the deeply learned features in neural net-
works. Specifically, we learn a center (a vector with the same dimension as a fea-
ture) for deep features of each class. In the course of training, we simultaneously
update the center and minimize the distances between the deep features and
their corresponding class centers. The CNNs are trained under the joint super-
vision of the softmax loss and center loss, with a hyper parameter to balance the
two supervision signals. Intuitively, the softmax loss forces the deep features of
different classes staying apart. The center loss efficiently pulls the deep features
of the same class to their centers. With the joint supervision, not only the inter-
class features differences are enlarged, but also the intra-class features variations
are reduced. Hence the discriminative power of the deeply learned features can
be high ly enhanced. Our main contributions are summarized as follows.
We propose a new loss function (called center loss) to minimize the intra-
class distances of the deep features. To b e best of our knowledge, this is the
first attempt to use such a loss function to help supervise the learning of
CNNs. With the joint supervision of the center loss and the softmax loss, the
highly discriminative features can be obtained for robust face recognition, as
supported by our experimental results.
We show that the proposed loss function is very easy to implement in the
CNNs. Our CNN models are trainable and can be directly optimized by the
standard SGD.
We present extensive experiments on the datasets of M egaFace Challenge [
23]
(the largest publ ic domain face database with 1 million faces for recognition)
and set new state-of-the-art under the evaluation protocol of small training
set. We also verify the excellent performance of our new approach on Labeled
Faces in the Wild (LFW) [
15] and YouTube Faces (YTF) datasets [38].
2 Related Work
Face recognition via deep learning has achieved a series of breakthrough in these
years [
25,27,29,30,34,37]. The idea of mapping a p air of face images to a distance
starts from [6]. They train siamese networks for drivin g the similarity metric to
be small for positive pairs, and large for the negative p ai rs. Hu et al. [
13]learn
a nonlinear transformations and yield discriminative deep metric with a margin
between positive and negative face image pairs. There approaches are required
image pairs as inp u t.
Very recently, [
31,34] supervise the learnin g process in CNNs by challeng-
ing identification signal (softmax loss function), which brings richer identity-
related information to deeply learn ed features. After that, joint identification-
verification supervision signal is adopted in [
29,37], leading to more discrimi-
native features. [
32] enhances the supervision by adding a fully connected layer
and loss functions to each convolutional layer. The effectiveness of triplet loss
has been demonstrated in [21,25,27]. With the deep embedding, the distance

502 Y. Wen et al.
between an anchor and a positive are minimized, while the distance between
an anchor and a negative are maximized until the margin is met. They achieve
state-of-the-art performance in LFW and YTF datasets.
3 The Proposed Approach
In this Section, we elaborate our approach. We first use a toy example to intu-
itively show the distributions of the deeply learned features. Inspired by the
distribution, we propose the center loss to improve the discriminative power of
the deeply learned features, followed by some discussions.
3.1 A Toy Example
In this section, a toy example on MNIST [
20] dataset is presented. We modify the
LeNets [19] to a deeper and wider network, but reduce the output number of the
last hidden layer to 2 (It means that the dimension of the deep features is 2). So
we can dir ectly plot the features on 2-D surface for visualization. More details
of the network architecture are given in Table
1. The softmax loss function is
presented as follows.
L
S
=
m
i=1
log
e
W
T
y
i
x
i
+b
y
i
n
j=1
e
W
T
j
x
i
+b
j
(1)
In Eq.
1, x
i
R
d
denotes the ith deep feature, belonging to the y
i
th class.
d is the feature dimension. W
j
R
d
denotes the jth column of th e weights
W R
d×n
in the last fully connected layer and b R
n
is the bias term. The
size of mini-batch and the number of class is m and n, respectively. We omit
the biases for simplifying analysis. (In fact, the performance is nearly of no
difference).
The resulting 2-D deep features are plotted in Fig.
2 to illustrate the dis-
tribution. Since the last fully connected layer acts like a linear classifier, the
deep features of different classes are distinguished by decision boundaries. From
Fig.
2 we can observe that: (i) under the supervision of softmax loss, the deeply
Table 1. The CNNs architecture we use in toy example, called LeNets++. Some of
the convolution layers are followed by max pooling. (5, 32)
/1,2
× 2 denotes 2 cascaded
convolution layers with 32 filters of size 5 × 5, where the stride and padding are 1 and
2 respectively. 2
/2,0
denotes the max-pooling layers with grid of 2 × 2, where the stride
and padding are 2 and 0 respectively. In LeNets++, we use the Parametric Rectified
Linear Unit (PReLU) [
12] as the nonlinear unit.
Stage 1 Stage 2 Stage 3 Stage 4
Layer Conv Pool Conv Pool Conv Pool FC
LeNets (5, 20)
/1,0
2
/2,0
(5, 50)
/1,0
2
/2,0
500
LeNets++ (5, 32)
/1,2
× 2 2
/2,0
(5, 64)
/1,2
× 2 2
/2,0
(5, 128)
/1,2
× 2 2
/2,0
2

A Discriminative Feature Learning Approach for Deep Face Recognition 503
(a) (b)
0
1
3
4
5
6
7
8
2
9
Fig. 2. The distribution of deeply learned features in (a) training set (b) testing set,
b oth under the supervision of softmax loss, where we use 50K/10K train/test splits.
The points with different colors denote features from different classes. Best viewed
in color. (Color figure online)
learned features are separable, and (ii) the deep features are not discriminative
enough, since they still show significant intra-class variations. Consequently, it
is not suitable to d ir ectly use these features for recognition.
3.2 The Center Loss
So, how to develop an effective loss function to improve the discriminative power
of the deeply learned features? Intuitively, minimizing the intra-class variations
while keeping the features of different classes separable is the key. To this end,
we propose the center loss function, as formulated in Eq.
2.
L
C
=
1
2
m
i=1
x
i
c
y
i
2
2
(2)
The c
y
i
R
d
denotes the y
i
th class center of deep features. The formula-
tion effectively characterizes the intra-class variations. Ideally, the c
y
i
should
be updated as the deep features changed. In other words, we need to take the
entire trai nin g set into account and average the features of every class in each
iteration, which is inefficient even impractical. Therefore, the center loss can not
be used directly. This is possibly the reason that such a center loss has never
been used in CNNs until now.
To address this problem, we make two necessary modifications. First, instead
of updating the centers with respect to the entire training set, we perform the
update based on mini-batch. In each iteration, the centers are computed by
averaging the features of the corresponding classes (In this case, some of the
centers may not update). Second, to avoid large perturbations caused by few
mislabelled samples, we use a scalar α to control the learning rate of the centers.

Citations
More filters
Proceedings Article

Prototypical Networks for Few-shot Learning

TL;DR: Prototypical Networks as discussed by the authors learn a metric space in which classification can be performed by computing distances to prototype representations of each class, and achieve state-of-the-art results on the CU-Birds dataset.
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

TL;DR: It is shown that the highest error involves images of dark-skinned women, while the most accurate result is for light-skinned men, in commercial API-based classifiers of gender from facial images, including IBM Watson Visual Recognition.
Proceedings ArticleDOI

SphereFace: Deep Hypersphere Embedding for Face Recognition

TL;DR: In this paper, the angular softmax (A-softmax) loss was proposed to learn angularly discriminative features for deep face recognition under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal interclass distance under a suitably chosen metric space.
Proceedings ArticleDOI

CosFace: Large Margin Cosine Loss for Deep Face Recognition

TL;DR: In this article, the authors proposed a large margin cosine loss (LMCL), which normalizes both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.