A Discriminative Feature Learning Approach for Deep Face Recognition

doi:10.1007/978-3-319-46478-7_31

A Discriminative Feature Learning Approach

for Deep Face Recognition

Yandong Wen

1

, Kaipeng Zhang

1

, Zhifeng Li

1(

B

)

, and Yu Qiao

1,2

1

Shenzhen Key Lab of Computer Vision and Pattern Recognition,

Shenzhen Institutes of Advanced Technology, CAS, Shenzhen, China

yandongw@andrew.cmu.edu, {kp.zhang,zhifeng.li,yu.qiao}@siat.ac.cn

2

The Chinese University of Hong Kong, Sha Tin, Hong Kong

Abstract. Convolutional neural networks (CNNs) have been widely

used in computer vision community, signiﬁcantly improving the state-of-

the-art. In most of the available CNNs, the softmax loss function is used

as the supervision signal to train the deep model. In order to enhance

the discriminative power of the deeply learned features, this paper pro-

p o ses a new supervision signal, called center loss, for face recognition

task. Speciﬁcally, the center loss simultaneously learns a center for deep

features of each class and penalizes the distances between the deep fea-

tures and their corresponding class centers. More importantly, we prove

that the proposed center loss function is trainable and easy to optimize

in the CNNs. With the joint supervision of softmax loss and center loss,

we can train a robust CNNs to obtain the deep features with the two

key learning obj ectives, inter-class dispension and intra-class compact-

ness as much as possible, which are very essential to face recognition.

It is encouraging to see that our CNNs (with such joint supervision)

achieve the state-of-the-art accuracy on several important face recog-

nition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces

(YTF), and MegaFace Challenge. Especially, our new approach achieves

the b est results on MegaFace (the largest public domain face benchmark)

under the protocol of small training set (contains under 500000 images

and under 20000 persons), signiﬁcantly improving the previous results

and setting new state-of-the-art for both face recognition and face veri-

ﬁcation tasks.

Keywords: Convolutional neural networks

· Face recognition · Discrim-

inative feature learning · Center loss

1 Introduction

Convolutional neural networks (CNNs) have achieved great success on vision

community, signiﬁcantly improving the state of the art in classiﬁcation problems,

such as object [

11,12,18,28,33], scene [41,42], action [3,16,36] and so on. It

mainly beneﬁts from the large scale training data [

8,26] and the end-to-end

learning framework. The most commonly used CNNs perform feature learning

c

 Springer International Publishing AG 2016

B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 499–515, 2016.

DOI: 10.1007/978-3-319-46478-7

31

500 Y. Wen et al.

Convolutional

Feature Learning

Label

Prediction

Loss

Function

Face Images

Separable

Features

Discriminative

Features

Input

Deeply learned

Features

Predicted

Labels

Predicted Labels

Classify

Fig. 1. The typical framework of convolutional neural networks.

and label prediction, mapping the input data to deep features (the output of the

last hidden layer), then to the predicted labels, as shown in Fig. 1.

In generic object, scene or action recognition, the classes of the possible

testing samples are within the training set, which is also referred to close-set

identiﬁcation. Therefore, the predicted labels dominate the performance and

softmax loss is able to directly address the classiﬁcation problems. In this way,

the label prediction (the last fully connected layer) acts like a linear classiﬁer

and the deeply learned features are prone to be separable.

For face recognition task, the deeply learned features need to be not only sep-

arable but also discriminative. S ince it is impractical to pre-collect all the possible

testing identities for training, the lab el prediction in CNNs is not always applica-

ble. The deeply learned features are required to be discriminative and generalized

enough for identifying new unseen classes without label prediction. Discrimina-

tive power characterizes features in both the compact intra-class variations and

separable inter-class diﬀerences, as shown in Fig.

1. Discriminative features can

be well-classiﬁed by nearest neighbor (NN) [

7] or k-nearest neighbor (k-NN)

[9] algorithms, which do not necessarily depend on the label prediction. How-

ever, the softmax loss only encourage the separability of features. The resulting

features are not suﬃciently eﬀective for face recognition.

Constructing highly eﬃcient loss function for discriminative feature learn-

ing in CNNs is non-trivial. Because the stochastic gradient descent (SGD) [

19]

optimizes the CNNs based on mini-batch, which can not reﬂect the global dis-

tribution of deep features very well. Due to the huge scale of training set, it is

impractical to input all the tr aini ng samples in every iteration. As alternative

approaches, contrastive loss [

10,29] and triplet loss [27] respectively construct

loss functions for image pairs and triplet. However, compared to the image sam-

ples, the number of training pairs or triplets dramatically grows. It inevitably

results in slow convergence and instability. By carefully selecting the image pairs

or triplets, the problem may be partially alleviated. But it signiﬁcantly in creases

the computational complexity and the training procedure becomes inconvenient.

A Discriminative Feature Learning Approach for Deep Face Recognition 501

In this paper, we propose a new loss function, namely center loss, to eﬃciently

enhance the discriminative p ower of the deeply learned features in neural net-

works. Speciﬁcally, we learn a center (a vector with the same dimension as a fea-

ture) for deep features of each class. In the course of training, we simultaneously

update the center and minimize the distances between the deep features and

their corresponding class centers. The CNNs are trained under the joint super-

vision of the softmax loss and center loss, with a hyper parameter to balance the

two supervision signals. Intuitively, the softmax loss forces the deep features of

diﬀerent classes staying apart. The center loss eﬃciently pulls the deep features

of the same class to their centers. With the joint supervision, not only the inter-

class features diﬀerences are enlarged, but also the intra-class features variations

are reduced. Hence the discriminative power of the deeply learned features can

be high ly enhanced. Our main contributions are summarized as follows.

– We propose a new loss function (called center loss) to minimize the intra-

class distances of the deep features. To b e best of our knowledge, this is the

ﬁrst attempt to use such a loss function to help supervise the learning of

CNNs. With the joint supervision of the center loss and the softmax loss, the

highly discriminative features can be obtained for robust face recognition, as

supported by our experimental results.

– We show that the proposed loss function is very easy to implement in the

CNNs. Our CNN models are trainable and can be directly optimized by the

standard SGD.

– We present extensive experiments on the datasets of M egaFace Challenge [

23]

(the largest publ ic domain face database with 1 million faces for recognition)

and set new state-of-the-art under the evaluation protocol of small training

set. We also verify the excellent performance of our new approach on Labeled

Faces in the Wild (LFW) [

15] and YouTube Faces (YTF) datasets [38].

2 Related Work

Face recognition via deep learning has achieved a series of breakthrough in these

years [

25,27,29,30,34,37]. The idea of mapping a p air of face images to a distance

starts from [6]. They train siamese networks for drivin g the similarity metric to

be small for positive pairs, and large for the negative p ai rs. Hu et al. [

13]learn

a nonlinear transformations and yield discriminative deep metric with a margin

between positive and negative face image pairs. There approaches are required

image pairs as inp u t.

Very recently, [

31,34] supervise the learnin g process in CNNs by challeng-

ing identiﬁcation signal (softmax loss function), which brings richer identity-

related information to deeply learn ed features. After that, joint identiﬁcation-

veriﬁcation supervision signal is adopted in [

29,37], leading to more discrimi-

native features. [

32] enhances the supervision by adding a fully connected layer

and loss functions to each convolutional layer. The eﬀectiveness of triplet loss

has been demonstrated in [21,25,27]. With the deep embedding, the distance

502 Y. Wen et al.

between an anchor and a positive are minimized, while the distance between

an anchor and a negative are maximized until the margin is met. They achieve

state-of-the-art performance in LFW and YTF datasets.

3 The Proposed Approach

In this Section, we elaborate our approach. We ﬁrst use a toy example to intu-

itively show the distributions of the deeply learned features. Inspired by the

distribution, we propose the center loss to improve the discriminative power of

the deeply learned features, followed by some discussions.

3.1 A Toy Example

In this section, a toy example on MNIST [

20] dataset is presented. We modify the

LeNets [19] to a deeper and wider network, but reduce the output number of the

last hidden layer to 2 (It means that the dimension of the deep features is 2). So

we can dir ectly plot the features on 2-D surface for visualization. More details

of the network architecture are given in Table

1. The softmax loss function is

presented as follows.

L

S

= −

m



i=1

log

e

W

T

y

i

x

i

+b

y

i



n

j=1

e

W

T

j

x

i

+b

j

(1)

In Eq.

1, x

i

∈ R

d

denotes the ith deep feature, belonging to the y

i

th class.

d is the feature dimension. W

j

∈ R

d

denotes the jth column of th e weights

W ∈ R

d×n

in the last fully connected layer and b ∈ R

n

is the bias term. The

size of mini-batch and the number of class is m and n, respectively. We omit

the biases for simplifying analysis. (In fact, the performance is nearly of no

diﬀerence).

The resulting 2-D deep features are plotted in Fig.

2 to illustrate the dis-

tribution. Since the last fully connected layer acts like a linear classiﬁer, the

deep features of diﬀerent classes are distinguished by decision boundaries. From

Fig.

2 we can observe that: (i) under the supervision of softmax loss, the deeply

Table 1. The CNNs architecture we use in toy example, called LeNets++. Some of

the convolution layers are followed by max pooling. (5, 32)

/1,2

× 2 denotes 2 cascaded

convolution layers with 32 ﬁlters of size 5 × 5, where the stride and padding are 1 and

2 respectively. 2

/2,0

denotes the max-pooling layers with grid of 2 × 2, where the stride

and padding are 2 and 0 respectively. In LeNets++, we use the Parametric Rectiﬁed

Linear Unit (PReLU) [

12] as the nonlinear unit.

Stage 1 Stage 2 Stage 3 Stage 4

Layer Conv Pool Conv Pool Conv Pool FC

LeNets (5, 20)

/1,0

2

/2,0

(5, 50)

/1,0

2

/2,0

500

LeNets++ (5, 32)

/1,2

× 2 2

/2,0

(5, 64)

/1,2

× 2 2

/2,0

(5, 128)

/1,2

× 2 2

/2,0

2

A Discriminative Feature Learning Approach for Deep Face Recognition 503

(a) (b)

0

1

3

4

5

6

7

8

2

9

Fig. 2. The distribution of deeply learned features in (a) training set (b) testing set,

b oth under the supervision of softmax loss, where we use 50K/10K train/test splits.

The points with diﬀerent colors denote features from diﬀerent classes. Best viewed

in color. (Color ﬁgure online)

learned features are separable, and (ii) the deep features are not discriminative

enough, since they still show signiﬁcant intra-class variations. Consequently, it

is not suitable to d ir ectly use these features for recognition.

3.2 The Center Loss

So, how to develop an eﬀective loss function to improve the discriminative power

of the deeply learned features? Intuitively, minimizing the intra-class variations

while keeping the features of diﬀerent classes separable is the key. To this end,

we propose the center loss function, as formulated in Eq.

2.

L

C

=

1

2

m



i=1

x

i

− c

y

i



2

(2)

The c

y

i

∈ R

d

denotes the y

i

th class center of deep features. The formula-

tion eﬀectively characterizes the intra-class variations. Ideally, the c

y

i

should

be updated as the deep features changed. In other words, we need to take the

entire trai nin g set into account and average the features of every class in each

iteration, which is ineﬃcient even impractical. Therefore, the center loss can not

be used directly. This is possibly the reason that such a center loss has never

been used in CNNs until now.

To address this problem, we make two necessary modiﬁcations. First, instead

of updating the centers with respect to the entire training set, we perform the

update based on mini-batch. In each iteration, the centers are computed by

averaging the features of the corresponding classes (In this case, some of the

centers may not update). Second, to avoid large perturbations caused by few

mislabelled samples, we use a scalar α to control the learning rate of the centers.

A Discriminative Feature Learning Approach for Deep Face Recognition

Citations

Prototypical Networks for Few-shot Learning

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

SphereFace: Deep Hypersphere Embedding for Face Recognition

CosFace: Large Margin Cosine Loss for Deep Face Recognition

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Related Papers (5)

Deep Residual Learning for Image Recognition

FaceNet: A unified embedding for face recognition and clustering

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Deep face recognition

DeepFace: Closing the Gap to Human-Level Performance in Face Verification