What contributions have the authors mentioned in the paper "Deep learning identity-preserving face space" ?

This paper addresses this challenge by proposing a new learningbased face representation: the face identity-preserving ( FIP ) features. In order to learn the FIP features, the authors carefully design a deep network that combines the feature extraction layers and the reconstruction layer.

What future works have the authors mentioned in the paper "Deep learning identity-preserving face space" ?

In the future work, the authors will extend the framework to deal with robust face recognition in other difficult conditions such as expression change and face sketch recognition [ 25, 30 ], and will combine FIP features with more classic face recognition approaches to further improve the performance [ 28, 29, 27 ].

(Open Access) Deep Learning Identity-Preserving Face Space (2013) | Zhenyao Zhu

Deep Learning Identity-Preserving Face Space

Zhenyao Zhu

1,∗

Ping Luo

1,3,∗

Xiaogang Wang

Xiaoou Tang

1,3,†

Department of Information Engineering, The Chinese University of Hong Kong

Department of Electronic Engineering, The Chinese University of Hong Kong

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

zz012@ie.cuhk.edu.hk pluo.lhi@gmail.com xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk

Abstract

Face recognition with large pose and illumination varia-

tions is a challenging problem in computer vision. This pa-

per addresses this challenge by proposing a new learning-

based face representation: the face identity-preserving

(FIP) features. Unlike conventional face descriptors,

the FIP features can signiﬁcantly reduce intra-identity

variances, while maintaining discriminativeness between

identities. Moreover, the FIP features extracted from an

image under any pose and illumination can be used to

reconstruct its face image in the canonical view. This

property makes it possible to improve the performance of

traditional descriptors, such as LBP [2] and Gabor [31],

which can be extracted from our reconstructed images in

the canonical view to eliminate variations. In order to

learn the FIP features, we carefully design a deep network

that combines the feature extraction layers and the recon-

struction layer. The former encodes a face image into the

FIP features, while the latter transforms them to an image

in the canonical view. Extensive experiments on the large

MultiPIE face database [7] demonstrate that it signiﬁcantly

outperforms the state-of-the-art face recognition methods.

1. Introduction

In many practical applications, the pose and illumination

changes become the bottleneck for face recognition [36].

Many existing works have been proposed to account for

such variations. The pose-invariant methods can be gen-

erally separated into two categories: 2D-based [17, 5, 23]

and 3D-based [18, 3]. In the ﬁrst category, poses are

either handled by 2D image matching or by encoding a

test image using some bases or exemplars. For example,

∗

indicates equal contribution.

†

This work is supported by the General Research Fund sponsored by

the Research Grants Council of the Kong Kong SAR (Project No. CUHK

416312 and CUHK 416510) and Guangdong Innovative Research Team

Program (No.201001D0104648280).

(a)

(b)

Figure 1. Three face images under different poses and illuminations

of two identities are shown in (a). The FIP features extracted from

these images are also visualized. The FIP features of the same identity

are similar, although the original images are captured in different poses

and illuminations. These examples indicate that FIP features are sparse

and identity-preserving (blue indicates zero value). (b) shows some

images of two identities, including the original image (left) and the

reconstructed image in the canonical view (right) from the FIP features.

The reconstructed images remove the pose and illumination variations and

retain the intrinsic face structures of the identities. Best viewed in color.

Carlos et al. [5] used stereo matching to compute the

similarity between two faces. Li et al. [17] represented

a test face as a linear combination of training images, and

utilized the linear regression coefﬁcients as features for face

recognition. 3D-based methods usually capture 3D face

data or estimate 3D models from 2D input, and try to match

them to a 2D probe face image. Such methods make it

possible to synthesize any view of the probe face, which

makes them generally more robust to pose variation. For

instance, Li et al. [18] ﬁrst generated a virtual view for the

probe face by using a set of 3D displacement ﬁelds sampled

from a 3D face database, and then matched the synthesized

face with the gallery faces. Similarly, Asthana et al. [3]

matched the 3D model to a 2D image using the view-based

active appearance model.

The illumination-invariant methods [26, 17] typically

(a) LBP (b) LE

Figure 2. The LBP (a), LE (b), CRBM (c), and FIP (d) features of 50

identities, each of which has 6 images in different poses and illuminations

are projected into two dimensions using Multidimensional scaling (MDS).

Images of the same identity are visualized in the same color. It shows that

FIP has the best representative power. Best viewed in color.

make assumptions about how illumination affects the face

images, and use these assumptions to model and remove

the illumination effect. For example, Wagner et al. [26]

designed a projector-based system to capture images of

each subject in the gallery under a few illuminations, which

can be linearly combined to generate images under arbitrary

illuminations. With this augmented gallery, they adopted

sparse coding to perform face recognition.

The above methods have certain limitations. For ex-

ample, capturing 3D data requires additional cost and

resources [18]. Inferring 3D models from 2D data is an ill-

posed problem [23]. As the statistical illumination models

[26] are often summarized from controlled environment,

they cannot be well generalized in practical applications.

In this paper, unlike previous works that either build

physical models or make statistical assumptions, we

propose a novel face representation, the face identity-

preserving (FIP) features, which are directly extracted

from face images with arbitrary poses and illuminations.

This new representation can signiﬁcantly remove pose and

illumination variations, while maintaining the discrimina-

tiveness across identities, as shown in Fig.1 (a). Fur-

thermore, unlike traditional face descriptors, e.g. LBP [2],

Gabor [31], and LE [4], which cannot recover the original

images, the FIP features can reconstruct face images in the

frontal pose and with neutral illumination (we call it the

canonical view) of the same identity, as shown in Fig.1 (b).

With this attractive property, the conventional descriptors

and learning algorithms can utilize our reconstructed face

images in the canonical view as input so as to eliminate the

negative effects from poses and illuminations.

Speciﬁcally, we present a new deep network to learn

the FIP features. It utilizes face images with arbitrary

pose and illumination variations of an identity as input,

and reconstructs a face in the canonical view of the same

identity as the target (see Fig.3). First, input images are

encoded through feature extraction layers, which have three

locally connected layers and two pooling layers stacked

alternately. Each layer captures face features at a different

scale. As shown in Fig.3, the ﬁrst locally connected

layer outputs 32 feature maps. Each map has a large

number of high responses outside the face region, which

mainly capture pose information, and some high responses

inside the face region, which capture face structures (red

indicates large response and blue indicates no response).

On the output feature maps of the second locally connected

layer, high responses outside the face region have been

signiﬁcantly reduced, which indicates that it discards most

pose variations while retain the face structures. The third

locally connected layer outputs the FIP features, which is

sparse and identity-preserving.

Second, the FIP features recover the face image in the

canonical view using a fully-connected reconstruction layer.

As there are large amount of parameters, our network is

hard to train using tranditional training methods [14, 12].

We propose a new training strategy, which contains two

steps: parameter initialization and parameter update. First,

we initialize the parameters based on the least square

dictionary learning. We then update all the parameters by

back-propagating the summed squared reconstruction error

between the reconstructed image and the ground truth.

Existing deep learning methods for face recognition

are generally in two categories: (1) unsupervised learning

features with deep models and then using discriminative

methods (e.g. SVM) for classiﬁcation [21, 10, 15]; (2)

directly using class labels as supervision of deep models

[6, 24]. In the ﬁrst category, features related to identity,

poses, and lightings are coupled when learned by deep

models. It is too late to rely on SVM to separate them later.

Our supervised model makes it possible to discard pose and

lighting features from the very bottom layer. In the second

category, a ‘0/1’ class label is a much weaker supervision,

compared with ours using a face image (with thousands

of pixels) of the canonical view as supervision. We

require the deep model to fully reconstruct the face in the

canonical view rather than simply predicting class labels,

and this strong regularization is more effective to avoid

overﬁtting. This design is suitable for face recognition,

where a canonical view exists. Different from convolutional

neural networks whose ﬁlters share weights, our ﬁlers

are localized and do not share weights since we assume

different face regions should employ different features.

This work makes three key contributions. (1) We pro-

pose a new deep network that combines the feature extrac-

tion layers and the reconstruction layer. Its architecture is

carefully designed to learn the FIP features. These features

can eliminate the poses and illumination variations, and

=24×24×32 n

=24×24×32

5×5 Locally

Connected and

Pooling

Fully

Connected

, V

FIP

, V

Feature Extraction Layers Reconstruction Layer

5×5 Locally

Connected and

Pooling

5×5 Locally

Connected

=96×96

=48×48×32

Figure 3. Architecture of the deep network. It combines the feature extraction layers and reconstruction layer. The feature extraction layers include three

locally connected layers and two pooling layers. They encode an input face x

into FIP features x

. x

, x

are the output feature maps of the ﬁrst and

second locally connected layers. FIP features can be used to recover the face image y in the canonical view. y is the ground truth. Best viewed in color.

maintain discriminativeness between different identities.

(2) Unlike conventional face descriptors, the FIP features

can be used to reconstruct a face image in the canonical

view. We also demonstrate signiﬁcant improvement of the

existing methods, when they are applied on our reconstruct-

ed face images. (3) Unlike existing works that need to know

the pose of a probe face, so as to build models for different

poses speciﬁcally, our method can extract the FIP features

without knowing information on pose and illumination.

The FIP features outperform the state-of-the-art methods,

including both 2D-based and 3D-based methods, on the

MultiPIE database [7].

2. Related Work

This section reviews related works on learning-based

face descriptors and deep models for feature learning.

Learning-based descriptors. Cao et al. [4] devised an

unsupervised feature learning method (LE) with random-

projection trees and PCA trees, and adopted PCA to gain

a compact face descriptor. Zhang et al. [35] extended [4]

by introducing an inter-modality encoding method, which

can match face images in two modalities, e.g. photos and

sketches, signiﬁcantly outperforming traditional methods

[25, 30]. There are studies that learn the ﬁlters and patterns

for the existing handcrafted descriptors. For example, Guo

et al. [8] proposed a supervised learning approach with

the Fisher separation criterion to learn the patterns of LBP

[2]. Zhen et al. [16] adopted a strategy similar to LDA

to learn the ﬁlters of LBP. Our FIP features are learned

with a multi-layer deep model in a supervised manner, and

have more discriminative and representative power than

the above works. We illustrate the feature space of FIP

compared with LE [4] and LBP [2] in Fig.2 (a), (b) and (d),

respectively, which show that the FIP space better maintains

both the intra-identity consistency and the inter-identity

discriminativeness.

Deep models. The deep models learn representations

by stacking many hidden layers, which are layer-wisely

trained in an unsupervised manner. For example, the deep

belief networks [9] (DBN) and deep Boltzmann machine

[22] (DBM) stack many layers of restricted Boltzmann

machines (RBM) and can extract different levels of features.

Recently, Huang et al. [10] introduced the convolutional

restricted Boltzmann machine (CRBM), which incorporates

local ﬁlters into RBM. Their learned ﬁlters can preserve the

local structures of data. Sun et al. [24] proposed a hy-

brid Convolutional Neural Network-Restricted Boltzmann

Machine (CNN-RBM) model to learn relational features

for comparing face similarity. Unlike DBN and DBM

employ fully connected layers, our deep network combines

both locally and fully connected layers, which enables it to

extract both the local and global information. The locally

connected architecture of our deep network is similar to

CRBM [10], but we learn the network with a supervised

scheme and the FIP features are required to recover the

frontal face image. Therefore, this method is more robust

to pose and illumination variations, as shown in Fig.2 (d).

3. Network Architecture

Fig.3 shows the architecture of our deep model. The

input is a face image x

under an arbitrary pose and

illumination, and the output is a frontal face image under

neutral illumination y. They both have n

= 96 × 96 =

9216 dimensions. The feature extraction layers have three

locally connected layers and two pooling layers, which

encode x

into FIP features x

In the ﬁrst layer, x

is transformed to 32 feature maps

through a weight matrix W

that contains 32 sub-matrices

= [W

; W

; . . . ; W

], ∀W

∈ R

, each of

which is sparse to retain the locally connected structure

[13]. Intuitively, each row of W

represents a small ﬁlter

centered at a pixel of x

, so that all of the elements in this

row equal zeros except for the elements belonging to the

ﬁlter. As our weights are not shared, the non-zero values of

these rows are not the same

. Therefore, the weight matrix

results in 32 feature maps {x

}

i=1

, each of which has

dimensions. Then, a matrix V

, where V

∈ {0, 1}

encodes the 2D topography of the pooling layer [13], down-

samples each of these feature map to 48 × 48 in order to

reduce the number of parameters need to be learned and

obtain more robust features. Each x

can be computed as

= V

σ(W

), (1)

where σ(x) = max(0, x) is the rectiﬁed linear function

[19] that is feature-intensity-invariant. So it is robust to

shape and illumination variations. x

can be obtained by

concatenating all the x

∈ R

48×48

together, obtaining a

large feature map in n

= 48 × 48 × 32 dimensions.

In the second layer, each x

is transformed to x

32 sub-

matrices {W

}

i=1

, ∀W

∈ R

48×48,48×48

j=1

σ(W

), (2)

where x

is down-sampled using V

to 24×24 dimensions.

Eq.2 means that each small feature map in the ﬁrst layer is

multiplied by 32 sub-matrices and then summed together.

Here, each sub-matrix has sparse structure as discussed

above. We can reformulate Eq.2 into a matrix form

= V

σ(W

), (3)

where W

= [W

; . . . ; W

], ∀W

∈ R

48×48,n

and

= [x

; . . . ; x

] ∈ R

, respectively. W

is simply

obtained by repeating W

for 32 times. Thus, x

has

= 24 × 24 × 32 dimensions.

In the third layer, x

is transformed to x

, i.e. the FIP

features, similar to the second layer, but without pooling.

In our notation, X ∈ R

a,b

means X is a two dimensional matrix

with a rows and b columns. x ∈ R

a×b

means x is a vector with a × b

dimensions. Also, [x; y] means that we concatenate vectors or matrices

x and y column-wisely, while [xy] means that we concatenate x and y

row-wisely.

For the convolutional neural network such as [14], the non-zero values

are the same for each row.

Note that in the conventional deep model [9], there is a bias term b, so

that the output is σ(W x + b). Since W x + b can be written as

W ex, we

drop the bias term b for simpliﬁcation.

Thus, x

is the same size as x

= σ(W

), (4)

where W

= [W

; . . . ; W

], ∀W

∈ R

24×24,n

and x

; . . . ; x

] ∈ R

, respectively.

Finally, the reconstruction layer transforms the FIP

features x

to the frontal face image y, through a weight

matrix W

∈ R

y = σ(W

). (5)

4. Training

Training our deep network requires estimating all the

weight matrices {W

} as introduced above, which is chal-

lenging because of the millions of parameters. Therefore,

we ﬁrst initialize the weights and then update them all. V

and V

are manually deﬁned [13] and ﬁxed.

4.1. Parameter Initialization

We cannot employ RBMs [9] to unsupervised pre-train

the weight matrices, because our input/output data are in

different spaces. Therefore, we devise a supervised method

based on the least square dictionary learning. As shown in

Fig.3, X

= {x

}

i=1

are a set of FIP features and Y =

}

i=1

are a set of target images, where m denotes the

number of training examples. Our objective is to minimize

the reconstruction error

arg min

k Y − σ(W

) k

, (6)

where k · k

is the Frobenius norm. Optimizing Eq.6 is

not trivial because of its nonlinearity. However, we can

initialize the weight matrices layer-wisely as

arg min

k Y − OW

, (7)

arg min

k Y − P W

, (8)

arg min

k Y − QW

, (9)

arg min

k Y − W

. (10)

In Eq.7, X

= {x

}

i=1

is a set of input images. W

has been introduced in Sec.3, so that W

results in 32

feature maps for each input. O is a ﬁxed binary matrix

that sums together the pixels in the same position of these

feature maps, which makes OW

at the same size as Y .

In Eq.8, X

= {x

}

i=1

is a set of outputs of the ﬁrst locally

connected layer before pooling and P is also a ﬁxed binary

matrix, which sums together the corresponding pixels and

rescales the results to the same size as

Y . Q, X

in Eq.9 are

deﬁned in the same way.

Intuitively, we ﬁrst directly use X

to approximate

Y with a linear transform W

without pooling. Once

has been initialized, X

= V

σ(W

) is used

to approximate Y again with another linear transform,

. We repeat this process until all the matrices have

been initialized. A similar strategy has been adopted

by [33], which learns different levels of representations

with a convolutional architecture. All of the above equa-

tions have closed-form solutions. For example, W

−1

Y X

)(X

)

−1

. The other matrices can

be computed in the same way.

4.2. Parameter Update

We update all the weight matrices after the initialization

by minimizing the loss function of reconstruction error

E(X

; W) =k Y − Y k

, (11)

where W = {W

, . . . , W

}. X

= {x

}, Y = {y

and Y = {y

} are a set of input images, a set of target

images, and a set of reconstructed images, respectively. We

update W using the stochastic gradient descent, in which

the update rule of W

, i = 1 . . . 4, in the k-th iteration is

∆

k+1

= 0.9 · ∆

− 0.004 · ·W

− ·

∂E

∂W

, (12)

k+1

= ∆

k+1

+ W

, (13)

where ∆ is the momentum variable [20],  is the learning

rate, and

∂E

∂W

= x

i−1

)

is the derivative, which is

computed as the outer product of the back-propagation error

and the feature of the previous layer x

i−1

. In our deep

network, there are three different expressions of e

. First,

for the transformation layer, e

is computed based on the

derivative of the linear rectiﬁed function [19]



[y − y]

, δ

> 0

0, δ

≤ 0

, (14)

where δ

= [W

]

. [·]

denotes the j-th element of a

vector.

Similarly, back-propagation error for e

is computed as



]

, δ

> 0

0, δ

≤ 0

, (15)

where δ

= [W

]

We compute e

and e

in the same way as e

since they

both adopt the same activation function. There is a slight

difference due to down-sampling. For these two layers, we

must up-sample the corresponding back-propagation error e

so that it has the same dimensions as the input feature. This

strategy has been introduced in [14]. We need to enforce the

weight matrices to have locally connected structures after

each gradient step as introduced in [12]. We implement this

by setting the corresponding matrix elements to zeros, if

there supposed to be no connections.

5. Experiments

We conduct two sets of experiments. Sec.5.1 compares

with state-of-the-art methods and learning-based descrip-

tors. Sec.5.2 demonstrates that classical face recognition

methods can be signiﬁcantly improved when applied on our

reconstructed face images in the canonical view.

Dataset. To extensively evaluate our method under

different poses and illuminations, we select the MultiPIE

face database [7], which contains 754,204 images of 337

identities. Each identity has images captured under 15

poses and 20 illuminations. These images were captured

in four sessions during different periods. Like the previous

methods [3, 18, 17], we evaluate our algorithm on a subset

of the MultiPIE database, where each identity has images

from all the four sections under seven poses from yaw

angles −45

◦

∼ +45

◦

, and 20 illuminations marked as ID

00-19 in MultiPIE. This subset has 128,940 images.

5.1. Face Recognition

The existing works conduct experiments on MultiPIE

with three different settings: Setting-I was introduced in

[3, 18, 34]; Setting-II and Setting-III were introduced in

[17]. We describe these settings below.

Setting-I and Setting-II only adopt images with differ-

ent poses, but with neutral illumination marked as ID 07.

They evaluate robustness to pose variations. For Setting-I,

the images of the ﬁrst 200 identities in all the four sessions

are chosen for training, and the images of the remaining

137 identities for test. During test, one frontal image (i.e.

◦

) of each identity in the test set is selected to the gallery,

so there are 137 gallery images in total. The remaining

images from −45

◦

∼ +45

◦

except 0

◦

are selected as

probes. For Setting-II, only the images in session one are

used, which only has 249 identities. The images of the

ﬁrst 100 identities are for training, and the images of the

remaining 149 identities for test. During test, one frontal

image of each identity in the test set is selected in the

gallery. The remaining images from −45

◦

∼ +45

◦

except

◦

are selected as probes.

Setting-III also adopts images in session one for training

and test, but it utilizes the images under all the 7 poses

and 20 illuminations. This is to evaluate the robustness

when both pose and illumination variations are present. The

selection of probes and gallery are the same as Setting-II.

We evaluate both the FIP features and the reconstructed

images using the above three settings. Face images are

roughly aligned according to the positions of eyes, and

rescaled to 96×96. They are converted to grayscale images.

The mean value over the training set is subtracted from

each pixel. For each identity, we use the images with

6 poses ranging from −45

◦

∼ +45

◦

except 0

◦

, and 19

illuminations marked as ID 00-19 except 07, as input to

train our deep network. The reconstruction target is the

Deep Learning Identity-Preserving Face Space

Figures

Citations

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

Deep Learning Face Representation from Predicting 10,000 Classes

Deep Learning Face Representation by Joint Identification-Verification

Deep Learning Face Representation by Joint Identification-Verification

Facial Landmark Detection by Deep Multi-task Learning

References

ImageNet Classification with Deep Convolutional Neural Networks

Gradient-based learning applied to document recognition

Principal Component Analysis

A fast learning algorithm for deep belief nets

Rectified Linear Units Improve Restricted Boltzmann Machines

Related Papers (5)

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Deep Learning Face Representation from Predicting 10,000 Classes

ImageNet Classification with Deep Convolutional Neural Networks

FaceNet: A unified embedding for face recognition and clustering

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "Deep learning identity-preserving face space" ?

Q2. What future works have the authors mentioned in the paper "Deep learning identity-preserving face space" ?