scispace - formally typeset
Open AccessProceedings ArticleDOI

Deep Learning Identity-Preserving Face Space

Reads0
Chats0
TLDR
This paper proposes a new learning based face representation: the face identity-preserving (FIP) features, a deep network that combines the feature extraction layers and the reconstruction layer that significantly outperforms the state-of-the-art face recognition methods.
Abstract
Face recognition with large pose and illumination variations is a challenging problem in computer vision. This paper addresses this challenge by proposing a new learning based face representation: the face identity-preserving (FIP) features. Unlike conventional face descriptors, the FIP features can significantly reduce intra-identity variances, while maintaining discriminative ness between identities. Moreover, the FIP features extracted from an image under any pose and illumination can be used to reconstruct its face image in the canonical view. This property makes it possible to improve the performance of traditional descriptors, such as LBP [2] and Gabor [31], which can be extracted from our reconstructed images in the canonical view to eliminate variations. In order to learn the FIP features, we carefully design a deep network that combines the feature extraction layers and the reconstruction layer. The former encodes a face image into the FIP features, while the latter transforms them to an image in the canonical view. Extensive experiments on the large MultiPIE face database [7] demonstrate that it significantly outperforms the state-of-the-art face recognition methods.

read more

Content maybe subject to copyright    Report

Deep Learning Identity-Preserving Face Space
Zhenyao Zhu
1,
Ping Luo
1,3,
Xiaogang Wang
2
Xiaoou Tang
1,3,
1
Department of Information Engineering, The Chinese University of Hong Kong
2
Department of Electronic Engineering, The Chinese University of Hong Kong
3
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
zz012@ie.cuhk.edu.hk pluo.lhi@gmail.com xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk
Abstract
Face recognition with large pose and illumination varia-
tions is a challenging problem in computer vision. This pa-
per addresses this challenge by proposing a new learning-
based face representation: the face identity-preserving
(FIP) features. Unlike conventional face descriptors,
the FIP features can significantly reduce intra-identity
variances, while maintaining discriminativeness between
identities. Moreover, the FIP features extracted from an
image under any pose and illumination can be used to
reconstruct its face image in the canonical view. This
property makes it possible to improve the performance of
traditional descriptors, such as LBP [2] and Gabor [31],
which can be extracted from our reconstructed images in
the canonical view to eliminate variations. In order to
learn the FIP features, we carefully design a deep network
that combines the feature extraction layers and the recon-
struction layer. The former encodes a face image into the
FIP features, while the latter transforms them to an image
in the canonical view. Extensive experiments on the large
MultiPIE face database [7] demonstrate that it significantly
outperforms the state-of-the-art face recognition methods.
1. Introduction
In many practical applications, the pose and illumination
changes become the bottleneck for face recognition [36].
Many existing works have been proposed to account for
such variations. The pose-invariant methods can be gen-
erally separated into two categories: 2D-based [17, 5, 23]
and 3D-based [18, 3]. In the first category, poses are
either handled by 2D image matching or by encoding a
test image using some bases or exemplars. For example,
indicates equal contribution.
This work is supported by the General Research Fund sponsored by
the Research Grants Council of the Kong Kong SAR (Project No. CUHK
416312 and CUHK 416510) and Guangdong Innovative Research Team
Program (No.201001D0104648280).
(a)
(b)
Figure 1. Three face images under different poses and illuminations
of two identities are shown in (a). The FIP features extracted from
these images are also visualized. The FIP features of the same identity
are similar, although the original images are captured in different poses
and illuminations. These examples indicate that FIP features are sparse
and identity-preserving (blue indicates zero value). (b) shows some
images of two identities, including the original image (left) and the
reconstructed image in the canonical view (right) from the FIP features.
The reconstructed images remove the pose and illumination variations and
retain the intrinsic face structures of the identities. Best viewed in color.
Carlos et al. [5] used stereo matching to compute the
similarity between two faces. Li et al. [17] represented
a test face as a linear combination of training images, and
utilized the linear regression coefficients as features for face
recognition. 3D-based methods usually capture 3D face
data or estimate 3D models from 2D input, and try to match
them to a 2D probe face image. Such methods make it
possible to synthesize any view of the probe face, which
makes them generally more robust to pose variation. For
instance, Li et al. [18] first generated a virtual view for the
probe face by using a set of 3D displacement fields sampled
from a 3D face database, and then matched the synthesized
face with the gallery faces. Similarly, Asthana et al. [3]
matched the 3D model to a 2D image using the view-based
active appearance model.
The illumination-invariant methods [26, 17] typically
1

(a) LBP (b) LE
(c) CRBM (d) FIP
Figure 2. The LBP (a), LE (b), CRBM (c), and FIP (d) features of 50
identities, each of which has 6 images in different poses and illuminations
are projected into two dimensions using Multidimensional scaling (MDS).
Images of the same identity are visualized in the same color. It shows that
FIP has the best representative power. Best viewed in color.
make assumptions about how illumination affects the face
images, and use these assumptions to model and remove
the illumination effect. For example, Wagner et al. [26]
designed a projector-based system to capture images of
each subject in the gallery under a few illuminations, which
can be linearly combined to generate images under arbitrary
illuminations. With this augmented gallery, they adopted
sparse coding to perform face recognition.
The above methods have certain limitations. For ex-
ample, capturing 3D data requires additional cost and
resources [18]. Inferring 3D models from 2D data is an ill-
posed problem [23]. As the statistical illumination models
[26] are often summarized from controlled environment,
they cannot be well generalized in practical applications.
In this paper, unlike previous works that either build
physical models or make statistical assumptions, we
propose a novel face representation, the face identity-
preserving (FIP) features, which are directly extracted
from face images with arbitrary poses and illuminations.
This new representation can significantly remove pose and
illumination variations, while maintaining the discrimina-
tiveness across identities, as shown in Fig.1 (a). Fur-
thermore, unlike traditional face descriptors, e.g. LBP [2],
Gabor [31], and LE [4], which cannot recover the original
images, the FIP features can reconstruct face images in the
frontal pose and with neutral illumination (we call it the
canonical view) of the same identity, as shown in Fig.1 (b).
With this attractive property, the conventional descriptors
and learning algorithms can utilize our reconstructed face
images in the canonical view as input so as to eliminate the
negative effects from poses and illuminations.
Specifically, we present a new deep network to learn
the FIP features. It utilizes face images with arbitrary
pose and illumination variations of an identity as input,
and reconstructs a face in the canonical view of the same
identity as the target (see Fig.3). First, input images are
encoded through feature extraction layers, which have three
locally connected layers and two pooling layers stacked
alternately. Each layer captures face features at a different
scale. As shown in Fig.3, the first locally connected
layer outputs 32 feature maps. Each map has a large
number of high responses outside the face region, which
mainly capture pose information, and some high responses
inside the face region, which capture face structures (red
indicates large response and blue indicates no response).
On the output feature maps of the second locally connected
layer, high responses outside the face region have been
significantly reduced, which indicates that it discards most
pose variations while retain the face structures. The third
locally connected layer outputs the FIP features, which is
sparse and identity-preserving.
Second, the FIP features recover the face image in the
canonical view using a fully-connected reconstruction layer.
As there are large amount of parameters, our network is
hard to train using tranditional training methods [14, 12].
We propose a new training strategy, which contains two
steps: parameter initialization and parameter update. First,
we initialize the parameters based on the least square
dictionary learning. We then update all the parameters by
back-propagating the summed squared reconstruction error
between the reconstructed image and the ground truth.
Existing deep learning methods for face recognition
are generally in two categories: (1) unsupervised learning
features with deep models and then using discriminative
methods (e.g. SVM) for classification [21, 10, 15]; (2)
directly using class labels as supervision of deep models
[6, 24]. In the first category, features related to identity,
poses, and lightings are coupled when learned by deep
models. It is too late to rely on SVM to separate them later.
Our supervised model makes it possible to discard pose and
lighting features from the very bottom layer. In the second
category, a ‘0/1’ class label is a much weaker supervision,
compared with ours using a face image (with thousands
of pixels) of the canonical view as supervision. We
require the deep model to fully reconstruct the face in the
canonical view rather than simply predicting class labels,
and this strong regularization is more effective to avoid
overfitting. This design is suitable for face recognition,
where a canonical view exists. Different from convolutional
neural networks whose filters share weights, our filers
are localized and do not share weights since we assume
different face regions should employ different features.
This work makes three key contributions. (1) We pro-
pose a new deep network that combines the feature extrac-
tion layers and the reconstruction layer. Its architecture is
carefully designed to learn the FIP features. These features
can eliminate the poses and illumination variations, and

n
2
=24×24×32 n
2
=24×24×32
5×5 Locally
Connected and
Pooling
Fully
Connected
W
1
, V
1
W
3
W
4
FIP
W
2
, V
2
Feature Extraction Layers Reconstruction Layer
x
0
x
1
x
2
x
3
y
y
5×5 Locally
Connected and
Pooling
5×5 Locally
Connected
n
0
=96×96
n
0
=96×96
n
1
=48×48×32
24
24
24
24
48
48
Figure 3. Architecture of the deep network. It combines the feature extraction layers and reconstruction layer. The feature extraction layers include three
locally connected layers and two pooling layers. They encode an input face x
0
into FIP features x
3
. x
1
, x
2
are the output feature maps of the first and
second locally connected layers. FIP features can be used to recover the face image y in the canonical view. y is the ground truth. Best viewed in color.
maintain discriminativeness between different identities.
(2) Unlike conventional face descriptors, the FIP features
can be used to reconstruct a face image in the canonical
view. We also demonstrate significant improvement of the
existing methods, when they are applied on our reconstruct-
ed face images. (3) Unlike existing works that need to know
the pose of a probe face, so as to build models for different
poses specifically, our method can extract the FIP features
without knowing information on pose and illumination.
The FIP features outperform the state-of-the-art methods,
including both 2D-based and 3D-based methods, on the
MultiPIE database [7].
2. Related Work
This section reviews related works on learning-based
face descriptors and deep models for feature learning.
Learning-based descriptors. Cao et al. [4] devised an
unsupervised feature learning method (LE) with random-
projection trees and PCA trees, and adopted PCA to gain
a compact face descriptor. Zhang et al. [35] extended [4]
by introducing an inter-modality encoding method, which
can match face images in two modalities, e.g. photos and
sketches, significantly outperforming traditional methods
[25, 30]. There are studies that learn the filters and patterns
for the existing handcrafted descriptors. For example, Guo
et al. [8] proposed a supervised learning approach with
the Fisher separation criterion to learn the patterns of LBP
[2]. Zhen et al. [16] adopted a strategy similar to LDA
to learn the filters of LBP. Our FIP features are learned
with a multi-layer deep model in a supervised manner, and
have more discriminative and representative power than
the above works. We illustrate the feature space of FIP
compared with LE [4] and LBP [2] in Fig.2 (a), (b) and (d),
respectively, which show that the FIP space better maintains
both the intra-identity consistency and the inter-identity
discriminativeness.
Deep models. The deep models learn representations
by stacking many hidden layers, which are layer-wisely
trained in an unsupervised manner. For example, the deep
belief networks [9] (DBN) and deep Boltzmann machine
[22] (DBM) stack many layers of restricted Boltzmann
machines (RBM) and can extract different levels of features.
Recently, Huang et al. [10] introduced the convolutional
restricted Boltzmann machine (CRBM), which incorporates
local filters into RBM. Their learned filters can preserve the
local structures of data. Sun et al. [24] proposed a hy-
brid Convolutional Neural Network-Restricted Boltzmann
Machine (CNN-RBM) model to learn relational features
for comparing face similarity. Unlike DBN and DBM
employ fully connected layers, our deep network combines
both locally and fully connected layers, which enables it to
extract both the local and global information. The locally
connected architecture of our deep network is similar to
CRBM [10], but we learn the network with a supervised
scheme and the FIP features are required to recover the
frontal face image. Therefore, this method is more robust
to pose and illumination variations, as shown in Fig.2 (d).
3. Network Architecture
Fig.3 shows the architecture of our deep model. The
input is a face image x
0
under an arbitrary pose and
illumination, and the output is a frontal face image under
neutral illumination y. They both have n
0
= 96 × 96 =
9216 dimensions. The feature extraction layers have three

locally connected layers and two pooling layers, which
encode x
0
into FIP features x
3
.
In the first layer, x
0
is transformed to 32 feature maps
through a weight matrix W
1
that contains 32 sub-matrices
W
1
= [W
1
1
; W
1
2
; . . . ; W
1
32
], W
1
i
R
n
0
,n
0
1
, each of
which is sparse to retain the locally connected structure
[13]. Intuitively, each row of W
1
i
represents a small filter
centered at a pixel of x
0
, so that all of the elements in this
row equal zeros except for the elements belonging to the
filter. As our weights are not shared, the non-zero values of
these rows are not the same
2
. Therefore, the weight matrix
W
1
results in 32 feature maps {x
1
i
}
32
i=1
, each of which has
n
0
dimensions. Then, a matrix V
1
, where V
ij
{0, 1}
encodes the 2D topography of the pooling layer [13], down-
samples each of these feature map to 48 × 48 in order to
reduce the number of parameters need to be learned and
obtain more robust features. Each x
1
i
can be computed as
3
x
1
i
= V
1
σ(W
1
i
x
0
), (1)
where σ(x) = max(0, x) is the rectified linear function
[19] that is feature-intensity-invariant. So it is robust to
shape and illumination variations. x
1
can be obtained by
concatenating all the x
1
i
R
48×48
together, obtaining a
large feature map in n
1
= 48 × 48 × 32 dimensions.
In the second layer, each x
1
i
is transformed to x
2
i
32 sub-
matrices {W
2
i
}
32
i=1
, W
2
i
R
48×48,48×48
,
x
2
i
=
32
X
j=1
V
2
σ(W
2
j
x
1
i
), (2)
where x
2
i
is down-sampled using V
2
to 24×24 dimensions.
Eq.2 means that each small feature map in the first layer is
multiplied by 32 sub-matrices and then summed together.
Here, each sub-matrix has sparse structure as discussed
above. We can reformulate Eq.2 into a matrix form
x
2
= V
2
σ(W
2
x
1
), (3)
where W
2
= [W
2
0
1
; . . . ; W
2
0
32
], W
2
0
i
R
48×48,n
1
and
x
1
= [x
1
1
; . . . ; x
1
32
] R
n
1
, respectively. W
2
0
i
is simply
obtained by repeating W
2
i
for 32 times. Thus, x
2
has
n
2
= 24 × 24 × 32 dimensions.
In the third layer, x
2
is transformed to x
3
, i.e. the FIP
features, similar to the second layer, but without pooling.
1
In our notation, X R
a,b
means X is a two dimensional matrix
with a rows and b columns. x R
a×b
means x is a vector with a × b
dimensions. Also, [x; y] means that we concatenate vectors or matrices
x and y column-wisely, while [xy] means that we concatenate x and y
row-wisely.
2
For the convolutional neural network such as [14], the non-zero values
are the same for each row.
3
Note that in the conventional deep model [9], there is a bias term b, so
that the output is σ(W x + b). Since W x + b can be written as
f
W ex, we
drop the bias term b for simplification.
Thus, x
3
is the same size as x
2
.
x
3
= σ(W
3
x
2
), (4)
where W
3
= [W
3
1
; . . . ; W
3
32
], W
3
i
R
24×24,n
2
and x
2
=
[x
2
1
; . . . ; x
2
32
] R
n
2
, respectively.
Finally, the reconstruction layer transforms the FIP
features x
3
to the frontal face image y, through a weight
matrix W
4
R
n
0
,n
2
,
y = σ(W
4
x
3
). (5)
4. Training
Training our deep network requires estimating all the
weight matrices {W
i
} as introduced above, which is chal-
lenging because of the millions of parameters. Therefore,
we first initialize the weights and then update them all. V
1
and V
2
are manually defined [13] and fixed.
4.1. Parameter Initialization
We cannot employ RBMs [9] to unsupervised pre-train
the weight matrices, because our input/output data are in
different spaces. Therefore, we devise a supervised method
based on the least square dictionary learning. As shown in
Fig.3, X
3
= {x
3
i
}
m
i=1
are a set of FIP features and Y =
{y
i
}
m
i=1
are a set of target images, where m denotes the
number of training examples. Our objective is to minimize
the reconstruction error
arg min
W
1
,W
2
,W
3
,W
4
k Y σ(W
4
X
3
) k
2
F
, (6)
where k · k
F
is the Frobenius norm. Optimizing Eq.6 is
not trivial because of its nonlinearity. However, we can
initialize the weight matrices layer-wisely as
arg min
W
1
k Y OW
1
X
0
k
2
F
, (7)
arg min
W
2
k Y P W
2
X
1
k
2
F
, (8)
arg min
W
3
k Y QW
3
X
2
k
2
F
, (9)
arg min
W
4
k Y W
4
X
3
k
2
F
. (10)
In Eq.7, X
0
= {x
0
i
}
m
i=1
is a set of input images. W
1
has been introduced in Sec.3, so that W
1
X
0
results in 32
feature maps for each input. O is a fixed binary matrix
that sums together the pixels in the same position of these
feature maps, which makes OW
1
X
0
at the same size as Y .
In Eq.8, X
1
= {x
1
i
}
m
i=1
is a set of outputs of the first locally
connected layer before pooling and P is also a fixed binary
matrix, which sums together the corresponding pixels and
rescales the results to the same size as
Y . Q, X
2
in Eq.9 are
defined in the same way.
Intuitively, we first directly use X
0
to approximate
Y with a linear transform W
1
without pooling. Once

W
1
has been initialized, X
1
= V
1
σ(W
1
X
0
) is used
to approximate Y again with another linear transform,
W
2
. We repeat this process until all the matrices have
been initialized. A similar strategy has been adopted
by [33], which learns different levels of representations
with a convolutional architecture. All of the above equa-
tions have closed-form solutions. For example, W
0
=
(O
T
O)
1
(O
T
Y X
0
T
)(X
0
X
0
T
)
1
. The other matrices can
be computed in the same way.
4.2. Parameter Update
We update all the weight matrices after the initialization
by minimizing the loss function of reconstruction error
E(X
0
; W) =k Y Y k
2
F
, (11)
where W = {W
1
, . . . , W
4
}. X
0
= {x
0
i
}, Y = {y
i
},
and Y = {y
i
} are a set of input images, a set of target
images, and a set of reconstructed images, respectively. We
update W using the stochastic gradient descent, in which
the update rule of W
i
, i = 1 . . . 4, in the k-th iteration is
k+1
= 0.9 ·
k
0.004 · ·W
i
k
·
E
W
i
k
, (12)
W
i
k+1
=
k+1
+ W
i
k
, (13)
where is the momentum variable [20], is the learning
rate, and
E
W
i
= x
i1
(e
i
)
T
is the derivative, which is
computed as the outer product of the back-propagation error
e
i
and the feature of the previous layer x
i1
. In our deep
network, there are three different expressions of e
i
. First,
for the transformation layer, e
4
is computed based on the
derivative of the linear rectified function [19]
e
4
j
=
[y y]
j
, δ
4
j
> 0
0, δ
4
j
0
, (14)
where δ
4
j
= [W
4
x
3
]
j
. [·]
j
denotes the j-th element of a
vector.
Similarly, back-propagation error for e
3
is computed as
e
3
j
=
[W
4
T
e
4
]
j
, δ
3
j
> 0
0, δ
3
j
0
, (15)
where δ
3
j
= [W
3
x
2
]
j
.
We compute e
1
and e
2
in the same way as e
3
since they
both adopt the same activation function. There is a slight
difference due to down-sampling. For these two layers, we
must up-sample the corresponding back-propagation error e
so that it has the same dimensions as the input feature. This
strategy has been introduced in [14]. We need to enforce the
weight matrices to have locally connected structures after
each gradient step as introduced in [12]. We implement this
by setting the corresponding matrix elements to zeros, if
there supposed to be no connections.
5. Experiments
We conduct two sets of experiments. Sec.5.1 compares
with state-of-the-art methods and learning-based descrip-
tors. Sec.5.2 demonstrates that classical face recognition
methods can be significantly improved when applied on our
reconstructed face images in the canonical view.
Dataset. To extensively evaluate our method under
different poses and illuminations, we select the MultiPIE
face database [7], which contains 754,204 images of 337
identities. Each identity has images captured under 15
poses and 20 illuminations. These images were captured
in four sessions during different periods. Like the previous
methods [3, 18, 17], we evaluate our algorithm on a subset
of the MultiPIE database, where each identity has images
from all the four sections under seven poses from yaw
angles 45
+45
, and 20 illuminations marked as ID
00-19 in MultiPIE. This subset has 128,940 images.
5.1. Face Recognition
The existing works conduct experiments on MultiPIE
with three different settings: Setting-I was introduced in
[3, 18, 34]; Setting-II and Setting-III were introduced in
[17]. We describe these settings below.
Setting-I and Setting-II only adopt images with differ-
ent poses, but with neutral illumination marked as ID 07.
They evaluate robustness to pose variations. For Setting-I,
the images of the first 200 identities in all the four sessions
are chosen for training, and the images of the remaining
137 identities for test. During test, one frontal image (i.e.
0
) of each identity in the test set is selected to the gallery,
so there are 137 gallery images in total. The remaining
images from 45
+45
except 0
are selected as
probes. For Setting-II, only the images in session one are
used, which only has 249 identities. The images of the
first 100 identities are for training, and the images of the
remaining 149 identities for test. During test, one frontal
image of each identity in the test set is selected in the
gallery. The remaining images from 45
+45
except
0
are selected as probes.
Setting-III also adopts images in session one for training
and test, but it utilizes the images under all the 7 poses
and 20 illuminations. This is to evaluate the robustness
when both pose and illumination variations are present. The
selection of probes and gallery are the same as Setting-II.
We evaluate both the FIP features and the reconstructed
images using the above three settings. Face images are
roughly aligned according to the positions of eyes, and
rescaled to 96×96. They are converted to grayscale images.
The mean value over the training set is subtracted from
each pixel. For each identity, we use the images with
6 poses ranging from 45
+45
except 0
, and 19
illuminations marked as ID 00-19 except 07, as input to
train our deep network. The reconstruction target is the

Figures
Citations
More filters
Proceedings ArticleDOI

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

TL;DR: A novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter is proposed and significantly outperforms state-of-the-art methods on this dataset.
Proceedings ArticleDOI

Deep Learning Face Representation from Predicting 10,000 Classes

TL;DR: It is argued that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set.
Proceedings Article

Deep Learning Face Representation by Joint Identification-Verification

TL;DR: This paper shows that the face identification-verification task can be well solved with deep learning and using both face identification and verification signals as supervision, and the error rate has been significantly reduced.
Posted Content

Deep Learning Face Representation by Joint Identification-Verification

TL;DR: In this paper, the Deep IDentification-verification features (DeepID2) are learned with carefully designed deep convolutional networks to reduce intra-personal variations while enlarging inter-personal differences.
Book ChapterDOI

Facial Landmark Detection by Deep Multi-task Learning

TL;DR: A novel tasks-constrained deep model is formulated, with task-wise early stopping to facilitate learning convergence and reduces model complexity drastically compared to the state-of-the-art method based on cascaded deep model.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Book

Principal Component Analysis

TL;DR: In this article, the authors present a graphical representation of data using Principal Component Analysis (PCA) for time series and other non-independent data, as well as a generalization and adaptation of principal component analysis.
Journal ArticleDOI

A fast learning algorithm for deep belief nets

TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Proceedings Article

Rectified Linear Units Improve Restricted Boltzmann Machines

TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Deep learning identity-preserving face space" ?

This paper addresses this challenge by proposing a new learningbased face representation: the face identity-preserving ( FIP ) features. In order to learn the FIP features, the authors carefully design a deep network that combines the feature extraction layers and the reconstruction layer. 

In the future work, the authors will extend the framework to deal with robust face recognition in other difficult conditions such as expression change and face sketch recognition [ 25, 30 ], and will combine FIP features with more classic face recognition approaches to further improve the performance [ 28, 29, 27 ].