What is the gradient of the loss w.r.t. nm?

The gradient of the loss w.r.t. αnm is∂L∂αnm =∂L∂xn ∂xn ∂αnm = 1 MK ∂L ∂xn K∑ k=1 ∂Cnm(I n k ) ∂αnm . (4)∂L ∂xn can be calculated by the closed form expression of p(yt | x) (Eq. (2)), and ∂C n m(I n k )∂αnm can be calculated usingthe back-propagation algorithm in the ConvNet.

What is the way to improve performance of a convolutional neural network?

Averaging the results of multiple ConvNets has been shown to be an effective way of improving performance [9, 15], while the authors will show that their hybrid structure is significantly better than the simple averaging scheme.

What are the main factors to consider when evaluating the accuracy of the three methods?

Although Tom-vs-Pete [3], high-dim LBP [7], and Fisher vector faces [25] have better accuracy than their method, there are two important factors to be considered.

What is the probability of the class m-th?

Let N and M be the number of groups and the number of ConvNets in each group, respectively, and Cnm(·) be the input-output mapping for the m-th ConvNet in the n-th group.

(Open Access) Hybrid Deep Learning for Face Verification (2013) | Yi Sun

Q: What have the authors contributed in "Hybrid deep learning for face verification" ?

This paper proposes a hybrid convolutional network ( ConvNet ) -Restricted Boltzmann Machine ( RBM ) model for face verification in wild conditions. A key contribution of this work is to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network. These relational features are further processed through multiple layers to extract high-level and global features.

Q: How do the authors train the Classification RBM?

The authors discriminatively train the Classification RBM by minimizing the negative log probability of the target class t given input x; that is, minimizing − log p(yt | x).

Q: What is the probability distribution of the nth ConvNet?

Then the n-th ConvNet group prediction can be expressed asxn = 1M M∑ m=1 1 K K∑ k=1 Cnm(I n k ) , (3)where the inner and outer sums are over different input modes (level 1 pooling) and different ConvNets (level 2 pooling), respectively.

Hybrid Deep Learning for Face Veriﬁcation

Yi Sun

Xiaogang Wang

2,3

Xiaoou Tang

1,3

Department of Information Engineering, The Chinese University of Hong Kong

Department of Electronic Engineering, The Chinese University of Hong Kong

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

sy011@ie.cuhk.edu.hk xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk

Abstract

This paper proposes a hybrid convolutional network

(ConvNet)-Restricted Boltzmann Machine (RBM) model for

face veriﬁcation in wild conditions. A key contribution

of this work is to directly learn relational visual features,

which indicate identity similarities, from raw pixels of face

pairs with a hybrid deep network. The deep ConvNets

in our model mimic the primary visual cortex to jointly

extract local relational visual features from two face images

compared with the learned ﬁlter pairs. These relational

features are further processed through multiple layers to

extract high-level and global features. Multiple groups of

ConvNets are constructed in order to achieve robustness

and characterize face similarities from different aspects.

The top-layer RBM performs inference from complementary

high-level features extracted from different ConvNet groups

with a two-level average pooling hierarchy. The entire

hybrid deep network is jointly ﬁne-tuned to optimize for the

task of face veriﬁcation. Our model achieves competitive

face veriﬁcation performance on the LFW dataset.

1. Introduction

Face recognition has been extensively studied in recent

decades [29, 28, 30, 1, 16, 5, 33, 12, 6, 3, 7, 25, 34].

This paper addresses the key challenge of computing the

similarity of two face images given their large intra-

personal variations in poses, illuminations, expressions,

ages, makeups, and occlusions. It becomes more difﬁcult

when faces to be compared are acquired in the wild.

We focus on the task of face veriﬁcation, which aims to

determine whether two face images belong to the same

identity.

Existing methods generally address the problem in two

steps: feature extraction and recognition. In the feature

extraction stage, a variety of hand-crafted features are used

[10, 22, 20, 6]. Although some learning-based feature ex-

traction approaches are proposed, their optimization targets

Figure 1: The hybrid ConvNet-RBM model. Solid and hol-

low arrows show forward and back propagation directions.

are not directly related to face identity [5, 13]. There-

fore, the features extracted encode intra-personal variations.

More importantly, existing approaches extract features from

each image separately and compare them at later stages

[8, 16, 3, 4]. Some important correlations between the two

compared images have been lost at the feature extraction

stage.

At the recognition stage, classiﬁers such as SVM are

used to classify two face images as having the same identity

or not [5, 24, 13], or other models are employed to compute

the similarities of two face images [10, 22, 12, 6, 7, 25].

The purpose of these models is to separate inter-personal

variations and intra-personal variations. However, all of

these models have been shown to have shallow structures

[2]. To handle large-scale data with complex distributions,

large amount of over-completed features may need to be ex-

tracted from the face [12, 7, 25]. Moreover, since the feature

extraction stage and the recognition stage are separate, they

cannot be jointly optimized. Once useful information is lost

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.188

1489

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.188

1489

in feature extraction, it cannot be recovered in recognition.

On the other hand, without the guidance of recognition, the

best way to design feature descriptors to capture identity

information is not clear.

All of the issues discussed above motivate us to learn a

hybrid deep network to compute face similarities. A high-

level illustration of our model is shown in Figure 1. Our

model has several unique features, as outlined below.

(1) It directly learns visual features from raw pixel-

s under the supervision of face identities. Instead of

extracting features from each face image separately, the

model jointly extracts relational visual features from two

face images in comparison. In our model, such relational

features are ﬁrst locally extracted with the automatically

learned ﬁlter pairs (pairs of ﬁlters convolving with the

two face images respectively as shown in Figure 1), and

then further processed through multiple layers of the deep

convolutional networks (ConvNets) to extract high-level

and global features. The extracted features are effective for

computing the identity similarities of face images.

(2) Considering the regular structures of faces, the deep

ConvNets in our model locally share weights in higher

convolutional layers, such that different mid- or high-level

features are extracted from different face regions, which is

contrary to conventional ConvNet structures [18], and can

greatly improve their ﬁtting and generalization capabilities.

(3) The deep and wide architecture of our hybrid network

can handle large-scale face data with complex distributions.

The deep ConvNets in our network have four convolutional

layers (followed by max-pooling) and two fully-connected

layers. In addition, multiple groups of ConvNets are

constructed to achieve good robustness and characterize

face similarities from different aspects. Predictions from

multiple ConvNet groups are pooled hierarchically and then

associated by the top-layer RBM for the ﬁnal inference.

(4) The feature extraction and recognition stages are

uniﬁed under a single network architecture. The parameters

of the entire pipeline (weights and biases in all the layers)

are jointly optimized for the target of face veriﬁcation.

2. Related work

All existing methods for face veriﬁcation start by extract-

ing features from two faces in comparison separately. A

variety of low-level features are commonly used [27, 10,

22, 33, 20, 6], including the hand-crafted features like LBP

[23] and its variants [32], SIFT [21], Gabor [31] and the

learned LE features [5]. Some methods generated mid-

level features [24, 13] with variants of convolutional deep

belief networks (CDBN) [19] or ConvNets [18]. They

are not learned with the supervision of identity matching.

Thus variations other than identity are encoded in the

features, such as poses, illumination, and expressions,

which constitute the main impediment to face recognition.

Many face recognition models are shallow structures,

and need high-dimensional over-completed feature repre-

sentations to learn the complex mappings from pairs of

noisy features to face similarities [12, 7, 25]; otherwise,

the models may suffer from inferior performance. Many

methods [5, 24, 13] used linear SVM to make the same-or-

different veriﬁcation decisions. Li et al.[20] and Chen et

al.[6, 7] factorized the face images as identity variations

plus variations within the same identity, and assumed each

factor as a Gaussian distribution for closed form solutions.

Huang et al.[12] and Simonyan et al.[25] learns linear

transformations via metric learning.

Some methods further learn high-level features based on

low-level hand-crafted features [16, 3, 4]. They are outputs

of classiﬁers that are trained to distinguish faces of different

people. All these methods extract features from a single

face separately, and the comparison of two face images

are deferred in the later recognition stage. Some identity

information may have been lost in the feature extraction

stage, and it cannot be retrieved in the recognition stage,

since the two stages are separated in the existing methods.

To avoid the potential information loss and make a reliable

decision, a large amount of high-level feature extractors

may need to be trained [3, 4].

There are a few methods that also used deep models

for face veriﬁcation [8, 24, 13], but extracted features

independently from each face. Thus relations between the

two faces are not modeled at their feature extraction stages.

In [34

], face images under various poses and lighting

conditions

were transformed to a canonical view with a

convolutional neural network. Then features are extracted

from the transformed images. In contrast, we deal with face

pairs directly by extracting relational visual features from

the two compared faces. The top layer RBM in our model

is similar to that of the deep belief net (DBN) proposed

by Hinton and Osindero [11]. However, we use ConvNets

instead of stack of RBMs in the lower layers to take the

local correlation in images into consideration. Averaging

the results of multiple ConvNets has been shown to be an

effective way of improving performance [9, 15], while we

will show that our hybrid structure is signiﬁcantly better

than the simple averaging scheme. Moreover, unlike most

existing face recognition pipelines, in which each stage is

optimized independently, our hybrid ConvNet-RBM model

is jointly optimized after pre-training each part separately,

which further enhances its performance.

3. The hybrid ConvNet-RBM model

3.1. Architecture overview

We detect the two eye centers and mouth center with the

facial point detection method proposed by Sun et al.[26].

Faces are aligned by similarity transformation according to

14901490

Figure 2: Architecture of the hybrid ConvNet-RBM model.

Neuron (or feature) number is marked beside each layer.

Figure 3: The structure of one ConvNet. The map numbers

and dimensions of the input layer and all the convolutional

and max-pooling layers are illustrated as the length, width,

and height of cuboids. The 3D convolution kernel sizes

of the convolutional layers and the pooling region sizes

of the max-pooling layers are shown as the small cuboids

and squares inside the large cuboids of maps respectively.

Neuron numbers of other layers are marked beside each

layer.

the three points. Figure 2 is an overview of our hybrid

ConvNet-RBM model, which is a cascade of deep ConvNet

groups, two levels of average pooling, and Classiﬁcation

RBM.

The lower part of our hybrid model contains 12 groups,

each of which contains ﬁve ConvNets. Figure 3 shows the

structure of one ConvNet. Each ConvNet takes a pair of

aligned face regions as input. Its four convolutional layers

(followed by max-pooling) extract the relational features

hierarchically. Finally, the extracted features pass a fully

connected layer and are fully connected to a single neuron

in layer L0 (shown in Figure 2), which indicates whether

the two regions belong to the same person. The input

region pairs for ConvNets in different groups differ in terms

of region ranges and color channels (shown in Figure 4)

to make their predictions complementary. When the size

of the input regions changes in different groups, the map

sizes in the following layers of the ConvNets will change

accordingly. Although ConvNets in the same group take

the same kind of region pair as input, they are different

in that they are trained with different bootstraps of the

training data (Section 4.1). Each input region pair generates

eight modes by exchanging the two regions and horizontally

ﬂipping each region (shown in Figure 5). When the eight

modes (shown as M1-M8 in Figure 2) are input to the same

Figure 4: Twelve face regions used in our network. P1 -

P4 are global regions covering the whole face, of size 39 ×

31. P1 and P2 (P3 and P4) differ slightly in the ranges of

regions. P5 - P12 are local regions covering different face

parts, of size 31 × 47. P1, P2, and P5 - P8 are in color. P3,

P4, and P9 - P12 are in gray values.

Figure 5: 8 possible modes for a pair of face regions.

ConvNet, eight outputs are generated. Layer L0 contains

the outputs of all the 5 × 12 ConvNets and therefore has

8 × 5 × 12 neurons. The purpose of bootstrapping and data

augmentation is to achieve robustness of predictions.

The group prediction is given by two levels of average

pooling of ConvNet predictions. Layer L1 (with 5 × 12

neurons) is formed by averaging the eight predictions of the

same ConvNet from eight different input modes. Layer L2

(with 12 neurons) is formed by averaging the ﬁve neurons in

L1 associated with the same group. The prediction variance

is greatly reduced after average pooling.

The top layer of our model in Figure 2 is a Classiﬁcation

RBM [17]. It merges the 12 group outputs in L2 to

give the ﬁnal prediction. The RBM has two outputs that

indicate the probability distribution over the two classes;

that is, whether they are the same person. The large

number of deep ConvNets means that our model has a high

capacity. Directly optimizing the whole network would

lead to severe over-ﬁtting. Therefore, we ﬁrst train each

ConvNet separately. Then, by ﬁxing all the ConvNets, the

RBM is trained. All the ConvNets and the RBM are trained

under supervision with the aim of predicting whether two

faces in comparison belong to the same person. These

two steps initialize the model to be near a good local

minimum. Finally, the whole network is ﬁne-tuned by back-

propagating errors from the top-layer RBM to all the lower-

layer ConvNets.

3.2. Deep ConvNets

A pair of gray regions forms two input maps of a

ConvNet (Figure 5), while a pair of color regions forms six

14911491

input maps, replacing each gray map with three maps from

RGB channels. The input regions are stacked into multiple

maps instead of being concatenated to form one map, which

enables the ConvNet to model the relations between the two

regions from the ﬁrst convolutional stage.

Our deep ConvNets contain four convolutional layers

(followed by max-pooling). The operation in each convo-

lutional layer can be expressed as

=max



0,b



∗ x



, (1)

where ∗ denotes convolution, x

and y

are the i-th input

map and the j-th output map respectively, k

is the

convolution kernel (ﬁlter) connecting the i-th input map

and the j-th output map, and b

is the bias for the j-th

output map. max (0, ·) is the non-linear activation function,

and is operated element-wise. Neurons with such non-

linearities are called rectiﬁed linear units [15]. Moreover,

weights of neurons (including convolution kernels and

biases) in the same map in higher convolutional layers are

locally shared. r indicates a local region where weights

are shared. Since faces are structured objects, locally

sharing weights in higher layers allows the network to learn

different high-level features at different locations. We ﬁnd

that sharing in this way can signiﬁcantly improve the ﬁtting

and generalization abilities of the network. The idea of

locally sharing weights was proposed by Huang et al.[13].

However, their model is much shallower than ours and the

gained improvement is small.

Since each stage extracts features from all the maps in

the previous stage, relations between the two face regions

are modeled; see Figure 6 for examples. As the network

goes deeper, more global and higher-level relations between

the two regions are modeled. These high-level relational

features make it possible for the top layer neurons in

ConvNets to predict the high-level concept of whether the

two input regions come from the same person. The network

output is a two-way softmax, y

exp(x

)



j=1

exp(x

)

for i =

1, 2, where x

is the total input to an output neuron i, and y

is its output. It represents a probability distribution over

the two classes (being the same person or not). Such a

probability distribution makes it valid to directly average

multiple ConvNet outputs without scaling. The ConvNets

are trained by minimizing − log y

, where t ∈{1, 2}

denotes the target class. The loss is minimized by stochastic

gradient descent, where the gradient is calculated by back-

propagation.

3.3. Classiﬁcation RBM

Classiﬁcation RBM models the joint distribution be-

tween its output neurons y (one out of C classes), input

neurons x (binary), and hidden neurons h (binary), as

Figure 6: Examples of the learned 4 × 4 ﬁlter pairs

of the ﬁrst convolutional layer of ConvNets taking color

(line 1) and gray (line 2) input region pairs, respectively.

The upper and lower ﬁlters in each pair convolve with

the two face regions in comparison, respectively, and the

results are added. For ﬁlter pairs in which one ﬁlter

varies greatly while the other remains near uniform (column

1, 2), features are extracted from the two input regions

separately. For those pairs in which both ﬁlters vary greatly,

some kind of relations between the two input regions are

extracted. Among the latter, some pairs extract simple

relations such as addition (column 5) or subtraction (column

6), while others extract more complex relations (column 6,

7). Interestingly, we ﬁnd that ﬁlters in some ﬁlter pairs are

nearly the same as those in some others, except that the

order of the two ﬁlters are inversed (columns 1-4). This

makes sense since face similarities should be invariant with

the order of the two face regions in comparison.

p(y, x, h) ∝ e

−E(y,x,h)

, where E(y, x, h)=−h



Wx −



Uy − b



x − c



h − d



y. Given input x, the conditional

probability of its output y can be explicitly expressed as

p(y

| x)=





1+e











1+e





, (2)

where c indicates the c-th class. We discriminatively train

the Classiﬁcation RBM by minimizing the negative log

probability of the target class t given input x; that is,

minimizing − log p(y

| x). The target can be optimized

by computing the exact gradient −

∂ log p(y

|x)

∂θ

, where θ ∈

{W, U, b, c, d} are RBM parameters to be learned.

3.4. Fine-tuning the entire network

Let N and M be the number of groups and the number

of ConvNets in each group, respectively, and C

(·) be the

input-output mapping for the m-th ConvNet in the n-th

group. Since the two outputs of the ConvNet represent a

probability distribution (summed to 1), when one output is

known, the other output contains no additional information.

So the hybrid model (and the mapping) only keeps the ﬁrst

output from the ConvNet. Let {I

}

k=1

be the K possible

input modes formed by a pair of face regions of group n.

14921492

Then the n-th ConvNet group prediction can be expressed



m=1



k=1

) , (3)

where the inner and outer sums are over different in-

put modes (level 1 pooling) and different ConvNets

(level 2 pooling), respectively. Given the N group

predictions {x

}

n=1

, the ﬁnal prediction by RBM is

max

c∈{1,2}

{p(y

| x)}, where p(y

| x) is deﬁned in Eq.

(2). After separately training each ConvNet and the RBM to

derive a good initialization, error is back-propagated from

the RBM to all groups of ConvNets and the whole model is

ﬁne-tuned. Let L(x)=− log p(y

| x) be the RBM loss

function, and α

be the parameters for the m-th ConvNet

in the n-th group. The gradient of the loss w.r.t. α

∂L

∂α

∂L

∂x

∂α

∂L

∂x



k=1

∂C

)

∂α

. (4)

∂L

∂x

can be calculated by the closed form expression of

p(y

| x) (Eq. (2)), and

∂C

)

∂α

can be calculated using

the back-propagation algorithm in the ConvNet.

4. Experiments

We evaluate our algorithm on LFW [14], which has been

used extensively to evaluate algorithms of face veriﬁcation

in the wild. We conduct evaluation under two different

settings: (1) 10-fold cross validation under the unrestricted

protocol of LFW without using extra data to train the

model, and (2) cross-dataset validation in which external

data exclusive to LFW is used for training. The former

shows the performance with a limited amount of training

data, while the latter shows the generalization ability across

different datasets. Section 4.1 explains the experimental

settings in detail, section 4.2 validates various aspects of

model design, and section 4.3 compares our results with

state-of-art results in literature.

4.1. Experiment settings

LFW is divided into 10 folds of mutually exclusive

people sets. For the unrestricted setting, performance is

evaluated using the 10-fold cross-validation. Each time

one fold is used for testing and the other nine for training.

Results averaged over the 10 folds are reported. The 600

testing pairs in each fold are predeﬁned by LFW and ﬁxed,

whereas training pairs can be generated using the identity

information in the other nine folds and the number is not

limited. This is referred as the LFW training settings.

For the cross-dataset setting, we use outside data ex-

clusive to LFW for training. PubFig [16] and WDRef [6]

are two large datasets other than LFW with faces in the

wild. However, PubFig only contains 200 people, thus

the identity variation is quite limited, while the images

in WDRef are not publicly available. Accordingly, we

created a new dataset, called the Celebrity Faces dataset

(CelebFaces). It contains 87, 628 face images of 5, 436

celebrities from the web, and was assembled by ﬁrst

collecting the celebrity names that do not exist in LFW to

avoid any overlap, then searching for the face images for

each name on the web. To conduct cross-dataset testing, the

model is trained on CelebFaces and tested on the predeﬁned

6, 000 test pairs in LFW. We will refer to this setting as the

CelebFaces training settings.

For both settings, we randomly choose 80% people from

the training data to train the deep ConvNets, and use the

remaining 20% people to train the top-layer RBM and

ﬁne-tune the entire model. The positive training pairs are

randomly formed such that on average each face image

appears in k =6(3) positive pairs for LFW (CelebFaces)

dataset, unless a person does not have enough training im-

ages. Given a ﬁxed number of training images, generating

more training pairs provides minimal assistance. Negative

training pairs are also randomly generated and their number

is the same as the number of positive training pairs. In this

way, we generate approximately 40, 000 (240, 000) training

pairs for the ConvNets and 8, 000 (50, 000) training pairs

for the RBM and ﬁne-tuning for LFW (CelebFaces) training

dataset. This random process for generating training data

is repeated for each ConvNet so that multiple different

ConvNets are trained in each group.

A separate validation dataset is needed during training to

avoid overﬁtting. After each training epoch

, we observe

the errors on the validation dataset and select the model

that provides the lowest validation error. We randomly

select 100 people from the training people to generate

the validation data. The free parameters in training (the

learning rate and its decreasing rate) are selected using view

1 of LFW

and are ﬁxed in all the experiments. We report

both the average accuracy and the ROC curve. The average

accuracy is deﬁned as the percentage of correctly classiﬁed

face pairs. We assign each face pair to the class with higher

probabilities without further learning a threshold for the

ﬁnal classiﬁcation.

4.2. Investigation on model design

Local weight sharing. Our ConvNets locally share

weights in the last two convolutional layers. In the second

last convolutional layer, maps are evenly divided into

2 × 2 regions, and weights are shared among neurons in

each region. In the last convolutional layer, weights are

independent for each neuron. We compare our ConvNets

One training epoch is a single pass of all the training samples.

View 1 is provided by LFW for algorithm development and parameter

selecting without over-ﬁtting the test data. [14].

14931493

Hybrid Deep Learning for Face Verification

Figures

Citations

A Discriminative Feature Learning Approach for Deep Face Recognition

Deep Learning Face Representation from Predicting 10,000 Classes

Deep Learning Face Representation by Joint Identification-Verification

Convolutional Neural Network Architectures for Matching Natural Language Sentences

Deeply learned face representations are sparse, selective, and robust

References

ImageNet Classification with Deep Convolutional Neural Networks

Distinctive Image Features from Scale-Invariant Keypoints

Gradient-based learning applied to document recognition

A fast learning algorithm for deep belief nets

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

Related Papers (5)

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

ImageNet Classification with Deep Convolutional Neural Networks

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

FaceNet: A unified embedding for face recognition and clustering

Face Description with Local Binary Patterns: Application to Face Recognition

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Hybrid deep learning for face verification" ?

Q2. What is the effect of sharing weights in higher layers?

Q3. What is the effect of local sharing weights in the same map?

Q4. What is the way to extract face similarities?

Q5. How do the authors train the Classification RBM?

Q6. What is the main reason why face recognition models are shallow?

Q7. What is the probability distribution of the two outputs of the ConvNet?

Q8. What are the convolutional layers and their pooling regions?

Q9. What is the gradient of the loss w.r.t. nm?

Q10. What is the probability distribution of the nth ConvNet?

Q11. What is the way to improve performance of a convolutional neural network?

Q12. What are the main factors to consider when evaluating the accuracy of the three methods?

Q13. How many people are used to train the deep ConvNets?

Q14. What is the probability of the class m-th?