scispace - formally typeset
Open AccessProceedings ArticleDOI

Hybrid Deep Learning for Face Verification

TLDR
This work proposes a hybrid convolutional network-Restricted Boltzmann Machine model for face verification in wild conditions to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network.
Abstract
This paper proposes a hybrid convolutional network (ConvNet)-Restricted Boltzmann Machine (RBM) model for face verification in wild conditions. A key contribution of this work is to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network. The deep ConvNets in our model mimic the primary visual cortex to jointly extract local relational visual features from two face images compared with the learned filter pairs. These relational features are further processed through multiple layers to extract high-level and global features. Multiple groups of ConvNets are constructed in order to achieve robustness and characterize face similarities from different aspects. The top-layer RBM performs inference from complementary high-level features extracted from different ConvNet groups with a two-level average pooling hierarchy. The entire hybrid deep network is jointly fine-tuned to optimize for the task of face verification. Our model achieves competitive face verification performance on the LFW dataset.

read more

Content maybe subject to copyright    Report

Hybrid Deep Learning for Face Verification
Yi Sun
1
Xiaogang Wang
2,3
Xiaoou Tang
1,3
1
Department of Information Engineering, The Chinese University of Hong Kong
2
Department of Electronic Engineering, The Chinese University of Hong Kong
3
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
sy011@ie.cuhk.edu.hk xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk
Abstract
This paper proposes a hybrid convolutional network
(ConvNet)-Restricted Boltzmann Machine (RBM) model for
face verification in wild conditions. A key contribution
of this work is to directly learn relational visual features,
which indicate identity similarities, from raw pixels of face
pairs with a hybrid deep network. The deep ConvNets
in our model mimic the primary visual cortex to jointly
extract local relational visual features from two face images
compared with the learned filter pairs. These relational
features are further processed through multiple layers to
extract high-level and global features. Multiple groups of
ConvNets are constructed in order to achieve robustness
and characterize face similarities from different aspects.
The top-layer RBM performs inference from complementary
high-level features extracted from different ConvNet groups
with a two-level average pooling hierarchy. The entire
hybrid deep network is jointly fine-tuned to optimize for the
task of face verification. Our model achieves competitive
face verification performance on the LFW dataset.
1. Introduction
Face recognition has been extensively studied in recent
decades [29, 28, 30, 1, 16, 5, 33, 12, 6, 3, 7, 25, 34].
This paper addresses the key challenge of computing the
similarity of two face images given their large intra-
personal variations in poses, illuminations, expressions,
ages, makeups, and occlusions. It becomes more difficult
when faces to be compared are acquired in the wild.
We focus on the task of face verification, which aims to
determine whether two face images belong to the same
identity.
Existing methods generally address the problem in two
steps: feature extraction and recognition. In the feature
extraction stage, a variety of hand-crafted features are used
[10, 22, 20, 6]. Although some learning-based feature ex-
traction approaches are proposed, their optimization targets
Figure 1: The hybrid ConvNet-RBM model. Solid and hol-
low arrows show forward and back propagation directions.
are not directly related to face identity [5, 13]. There-
fore, the features extracted encode intra-personal variations.
More importantly, existing approaches extract features from
each image separately and compare them at later stages
[8, 16, 3, 4]. Some important correlations between the two
compared images have been lost at the feature extraction
stage.
At the recognition stage, classifiers such as SVM are
used to classify two face images as having the same identity
or not [5, 24, 13], or other models are employed to compute
the similarities of two face images [10, 22, 12, 6, 7, 25].
The purpose of these models is to separate inter-personal
variations and intra-personal variations. However, all of
these models have been shown to have shallow structures
[2]. To handle large-scale data with complex distributions,
large amount of over-completed features may need to be ex-
tracted from the face [12, 7, 25]. Moreover, since the feature
extraction stage and the recognition stage are separate, they
cannot be jointly optimized. Once useful information is lost
2013 IEEE International Conference on Computer Vision
1550-5499/13 $31.00 © 2013 IEEE
DOI 10.1109/ICCV.2013.188
1489
2013 IEEE International Conference on Computer Vision
1550-5499/13 $31.00 © 2013 IEEE
DOI 10.1109/ICCV.2013.188
1489

in feature extraction, it cannot be recovered in recognition.
On the other hand, without the guidance of recognition, the
best way to design feature descriptors to capture identity
information is not clear.
All of the issues discussed above motivate us to learn a
hybrid deep network to compute face similarities. A high-
level illustration of our model is shown in Figure 1. Our
model has several unique features, as outlined below.
(1) It directly learns visual features from raw pixel-
s under the supervision of face identities. Instead of
extracting features from each face image separately, the
model jointly extracts relational visual features from two
face images in comparison. In our model, such relational
features are first locally extracted with the automatically
learned filter pairs (pairs of filters convolving with the
two face images respectively as shown in Figure 1), and
then further processed through multiple layers of the deep
convolutional networks (ConvNets) to extract high-level
and global features. The extracted features are effective for
computing the identity similarities of face images.
(2) Considering the regular structures of faces, the deep
ConvNets in our model locally share weights in higher
convolutional layers, such that different mid- or high-level
features are extracted from different face regions, which is
contrary to conventional ConvNet structures [18], and can
greatly improve their fitting and generalization capabilities.
(3) The deep and wide architecture of our hybrid network
can handle large-scale face data with complex distributions.
The deep ConvNets in our network have four convolutional
layers (followed by max-pooling) and two fully-connected
layers. In addition, multiple groups of ConvNets are
constructed to achieve good robustness and characterize
face similarities from different aspects. Predictions from
multiple ConvNet groups are pooled hierarchically and then
associated by the top-layer RBM for the final inference.
(4) The feature extraction and recognition stages are
unified under a single network architecture. The parameters
of the entire pipeline (weights and biases in all the layers)
are jointly optimized for the target of face verification.
2. Related work
All existing methods for face verification start by extract-
ing features from two faces in comparison separately. A
variety of low-level features are commonly used [27, 10,
22, 33, 20, 6], including the hand-crafted features like LBP
[23] and its variants [32], SIFT [21], Gabor [31] and the
learned LE features [5]. Some methods generated mid-
level features [24, 13] with variants of convolutional deep
belief networks (CDBN) [19] or ConvNets [18]. They
are not learned with the supervision of identity matching.
Thus variations other than identity are encoded in the
features, such as poses, illumination, and expressions,
which constitute the main impediment to face recognition.
Many face recognition models are shallow structures,
and need high-dimensional over-completed feature repre-
sentations to learn the complex mappings from pairs of
noisy features to face similarities [12, 7, 25]; otherwise,
the models may suffer from inferior performance. Many
methods [5, 24, 13] used linear SVM to make the same-or-
different verification decisions. Li et al.[20] and Chen et
al.[6, 7] factorized the face images as identity variations
plus variations within the same identity, and assumed each
factor as a Gaussian distribution for closed form solutions.
Huang et al.[12] and Simonyan et al.[25] learns linear
transformations via metric learning.
Some methods further learn high-level features based on
low-level hand-crafted features [16, 3, 4]. They are outputs
of classifiers that are trained to distinguish faces of different
people. All these methods extract features from a single
face separately, and the comparison of two face images
are deferred in the later recognition stage. Some identity
information may have been lost in the feature extraction
stage, and it cannot be retrieved in the recognition stage,
since the two stages are separated in the existing methods.
To avoid the potential information loss and make a reliable
decision, a large amount of high-level feature extractors
may need to be trained [3, 4].
There are a few methods that also used deep models
for face verification [8, 24, 13], but extracted features
independently from each face. Thus relations between the
two faces are not modeled at their feature extraction stages.
In [34
], face images under various poses and lighting
conditions
were transformed to a canonical view with a
convolutional neural network. Then features are extracted
from the transformed images. In contrast, we deal with face
pairs directly by extracting relational visual features from
the two compared faces. The top layer RBM in our model
is similar to that of the deep belief net (DBN) proposed
by Hinton and Osindero [11]. However, we use ConvNets
instead of stack of RBMs in the lower layers to take the
local correlation in images into consideration. Averaging
the results of multiple ConvNets has been shown to be an
effective way of improving performance [9, 15], while we
will show that our hybrid structure is significantly better
than the simple averaging scheme. Moreover, unlike most
existing face recognition pipelines, in which each stage is
optimized independently, our hybrid ConvNet-RBM model
is jointly optimized after pre-training each part separately,
which further enhances its performance.
3. The hybrid ConvNet-RBM model
3.1. Architecture overview
We detect the two eye centers and mouth center with the
facial point detection method proposed by Sun et al.[26].
Faces are aligned by similarity transformation according to
14901490

Figure 2: Architecture of the hybrid ConvNet-RBM model.
Neuron (or feature) number is marked beside each layer.
Figure 3: The structure of one ConvNet. The map numbers
and dimensions of the input layer and all the convolutional
and max-pooling layers are illustrated as the length, width,
and height of cuboids. The 3D convolution kernel sizes
of the convolutional layers and the pooling region sizes
of the max-pooling layers are shown as the small cuboids
and squares inside the large cuboids of maps respectively.
Neuron numbers of other layers are marked beside each
layer.
the three points. Figure 2 is an overview of our hybrid
ConvNet-RBM model, which is a cascade of deep ConvNet
groups, two levels of average pooling, and Classification
RBM.
The lower part of our hybrid model contains 12 groups,
each of which contains ve ConvNets. Figure 3 shows the
structure of one ConvNet. Each ConvNet takes a pair of
aligned face regions as input. Its four convolutional layers
(followed by max-pooling) extract the relational features
hierarchically. Finally, the extracted features pass a fully
connected layer and are fully connected to a single neuron
in layer L0 (shown in Figure 2), which indicates whether
the two regions belong to the same person. The input
region pairs for ConvNets in different groups differ in terms
of region ranges and color channels (shown in Figure 4)
to make their predictions complementary. When the size
of the input regions changes in different groups, the map
sizes in the following layers of the ConvNets will change
accordingly. Although ConvNets in the same group take
the same kind of region pair as input, they are different
in that they are trained with different bootstraps of the
training data (Section 4.1). Each input region pair generates
eight modes by exchanging the two regions and horizontally
flipping each region (shown in Figure 5). When the eight
modes (shown as M1-M8 in Figure 2) are input to the same
Figure 4: Twelve face regions used in our network. P1 -
P4 are global regions covering the whole face, of size 39 ×
31. P1 and P2 (P3 and P4) differ slightly in the ranges of
regions. P5 - P12 are local regions covering different face
parts, of size 31 × 47. P1, P2, and P5 - P8 are in color. P3,
P4, and P9 - P12 are in gray values.
Figure 5: 8 possible modes for a pair of face regions.
ConvNet, eight outputs are generated. Layer L0 contains
the outputs of all the 5 × 12 ConvNets and therefore has
8 × 5 × 12 neurons. The purpose of bootstrapping and data
augmentation is to achieve robustness of predictions.
The group prediction is given by two levels of average
pooling of ConvNet predictions. Layer L1 (with 5 × 12
neurons) is formed by averaging the eight predictions of the
same ConvNet from eight different input modes. Layer L2
(with 12 neurons) is formed by averaging the ve neurons in
L1 associated with the same group. The prediction variance
is greatly reduced after average pooling.
The top layer of our model in Figure 2 is a Classification
RBM [17]. It merges the 12 group outputs in L2 to
give the final prediction. The RBM has two outputs that
indicate the probability distribution over the two classes;
that is, whether they are the same person. The large
number of deep ConvNets means that our model has a high
capacity. Directly optimizing the whole network would
lead to severe over-fitting. Therefore, we first train each
ConvNet separately. Then, by fixing all the ConvNets, the
RBM is trained. All the ConvNets and the RBM are trained
under supervision with the aim of predicting whether two
faces in comparison belong to the same person. These
two steps initialize the model to be near a good local
minimum. Finally, the whole network is fine-tuned by back-
propagating errors from the top-layer RBM to all the lower-
layer ConvNets.
3.2. Deep ConvNets
A pair of gray regions forms two input maps of a
ConvNet (Figure 5), while a pair of color regions forms six
14911491

input maps, replacing each gray map with three maps from
RGB channels. The input regions are stacked into multiple
maps instead of being concatenated to form one map, which
enables the ConvNet to model the relations between the two
regions from the first convolutional stage.
Our deep ConvNets contain four convolutional layers
(followed by max-pooling). The operation in each convo-
lutional layer can be expressed as
y
r
j
=max
0,b
r
j
+
i
k
r
ij
x
r
i
, (1)
where denotes convolution, x
i
and y
j
are the i-th input
map and the j-th output map respectively, k
ij
is the
convolution kernel (filter) connecting the i-th input map
and the j-th output map, and b
j
is the bias for the j-th
output map. max (0, ·) is the non-linear activation function,
and is operated element-wise. Neurons with such non-
linearities are called rectified linear units [15]. Moreover,
weights of neurons (including convolution kernels and
biases) in the same map in higher convolutional layers are
locally shared. r indicates a local region where weights
are shared. Since faces are structured objects, locally
sharing weights in higher layers allows the network to learn
different high-level features at different locations. We find
that sharing in this way can significantly improve the fitting
and generalization abilities of the network. The idea of
locally sharing weights was proposed by Huang et al.[13].
However, their model is much shallower than ours and the
gained improvement is small.
Since each stage extracts features from all the maps in
the previous stage, relations between the two face regions
are modeled; see Figure 6 for examples. As the network
goes deeper, more global and higher-level relations between
the two regions are modeled. These high-level relational
features make it possible for the top layer neurons in
ConvNets to predict the high-level concept of whether the
two input regions come from the same person. The network
output is a two-way softmax, y
i
=
exp(x
i
)
2
j=1
exp(x
j
)
for i =
1, 2, where x
i
is the total input to an output neuron i, and y
i
is its output. It represents a probability distribution over
the two classes (being the same person or not). Such a
probability distribution makes it valid to directly average
multiple ConvNet outputs without scaling. The ConvNets
are trained by minimizing log y
t
, where t ∈{1, 2}
denotes the target class. The loss is minimized by stochastic
gradient descent, where the gradient is calculated by back-
propagation.
3.3. Classification RBM
Classification RBM models the joint distribution be-
tween its output neurons y (one out of C classes), input
neurons x (binary), and hidden neurons h (binary), as
Figure 6: Examples of the learned 4 × 4 filter pairs
of the first convolutional layer of ConvNets taking color
(line 1) and gray (line 2) input region pairs, respectively.
The upper and lower filters in each pair convolve with
the two face regions in comparison, respectively, and the
results are added. For filter pairs in which one filter
varies greatly while the other remains near uniform (column
1, 2), features are extracted from the two input regions
separately. For those pairs in which both filters vary greatly,
some kind of relations between the two input regions are
extracted. Among the latter, some pairs extract simple
relations such as addition (column 5) or subtraction (column
6), while others extract more complex relations (column 6,
7). Interestingly, we find that filters in some filter pairs are
nearly the same as those in some others, except that the
order of the two filters are inversed (columns 1-4). This
makes sense since face similarities should be invariant with
the order of the two face regions in comparison.
p(y, x, h) e
E(y,x,h)
, where E(y, x, h)=h
Wx
h
Uy b
x c
h d
y. Given input x, the conditional
probability of its output y can be explicitly expressed as
p(y
c
| x)=
e
d
c
j
1+e
c
j
+U
jc
+
k
W
jk
x
k
i
e
d
i
j
1+e
c
j
+U
ji
+
k
W
jk
x
k
, (2)
where c indicates the c-th class. We discriminatively train
the Classification RBM by minimizing the negative log
probability of the target class t given input x; that is,
minimizing log p(y
t
| x). The target can be optimized
by computing the exact gradient
log p(y
t
|x)
∂θ
, where θ
{W, U, b, c, d} are RBM parameters to be learned.
3.4. Fine-tuning the entire network
Let N and M be the number of groups and the number
of ConvNets in each group, respectively, and C
n
m
(·) be the
input-output mapping for the m-th ConvNet in the n-th
group. Since the two outputs of the ConvNet represent a
probability distribution (summed to 1), when one output is
known, the other output contains no additional information.
So the hybrid model (and the mapping) only keeps the first
output from the ConvNet. Let {I
n
k
}
K
k=1
be the K possible
input modes formed by a pair of face regions of group n.
14921492

Then the n-th ConvNet group prediction can be expressed
as
x
n
=
1
M
M
m=1
1
K
K
k=1
C
n
m
(I
n
k
) , (3)
where the inner and outer sums are over different in-
put modes (level 1 pooling) and different ConvNets
(level 2 pooling), respectively. Given the N group
predictions {x
n
}
N
n=1
, the final prediction by RBM is
max
c∈{1,2}
{p(y
c
| x)}, where p(y
c
| x) is defined in Eq.
(2). After separately training each ConvNet and the RBM to
derive a good initialization, error is back-propagated from
the RBM to all groups of ConvNets and the whole model is
fine-tuned. Let L(x)= log p(y
t
| x) be the RBM loss
function, and α
n
m
be the parameters for the m-th ConvNet
in the n-th group. The gradient of the loss w.r.t. α
n
m
is
∂L
∂α
n
m
=
∂L
∂x
n
∂x
n
∂α
n
m
=
1
MK
∂L
∂x
n
K
k=1
∂C
n
m
(I
n
k
)
∂α
n
m
. (4)
∂L
∂x
n
can be calculated by the closed form expression of
p(y
t
| x) (Eq. (2)), and
∂C
n
m
(I
n
k
)
∂α
n
m
can be calculated using
the back-propagation algorithm in the ConvNet.
4. Experiments
We evaluate our algorithm on LFW [14], which has been
used extensively to evaluate algorithms of face verification
in the wild. We conduct evaluation under two different
settings: (1) 10-fold cross validation under the unrestricted
protocol of LFW without using extra data to train the
model, and (2) cross-dataset validation in which external
data exclusive to LFW is used for training. The former
shows the performance with a limited amount of training
data, while the latter shows the generalization ability across
different datasets. Section 4.1 explains the experimental
settings in detail, section 4.2 validates various aspects of
model design, and section 4.3 compares our results with
state-of-art results in literature.
4.1. Experiment settings
LFW is divided into 10 folds of mutually exclusive
people sets. For the unrestricted setting, performance is
evaluated using the 10-fold cross-validation. Each time
one fold is used for testing and the other nine for training.
Results averaged over the 10 folds are reported. The 600
testing pairs in each fold are predefined by LFW and fixed,
whereas training pairs can be generated using the identity
information in the other nine folds and the number is not
limited. This is referred as the LFW training settings.
For the cross-dataset setting, we use outside data ex-
clusive to LFW for training. PubFig [16] and WDRef [6]
are two large datasets other than LFW with faces in the
wild. However, PubFig only contains 200 people, thus
the identity variation is quite limited, while the images
in WDRef are not publicly available. Accordingly, we
created a new dataset, called the Celebrity Faces dataset
(CelebFaces). It contains 87, 628 face images of 5, 436
celebrities from the web, and was assembled by first
collecting the celebrity names that do not exist in LFW to
avoid any overlap, then searching for the face images for
each name on the web. To conduct cross-dataset testing, the
model is trained on CelebFaces and tested on the predefined
6, 000 test pairs in LFW. We will refer to this setting as the
CelebFaces training settings.
For both settings, we randomly choose 80% people from
the training data to train the deep ConvNets, and use the
remaining 20% people to train the top-layer RBM and
fine-tune the entire model. The positive training pairs are
randomly formed such that on average each face image
appears in k =6(3) positive pairs for LFW (CelebFaces)
dataset, unless a person does not have enough training im-
ages. Given a fixed number of training images, generating
more training pairs provides minimal assistance. Negative
training pairs are also randomly generated and their number
is the same as the number of positive training pairs. In this
way, we generate approximately 40, 000 (240, 000) training
pairs for the ConvNets and 8, 000 (50, 000) training pairs
for the RBM and fine-tuning for LFW (CelebFaces) training
dataset. This random process for generating training data
is repeated for each ConvNet so that multiple different
ConvNets are trained in each group.
A separate validation dataset is needed during training to
avoid overfitting. After each training epoch
1
, we observe
the errors on the validation dataset and select the model
that provides the lowest validation error. We randomly
select 100 people from the training people to generate
the validation data. The free parameters in training (the
learning rate and its decreasing rate) are selected using view
1 of LFW
2
and are fixed in all the experiments. We report
both the average accuracy and the ROC curve. The average
accuracy is defined as the percentage of correctly classified
face pairs. We assign each face pair to the class with higher
probabilities without further learning a threshold for the
final classification.
4.2. Investigation on model design
Local weight sharing. Our ConvNets locally share
weights in the last two convolutional layers. In the second
last convolutional layer, maps are evenly divided into
2 × 2 regions, and weights are shared among neurons in
each region. In the last convolutional layer, weights are
independent for each neuron. We compare our ConvNets
1
One training epoch is a single pass of all the training samples.
2
View 1 is provided by LFW for algorithm development and parameter
selecting without over-fitting the test data. [14].
14931493

Citations
More filters
Book ChapterDOI

A Discriminative Feature Learning Approach for Deep Face Recognition

TL;DR: This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.
Proceedings ArticleDOI

Deep Learning Face Representation from Predicting 10,000 Classes

TL;DR: It is argued that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set.
Proceedings Article

Deep Learning Face Representation by Joint Identification-Verification

TL;DR: This paper shows that the face identification-verification task can be well solved with deep learning and using both face identification and verification signals as supervision, and the error rate has been significantly reduced.
Proceedings Article

Convolutional Neural Network Architectures for Matching Natural Language Sentences

TL;DR: Convolutional neural network models for matching two sentences are proposed, by adapting the convolutional strategy in vision and speech and nicely represent the hierarchical structures of sentences with their layer-by-layer composition and pooling.
Proceedings ArticleDOI

Deeply learned face representations are sparse, selective, and robust

TL;DR: DeepID2+ as discussed by the authors improves the performance by increasing the dimension of hidden representations and adding supervision to early convolutional layers, achieving state-of-the-art performance on LFW and YouTube Faces benchmarks.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Journal ArticleDOI

A fast learning algorithm for deep belief nets

TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Journal ArticleDOI

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What have the authors contributed in "Hybrid deep learning for face verification" ?

This paper proposes a hybrid convolutional network ( ConvNet ) -Restricted Boltzmann Machine ( RBM ) model for face verification in wild conditions. A key contribution of this work is to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network. These relational features are further processed through multiple layers to extract high-level and global features. 

Since faces are structured objects, locally sharing weights in higher layers allows the network to learn different high-level features at different locations. 

weights of neurons (including convolution kernels and biases) in the same map in higher convolutional layers are locally shared. 

(2) Considering the regular structures of faces, the deep ConvNets in their model locally share weights in higher convolutional layers, such that different mid- or high-level features are extracted from different face regions, which is contrary to conventional ConvNet structures [18], and can greatly improve their fitting and generalization capabilities. 

The authors discriminatively train the Classification RBM by minimizing the negative log probability of the target class t given input x; that is, minimizing − log p(yt | x). 

Many face recognition models are shallow structures, and need high-dimensional over-completed feature representations to learn the complex mappings from pairs of noisy features to face similarities [12, 7, 25]; otherwise, the models may suffer from inferior performance. 

Since the two outputs of the ConvNet represent a probability distribution (summed to 1), when one output is known, the other output contains no additional information. 

The 3D convolution kernel sizes of the convolutional layers and the pooling region sizes of the max-pooling layers are shown as the small cuboids and squares inside the large cuboids of maps respectively. 

The gradient of the loss w.r.t. αnm is∂L∂αnm =∂L∂xn ∂xn ∂αnm = 1 MK ∂L ∂xn K∑ k=1 ∂Cnm(I n k ) ∂αnm . (4)∂L ∂xn can be calculated by the closed form expression of p(yt | x) (Eq. (2)), and ∂C n m(I n k )∂αnm can be calculated usingthe back-propagation algorithm in the ConvNet. 

Then the n-th ConvNet group prediction can be expressed asxn = 1M M∑ m=1 1 K K∑ k=1 Cnm(I n k ) , (3)where the inner and outer sums are over different input modes (level 1 pooling) and different ConvNets (level 2 pooling), respectively. 

Averaging the results of multiple ConvNets has been shown to be an effective way of improving performance [9, 15], while the authors will show that their hybrid structure is significantly better than the simple averaging scheme. 

Although Tom-vs-Pete [3], high-dim LBP [7], and Fisher vector faces [25] have better accuracy than their method, there are two important factors to be considered. 

For both settings, the authors randomly choose 80% people from the training data to train the deep ConvNets, and use the remaining 20% people to train the top-layer RBM and fine-tune the entire model. 

Let N and M be the number of groups and the number of ConvNets in each group, respectively, and Cnm(·) be the input-output mapping for the m-th ConvNet in the n-th group.