scispace - formally typeset
Open AccessProceedings ArticleDOI

Using a Deformation Field Model for Localizing Faces and Facial Points under Weak Supervision

Reads0
Chats0
TLDR
This work extends the mixtures from trees to more general loopy graphs and can localize facial points with an accuracy similar to fully supervised approaches without any facial point annotation at the level of individual training images.
Abstract
Face detection and facial points localization are interconnected tasks. Recently it has been shown that solving these two tasks jointly with a mixture of trees of parts (MTP) leads to state-of-the-art results. However, MTP, as most other methods for facial point localization proposed so far, requires a complete annotation of the training data at facial point level. This is used to predefine the structure of the trees and to place the parts correctly. In this work we extend the mixtures from trees to more general loopy graphs. In this way we can learn in a weakly supervised manner (using only the face location and orientation) a powerful deformable detector that implicitly aligns its parts to the detected face in the image. By attaching some reference points to the correct parts of our detector we can then localize the facial points. In terms of detection our method clearly outperforms the state-of-the-art, even if competing with methods that use facial point annotations during training. Additionally, without any facial point annotation at the level of individual training images, our method can localize facial points with an accuracy similar to fully supervised approaches.

read more

Content maybe subject to copyright    Report

Using a deformation field model for localizing faces
and facial points under weak supervision
Marco Pedersoli
Tinne Tuytelaars
Luc Van Gool
KU Leuven, ESAT/PSI - iMinds
ETH Z
¨
urich, CVL/D-ITET
firstname.lastname@esat.kuleuven.be vangool@vision.ee.ethz.ch
Abstract
Face detection and facial points localization are inter-
connected tasks. Recently it has been shown that solving
these two tasks jointly with a mixture of trees of parts (MTP)
leads to state-of-the-art results. However, MTP, as most
other methods for facial point localization proposed so far,
requires a complete annotation of the training data at fa-
cial point level. This is used to predefine the structure of the
trees and to place the parts correctly. In this work we ex-
tend the mixtures from trees to more general loopy graphs.
In this way we can learn in a weakly supervised manner
(using only the face location and orientation) a powerful
deformable detector that implicitly aligns its parts to the
detected face in the image. By attaching some reference
points to the correct parts of our detector we can then lo-
calize the facial points. In terms of detection our method
clearly outperforms the state-of-the-art, even if competing
with methods that use facial point annotations during train-
ing. Additionally, without any facial point annotation at the
level of individual training images, our method can localize
facial points with an accuracy similar to fully supervised
approaches.
1. Introduction
Even if the problem of detecting faces seems practi-
cally solved, this is true only when considering well aligned
frontal faces. In the general case, the problem is still chal-
lenging because perspective deformations due to the differ-
ent poses/orientations of the face/camera generate a huge
space of variability which can only be covered by a huge
amount of samples. At the same time, interest is rapidly
moving to facial analysis which requires the precise local-
ization of facial points. In terms of annotation, this typically
requires a much more time consuming procedure, where for
each example each facial point needs to be annotated.
Recent methods for object detection [7] have shown that
Training Images
Test Images
Def. Field Models
Model Annotation
Figure 1. Overview of our method. Knowing only the location
and orientation of faces at training time suffices to learn a model
that can not only detect faces, but also localize their facial points.
aligning the object can often produce better detection accu-
racy or similar detection accuracy but with a reduced set of
training data. In particular, for faces, Zhu et al. [33] have
shown that using a mixture of trees of parts (MTP) con-
nected with spring-like deformation costs can lead, with a
limited number of samples, to performance comparable to
commercial face detectors generally trained with millions
of samples [22]. Furthermore, as parts are placed on facial
landmarks, the same model can be used for facial point lo-
calization as well as pose estimation.
In this work, we extend that approach. Instead of model-
ing faces as trees of parts, we model them as a densely and
uniformly distributed set of parts connected with pairwise
connections, forming a graph. The immediate advantage of
this representation compared to the tree of parts is that the
model does not need to know where the facial points are,
because parts are placed uniformly over the entire face. The
aligned locations of the parts are estimated during learning
as latent variables.
1

This deformation model has loops and therefore its op-
timization is in general too expensive for detection. How-
ever, the inference procedure proposed in [18] makes the de-
tection computationally feasible for up to 100 object parts.
This model is well suited for perspective-like deformations
and is general enough to automatically and jointly learn the
face appearance, align the faces and learn the pairwise de-
formation costs without any facial point annotation. With
this approach we show improved detection capabilities with
less supervision (and just using few images). Additionally,
we show that the generated alignment is good enough to be
used for unsupervised facial point localization. In practice,
without any additional learning we can manually select any
number of facial points on the model representation (Fig. 1)
and then use the model to localize the facial points on a new
face.
The paper is structured as follows. In section 2 we dis-
cuss how our work relates to previous work. Then in sec-
tion 3 we define our deformation model, how to learn it and
how to localize facial points. Finally, section 4 reports on
experiments comparing our model with the state of the art,
while in section 5 we draw conclusions.
2. Related work
Face detection has been broadly studied. Here, due to
lack of space, we limit ourselves to methods that are highly
connected to ours. For a complete review on face detection
and facial points localization we refer to recent surveys [9,
30, 31].
Viola and Jones [26] introduced the first detector able to
correctly detect most of the faces in an image with a lim-
ited number of false positives. However, when considering
faces in unconstrained environments, with any possible ori-
entation, shadow, and occlusion, the real performance of
the method is still quite low. To detect faces with differ-
ent orientations, Chang et al. [10] learn different models
for different discrete orientations and then at test time use
a hierarchical cascade structure for a fast selection of the
correct model. This improves detection, but needs a large
amount of training data, because each model needs to be
trained with samples at the correct orientation. In contrast,
our model can adapt to the local and global deformations
of a face. Therefore, the same sample is recycled for dif-
ferent face orientations producing a reduced computational
cost, a higher number of samples per model and thus better
performance.
A classical approach for facial point localization is Ac-
tive Appearance Models (AAMs) [2, 17, 14] and more re-
cently Constrained Local Models (CLM) [19]. Unfortu-
nately those models, to work properly, need a good initial-
ization, otherwise they get stuck in poor local minima. In-
stead, in our case, as we jointly perform face detection and
landmark localization, there is no need for an initialization
procedure. Other recent approaches are based on boosted
regressors [25] and conditional regression forests [4]. They
have shown promising results, but they need a large amount
of annotated training images.
Our approach is similar to elastic graph matching meth-
ods [13, 16, 27], where the facial landmarks are connected
in a deformable graph. However, in our model we do not
need the annotated facial points, because a dense grid of
points is placed around the face and then, during training
the best deformation for these points is learned.
In terms of features, our approach is similar to de-
formable part models (DPM) [7], because we use a dense
scan of parts over HOG features [3] at multiple scales. Ev-
eringham et al. [6] use a pictorial structure model to build
a facial point detector, while in [24] a DPM is trained with
structured output support vector machines to directly opti-
mize the localization of the facial points. However, in con-
trast to previous methods, in our model parts are not limited
to be connected with a star or tree structure. Instead, they
form a graph.
Our work was inspired by the work of Zhu et al. [33] and
its extension [28], where it is shown that a mixture of trees
of parts placed on annotated facial points obtains state-of-
the-art results for face detection, pose estimation and facial
point localization. Here we show that this way of tackling
the problem has further unexplored potential. By changing
the tree mixtures for graph mixtures (that can still be opti-
mized in a reasonable time) we can automatically learn the
structure of the faces and align them without the need for
annotated facial points. Another recent approach for face
detection and facial point estimation is based on learning
exemplars and use them to detect faces and transfer their an-
notations [21]. Again, the number of training faces needed
to obtain good performance is quite high.
Unsupervised alignment is also a well studied topic
[11, 15, 23]. Congealing, initially proposed by Learned-
Miller [15] and then further extended and improved [11],
enforces data alignment by finding the warping parameters
that minimize a cost function on the ensemble of training
images. Tong et al. [23] applied the algorithm for localiz-
ing facial points with a limited number of annotated images,
while Zhu et al. [32] applied a similar alignment on a dense
deformable map. However, these techniques do not con-
sider negative examples and expect the face to be already
coarsely aligned i.e. previously detected. Instead we show
that joining face detection and facial points localization is
beneficial for both tasks. As far as we know, this is the first
work performing unsupervised facial points localization as
a side effect of deformable face detection. In our case the
face deformation is a latent variable and we implicitly align
the training images to the detector model because the align-
ment minimizes the recognition loss.

3. Model
We build our model on the deformation field model
(DFM) proposed for object detection in [18]. In this section
we first revise the DFM and then we show how to adapt it
to detecting faces. Given an image I and a set of learned
weights w
m
, we define the score generated by the mixture
m of our model as a combination of appearance and defor-
mation:
S(I, L, H
m
, w
m
) = A(I, L, w
m
) D(L, H
m
, w
m
). (1)
where L = {l
i
: i P} with l
i
= (l
x
i
, l
y
i
) representing
the location of part i and H
m
= {h
i
: i P} with h
i
=
(h
x
i
, h
y
i
) representing the anchor point of part i. The score
of the model appearance is produced by a set of parts P
placed on the image I:
A(I, L, w
m
) =
X
i∈P
w
A
m,i
, Φ
A
(I, l
i
)
, (2)
where w
A
m,i
are the weights for the mixture m associated
to the features Φ
A
(e.g. HOG features) extracted at location
l
i
. The deformation cost penalizes relative displacement of
connected parts:
D(L, H
m
, w
m
) =
X
ij∈E
w
D
m,ij
, Φ
D
(l
i
h
i
, l
j
h
j
)]
.
(3)
E are the edges of the graph connecting neighboring parts
and w
D
m,ij
are the weights for the mixture m associated to
the deformation features Φ
D
(d
i
, d
j
) = (|d
x
i
d
x
j
|, |d
y
i
d
y
j
|, (d
x
i
d
x
j
)
2
, (d
y
i
d
y
j
)
2
).
As we use a 4-connected grid model, the corresponding
graph has loops and standard dynamic programming opti-
mization cannot be used. Instead, we consider the problem
as an instance of a CRF optimization where nodes are the
object parts, node labels are the locations of the parts in
an image and edges are the pairwise connections between
parts. As proposed in [18] for each mixture and for each
scale we globally maximize Eq.(1) using alpha expansion
[1], which is fast. To detect multiple instances in the same
image, we iteratively re-run the algorithm penalizing the lo-
cations of a previous detection. For loopy connected de-
formation models, this iterative optimization is much faster
than a sliding window approach, but it provides similar ac-
curacy. The computational cost of the algorithm is linear in
the number of locations in the image and it is empirically
linear also in the number of parts. For a complete analysis
of speed and quality of this deformation model we refer to
[18].
3.1. DFM for faces
To effectively use DFM we adapt the algorithm to the
specific problem of detecting faces and facial points. In or-
der to properly localize facial points we have to use a rel-
atively high number of parts so that each point can be lo-
calized by a different part. At the same time, we want to
detect small faces, therefore the global model should have a
relatively low resolution. However, a low resolution model
with many parts would not give good results in terms of fa-
cial point localization because coarse parts are not discrimi-
native enough to properly localize the facial features. Thus,
whereas in the original DFM parts are placed on a regular
grid side by side, here, we introduce an overlap of 50% that
allows for bigger parts without increasing the model reso-
lution. More in detail, in all our experiments we use parts
of 4 × 4 HOG cells with 2 cells overlap. For selecting the
model resolution (and consequently the number of parts) we
use two conditions: (i) maximum number of parts defined
by the maximum computational cost which is linear in the
number of parts, (ii) ensure that at least 85% of the training
data can be represented with the given resolution.
Also, as faces have quite a rigid structure
1
, to avoid un-
likely configurations, we modify the deformation features
as Φ
F
(d
i
, d
j
) = (cl(d
x
i
d
x
j
), cl(d
y
i
d
y
j
), (d
x
i
d
x
j
)
2
, (d
y
i
d
y
j
)
2
). cl() is a non-linear function defined as:
cl(d) =
+ if d < µ
d otherwise
(4)
where µ is the size of a part. It forbids parts to cross over
each other, thus enforcing more regularity in the deforma-
tion structure
2
.
Finally, to select among the different mixtures and over-
lapping hypotheses we run a non-maximal-suppression that
suppresses all the bounding boxes that overlap more than
30% with the highest scoring ones.
3.2. Learning
Given a set of positive and negative images, the bounding
boxes B of the faces and their pose (yaw), we want to learn
a vector of weights w
such that:
w
= arg min
w
{
1
2
max
m
|w
m
|
2
+ C
M
X
n=1
K
X
k=1
max(0, 1 + max
L,m
S(I
n,k
, L, H
m
, w
m
))
+ C
|B|
X
n=1
max(0, 1 max
L,m
S(B
n
, L, H
m
, w
m
))}. (5)
This minimization is an instance of the latent SVM prob-
lem [7]. The locations of the object parts L and mixture
1
in the sense that the topological location of the different parts is al-
ways the same, but their distance can change and this is why a deformation
model is useful
2
Although cl(d) is not symmetric, it can still be optimized with alpha
expansion as shown in [1]

m are the latent variables. C is the trade-off between loss
and regularization. In all our experiments we fix C to 0.001.
The regularization is maximized over mixtures m to enforce
comparable scores for each mixture. For negative examples,
as we are interested in ranking detections, we select the first
K best detections generated from each of M negative im-
ages. For positive examples we collect the cropped region
B
n
B around each bounding box.
As opposed to binary SVMs, here the problem is not
symmetric. Due to the maximization of the latent variables,
the loss for the negative samples is convex, while the loss
for the positive samples is concave. This is solved using
an iterative procedure. Given an initial w we find the latent
values L and m for the positive samples. Then, fixing those,
we find a new w optimizing the convex problem.
In the ideal case, when we can optimally maximize the
score of Eq. (1), the loss of the positive samples can only
decrease at each iteration and, hence, the algorithm con-
verges [29]. Unfortunately, the alpha expansion algorithm
puts only a weak bound on the quality of the solution [1].
As suggested in [18], to keep the convergence, we main-
tain a buffer with the previously assigned values for the la-
tent variables. When the new assignment is effectuated, we
maintain it only if it produces a lower loss (higher score);
otherwise the old assignment is restored.
The optimization is effectuated using stochastic gradient
descent [20]. As the number of negative samples is expo-
nential, to use a limited amount of memory we use negative
mining as proposed in [7]. During learning, the weights as-
sociated to the deformation costs w
D
m
are forced to be posi-
tive to avoid unwanted configurations. With stochastic gra-
dient descent we can impose positiveness on the weights by
just re-projecting, at each update of w, all negative weights
to zero.
3.3. Initialization
As the problem defined in Eq. 5 is not convex, the final
quality of the model highly depends on the quality of the ini-
tialization of the latent variables. Initially, to avoid wrong
configurations, the latent variable values are restricted to
fewer configurations. Then, slowly, the restrictions are re-
laxed and a refined model can be learned. More in detail:
Split into mixtures: we split uniformly the range of
yaw angles of the faces in the dataset based on the
number of mixtures that we want to build. For roll
and pitch we assume that those rotations can be ac-
counted for in the deformation model and they do not
need a separate mixture. For instance, for a 2-mixtures
model, we split the yaw angle between 0-45 and be-
tween 45-90. We consider only the absolute value of
the angles because examples facing left can be placed
in the same mixture with examples facing right (see
below). Then, for each mixture m we crop the corre-
sponding bounding boxes B
m
of the face and rescale
them to a fixed scale.
Left-right alignment: we align left and right facing
examples to train a single mixture with more posi-
tive samples. Then at test time we run the inference
with the learned model as well as with the horizontally
flipped version, so that we can detect faces facing both
sides. To this end, we define an alignment energy as:
|B
m
|
X
n=1
|Φ
A
(B
m
n
, L
)
1
|B
m
|
|B
m
|
X
n=1
Φ
A
(B
m
n
, L
)|
2
, (6)
which measures the norm of the variance on each cell
for all the samples of a given mixture. L
is the resting-
like configuration of the parts, when there is no defor-
mation cost. We minimize this energy just selecting
random samples and flipping them horizontally. If the
energy with the flipped sample is lower than before,
then the sample is kept flipped, otherwise, the old con-
figuration is restored. We repeat this procedure for 10
times the number of samples in the mixture.
Initial appearance model: with the samples separated
by yaw angle and correctly aligned B
m
for each mix-
ture m and a set of random cropped regions R from
images not containing faces we train a first appearance
model based on standard SVM optimization with the
latent variables fixed:
w
m
= arg min
w
{
1
2
|w
m
|
2
+ C
|R|
X
n=1
max(0, 1 +
w
A
m
, Φ
A
(R
n
, L
)
)
+ C
|B
m
|
X
n=1
max(0, 1
w
A
m
, Φ
A
(B
m
n
, L
)
}. (7)
Initial deformation model: we initialize the deforma-
tion weights connecting the part i and j as w
D
m,ij
=
|w
A
m,i
| + |w
A
m,j
| so that they are comparable to the cor-
responding appearance weights. In most of the cases
this initial configuration does not allow for any defor-
mation. However, it allows global displacements of the
mixture model to rigidly align the mixture to the face,
as in our deformation model a global displacement is
not penalized. Then, during training, the deformation
weights are regularized and after some latent SVM it-
erations they will be small enough to allow for defor-
mations.
Complete model: we concatenate all the initial ap-
pearance weights w
A
m
with the corresponding deforma-
tion weights w
D
m
to form the complete w that is used

to initialize Eq. 5 and therefore start the latent SVM
optimization.
Initialization without pose: in the experiments we
also tried to learn the model without using the fa-
cial pose. In this case we first perform the left-right
alignment as previously explained on the entire train-
ing data. Afterwards, we perform k-means on the ex-
tracted features, with k equal to the number of desired
mixtures. In this way each cluster now represents one
mixture and the following steps of the initialization are
the same as before.
3.4. Facial point localization
Once the training is completed we obtain a set of de-
formable templates representing the different views of a
face. In Fig. 2 we show the positive weights associated to
the HOG features learned for 3 different viewpoints. From
this visualization we can easily recognize the face structure
and therefore we can manually annotate the facial points
that we want to localize in a new image.
Then, for each point we can find which part of the grid
it belongs to and anchor it to the corresponding part. In this
way, when applying the detector on a new image the facial
point will follow the location of the part that it is anchored
to. As during learning we trade-off appearance and defor-
mation, also on test images we can expect that the parts will
distort to adapt to the current image.
If a facial point is placed at the edge between two parts,
its real placement on the image could vary a lot depending
on which part we decide it is attached to. To avoid such
problem each facial point is attached to the 4 closest parts.
The final point location on the image is then the bi-linear
interpolation of the location of the point on the four parts.
This procedure reduces the quantization effect due to the
fact that parts have a finite size. In practice in our experi-
ments the interpolation always gives better results and we
use it in all experiments.
3.5. Evaluating the Facial Alignment
The previous procedure is useful to estimate the loca-
tion of facial points for real applications, like facial emotion
recognition or person identification. However, it is based
on a subjective localization of the facial points on the object
model. A more direct way to estimate how good a model is
in aligning faces is to re-project for each image the ground
truth annotations onto the object model. If most of the
points fire at the same location on the object model, then the
alignment is well done. In this sense, for each facial point
we evaluate the standard deviation of the re-projections of
the annotated faces.
Again, the re-projection is computed using bi-linear in-
terpolation. In practice for each annotated facial point, the 4
closest parts are detected and then the location of the point
on the model is their weighted mean. An example of anno-
tated facial points re-projection is shown in Fig. 2.
4. Experiments
4.1. Datasets
We train our method using 900 samples from two well
known datasets: MultiPIE [8] and Labeled faces in-the-wild
(LFW) [12]. We use 900 samples for training to have a
fair comparison with previous methods [33] that used the
same number of samples. In contrast to the other methods,
we use only the location of the bounding box and the pose
(yaw) of the face, but not the facial point locations. We
use MultiPIE, that is a collection of images taken in a con-
trolled environment, to perform an analysis of the model
parameters. Afterwards, for a comparison with other state-
of-the-art methods we train with LFW which contains un-
constrained images and has a better representation of the
“in-the-wild” data distribution. For both datasets, for col-
lecting negative samples we use the negative images of the
INRIA dataset [3] which do not contain faces. The test is ef-
fectuated on Annotated Faces in-the-wild (AFW) proposed
in [33].
4.2. Number of Mixtures
In Table 1 we evaluate the effect of changing the number
of mixtures of the model using a grid of 10 × 10 parts with
part size of 4×4 HOG cells and 2 cells overlap. We train our
model on MultiPIE considering 300 frontal views and 600
lateral views spanning from +-15 to +-90 degrees as in [33].
For each configuration we evaluate on AFW the detection
average precision (AP) and the average standard deviation
of the projection of the facial points on the model as ex-
plained in sec. 3.5. For the average precision, we consider
a detection as correct if its overlap with the ground truth
bounding box is more than 50%, as in [5]. We use the aver-
age standard deviation of the re-projection of the annotated
facial points (as percentage of the face size) to estimate the
capability of alignment of the model on the test samples. A
high standard deviation means that the localization is poor,
while a small one is an indicator that the model aligns well
with the test images.
From the table we can see that increasing the number
of mixtures (up to 8) leads to a better facial point estimation
(lower standard deviation). However, increasing the number
of mixtures reduces the number of samples per mixture and
thus, with more than 8 mixtures the facial point localiza-
tion becomes worse. For detection we can see that starting
from 6 mixtures already gives near optimal performance.
As the computational time is linear in the number of mix-
tures, for the next experiments we select the configuration
with 6 components which has a quite good AP and facial

Citations
More filters
Journal ArticleDOI

300 Faces In-The-Wild Challenge

TL;DR: This paper proposes a semi-automatic annotation technique that was employed to re-annotate most existing facial databases under a unified protocol, and presents the 300 Faces In-The-Wild Challenge (300-W), the first facial landmark localization challenge that was organized twice, in 2013 and 2015.
Book ChapterDOI

Ten Years of Pedestrian Detection, What Have We Learned?

TL;DR: This work analyzes the remarkable progress of the last decade by dis- cussing the main ideas explored in the 40+ detectors currently present in the Caltech pedestrian detection benchmark to find a new decision forest detector.
Journal ArticleDOI

Learning Deep Representation for Face Alignment with Auxiliary Attributes

TL;DR: A novel tasks-constrained deep model is formulated, which not only learns the inter-task correlation but also employs dynamic task coefficients to facilitate the optimization convergence when learning multiple complex tasks.
Proceedings ArticleDOI

Unsupervised Discovery of Object Landmarks as Structural Representations

TL;DR: This paper proposes an autoencoding formulation to discover landmarks as explicit structural representations, which naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures.
Proceedings ArticleDOI

Unsupervised Part-Based Disentangling of Object Shape and Appearance

TL;DR: In this paper, an unsupervised approach for disentangling appearance and shape by learning parts consistently over all instances of a category is presented, which can be applied to a wide range of object categories and diverse tasks including pose prediction, image synthesis, and video-to-video translation.
References
More filters
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

Robust Real-Time Face Detection

TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Journal ArticleDOI

Fast approximate energy minimization via graph cuts

TL;DR: This work presents two algorithms based on graph cuts that efficiently find a local minimum with respect to two types of large moves, namely expansion moves and swap moves that allow important cases of discontinuity preserving energies.
Journal ArticleDOI

Face recognition: A literature survey

TL;DR: In this paper, the authors provide an up-to-date critical survey of still-and video-based face recognition research, and provide some insights into the studies of machine recognition of faces.
Related Papers (5)