Using a Deformation Field Model for Localizing Faces and Facial Points under Weak Supervision

doi:10.1109/CVPR.2014.472

Using a deformation ﬁeld model for localizing faces

and facial points under weak supervision

Marco Pedersoli

†

Tinne Tuytelaars

†

Luc Van Gool

†‡

†

KU Leuven, ESAT/PSI - iMinds

‡

ETH Z

¨

urich, CVL/D-ITET

firstname.lastname@esat.kuleuven.be vangool@vision.ee.ethz.ch

Abstract

Face detection and facial points localization are inter-

connected tasks. Recently it has been shown that solving

these two tasks jointly with a mixture of trees of parts (MTP)

leads to state-of-the-art results. However, MTP, as most

other methods for facial point localization proposed so far,

requires a complete annotation of the training data at fa-

cial point level. This is used to predeﬁne the structure of the

trees and to place the parts correctly. In this work we ex-

tend the mixtures from trees to more general loopy graphs.

In this way we can learn in a weakly supervised manner

(using only the face location and orientation) a powerful

deformable detector that implicitly aligns its parts to the

detected face in the image. By attaching some reference

points to the correct parts of our detector we can then lo-

calize the facial points. In terms of detection our method

clearly outperforms the state-of-the-art, even if competing

with methods that use facial point annotations during train-

ing. Additionally, without any facial point annotation at the

level of individual training images, our method can localize

facial points with an accuracy similar to fully supervised

approaches.

1. Introduction

Even if the problem of detecting faces seems practi-

cally solved, this is true only when considering well aligned

frontal faces. In the general case, the problem is still chal-

lenging because perspective deformations due to the differ-

ent poses/orientations of the face/camera generate a huge

space of variability which can only be covered by a huge

amount of samples. At the same time, interest is rapidly

moving to facial analysis which requires the precise local-

ization of facial points. In terms of annotation, this typically

requires a much more time consuming procedure, where for

each example each facial point needs to be annotated.

Recent methods for object detection [7] have shown that

Training Images

Test Images

Def. Field Models

Model Annotation

Figure 1. Overview of our method. Knowing only the location

and orientation of faces at training time sufﬁces to learn a model

that can not only detect faces, but also localize their facial points.

aligning the object can often produce better detection accu-

racy or similar detection accuracy but with a reduced set of

training data. In particular, for faces, Zhu et al. [33] have

shown that using a mixture of trees of parts (MTP) con-

nected with spring-like deformation costs can lead, with a

limited number of samples, to performance comparable to

commercial face detectors generally trained with millions

of samples [22]. Furthermore, as parts are placed on facial

landmarks, the same model can be used for facial point lo-

calization as well as pose estimation.

In this work, we extend that approach. Instead of model-

ing faces as trees of parts, we model them as a densely and

uniformly distributed set of parts connected with pairwise

connections, forming a graph. The immediate advantage of

this representation compared to the tree of parts is that the

model does not need to know where the facial points are,

because parts are placed uniformly over the entire face. The

aligned locations of the parts are estimated during learning

as latent variables.

1

This deformation model has loops and therefore its op-

timization is in general too expensive for detection. How-

ever, the inference procedure proposed in [18] makes the de-

tection computationally feasible for up to 100 object parts.

This model is well suited for perspective-like deformations

and is general enough to automatically and jointly learn the

face appearance, align the faces and learn the pairwise de-

formation costs without any facial point annotation. With

this approach we show improved detection capabilities with

less supervision (and just using few images). Additionally,

we show that the generated alignment is good enough to be

used for unsupervised facial point localization. In practice,

without any additional learning we can manually select any

number of facial points on the model representation (Fig. 1)

and then use the model to localize the facial points on a new

face.

The paper is structured as follows. In section 2 we dis-

cuss how our work relates to previous work. Then in sec-

tion 3 we deﬁne our deformation model, how to learn it and

how to localize facial points. Finally, section 4 reports on

experiments comparing our model with the state of the art,

while in section 5 we draw conclusions.

2. Related work

Face detection has been broadly studied. Here, due to

lack of space, we limit ourselves to methods that are highly

connected to ours. For a complete review on face detection

and facial points localization we refer to recent surveys [9,

30, 31].

Viola and Jones [26] introduced the ﬁrst detector able to

correctly detect most of the faces in an image with a lim-

ited number of false positives. However, when considering

faces in unconstrained environments, with any possible ori-

entation, shadow, and occlusion, the real performance of

the method is still quite low. To detect faces with differ-

ent orientations, Chang et al. [10] learn different models

for different discrete orientations and then at test time use

a hierarchical cascade structure for a fast selection of the

correct model. This improves detection, but needs a large

amount of training data, because each model needs to be

trained with samples at the correct orientation. In contrast,

our model can adapt to the local and global deformations

of a face. Therefore, the same sample is recycled for dif-

ferent face orientations producing a reduced computational

cost, a higher number of samples per model and thus better

performance.

A classical approach for facial point localization is Ac-

tive Appearance Models (AAMs) [2, 17, 14] and more re-

cently Constrained Local Models (CLM) [19]. Unfortu-

nately those models, to work properly, need a good initial-

ization, otherwise they get stuck in poor local minima. In-

stead, in our case, as we jointly perform face detection and

landmark localization, there is no need for an initialization

procedure. Other recent approaches are based on boosted

regressors [25] and conditional regression forests [4]. They

have shown promising results, but they need a large amount

of annotated training images.

Our approach is similar to elastic graph matching meth-

ods [13, 16, 27], where the facial landmarks are connected

in a deformable graph. However, in our model we do not

need the annotated facial points, because a dense grid of

points is placed around the face and then, during training

the best deformation for these points is learned.

In terms of features, our approach is similar to de-

formable part models (DPM) [7], because we use a dense

scan of parts over HOG features [3] at multiple scales. Ev-

eringham et al. [6] use a pictorial structure model to build

a facial point detector, while in [24] a DPM is trained with

structured output support vector machines to directly opti-

mize the localization of the facial points. However, in con-

trast to previous methods, in our model parts are not limited

to be connected with a star or tree structure. Instead, they

form a graph.

Our work was inspired by the work of Zhu et al. [33] and

its extension [28], where it is shown that a mixture of trees

of parts placed on annotated facial points obtains state-of-

the-art results for face detection, pose estimation and facial

point localization. Here we show that this way of tackling

the problem has further unexplored potential. By changing

the tree mixtures for graph mixtures (that can still be opti-

mized in a reasonable time) we can automatically learn the

structure of the faces and align them without the need for

annotated facial points. Another recent approach for face

detection and facial point estimation is based on learning

exemplars and use them to detect faces and transfer their an-

notations [21]. Again, the number of training faces needed

to obtain good performance is quite high.

Unsupervised alignment is also a well studied topic

[11, 15, 23]. Congealing, initially proposed by Learned-

Miller [15] and then further extended and improved [11],

enforces data alignment by ﬁnding the warping parameters

that minimize a cost function on the ensemble of training

images. Tong et al. [23] applied the algorithm for localiz-

ing facial points with a limited number of annotated images,

while Zhu et al. [32] applied a similar alignment on a dense

deformable map. However, these techniques do not con-

sider negative examples and expect the face to be already

coarsely aligned i.e. previously detected. Instead we show

that joining face detection and facial points localization is

beneﬁcial for both tasks. As far as we know, this is the ﬁrst

work performing unsupervised facial points localization as

a side effect of deformable face detection. In our case the

face deformation is a latent variable and we implicitly align

the training images to the detector model because the align-

ment minimizes the recognition loss.

3. Model

We build our model on the deformation ﬁeld model

(DFM) proposed for object detection in [18]. In this section

we ﬁrst revise the DFM and then we show how to adapt it

to detecting faces. Given an image I and a set of learned

weights w

m

, we deﬁne the score generated by the mixture

m of our model as a combination of appearance and defor-

mation:

S(I, L, H

m

, w

m

) = A(I, L, w

m

) − D(L, H

m

, w

m

). (1)

where L = {l

i

: i ∈ P} with l

i

= (l

x

i

, l

y

i

) representing

the location of part i and H

m

= {h

i

: i ∈ P} with h

i

=

(h

x

i

, h

y

i

) representing the anchor point of part i. The score

of the model appearance is produced by a set of parts P

placed on the image I:

A(I, L, w

m

) =

X

i∈P



w

A

m,i

, Φ

A

(I, l

i

)



, (2)

where w

A

m,i

are the weights for the mixture m associated

to the features Φ

A

(e.g. HOG features) extracted at location

l

i

. The deformation cost penalizes relative displacement of

connected parts:

D(L, H

m

, w

m

) =

X

ij∈E



w

D

m,ij

, Φ

D

(l

i

− h

i

, l

j

− h

j

)]



.

(3)

E are the edges of the graph connecting neighboring parts

and w

D

m,ij

are the weights for the mixture m associated to

the deformation features Φ

D

(d

i

, d

j

) = (|d

x

i

− d

x

j

|, |d

y

i

−

d

y

j

|, (d

x

i

− d

x

j

)

2

, (d

y

i

− d

y

j

)

2

).

As we use a 4-connected grid model, the corresponding

graph has loops and standard dynamic programming opti-

mization cannot be used. Instead, we consider the problem

as an instance of a CRF optimization where nodes are the

object parts, node labels are the locations of the parts in

an image and edges are the pairwise connections between

parts. As proposed in [18] for each mixture and for each

scale we globally maximize Eq.(1) using alpha expansion

[1], which is fast. To detect multiple instances in the same

image, we iteratively re-run the algorithm penalizing the lo-

cations of a previous detection. For loopy connected de-

formation models, this iterative optimization is much faster

than a sliding window approach, but it provides similar ac-

curacy. The computational cost of the algorithm is linear in

the number of locations in the image and it is empirically

linear also in the number of parts. For a complete analysis

of speed and quality of this deformation model we refer to

[18].

3.1. DFM for faces

To effectively use DFM we adapt the algorithm to the

speciﬁc problem of detecting faces and facial points. In or-

der to properly localize facial points we have to use a rel-

atively high number of parts so that each point can be lo-

calized by a different part. At the same time, we want to

detect small faces, therefore the global model should have a

relatively low resolution. However, a low resolution model

with many parts would not give good results in terms of fa-

cial point localization because coarse parts are not discrimi-

native enough to properly localize the facial features. Thus,

whereas in the original DFM parts are placed on a regular

grid side by side, here, we introduce an overlap of 50% that

allows for bigger parts without increasing the model reso-

lution. More in detail, in all our experiments we use parts

of 4 × 4 HOG cells with 2 cells overlap. For selecting the

model resolution (and consequently the number of parts) we

use two conditions: (i) maximum number of parts deﬁned

by the maximum computational cost which is linear in the

number of parts, (ii) ensure that at least 85% of the training

data can be represented with the given resolution.

Also, as faces have quite a rigid structure

1

, to avoid un-

likely conﬁgurations, we modify the deformation features

as Φ

F

(d

i

, d

j

) = (cl(d

x

i

−d

x

j

), cl(d

y

i

−d

y

j

), (d

x

i

−d

x

j

)

2

, (d

y

i

−

d

y

j

)

2

). cl() is a non-linear function deﬁned as:

cl(d) =



+∞ if d < −µ

d otherwise

(4)

where µ is the size of a part. It forbids parts to cross over

each other, thus enforcing more regularity in the deforma-

tion structure

2

.

Finally, to select among the different mixtures and over-

lapping hypotheses we run a non-maximal-suppression that

suppresses all the bounding boxes that overlap more than

30% with the highest scoring ones.

3.2. Learning

Given a set of positive and negative images, the bounding

boxes B of the faces and their pose (yaw), we want to learn

a vector of weights w

∗

such that:

w

∗

= arg min

w

{

1

2

max

m

|w

m

|

2

+ C

M

X

n=1

K

X

k=1

max(0, 1 + max

L,m

S(I

n,k

, L, H

m

, w

m

))

+ C

|B|

X

n=1

max(0, 1 − max

L,m

S(B

n

, L, H

m

, w

m

))}. (5)

This minimization is an instance of the latent SVM prob-

lem [7]. The locations of the object parts L and mixture

1

in the sense that the topological location of the different parts is al-

ways the same, but their distance can change and this is why a deformation

model is useful

2

Although cl(d) is not symmetric, it can still be optimized with alpha

expansion as shown in [1]

m are the latent variables. C is the trade-off between loss

and regularization. In all our experiments we ﬁx C to 0.001.

The regularization is maximized over mixtures m to enforce

comparable scores for each mixture. For negative examples,

as we are interested in ranking detections, we select the ﬁrst

K best detections generated from each of M negative im-

ages. For positive examples we collect the cropped region

B

n

∈ B around each bounding box.

As opposed to binary SVMs, here the problem is not

symmetric. Due to the maximization of the latent variables,

the loss for the negative samples is convex, while the loss

for the positive samples is concave. This is solved using

an iterative procedure. Given an initial w we ﬁnd the latent

values L and m for the positive samples. Then, ﬁxing those,

we ﬁnd a new w optimizing the convex problem.

In the ideal case, when we can optimally maximize the

score of Eq. (1), the loss of the positive samples can only

decrease at each iteration and, hence, the algorithm con-

verges [29]. Unfortunately, the alpha expansion algorithm

puts only a weak bound on the quality of the solution [1].

As suggested in [18], to keep the convergence, we main-

tain a buffer with the previously assigned values for the la-

tent variables. When the new assignment is effectuated, we

maintain it only if it produces a lower loss (higher score);

otherwise the old assignment is restored.

The optimization is effectuated using stochastic gradient

descent [20]. As the number of negative samples is expo-

nential, to use a limited amount of memory we use negative

mining as proposed in [7]. During learning, the weights as-

sociated to the deformation costs w

D

m

are forced to be posi-

tive to avoid unwanted conﬁgurations. With stochastic gra-

dient descent we can impose positiveness on the weights by

just re-projecting, at each update of w, all negative weights

to zero.

3.3. Initialization

As the problem deﬁned in Eq. 5 is not convex, the ﬁnal

quality of the model highly depends on the quality of the ini-

tialization of the latent variables. Initially, to avoid wrong

conﬁgurations, the latent variable values are restricted to

fewer conﬁgurations. Then, slowly, the restrictions are re-

laxed and a reﬁned model can be learned. More in detail:

• Split into mixtures: we split uniformly the range of

yaw angles of the faces in the dataset based on the

number of mixtures that we want to build. For roll

and pitch we assume that those rotations can be ac-

counted for in the deformation model and they do not

need a separate mixture. For instance, for a 2-mixtures

model, we split the yaw angle between 0-45 and be-

tween 45-90. We consider only the absolute value of

the angles because examples facing left can be placed

in the same mixture with examples facing right (see

below). Then, for each mixture m we crop the corre-

sponding bounding boxes B

m

of the face and rescale

them to a ﬁxed scale.

• Left-right alignment: we align left and right facing

examples to train a single mixture with more posi-

tive samples. Then at test time we run the inference

with the learned model as well as with the horizontally

ﬂipped version, so that we can detect faces facing both

sides. To this end, we deﬁne an alignment energy as:

|B

m

|

X

n=1

|Φ

A

(B

m

n

, L

∗

) −

1

|B

m

|

|B

m

|

X

n=1

Φ

A

(B

m

n

, L

∗

)|

2

, (6)

which measures the norm of the variance on each cell

for all the samples of a given mixture. L

∗

is the resting-

like conﬁguration of the parts, when there is no defor-

mation cost. We minimize this energy just selecting

random samples and ﬂipping them horizontally. If the

energy with the ﬂipped sample is lower than before,

then the sample is kept ﬂipped, otherwise, the old con-

ﬁguration is restored. We repeat this procedure for 10

times the number of samples in the mixture.

• Initial appearance model: with the samples separated

by yaw angle and correctly aligned B

m

for each mix-

ture m and a set of random cropped regions R from

images not containing faces we train a ﬁrst appearance

model based on standard SVM optimization with the

latent variables ﬁxed:

w

∗

m

= arg min

w

{

1

2

|w

m

|

2

+ C

|R|

X

n=1

max(0, 1 +



w

A

m

, Φ

A

(R

n

, L

∗

)



)

+ C

|B

m

|

X

n=1

max(0, 1 −



w

A

m

, Φ

A

(B

m

n

, L

∗

)



}. (7)

• Initial deformation model: we initialize the deforma-

tion weights connecting the part i and j as w

D

m,ij

=

|w

A

m,i

| + |w

A

m,j

| so that they are comparable to the cor-

responding appearance weights. In most of the cases

this initial conﬁguration does not allow for any defor-

mation. However, it allows global displacements of the

mixture model to rigidly align the mixture to the face,

as in our deformation model a global displacement is

not penalized. Then, during training, the deformation

weights are regularized and after some latent SVM it-

erations they will be small enough to allow for defor-

mations.

• Complete model: we concatenate all the initial ap-

pearance weights w

A

m

with the corresponding deforma-

tion weights w

D

m

to form the complete w that is used

to initialize Eq. 5 and therefore start the latent SVM

optimization.

• Initialization without pose: in the experiments we

also tried to learn the model without using the fa-

cial pose. In this case we ﬁrst perform the left-right

alignment as previously explained on the entire train-

ing data. Afterwards, we perform k-means on the ex-

tracted features, with k equal to the number of desired

mixtures. In this way each cluster now represents one

mixture and the following steps of the initialization are

the same as before.

3.4. Facial point localization

Once the training is completed we obtain a set of de-

formable templates representing the different views of a

face. In Fig. 2 we show the positive weights associated to

the HOG features learned for 3 different viewpoints. From

this visualization we can easily recognize the face structure

and therefore we can manually annotate the facial points

that we want to localize in a new image.

Then, for each point we can ﬁnd which part of the grid

it belongs to and anchor it to the corresponding part. In this

way, when applying the detector on a new image the facial

point will follow the location of the part that it is anchored

to. As during learning we trade-off appearance and defor-

mation, also on test images we can expect that the parts will

distort to adapt to the current image.

If a facial point is placed at the edge between two parts,

its real placement on the image could vary a lot depending

on which part we decide it is attached to. To avoid such

problem each facial point is attached to the 4 closest parts.

The ﬁnal point location on the image is then the bi-linear

interpolation of the location of the point on the four parts.

This procedure reduces the quantization effect due to the

fact that parts have a ﬁnite size. In practice in our experi-

ments the interpolation always gives better results and we

use it in all experiments.

3.5. Evaluating the Facial Alignment

The previous procedure is useful to estimate the loca-

tion of facial points for real applications, like facial emotion

recognition or person identiﬁcation. However, it is based

on a subjective localization of the facial points on the object

model. A more direct way to estimate how good a model is

in aligning faces is to re-project for each image the ground

truth annotations onto the object model. If most of the

points ﬁre at the same location on the object model, then the

alignment is well done. In this sense, for each facial point

we evaluate the standard deviation of the re-projections of

the annotated faces.

Again, the re-projection is computed using bi-linear in-

terpolation. In practice for each annotated facial point, the 4

closest parts are detected and then the location of the point

on the model is their weighted mean. An example of anno-

tated facial points re-projection is shown in Fig. 2.

4. Experiments

4.1. Datasets

We train our method using 900 samples from two well

known datasets: MultiPIE [8] and Labeled faces in-the-wild

(LFW) [12]. We use 900 samples for training to have a

fair comparison with previous methods [33] that used the

same number of samples. In contrast to the other methods,

we use only the location of the bounding box and the pose

(yaw) of the face, but not the facial point locations. We

use MultiPIE, that is a collection of images taken in a con-

trolled environment, to perform an analysis of the model

parameters. Afterwards, for a comparison with other state-

of-the-art methods we train with LFW which contains un-

constrained images and has a better representation of the

“in-the-wild” data distribution. For both datasets, for col-

lecting negative samples we use the negative images of the

INRIA dataset [3] which do not contain faces. The test is ef-

fectuated on Annotated Faces in-the-wild (AFW) proposed

in [33].

4.2. Number of Mixtures

In Table 1 we evaluate the effect of changing the number

of mixtures of the model using a grid of 10 × 10 parts with

part size of 4×4 HOG cells and 2 cells overlap. We train our

model on MultiPIE considering 300 frontal views and 600

lateral views spanning from +-15 to +-90 degrees as in [33].

For each conﬁguration we evaluate on AFW the detection

average precision (AP) and the average standard deviation

of the projection of the facial points on the model as ex-

plained in sec. 3.5. For the average precision, we consider

a detection as correct if its overlap with the ground truth

bounding box is more than 50%, as in [5]. We use the aver-

age standard deviation of the re-projection of the annotated

facial points (as percentage of the face size) to estimate the

capability of alignment of the model on the test samples. A

high standard deviation means that the localization is poor,

while a small one is an indicator that the model aligns well

with the test images.

From the table we can see that increasing the number

of mixtures (up to 8) leads to a better facial point estimation

(lower standard deviation). However, increasing the number

of mixtures reduces the number of samples per mixture and

thus, with more than 8 mixtures the facial point localiza-

tion becomes worse. For detection we can see that starting

from 6 mixtures already gives near optimal performance.

As the computational time is linear in the number of mix-

tures, for the next experiments we select the conﬁguration

with 6 components which has a quite good AP and facial

Using a Deformation Field Model for Localizing Faces and Facial Points under Weak Supervision

Figures

Citations

300 Faces In-The-Wild Challenge

Ten Years of Pedestrian Detection, What Have We Learned?

Learning Deep Representation for Face Alignment with Auxiliary Attributes

Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised Part-Based Disentangling of Object Shape and Appearance

References

Histograms of oriented gradients for human detection

Robust Real-Time Face Detection

Object Detection with Discriminatively Trained Part-Based Models

Fast approximate energy minimization via graph cuts

Face recognition: A literature survey

Related Papers (5)

Facial Landmark Detection by Deep Multi-task Learning

Face alignment by coarse-to-fine shape searching

Face Alignment by Explicit Shape Regression

Face Alignment at 3000 FPS via Regressing Local Binary Features

Face detection, pose estimation, and landmark localization in the wild