One Millisecond Face Alignment with an Ensemble of Regression Trees

doi:10.1109/CVPR.2014.241

Vahid Kazemi and Josephine Sullivan

KTH, Royal Institute of Technology

Computer Vision and Active Perception Lab

Teknikringen 14, Stockholm, Sweden

{vahidk,sullivan}@csc.kth.se

Abstract

This paper addresses the problem of Face Alignment for

a single image. We show how an ensemble of regression

trees can be used to estimate the face’s landmark positions

directly from a sparse subset of pixel intensities, achieving

super-realtime performance with high quality predictions.

We present a general framework based on gradient boosting

for learning an ensemble of regression trees that optimizes

the sum of square error loss and naturally handles missing

or partially labelled data. We show how using appropriate

priors exploiting the structure of image data helps with ef-

ﬁcient feature selection. Different regularization strategies

and its importance to combat overﬁtting are also investi-

gated. In addition, we analyse the effect of the quantity of

training data on the accuracy of the predictions and explore

the effect of data augmentation using synthesized data.

1. Introduction

In this paper we present a new algorithm that performs

face alignment in milliseconds and achieves accuracy supe-

rior or comparable to state-of-the-art methods on standard

datasets. The speed gains over previous methods is a con-

sequence of identifying the essential components of prior

face alignment algorithms and then incorporating them in

a streamlined formulation into a cascade of high capacity

regression functions learnt via gradient boosting.

We show, as others have [8, 2], that face alignment can

be solved with a cascade of regression functions. In our case

each regression function in the cascade efﬁciently estimates

the shape from an initial estimate and the intensities of a

sparse set of pixels indexed relative to this initial estimate.

Our work builds on the large amount of research over the

last decade that has resulted in signiﬁcant progress for face

alignment [9, 4, 13, 7, 15, 1, 16, 18, 3, 6, 19]. In particular,

we incorporate into our learnt regression functions two key

elements that are present in several of the successful algo-

rithms cited and we detail these elements now.

Figure 1. Selected results on the HELEN dataset. An ensemble

of randomized regression trees is used to detect 194 landmarks on

face from a single image in a millisecond.

The ﬁrst revolves around the indexing of pixel intensi-

ties relative to the current estimate of the shape. The ex-

tracted features in the vector representation of a face image

can greatly vary due to both shape deformation and nui-

sance factors such as changes in illumination conditions.

This makes accurate shape estimation using these features

difﬁcult. The dilemma is that we need reliable features to

accurately predict the shape, and on the other hand we need

an accurate estimate of the shape to extract reliable features.

Previous work [4, 9, 5, 8] as well as this work, use an it-

erative approach (the cascade) to deal with this problem.

Instead of regressing the shape parameters based on fea-

tures extracted in the global coordinate system of the image,

the image is transformed to a normalized coordinate system

based on a current estimate of the shape, and then the fea-

tures are extracted to predict an update vector for the shape

parameters. This process is usually repeated several times

until convergence.

The second considers how to combat the difﬁculty of the

1

inference/prediction problem. At test time, an alignment al-

gorithm has to estimate the shape, a high dimensional vec-

tor, that best agrees with the image data and our model of

shape. The problem is non-convex with many local optima.

Successful algorithms [4, 9] handle this problem by assum-

ing the estimated shape must lie in a linear subspace, which

can be discovered, for example, by ﬁnding the principal

components of the training shapes. This assumption greatly

reduces the number of potential shapes considered during

inference and can help to avoid local optima. Recent work

[8, 11, 2] uses the fact that a certain class of regressors are

guaranteed to produce predictions that lie in a linear sub-

space deﬁned by the training shapes and there is no need

for additional constraints. Crucially, our regression func-

tions have these two elements.

Allied to these two factors is our efﬁcient regression

function learning. We optimize an appropriate loss func-

tion and perform feature selection in a data-driven manner.

In particular, we learn each regressor via gradient boosting

[10] with a squared error loss function, the same loss func-

tion we want to minimize at test time. The sparse pixel set,

used as the regressor’s input, is selected via a combination

of the gradient boosting algorithm and a prior probability on

the distance between pairs of input pixels. The prior distri-

bution allows the boosting algorithm to efﬁciently explore

a large number of relevant features. The result is a cascade

of regressors that can localize the facial landmarks when

initialized with the mean face pose.

The major contributions of this paper are

1. A novel method for alignment based on ensemble of

regression trees that performs shape invariant feature

selection while minimizing the same loss function dur-

ing training time as we want to minimize at test time.

2. We present a natural extension of our method that han-

dles missing or uncertain labels.

3. Quantitative and qualitative results are presented that

conﬁrm that our method produces high quality predic-

tions while being much more efﬁcient than the best

previous method (Figure 1).

4. The effect of quantity of training data, use of partially

labeled data and synthesized data on quality of predic-

tions are analyzed.

2. Method

This paper presents an algorithm to precisely estimate

the position of facial landmarks in a computationally efﬁ-

cient way. Similar to previous works [8, 2] our proposed

method utilizes a cascade of regressors. In the rest of this

section we describe the details of the form of the individual

components of the cascade and how we perform training.

2.1. The cascade of regressors

To begin we introduce some notation. Let x

i

∈ R

2

be

the x, y-coordinates of the ith facial landmark in an image I.

Then the vector S = (x

T

1

, x

T

2

, . . . , x

T

p

)

T

∈ R

2p

denotes the

coordinates of all the p facial landmarks in I. Frequently,

in this paper we refer to the vector S as the shape. We use

ˆ

S

(t)

to denote our current estimate of S. Each regressor,

r

t

(·, ·), in the cascade predicts an update vector from the

image and

ˆ

S

(t)

that is added to the current shape estimate

ˆ

S

(t)

to improve the estimate:

ˆ

S

(t+1)

=

ˆ

S

(t)

+ r

t

(I,

ˆ

S

(t)

) (1)

The critical point of the cascade is that the regressor r

t

makes its predictions based on features, such as pixel in-

tensity values, computed from I and indexed relative to the

current shape estimate

ˆ

S

(t)

. This introduces some form of

geometric invariance into the process and as the cascade

proceeds one can be more certain that a precise semantic

location on the face is being indexed. Later we describe

how this indexing is performed.

Note that the range of outputs expanded by the ensemble

is ensured to lie in a linear subspace of training data if the

initial estimate

ˆ

S

(0)

belongs to this space. We therefore do

not need to enforce additional constraints on the predictions

which greatly simpliﬁes our method. The initial shape can

simply be chosen as the mean shape of the training data

centered and scaled according to the bounding box output

of a generic face detector.

To train each r

t

we use the gradient tree boosting algo-

rithm with a sum of square error loss as described in [10].

We now give the explicit details of this process.

2.2. Learning each regressor in the cascade

Assume we have training data (I

1

, S

1

), . . . , (I

n

, S

n

)

where each I

i

is a face image and S

i

its shape vector.

To learn the ﬁrst regression function r

0

in the cascade we

create from our training data triplets of a face image, an

initial shape estimate and the target update step, that is,

(I

π

i

,

ˆ

S

(0)

i

, ∆S

(0)

i

) where

π

i

∈ {1, . . . , n} (2)

ˆ

S

(0)

i

∈ {S

1

, . . . , S

n

}\S

π

i

and (3)

∆S

(0)

i

= S

π

i

−

ˆ

S

(0)

i

(4)

for i = 1, . . . , N. We set the total number of these triplets to

N = nR where R is the number of initializations used per

image I

i

. Each initial shape estimate for an image is sam-

pled uniformly from {S

1

, . . . , S

n

} without replacement.

From this data we learn the regression function r

0

(see

algorithm 1), using gradient tree boosting with a sum of

square error loss. The set of training triplets is then updated

to provide the training data, (I

π

i

,

ˆ

S

(1)

i

, ∆S

(1)

i

), for the next

regressor r

1

in the cascade by setting (with t = 0)

ˆ

S

(t+1)

i

=

ˆ

S

(t)

i

+ r

t

(I

π

i

,

ˆ

S

(t)

i

) (5)

∆S

(t+1)

i

= S

π

i

−

ˆ

S

(t+1)

i

(6)

This process is iterated until a cascade of T regressors

r

0

, r

1

, . . . , r

T −1

are learnt which when combined give a

sufﬁcient level of accuracy.

As stated each regressor r

t

is learned using the gradi-

ent boosting tree algorithm. It should be remembered that

a square error loss is used and the residuals computed in

the innermost loop correspond to the gradient of this loss

function evaluated at each training sample. Included in

the statement of the algorithm is a learning rate parame-

ter 0 < ν ≤ 1 also known as the shrinkage factor. Set-

ting ν < 1 helps combat over-ﬁtting and usually results in

regressors which generalize much better than those learnt

with ν = 1 [10].

Algorithm 1 Learning r

t

in the cascade

Have training data {(I

π

i

,

ˆ

S

(t)

i

, ∆S

(t)

i

)}

N

i=1

and the learning

rate (shrinkage factor) 0 < ν < 1

1. Initialise

f

0

(I,

ˆ

S

(t)

) = arg min

γ∈R

2p

N

X

i=1

k∆S

(t)

i

− γk

2

2. for k = 1, . . . , K:

(a) Set for i = 1, . . . , N

r

ik

= ∆S

(t)

i

− f

k−1

(I

π

i

,

ˆ

S

(t)

i

)

(b) Fit a regression tree to the targets r

ik

giving a weak

regression function g

k

(I,

ˆ

S

(t)

).

(c) Update

f

k

(I,

ˆ

S

(t)

) = f

k−1

(I,

ˆ

S

(t)

) + ν g

k

(I,

ˆ

S

(t)

)

3. Output r

t

(I,

ˆ

S

(t)

) = f

K

(I,

ˆ

S

(t)

)

2.3. Tree based regressor

The core of each regression function r

t

is the tree based

regressors ﬁt to the residual targets during the gradient

boosting algorithm. We now review the most important im-

plementation details for training each regression tree.

2.3.1 Shape invariant split tests

At each split node in the regression tree we make a decision

based on thresholding the difference between the intensities

of two pixels. The pixels used in the test are at positions u

and v when deﬁned in the coordinate system of the mean

shape. For a face image with an arbitrary shape, we would

like to index the points that have the same position rela-

tive to its shape as u and v have to the mean shape. To

achieve this, the image can be warped to the mean shape

based on the current shape estimate before extracting the

features. Since we only use a very sparse representation of

the image, it is much more efﬁcient to warp the location

of points as opposed to the whole image. Furthermore, a

crude approximation of warping can be done using only a

global similarity transform in addition to local translations

as suggested by [2].

The precise details are as follows. Let k

u

be the index

of the facial landmark in the mean shape that is closest to u

and deﬁne its offset from u as

δx

u

= u −

¯

x

k

u

Then for a shape S

i

deﬁned in image I

i

, the position in I

i

that is qualitatively similar to u in the mean shape image is

given by

u

0

= x

i,k

u

+

1

s

i

R

T

i

δx

u

(7)

where s

i

and R

i

are the scale and rotation matrix of the sim-

ilarity transform which transforms S

i

to

¯

S, the mean shape.

The scale and rotation are found to minimize

p

X

j=1

k

¯

x

j

− (s

i

R

i

x

i,j

+ t

i

)k

2

(8)

the sum of squares between the mean shape’s facial land-

mark points,

¯

x

j

’s, and those of the warped shape. v

0

is sim-

ilarly deﬁned. Formally each split is a decision involving 3

parameters θ = (τ, u, v) and is applied to each training and

test example as

h(I

π

i

,

ˆ

S

(t)

i

, θ) =

(

1 I

π

i

(u

0

) − I

π

i

(v

0

) > τ

0 otherwise

(9)

where u

0

and v

0

are deﬁned using the scale and rotation

matrix which best warp

ˆ

S

(t)

i

to

¯

S according to equation (7).

In practice the assignments and local translations are de-

termined during the training phase. Calculating the similar-

ity transform, at test time the most computationally expen-

sive part of this process, is only done once at each level of

the cascade.

2.3.2 Choosing the node splits

For each regression tree, we approximate the underlying

function with a piecewise constant function where a con-

stant vector is ﬁt to each leaf node. To train the regression

tree we randomly generate a set of candidate splits, that is

θ’s, at each node. We then greedily choose the θ

∗

, from

these candidates, which minimizes the sum of square error.

If Q is the set of the indices of the training examples at a

node, this corresponds to minimizing

E(Q, θ) =

X

s∈{l,r}

X

i∈Q

θ,s

kr

i

− µ

θ,s

k

2

(10)

where Q

θ,l

is the indices of the examples that are sent to the

left node due to the decision induced by θ, r

i

is the vector

of all the residuals computed for image i in the gradient

boosting algorithm and

µ

θ,s

=

1

|Q

θ,s

|

X

i∈Q

θ,s

r

i

, for s ∈ {l, r} (11)

The optimal split can be found very efﬁciently because if

one rearranges equation (10) and omits the factors not de-

pendent on θ then one can see that

arg min

θ

E(Q, θ) = arg max

θ

X

s∈{l,r}

|Q

θ,s

| µ

T

θ,s

µ

θ,s

Here we only need to compute µ

θ,l

when evaluating differ-

ent θ’s, as µ

θ,r

can be calculated from the average of the

targets at the parent node µ and µ

θ,l

as follows

µ

θ,r

=

|Q|µ − |Q

θ,l

|µ

θ,l

Q

θ,r

2.3.3 Feature selection

The decision at each node is based on thresholding the dif-

ference of intensity values at a pair of pixels. This is a rather

simple test, but it is much more powerful than single in-

tensity thresholding because of its relative insensitivity to

changes in global lighting. Unfortunately, the drawback of

using pixel differences is the number of potential split (fea-

ture) candidates is quadratic in the number of pixels in the

mean image. This makes it difﬁcult to ﬁnd good θ’s with-

out searching over a very large number of them. However,

this limiting factor can be eased, to some extent, by taking

the structure of image data into account. We introduce an

exponential prior

P (u, v) ∝ e

−λku−vk

(12)

over the distance between the pixels used in a split to en-

courage closer pixel pairs to be chosen.

We found using this simple prior reduces the prediction

error on a number of face datasets. Figure 4 compares the

features selected with and without this prior, where the size

of the feature pool is ﬁxed to 20 in both cases.

2.4. Handling missing labels

The objective of equation (10) can be easily extended to

handle the case where some of the landmarks are not la-

beled in some of the training images (or we have a mea-

sure of uncertainty for each landmark). Introduce variables

w

i,j

∈ [0, 1] for each training image i and each landmark j.

Setting w

i,j

to 0 indicates that the landmark j is not labeled

in the ith image while setting it to 1 indicates that it is. Then

equation (10) can be updated to

E(Q, θ) =

X

s∈{l,r}

X

i∈Q

θ,s

(r

i

− µ

θ,s

)

T

W

i

(r

i

− µ

θ,s

)

where W

i

is a diagonal matrix with the vector

(w

i1

, w

i1

, w

i2

, w

i2

, . . . , w

ip

, w

ip

)

T

on its diagonal and

µ

θ,s

=





X

i∈Q

θ,s

W

i





−1

X

i∈Q

θ,s

W

i

r

i

, for s ∈ {l, r}

(13)

The gradient boosting algorithm must also be modiﬁed

to account of these weight factors. This can be done simply

by initializing the ensemble model with the weighted aver-

age of targets, and ﬁtting regression trees to the weighted

residuals in algorithm 1 as follows

r

ik

= W

i

(∆S

(t)

i

− f

k−1

(I

π

i

,

ˆ

S

(t)

i

)) (14)

3. Experiments

Baselines: To accurately benchmark the performance

of our proposed method, an ensemble of regression trees

(ERT) we created two more baselines. The ﬁrst is based

on randomized ferns with random feature selection (EF)

and the other is a more advanced version of this with cor-

relation based feature selection (EF+CB) which is our re-

implementation of [2]. All the parameters are ﬁxed for all

three approaches.

EF uses a straightforward implementation of random-

ized ferns as the weak regressors within the ensemble and

is the fastest to train. We use the same shrinkage method as

suggested by [2] to regularize the ferns.

EF+CB uses a correlation based feature selection

method that projects the target outputs, r

i

’s, onto a random

direction, w, and chooses the pairs of features (u, v) s.t.

I

i

(u

0

) − I

i

(v

0

) has the highest sample correlation over the

training data with the projected targets w

T

r

i

.

Parameters: Unless speciﬁed, all the experiments are

performed with the following ﬁxed parameter settings. The

number of strong regressors, r

t

, in the cascade is T = 10

and each r

t

comprises of K = 500 weak regressors g

k

. The

depth of the trees (or ferns) used to represent g

k

is set to

F = 5. At each level of the cascade P = 400 pixel loca-

tions are sampled from the image. To train the weak regres-

sors, we randomly sample a pair of these P pixel locations

(a) T = 0 (b) T = 1 (c) T = 2 (d) T = 3 (e) T = 10 (f) Ground truth

Figure 2. Landmark estimates at different levels of the cascade initialized with the mean shape centered at the output of a basic Viola &

Jones[17] face detector. After the ﬁrst level of the cascade, the error is already greatly reduced.

according to our prior and choose a random threshold to cre-

ate a potential split as described in equation (9). The best

split is then found by repeating this process S = 20 times,

and choosing the one that optimizes our objective. To create

the training data to learn our model we use R = 20 different

initializations for each training example.

Performance: The runtime complexity of the algorithm

on a single image is constant O(T KF ). The complexity of

the training time depends linearly on the number of train-

ing data O(NDT KF S) where N is the number of training

data and D is dimension of the targets. In practice with a

single CPU our algorithm takes about an hour to train on

the HELEN[12] dataset and at runtime it only takes about

one millisecond per image.

Database: Most of the experimental results reported are

for the HELEN[12] face database which we found to be the

most challenging publicly available dataset. It consists of a

total of 2330 images, each of which is annotated with 194

landmarks. As suggested by the authors we use 2000 im-

ages for training data and the rest for testing.

We also report ﬁnal results on the popular LFPW[1]

database which consists of 1432 images. Unfortunately, we

could only download 778 training images and 216 valid test

images which makes our results not directly comparable to

those previously reported on this dataset.

Comparison: Table 1 is a summary of our results com-

pared to previous algorithms. In addition to our baselines,

we have also compared our results with two variations of

Active Shape Models, STASM[14] and CompASM[12].

[14] [12] EF EF+CB EF+CB (5) EF+CB (10) ERT

Error .111 .091 .069 .062 .059 .055 .049

Table 1. A summary of the results of different algorithms on the

HELEN dataset. The error is the average normalized distance

of each landmark to its ground truth position. The distances are

normalized by dividing by the interocular distance. The number

within the bracket represents the number of times the regression

algorithm was run with a random initialization. If no number is

displayed then the method was initialized with the mean shape. In

the case of multiple estimations the median of the estimates was

chosen as the ﬁnal estimate for the landmark.

The ensemble of regression trees described in this pa-

per signiﬁcantly improves the results over the ensemble of

ferns. Figure 3 shows the average error at different levels

of the cascade which shows that ERT can reduce the error

much faster than other baselines. Note that we have also

provided the results of running EF+CB multiple times and

taking the median of ﬁnal predictions. The results show that

similar error rate to EF+CB can be achieved by our method

with an order of magnitude less computation.

We have also provided results for the widely used

LFPW[1] dataset (Table 2). With our EF+CB baseline

we could not replicate the numbers reported by [2]. (This

could be due to the fact that we could not obtain the whole

dataset.) Nevertheless our method surpasses most of the

previously reported results on this dataset taking only a frac-

tion of the computational time needed by any other method.

[1] [2] EF EF+CB EF+CB (5) EF+CB (10) ERT

Error .040 .034 .051 .046 .043 .041 .038

Table 2. A comparison of the different methods when applied to

the LFPW dataset. Please see the caption for table 1 for an expla-

nation of the numbers.

Feature Selection: Table 4 shows the effect of using

equation (12) as a prior on the distance between pixels used

in a split instead of a uniform prior on the ﬁnal results. The

parameter λ determines the effective maximum distance be-

tween the two pixels in our features and was set to 0.1 in

our experiments. Selecting this parameter by cross valida-

tion when learning each strong regressor, r

t

, in the cascade

could potentially lead to a more signiﬁcant improvement.

Figure 4 is a visualization of the selected pairs of features

when the different priors are used.

Uniform Exponential

Error .053 .049

Table 3. The effect of using different priors for feature selection

on the ﬁnal average error. An exponential prior is applied on the

Euclidean distance between the two pixels deﬁning a feature, see

equation (12).

Regularization: When using the gradient boosting algo-

rithm one needs to be careful to avoid overﬁtting. To obtain

lower test errors it is necessary to perform some form of

regularization. The simplest approach is shrinkage. This

One Millisecond Face Alignment with an Ensemble of Regression Trees

Figures

Citations

A Style-Based Generator Architecture for Generative Adversarial Networks

Deep High-Resolution Representation Learning for Visual Recognition

HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition

OpenFace: An open source facial behavior analysis toolkit

OpenFace 2.0: Facial Behavior Analysis Toolkit

References

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Robust real-time face detection

Active shape models—their training and application

Face detection, pose estimation, and landmark localization in the wild

Face Alignment by Explicit Shape Regression

Related Papers (5)

Dlib-ml: A Machine Learning Toolkit

Histograms of oriented gradients for human detection

Face detection, pose estimation, and landmark localization in the wild

Deep Residual Learning for Image Recognition

Active appearance models