What contributions have the authors mentioned in the paper "Face alignment by coarse-to-fine shape searching" ?

The authors present a novel face alignment framework based on coarse-to-fine shape searching. Unlike the conventional cascaded regression approaches that start with an initial shape and refine the shape in a cascaded manner, their approach begins with a coarse search over a shape space that contains diverse shapes, and employs the coarse solution to constrain subsequent finer search of shapes.

What are the future works mentioned in the paper "Face alignment by coarse-to-fine shape searching" ?

The authors plan to incorporate learning-based feature in their framework in the future to further improve the accuracy and efficiency.

(Open Access) Face alignment by coarse-to-fine shape searching (2015) | Shizhan Zhu

Face Alignment by Coarse-to-Fine Shape Searching

Shizhan Zhu

1,2

Cheng Li

Chen Change Loy

1,3

Xiaoou Tang

1,3

Department of Information Engineering, The Chinese University of Hong Kong

SenseTime Group

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

zs014@ie.cuhk.edu.hk, chengli@sensetime.com, ccloy@ie.cuhk.edu.hk, xtang@ie.cuhk.edu.hk

Abstract

We present a novel face alignment framework based on

coarse-to-ﬁne shape searching. Unlike the conventional

cascaded regression approaches that start with an initial

shape and reﬁne the shape in a cascaded manner, our ap-

proach begins with a coarse search over a shape space that

contains diverse shapes, and employs the coarse solution

to constrain subsequent ﬁner search of shapes. The unique

stage-by-stage progressive and adaptive search i) prevents

the ﬁnal solution from being trapped in local optima due

to poor initialisation, a common problem encountered by

cascaded regression approaches; and ii) improves the ro-

bustness in coping with large pose variations. The frame-

work demonstrates real-time performance and state-of-the-

art results on various benchmarks including the challenging

300-W dataset.

1. Introduction

Face alignment aims at locating facial key points au-

tomatically. It is essential to many facial analysis tasks,

e.g. face veriﬁcation and recognition [11], expression recog-

nition [2], or facial attributes analysis [16]. Among the

many different approaches for face alignment, cascaded

pose regression [8, 10, 29, 37] has emerged as one of the

most popular and state-of-the-art methods. The algorithm

typically starts from an initial shape, e.g. mean shape of

training samples, and reﬁnes the shape through sequentially

trained regressors.

In this study, we re-consider the face alignment prob-

lem from a different view point by taking a coarse-to-ﬁne

shape searching approach (Fig. 1(a)). The algorithm begins

with a coarse searching in a shape space that encompasses

a large number of candidate shapes. The coarse search-

ing stage identiﬁes a sub-region within the shape space for

further searching in subsequent ﬁner stages and simultane-

ously discards unpromising shape space sub-regions. Sub-

sequent ﬁner stages progressively and adaptively shrink the

plausible region and converge the space to a small region

where the ﬁnal shape is estimated. In practice, only three

stages are required.

In comparison to the conventional cascaded regression

approaches, the coarse-to-ﬁne framework is attractive in

two aspects:

1) Initialisation independent: A widely acknowledged

shortcoming of cascaded regression approach is its depen-

dence on initialisation [32]. In particular, if the initialised

shape is far from the target shape, it is unlikely that the

discrepancy will be completely rectiﬁed by subsequent it-

erations in the cascade. As a consequence, the ﬁnal solu-

tion may be trapped in local optima (Fig. 1(c)). Existing

methods often circumvent this problem by adopting some

heuristic assumptions or strategies (see Sec. 2 for details),

which mitigate the problem to certain extent, but do not

fully resolve the issue. The proposed coarse-to-ﬁne frame-

work relaxes the need of shape initialisation. It starts its ﬁrst

stage by exploring the whole shape space, without locking

itself on a speciﬁc single initialisation point. This frees the

alignment process from being affected by poor initialisa-

tion, leading to more robust face alignment.

2) Robust to large pose variation: The early stages in the

coarse-to-ﬁne search is formulated to simultaneously ac-

commodate and consider diverse pose variations, e.g. with

different degrees of head pose, and face contours. The

search then progressively focus the processing on dedicated

shape sub-region to estimate the best shape. Experimental

results show that this searching mechanism is more robust

in coping with large pose variations in comparison to the

cascaded regression approach.

Since searching through shape space is challenging w.r.t.

speed issue, we propose a hybrid features setting to achieve

practical speed. Owing to the unique error tolerance in the

coarse-to-ﬁne searching mechanism, our framework is ca-

pable of exploiting the advantages and characteristics of

different features. For instance, we have the ﬂexibility to

employ less accurate but computationally efﬁcient feature,

e.g. BRIEF [9] at the early stages, and use more accurate but

(b) Steps of the proposed coarse-to-fine search

Decision on sub-region center 𝑥

(l)

Ground truth location 𝑥

∗

Shape space region visualized in 2D

Error: 12.04

(a) Sub-region searching

Regression

Sampling

Unpromising

sub-regions

Stage 2

Stage 3

Stage 1

Mean shape

Error: 23.05

given P

(l=0)

estimate 𝑥

(l=1)

estimate P

(l=1)

(l=0)

(l=1)

given P

(l=1)

estimate 𝑥

(l=2)

estimate P

(l=2)

given P

(l=2)

estimate 𝑥

(l=3)

(l=1)

(l=2)

…

Candidate shapes

Figure 1. (a) A diagram that illustrates the coarse-to-ﬁne shape searching method for estimating the target shape. (b) to (c) Comparison of

the steps between proposed coarse-to-ﬁne search and cascaded regression. Landmarks on nose and mouth are trapped in local optima in

cascaded regression due to poor initialisation, and latter cascaded iterations seldom contribute much to rectifying the shape. The proposed

method overcomes these problems through coarse-to-ﬁne shape searching.

relatively slow feature, e.g. SIFT [23], at later stage. Such a

setting allows the proposed framework to achieve improved

computational efﬁciency, whilst it is still capable of main-

taining high accuracy rate without using accurate features

in all stages. Our MATLAB implementation achieves 25

fps real-time performance on a single core i5-4590. It is

worth pointing out that impressive alignment speed (more

than 1000 fps even for 194 landmarks) has been achieved

by Ren et al. [29] and Kazemi et al. [20]. Though it is be-

yond the scope of this work to explore learning-based shape

indexed feature, we believe the proposed shape searching

framework could beneﬁt from such high-speed feature.

Experimental results demonstrate that the coarse-to-

ﬁne shape searching framework is a compelling alterna-

tive to the popular cascaded regression approaches. Our

method outperforms existing methods in various benchmark

datasets including the challenging 300-W dataset [30]. Our

code is available in project page mmlab.ie.cuhk.edu.

hk/projects/CFSS.html.

2. Related work

A number of methods have been proposed for face align-

ment, including the classic active appearance model [12, 22,

24] and constrained local model [13, 35, 31, 15].

Face alignment by cascaded regression: There are a few

successful methods that adopt the concept of cascaded pose

regression [17]. Supervised descent method (SDM) [37]

is proposed to solve nonlinear least squares optimisation

problem. The non-linear SIFT [23] feature and linear re-

gressors are applied. Feature learning based method, e.g.

Cao et al. [10] and Burgos-Artizzu et al. [8], regress se-

lected discriminative pixel-difference features with random

ferns [27]. Ren et al. [29] learns the local binary features

with random forest [6], achieving very fast performance.

All the aforementioned methods assume the initial shape

is provided in some forms, typically a mean shape [37, 29].

Mean shape is used with the assumption that the test sam-

ples are distributed close to the mean pose of the training

samples. This assumption does not always hold especially

for faces with large pose variations. Cao et al. [10] propose

to run the algorithm several times using different initiali-

sations and take as ﬁnal output the median of all predic-

tions. Burgos-Artizzu et al. [8] improve the strategy by a

smart restart method but it requires cross-validation to de-

termine a threshold and the number of runs. In general,

these strategies mitigate the problem to some extents, but

still do not fully eliminate the dependence on shape initial-

isation. Zhang et al. [38] propose to obtain initialisation

by predicting a rough estimation from global image patch,

still followed by sequentially trained auto-encoder regres-

sion networks. Our method instead solves the initialisation

problem via optimising shape sub-region. We will show in

Sec. 4 that our proposed searching method is robust to large

pose variation and outperforms previous methods.

Coarse-to-ﬁne methods: The coarse-to-ﬁne approach has

been widely used to address various image processing and

computer vision problems such as face detection [18], shape

detection [1] and optical ﬂow [7]. Some existing face align-

ment methods also adopt a coarse-to-ﬁne approach but with

a signiﬁcantly different notion than our shape searching

framework. Sun et al. [33] ﬁrst have a coarse estimation of

landmark locations and apply cascaded deep models to re-

ﬁne the position of landmarks of each facial part. Zhang et

al. [38] deﬁne coarse-to-ﬁne as applying cascaded of auto-

encoder networks on images with increasing resolution.

3. Coarse-to-ﬁne shape searching

Conventional cascaded regression methods reﬁne a

shape via sequentially regressing local appearance patterns

indexed by the current estimated shape. In particular,

k+1

= x

+ r

(φ(I; x

)), (1)

where the 2n dimensional shape vector x

represents the

current estimate of (x, y) coordinates of the n landmarks

after the k

iteration. The local appearance patterns indexed

by the shape x on the face image I is denoted as φ(I; x),

and r

is the k

learned regressor. For simplicity we always

omit ‘I’ in Eq. 1.

The estimation by cascaded regression can be easily

trapped in local optima given a poor shape initialisation

since the method reﬁnes a shape by optimising a single

shape vector x (Fig. 1(c)). In our approach, we overcome

the problem through a coarse-to-ﬁne shape searching within

a shape space (Fig. 1(a) and (b)).

3.1. Overview of coarse-to-ﬁne shape searching

Formally, we form a 2n dimensional shape space.

We denote N candidate shapes in the space as S =

, s

, ..., s

} (N  2n). The candidate shapes in S are

obtained from training set pre-processed by Procrustes anal-

ysis [19]. S is ﬁxed throughout the whole shape searching

process.

Given a face image, face alignment is performed through

l = 1, . . . , L stages of shape searching, as depicted in

Fig. 1(a). In each l

stage, we aim to ﬁnd a ﬁner shape

sub-region, which is represented by



(l)

, P

(l)



, where

(l)

denotes the center of the estimated shape sub-region, and

(l)

represents the probability distribution that deﬁnes the

scope of estimated sub-region around the center. When the

searching progresses through stages, e.g. from Stage 1 to

2, the algorithm adaptively determines the values of

x and

, leading to a ﬁner shape sub-region for the next search-

ing stage, with closer estimate to the target shape. The pro-

cess continues until convergence and the center of the last

ﬁnest sub-region is the ﬁnal shape estimation.

In each stage, we ﬁrst determine the sub-region center

based on the given sub-region for this stage, and then es-

Algorithm 1 Training of coarse-to-ﬁne shape searching

1: procedure TRAINING(Shapes S, Training set {I

; x

i∗

}

i=1

)

2: Set P

(0)

to be uniform distribution over S

3: for l = 1, 2, . . . , L do

4: Sample candidate shapes x

according to P

(l−1)

5: Learn K

regressors {r

}

k=1

with {x

, x

i∗

}

N, N

i=1,j=1

6: Get regressed shapes x

based on the K

regressors

7: Set initial weight to be equal: w

(0) = e/N

8: Construct G

and edge weight according to Eq. 4

9: for t = 0, 1, . . . , T − 1 do

10: Update w

(t + 1) according to Eq. 6

11: end for

12: Compute sub-region center

(l)

via Eq. 3

13: if l < L then

14: Learn distribution with {

(l)

, x

i∗

}

i=1

15: Set probabilistic distribution P

(l)

via Eq. 7

16: end if

17: end for

18: end procedure

timate the ﬁner sub-region used for further searching. A

larger/coarser region is expected at earlier stages, whilst

a smaller/ﬁner region is expected at latter stages. In the

ﬁrst searching stage, the given ‘sub-region’ P

(l=0)

is set to

be a uniform distribution over all candidate shapes, i.e. the

searching region is over the full set of S. In the subsequent

stages, the given sub-region is the estimated P

(l−1)

from the

preceding stage.

As an overview of the whole approach, we list the ma-

jor training steps in Algorithm 1, and introduce the learning

method in Sec. 3.2 and Sec. 3.3. Testing procedure of the

approach is similar excluding the learning steps. More pre-

cisely, the learning steps involve learning the regressors in

each stage (Eq. 2 and Step 5 in Algorithm 1) and parameters

for estimating probabilistic distribution (Eq. 8 and 10, Step

14 in Algorithm 1).

3.2. Learn to estimate sub-region center

x given P

To learn to compute the sub-region center

(l)

for the l

searching stage, three speciﬁc steps are conducted:

Step-1: In contrast to cascaded regression that employs

a single initial shape (typically the mean shape) for regres-

sion, we explore a larger area in the shape space guided by

the probabilistic distribution P

(l−1)

. In particular, for each

training sample, we randomly draw N

initial shapes from

S based on P

(l−1)

. We denote the N

initial shapes of the

training sample as x

, with i = 1 . . . N representing

the index of training sample, and j = 1 . . . N

denoting the

index of the randomly drawn shapes.

Step-2: This step aims to regress each initial shape x

to a shape closer to the ground truth shape x

i∗

. Speciﬁcally,

we learn K

regressors in a sequential manner with iteration

k = 0, . . . , K

− 1, i.e.

= argmin

i=1

j=1

i∗

− x

− r(φ(x

))k

+ Φ(r),

k+1

= x

+ r

(φ(x

)) k = 0, . . . , K

− 1

(2)

where Φ(r) denotes the `

regularisation term for each pa-

rameter in model r. It is worth pointing out that K

smaller than the number of regression iterations typically

needed in cascaded regression. This is because i) due to the

error tolerance of coarse-to-ﬁne searching, regressed shapes

for early stages need not be accurate, and ii) for later stages

initial candidate shapes x

tend to be similar to the target

shape, thus fewer iterations are needed for convergence.

Step-3: After we learn the regressors and obtain the set

of regressed shapes,

j=1

, we wish to learn a weight

vector w

= (w

, . . . w

)

to linearly combine all the

regressed shapes for collectively estimating the sub-region

center

(l)

for i-th training sample

(l)

j=1

. (3)

A straightforward method to obtain

(l)

is to average all

the regressed shapes by ﬁxing w

= 1/N

. However, this

simple method is found susceptible to small quantity of er-

roneous regressed shapes caused by local optima. In or-

der to suppress their inﬂuence in computing the sub-region

center, we adopt the dominant set approach [28] for estimat-

ing w

. Intuitively, a high weight is assigned to regressed

shapes that form a cohesive cluster, whilst a low weight is

given to outliers. This amounts to ﬁnding a maximal clique

in an undirected graph. Note that this step is purely unsu-

pervised.

More precisely, we construct an undirected graph, G

, E

), where the vertices are the regressed shapes V

j=1

, and each edge in the edge set E

is weighted by

afﬁnity deﬁned as

= sim(x

, x

)



exp(−βkx

− x

), p 6= q

0, p = q

(4)

Representing all the elements a

in a matrix forms an afﬁn-

ity matrix, A. Note that we set the diagonal elements of

A to zero to avoid self-loops. Following [28], we ﬁnd the

weight vector w

by optimising the following problem,

max

s.t. w

∈ ∆

(5)

We denote the simplex as ∆

= {x ∈ R

|x ≥ 0, e

x =

1}, where e = (1, 1, ..., 1)

. An efﬁcient way to optimise

Eq. 5 is by using continuous optimisation technique known

as replicator dynamics [28, 36]

(t + 1) =

(t) ◦ (Aw

(t))

(t)

, (6)

where t = 0, 1, . . . , T − 1, and symbol ‘◦’ denotes elemen-

tary multiplication. Intuitively, in each weighting iteration

t, each vertex votes all its weight to other vertex, w.r.t. the

afﬁnity between the two vertices. After optimising Eq. 6

for T iterations, we obtain w

(t = T ) and plug the weight

vector into Eq. 3 for estimating the sub-region center.

3.3. Learn to estimate probabilistic distribution P

given

We then learn to estimate the probabilistic distribution

(l)

based on the estimated sub-region center

(l)

. We aim

to determine the probabilistic distribution, P

(l)

(s|

(l)

) =

P (s −

(l)

|φ(

(l)

)), where s ∈ S and

s∈S

(l)

(s|

(l)

) =

1. For clarity, we drop the subscripts (l) from

(l)

and P

(l)

We model the probabilistic distribution P

(l)

P (s −

x|φ(

x)) =

P (s −

x)P (φ(

x)|s −

y ∈S

P (y −

x)P (φ(

x)|y −

. (7)

The denominator is a normalising factor. Thus, when es-

timating the posterior probability of each shape s in S we

focus on the two factors P(s −

x), and P (φ(

x)|s −

x).

The factor P (s −

x), referred as shape adjustment prior,

is modelled as

P (s −

x) ∝ exp(−

(s −

−1

(s −

x)). (8)

The covariance matrix is learned by {

, x

i∗

}

i=1

pairs on

training data, where x

∗

denotes the ground truth shape

In practice, Σ is restricted to be diagonal and we decor-

relate the shape residual by principle component analysis.

This shape adjustment prior aims to approximately delin-

eate the searching scope near

x, and typically the distribu-

tion is more concentrated for later searching stages.

The other factor P(φ(

x)|s−

x) is referred as feature sim-

ilarity likelihood. Following [5], we divide this factor into

different facial parts,

P (φ(

x)|s −

x) =

P (φ(

(j)

)|s

(j)

−

(j)

), (9)

where j represents the facial part index. The probabilis-

tic independence comes from our conditioning on the given

We assume E(x

∗

−

x) = 0.

exemplar candidate shapes s and

x, and throughout our ap-

proach, all intermediate estimated poses are strictly shapes.

Again by applying Baye’s rule, we can rewrite Eq. 9 into

P (φ(

x)|s −

x) =

P (φ(

(j)

))

P (s

(j)

)

P (s

(j)

−

(j)

|φ(

(j)

))

∝

P (s

(j)

−

(j)

|φ(

(j)

)),

(10)

which could be learned via discriminative mapping for each

facial part. This feature similarity likelihood aims to guide

shapes moving towards more plausible shape region, by

separately considering local appearance from each facial

part.

By combining the two factors, we form the probabilis-

tic estimate for the shape space and could sample candidate

shapes for next stage. Such probabilistic sampling enables

us to estimate current shape error and reﬁne current esti-

mate via local appearance, while at the same time the shape

constraints are still strictly encoded.

3.4. Shape searching with hybrid features

In the conventional cascaded regression framework, one

often selects a particular features for regression, e.g. SIFT

in [37]. The selection of features involves the tradeoff be-

tween alignment accuracy and speed. It can be observed

from Fig. 2 that different features (e.g. HoG [14], SIFT [23],

LBP [26], SURF [4], BRIEF [9]) exhibit different charac-

teristics in accuracy and speed. It is clear that if one adheres

to the SIFT feature throughout the whole regression proce-

dure, the best performance in our method can be obtained.

However, the run time efﬁciency is much lower than that of

the BRIEF feature.

Our coarse-to-ﬁne shape searching framework is capable

of exploiting different types of features at different stages,

taking advantages of their speciﬁc characteristics, i.e. speed

and accuracy. Based on the feature characteristics observed

in Fig. 2, we can operate the coarse-to-ﬁne framework in

two different feature settings through switching features in

different searching stages:

• CFSS - The SIFT feature is used in all stages to obtain

the best accuracy in our approach.

• CFSS-Practical - Since our framework only seeks

for a coarse shape sub-region in the early stages,

thus relatively weaker features with much faster speed

(e.g. BRIEF) would be a better choice for early stage,

and SIFT is only used in the last stage for reﬁnement.

In our 3-stage implementation, we use the BRIEF fea-

ture in the ﬁrst two stages, and SIFT in the last stage.

In the experiments we will demonstrate that the CFSS-

Practical performs competitively to the CFSS, despite us-

ing the less accurate BRIEF for the ﬁrst two stages. The

0 5

10 15 20 25 30

Averaged Initialization Error

Averaged Output Error

SURF

HoG

LBP

BRIEF

SIFT

(a) Regression curves for features.

SIFT LBP SURF HoG BRIEF

Time cost per frame

(b) Speed for features.

Figure 2. We evaluate each feature’s accuracy and speed using a

validation set extracted from the training set. (a) We simulate dif-

ferent initial conditions with different initialisation errors to eval-

uate the averaged output error of cascaded regression. We ensure

that the result has converged for each initialisation condition. (b)

Comparison of speed of various features measured under the same

quantity of regression tasks.

CFSS enjoys such feature switching ﬂexibility thanks to the

error tolerance of the searching framework. In particular,

CFSS allows for less accurate shape sub-region in the ear-

lier searching stages, since subsequent stages can rapidly

converge to the desired shape space location for target shape

estimation.

3.5. Time complexity analysis

The most time consuming module is feature extraction,

which directly inﬂuences the time complexity. We assume

the complexity for feature extraction is O(F ). The com-

plexity of CFSS is thus O(F(L− 1 +

l=1

)). By ap-

plying the hybrid feature setting, the complexity reduces to

O(F N

), since only the last searching stage utilises the

more accurate feature, and the time spent on the fast feature

contributes only a small fraction to the whole processing

time. As is shown in Sec. 4.2, the efﬁciency of the search-

ing approach is in the same order of magnitude compared

with cascaded regression method, but with much more ac-

curate prediction.

3.6. Implementation details

In practice, we use L = 3 searching stages in the CFSS.

Increasing the number of stages only leads to marginal

improvement. The number of regressors, K

, and initial

shapes N

, are set without optimisation. In general, we

found setting K

= 3, and N

= 15 works well for CFSS.

Only marginal improvement is obtained with larger num-

ber of K

and N

. For CFSS-Practical, we gain further run

time efﬁciency by reducing the regression iterations K

, and

decreasing N

without sacriﬁcing too much accuracy. We

choose K

in the range of 1 to 2, and N

in the range of 5

to 10. We observe that the alignment accuracy is not sen-

sitive to these parameters. We set T = 10 in Eq. 6. β (in

Face alignment by coarse-to-fine shape searching

Figures

Citations

Deep High-Resolution Representation Learning for Visual Recognition

HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition

OpenFace 2.0: Facial Behavior Analysis Toolkit

Face Alignment Across Large Poses: A 3D Solution

How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks)

References

Random Forests

Distinctive Image Features from Scale-Invariant Keypoints

Histograms of oriented gradients for human detection

Distinctive Image Features from Scale-Invariant Keypoints

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

Related Papers (5)

Supervised Descent Method and Its Applications to Face Alignment

300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge

Face detection, pose estimation, and landmark localization in the wild

Face Alignment by Explicit Shape Regression

Deep Convolutional Network Cascade for Facial Point Detection

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "Face alignment by coarse-to-fine shape searching" ?

Q2. What are the future works mentioned in the paper "Face alignment by coarse-to-fine shape searching" ?