What are the contributions mentioned in the paper "Stacked deformable part model with shape regression for object part localization" ?

This paper explores the localization of pre-defined semantic object parts, which is much more challenging than traditional object detection and very important for applications such as face recognition, HCI and fine-grained object recognition. To address this problem, the authors make two critical improvements over the widely used deformable part model ( DPM ). The second is that the authors propose to stack multiple SR-DPMs, where each layer uses the output of previous SR-DPM as the input to progressively refine the result.

(Open Access) Stacked Deformable Part Model with Shape Regression for Object Part Localization (2014) | Junjie Yan

Stacked Deformable Part Model with Shape

Regression for Object Part Localization

Junjie Yan, Zhen Lei, Yang Yang, and Stan Z. Li



Center for Biometrics and Security Research & National Laboratory

of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China

{jjyan,zlei,yang.yang,szli}@nlpr.ia.ac.cn

Abstract. This paper explores the localization of pre-deﬁned seman-

tic object parts, which is much more challenging than traditional object

detection and very important for applications such as face recognition,

HCI and ﬁne-grained object recognition. To address this problem, we

make two critical improvements over the widely used deformable part

model (DPM). The ﬁrst is that we use appearance based shape regres-

sion to globally estimate the anchor location of each part and then locally

reﬁne each part according to the estimated anchor location under the

constraint of DPM. The DPM with shape regression (SR-DPM) is more

ﬂexible than the traditional DPM by relaxing the ﬁxed anchor location of

each part. It enjoys the eﬃcient dynamic programming inference as tradi-

tional DPM and can be discriminatively trained via a coordinate descent

procedure. The second is that we propose to stack multiple SR-DPMs,

where each layer uses the output of previous SR-DPM as the input to

progressively reﬁne the result. It provides an analogy to deep neural

network while beneﬁting from hand-crafted feature and model. The pro-

posed methods are applied to human pose estimation, face alignment

and general object part localization tasks and achieve state-of-the-art

performance.

1 Introduction

This paper focuses on localizing object parts from monocular image. For human

and face category, this problem is often named as “human pose esti mation” or

“face alignment”. Accurate part localization serves as the basis of many high

level applications. For example, a recent work [9] shows that directly extracting

features around reliable face parts (landmarks) achieves leading face recognition

performance. As surveyed in [28], human part localization can help with action

recognition and human computer interaction. For general object, reliable part

localization contributes to ﬁne-grained object recognition, as proved in [46,6].

However, this problem is very challenging due to the variations in subject level

(e.g., a human can take many diﬀerent poses and dresses), category level (e.g.,

adult and baby) and image level (e.g., illumination and cluttered background).



Corresponding author.

D. Fle et et al. (Eds.): ECCV 2014, Part II, LNCS 8690, pp. 568–583 , 2014.

 Springer International Publishing Switzerland 2014

Stacked Deformable Part Model with Shape Regression 569

Human pose estimation and face alignment have been extensively explored for

decades and achieved much progress. The critical issue is how to model the ver-

satile spatial deformation and plausible appearance variation. The seminal work

[21] exploits the pictorial structure (PS) from [23], which uses Gaussian distribu-

tion to capture the deformation of each part and constrain the relative position

of interrelated parts via a tree structure. PS is improved by strong appearance

representation (e.g., [17,25,29,30]), discriminative classiﬁer (e.g., [25,44]) and

powerful structure (e.g., [41,39,42,36,38,40]), and ﬁnally it becomes the leading

method in localizing human parts on challenging benchmarks. DPM [20], as one

of the representative works in this category, uses structural SVM training and

HOG feature in pictorial structure for object detection, and it is lately extended

by [44] for human pose estimation.

PS [21] and its widely used extension DPM [20,44] , however, cannot capture

the global information and have limited ﬂexibility, due to the deformation con-

straint by the ﬁxed anchor location. To break the limitation of DPM, we propose

a novel approach by incorporating shape regression into DPM, namely SR-DPM.

Speciﬁcally, the shape regression estimates part locations using the appearance

information globally. We set the regressed shape as the anchor locations in DPM

and allow the deformations of parts around them to satisfy the local appearance

consistency. Compared to traditional DPM, SR-DPM is of high degree of free-

dom to model global and local variations suﬃciently. Due to the fact that shape

regression and DPM can beneﬁt from each other, we build an objective function

to jointly learn them. It is a non-convex optimization problem, and we design a

coordinate descent procedure to solve it.

In addition, we show that stacking SR-DPMs could further improve the per-

formance. The complex shape variations are often beyond the representation

capacity of single DPM or SR-DPM. To fully explore the data, we propose the

stacked SR-DPM (S-SR-DPM), where each SR-DPM uses the output of previ-

ous SR-DPMs as the input and progressively reﬁnes the result. Note that the

SR-DPMs in diﬀerent layers use diﬀerent parameters. The S-SR-DPM provides

a natural analogy to deep convolutional neural network (DCNN) in increasing

representation capacity [5]. Compared with the end-to-end learning in DCNN,

the S-SR-DPM takes advantage of well designed hand-crafted pipelines and can

achieve good performance with much fewer training data.

Previous works usually only consider part localization of a special category

(e.g., human and face). In this paper we show wide applications of our method

on human, face and general object. For human pose estimation, we conduct

experiments on challenging LSP [25]. For face alignment, we use the LFPW [4]

as the testbed. It terms of general object, we use the annotations [3] of animals

from Pascal VOC [19]. We compare our method with diﬀerent state-of-the-art

methods on these three tasks and achieve the leading performance.

The rest of the paper is organized as follows. Section 2 reviews the related

work. The proposed SR-DPM and its stacked form are described in section 3

and section 4. We show experiments in section 5 and ﬁnally conclude the paper

in section 6.

570 J. Yan et al.

2 Related Work

Many works on human pose estimation are based on pictorial structure in either

generative or discriminative manner. The pictorial structure [21] uses Gaussian

model to capture the deformation of each part and links parts by tree structure.

Inference in pictorial structure is very eﬃcient due to the dynamic program-

ming and distance transform [21]. The pictorial structure is lately exploited in

deformable part model (DPM) [20] with HOG feature and latent-SVM learning,

and it achieves great success in Pascal VOC object detection. [44] extends DPM

for articulated human pose estimation by adding part subtype and using part

annotations in learning. [3] proves the advantage of fully supervised learning of

parts over latent learning in [20] for general objects. [40] shows that automati-

cally learning the tree is better than hand-crafted physical connections. [29] uses

Poselets [7] to capture mid-level cues to latently capture high-order dependen-

cies for pictorial structure. Many recent works improve PS in more part levels,

more global models and more part models [39,42,36,38,26,33,15,31]. A very re-

cent work [30] combines diﬀerent appearance cues under the pictorial structure

framework and achieves the current leading performance.

Although being similar to human pose estimation problem, face alignment

ﬁeld often uses very diﬀerent methods, mainly due to the stronger spatial con-

straint of human face than human body. The most popular models include ac-

tive shape model (ASM [11]), active appearance model (AAM [10]) and their

extensions. Diﬀerent from the Gaussian deformation of each local part in PS,

ASM/AAM captures the shape deformation globally with PCA constraint. The

global PCA constraint, however, has been indicated to be very sensitive and

is lately extended to be constrained local model (CLM [12,34,4,2]) by a shape

constraint on appearance of local parts. [47] exploits the DPM developed in [44]

for joint face detection and alignment. [45] further improves the work with opti-

mized mixtures and a two-step cascaded deformable shape model. In very recent,

face alignment is taken as a regression problem [8,14,43,37], which directly learns

the mapping the appearance to shape and achieves the leading performance on

face alignment benchmarks and challenges (e.g., 300-W [32]). These methods,

however, are sensitive to initialization, which makes them unsuitable for more

diﬃcult human and object part localization.

We stack multiple SR-DPMs, which is related to a very recent work [35]. In

[35], multiple ﬁsher vector coding layers are stacked to get a similar performance

of deep neural network for image classiﬁcation task. In [16], boosting is used to

estimate the shape with pose-index feature, where the features are re-computed

at the latest estimation of landmark localization. In [43], linear regression are

stacked for face alignment.

Compared with previous works, the main contributions of this work are sum-

marized as follows:

– We propose SR-DPM to incorporate DPM with shape regression and show

how to jointly learn them. The SR-DPM is much more ﬂexible than DPM

in handling real world object deformation.

Stacked Deformable Part Model with Shape Regression 571

– We stack multiple SR-DPMs to increase the representation capacity, where

each layer progressively reﬁnes the part locations. As shown empirically in

experiments, the stacked SR-DPM is critical for better performance.

– To our best knowledge, it is the ﬁrst work to simultaneously achieve state-of-

the-art performance on human pose estimation, face alignment and general

object part localization.

3 Deformable Part Model with Shape Regression

The DPM is composed of the root ﬁlter β

and some parts. Each part has a

appearance ﬁlter β

and deformation term d

. Given an object part conﬁguration

speciﬁed by S =[x

, ··· ,x

]

and object location (x

), the DPM

favors some special part conﬁgurations by:

s(S, I)=β

,I)+



i=1

(β

,I) − d

)), (1)

where φ

,I) is the HOG feature of the i-th part, and φ

)is

the separable quadratic function to represent the deformation. φ

)

is deﬁned based on the relative location between the (x

) and its anchor loca-

tion (a

), which is ﬁxed after the speciﬁcation of (x

). It is straightfor-

ward to add mixture parts [44] or mixture components [20], but we leave them

out to simplify the notation.

For each sliding window in localization, only the root location (x

)isknown

in advance and each part location is inferred by maximizing the part appearance

score minus the deformation cost associated with displacement to anchor loca-

tion. Since parts are directly attached to the root, their locations are inferred

independently given the ﬁxed root by:

max

(β

,I) − d

)), (2)

where (x

) traverses all possible locations of the part. The procedure can be

eﬃciently solved by distance transform as used in [21,44].

Our improvement comes from the anchor location of each part. In DPM, the

anchor location of each part is deﬁned according to relative position of either

the root [20] or its parent part [44]. It limits the ﬂexibility since that each part

can only have a small deformation around its ﬁxed anchor location. Additionally,

the star-structure used cannot capture global information, such as the high order

spatial dependencies of left-arm, right-arm, left-leg and right-leg.

In this paper, we propose to use regression to estimate the anchor locations

directly from the image appearance to capture the global information and in-

crease the ﬂexibility. After that we allow each part to have deformation based

on these adaptive anchor part locations under the constraint of DPM. Let us

use



A =[a

, a

, ··· , a

, a

]

to specify the estimated anchor part locations.

Suppose the initial shape is A

and ground-truth shape is A

∗

, we always want

572 J. Yan et al.

that each (x

) to have relationship with all the parts initialized by S

(which

is the mean shape) to capture the global information. The function can be very

complex, and in this paper we use a simple linear function to approximate it:



A = f(A

,I)=A

+ W

Φ(A

,I), (3)

where Φ(A

,I) is the local appearance feature extracted around all parts. In this

paper, we deﬁne it as the HOG feature [13] from the implementation in [20]. We

concatenate feature vectors of all parts speciﬁed by A

tobealongvector,which

has Nn

values and n

is the length of HOG vector for a part. The dimension

of corresponding regression matrix W is Nn

× 2N . In Eq. 3, each new part

location is estimated based on all the initial part locations, thus Eq. 3 encodes

global information which previously cannot be captured in pictorial structure

based models. No parametric shape prior, such as global shape PCA in ASM

and local part Gaussian deformation in pictorial structure, is assumed in Eq. 3.

It has advantage especially for real world objects, whose spatial deformation can

be very complex and simple parametric prior cannot describe it well.

The above shape regression, however, is not enough for object part localiza-

tion. The reason is that it cannot measure the conﬁdence of the estimated part

locations, which is very important for sliding window based scanning. Addition-

ally, the global shape regression matrix not explicitly consider the appearance

consistency of regressed part location. To this end, we further use the deformable

part model to incorporate shape regression, by replacing the ﬁxed anchor loca-

tion with the shape regression output



s(S, I)=β

,I)+



i=1

(β

,I) − d

, a

)) (4)

where



A =[a

, a

, ··· , a

, a

]

= A

+ W

Φ(A

,I).

For each sliding window in localization, we ﬁnd the S to maximize the conﬁ-

dence score deﬁned above, and take it the the estimated shape conﬁguration of

the sliding window. The deformable part model with shape regression (SR-DPM)

provides the ﬂexibility to capture large variations, but it also brings challenges,

since the regression matrix W and the deformable part model parameter β are

all unknown. In the following part, we present the objective function for joint

learning and show the optimization method.

3.1 Model Learning

The objective function for model learning is motivated by the original DPM used

in object detection, which is deﬁned as:

arg min

β,S

β

+ C



m=1

max(0, 1 − y

· s(S

)), (5)

where the ﬁrst term is used for regularization and the second term is the hinge

loss to punish error in detection. M is the number of training samples, and S

Stacked Deformable Part Model with Shape Regression for Object Part Localization

Figures

Citations

The PASCAL Visual Object Classes Challenge

Global supervised descent method

Deep Deformation Network for Object Landmark Localization

Deep Deformation Network for Object Landmark Localization

Unsupervised learning of object landmarks by factorized spatial embeddings

References

ImageNet Classification with Deep Convolutional Neural Networks

Histograms of oriented gradients for human detection

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

The Pascal Visual Object Classes (VOC) Challenge

Rich feature hierarchies for accurate object detection and semantic segmentation

Related Papers (5)

Supervised Descent Method and Its Applications to Face Alignment

Face Alignment by Explicit Shape Regression

Face Alignment at 3000 FPS via Regressing Local Binary Features

Active shape models—their training and application

Face detection, pose estimation, and landmark localization in the wild

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Stacked deformable part model with shape regression for object part localization" ?