scispace - formally typeset
Open AccessBook ChapterDOI

Stacked Deformable Part Model with Shape Regression for Object Part Localization

Reads0
Chats0
TLDR
The DPM with shape regression (SR-DPM) is more flexible than the traditional DPM by relaxing the fixed anchor location of each part and provides an analogy to deep neural network while benefiting from hand-crafted feature and model.
Abstract
This paper explores the localization of pre-defined semantic object parts, which is much more challenging than traditional object detection and very important for applications such as face recognition, HCI and fine-grained object recognition. To address this problem, we make two critical improvements over the widely used deformable part model (DPM). The first is that we use appearance based shape regression to globally estimate the anchor location of each part and then locally refine each part according to the estimated anchor location under the constraint of DPM. The DPM with shape regression (SR-DPM) is more flexible than the traditional DPM by relaxing the fixed anchor location of each part. It enjoys the efficient dynamic programming inference as traditional DPM and can be discriminatively trained via a coordinate descent procedure. The second is that we propose to stack multiple SR-DPMs, where each layer uses the output of previous SR-DPM as the input to progressively refine the result. It provides an analogy to deep neural network while benefiting from hand-crafted feature and model. The proposed methods are applied to human pose estimation, face alignment and general object part localization tasks and achieve state-of-the-art performance.

read more

Content maybe subject to copyright    Report

Stacked Deformable Part Model with Shape
Regression for Object Part Localization
Junjie Yan, Zhen Lei, Yang Yang, and Stan Z. Li
Center for Biometrics and Security Research & National Laboratory
of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China
{jjyan,zlei,yang.yang,szli}@nlpr.ia.ac.cn
Abstract. This paper explores the localization of pre-defined seman-
tic object parts, which is much more challenging than traditional object
detection and very important for applications such as face recognition,
HCI and fine-grained object recognition. To address this problem, we
make two critical improvements over the widely used deformable part
model (DPM). The first is that we use appearance based shape regres-
sion to globally estimate the anchor location of each part and then locally
refine each part according to the estimated anchor location under the
constraint of DPM. The DPM with shape regression (SR-DPM) is more
flexible than the traditional DPM by relaxing the fixed anchor location of
each part. It enjoys the efficient dynamic programming inference as tradi-
tional DPM and can be discriminatively trained via a coordinate descent
procedure. The second is that we propose to stack multiple SR-DPMs,
where each layer uses the output of previous SR-DPM as the input to
progressively refine the result. It provides an analogy to deep neural
network while benefiting from hand-crafted feature and model. The pro-
posed methods are applied to human pose estimation, face alignment
and general object part localization tasks and achieve state-of-the-art
performance.
1 Introduction
This paper focuses on localizing object parts from monocular image. For human
and face category, this problem is often named as “human pose esti mation” or
“face alignment”. Accurate part localization serves as the basis of many high
level applications. For example, a recent work [9] shows that directly extracting
features around reliable face parts (landmarks) achieves leading face recognition
performance. As surveyed in [28], human part localization can help with action
recognition and human computer interaction. For general object, reliable part
localization contributes to fine-grained object recognition, as proved in [46,6].
However, this problem is very challenging due to the variations in subject level
(e.g., a human can take many different poses and dresses), category level (e.g.,
adult and baby) and image level (e.g., illumination and cluttered background).
Corresponding author.
D. Fle et et al. (Eds.): ECCV 2014, Part II, LNCS 8690, pp. 568–583 , 2014.
c
Springer International Publishing Switzerland 2014

Stacked Deformable Part Model with Shape Regression 569
Human pose estimation and face alignment have been extensively explored for
decades and achieved much progress. The critical issue is how to model the ver-
satile spatial deformation and plausible appearance variation. The seminal work
[21] exploits the pictorial structure (PS) from [23], which uses Gaussian distribu-
tion to capture the deformation of each part and constrain the relative position
of interrelated parts via a tree structure. PS is improved by strong appearance
representation (e.g., [17,25,29,30]), discriminative classifier (e.g., [25,44]) and
powerful structure (e.g., [41,39,42,36,38,40]), and finally it becomes the leading
method in localizing human parts on challenging benchmarks. DPM [20], as one
of the representative works in this category, uses structural SVM training and
HOG feature in pictorial structure for object detection, and it is lately extended
by [44] for human pose estimation.
PS [21] and its widely used extension DPM [20,44] , however, cannot capture
the global information and have limited flexibility, due to the deformation con-
straint by the fixed anchor location. To break the limitation of DPM, we propose
a novel approach by incorporating shape regression into DPM, namely SR-DPM.
Specifically, the shape regression estimates part locations using the appearance
information globally. We set the regressed shape as the anchor locations in DPM
and allow the deformations of parts around them to satisfy the local appearance
consistency. Compared to traditional DPM, SR-DPM is of high degree of free-
dom to model global and local variations sufficiently. Due to the fact that shape
regression and DPM can benefit from each other, we build an objective function
to jointly learn them. It is a non-convex optimization problem, and we design a
coordinate descent procedure to solve it.
In addition, we show that stacking SR-DPMs could further improve the per-
formance. The complex shape variations are often beyond the representation
capacity of single DPM or SR-DPM. To fully explore the data, we propose the
stacked SR-DPM (S-SR-DPM), where each SR-DPM uses the output of previ-
ous SR-DPMs as the input and progressively refines the result. Note that the
SR-DPMs in different layers use different parameters. The S-SR-DPM provides
a natural analogy to deep convolutional neural network (DCNN) in increasing
representation capacity [5]. Compared with the end-to-end learning in DCNN,
the S-SR-DPM takes advantage of well designed hand-crafted pipelines and can
achieve good performance with much fewer training data.
Previous works usually only consider part localization of a special category
(e.g., human and face). In this paper we show wide applications of our method
on human, face and general object. For human pose estimation, we conduct
experiments on challenging LSP [25]. For face alignment, we use the LFPW [4]
as the testbed. It terms of general object, we use the annotations [3] of animals
from Pascal VOC [19]. We compare our method with different state-of-the-art
methods on these three tasks and achieve the leading performance.
The rest of the paper is organized as follows. Section 2 reviews the related
work. The proposed SR-DPM and its stacked form are described in section 3
and section 4. We show experiments in section 5 and finally conclude the paper
in section 6.

570 J. Yan et al.
2 Related Work
Many works on human pose estimation are based on pictorial structure in either
generative or discriminative manner. The pictorial structure [21] uses Gaussian
model to capture the deformation of each part and links parts by tree structure.
Inference in pictorial structure is very efficient due to the dynamic program-
ming and distance transform [21]. The pictorial structure is lately exploited in
deformable part model (DPM) [20] with HOG feature and latent-SVM learning,
and it achieves great success in Pascal VOC object detection. [44] extends DPM
for articulated human pose estimation by adding part subtype and using part
annotations in learning. [3] proves the advantage of fully supervised learning of
parts over latent learning in [20] for general objects. [40] shows that automati-
cally learning the tree is better than hand-crafted physical connections. [29] uses
Poselets [7] to capture mid-level cues to latently capture high-order dependen-
cies for pictorial structure. Many recent works improve PS in more part levels,
more global models and more part models [39,42,36,38,26,33,15,31]. A very re-
cent work [30] combines different appearance cues under the pictorial structure
framework and achieves the current leading performance.
Although being similar to human pose estimation problem, face alignment
field often uses very different methods, mainly due to the stronger spatial con-
straint of human face than human body. The most popular models include ac-
tive shape model (ASM [11]), active appearance model (AAM [10]) and their
extensions. Different from the Gaussian deformation of each local part in PS,
ASM/AAM captures the shape deformation globally with PCA constraint. The
global PCA constraint, however, has been indicated to be very sensitive and
is lately extended to be constrained local model (CLM [12,34,4,2]) by a shape
constraint on appearance of local parts. [47] exploits the DPM developed in [44]
for joint face detection and alignment. [45] further improves the work with opti-
mized mixtures and a two-step cascaded deformable shape model. In very recent,
face alignment is taken as a regression problem [8,14,43,37], which directly learns
the mapping the appearance to shape and achieves the leading performance on
face alignment benchmarks and challenges (e.g., 300-W [32]). These methods,
however, are sensitive to initialization, which makes them unsuitable for more
difficult human and object part localization.
We stack multiple SR-DPMs, which is related to a very recent work [35]. In
[35], multiple fisher vector coding layers are stacked to get a similar performance
of deep neural network for image classification task. In [16], boosting is used to
estimate the shape with pose-index feature, where the features are re-computed
at the latest estimation of landmark localization. In [43], linear regression are
stacked for face alignment.
Compared with previous works, the main contributions of this work are sum-
marized as follows:
We propose SR-DPM to incorporate DPM with shape regression and show
how to jointly learn them. The SR-DPM is much more flexible than DPM
in handling real world object deformation.

Stacked Deformable Part Model with Shape Regression 571
We stack multiple SR-DPMs to increase the representation capacity, where
each layer progressively refines the part locations. As shown empirically in
experiments, the stacked SR-DPM is critical for better performance.
To our best knowledge, it is the first work to simultaneously achieve state-of-
the-art performance on human pose estimation, face alignment and general
object part localization.
3 Deformable Part Model with Shape Regression
The DPM is composed of the root filter β
0
and some parts. Each part has a
appearance filter β
i
and deformation term d
i
. Given an object part configuration
specified by S =[x
1
,y
1
, ··· ,x
N
,y
N
]
T
and object location (x
0
,y
0
), the DPM
favors some special part configurations by:
s(S, I)=β
T
0
φ
a
(x
0
,y
0
,I)+
N
i=1
(β
T
i
φ
a
(x
i
,y
i
,I) d
T
i
φ
d
(x
i
,y
i
,a
x
i
,a
y
i
)), (1)
where φ
a
(x
i
,y
i
,I) is the HOG feature of the i-th part, and φ
d
(x
i
,y
i
,a
x
i
,a
y
i
)is
the separable quadratic function to represent the deformation. φ
d
(x
i
,y
i
,a
x
i
,a
y
i
)
is defined based on the relative location between the (x
i
,y
i
) and its anchor loca-
tion (a
x
i
,a
y
i
), which is fixed after the specification of (x
0
,y
0
). It is straightfor-
ward to add mixture parts [44] or mixture components [20], but we leave them
out to simplify the notation.
For each sliding window in localization, only the root location (x
0
,y
0
)isknown
in advance and each part location is inferred by maximizing the part appearance
score minus the deformation cost associated with displacement to anchor loca-
tion. Since parts are directly attached to the root, their locations are inferred
independently given the fixed root by:
max
x
i
,y
i
(β
T
i
φ
a
(x
i
,y
i
,I) d
T
i
φ
d
(x
i
,y
i
,a
x
i
,a
y
i
)), (2)
where (x
i
,y
i
) traverses all possible locations of the part. The procedure can be
efficiently solved by distance transform as used in [21,44].
Our improvement comes from the anchor location of each part. In DPM, the
anchor location of each part is defined according to relative position of either
the root [20] or its parent part [44]. It limits the flexibility since that each part
can only have a small deformation around its fixed anchor location. Additionally,
the star-structure used cannot capture global information, such as the high order
spatial dependencies of left-arm, right-arm, left-leg and right-leg.
In this paper, we propose to use regression to estimate the anchor locations
directly from the image appearance to capture the global information and in-
crease the flexibility. After that we allow each part to have deformation based
on these adaptive anchor part locations under the constraint of DPM. Let us
use
A =[a
x
1
, a
y
1
, ··· , a
x
N
, a
y
N
]
T
to specify the estimated anchor part locations.
Suppose the initial shape is A
0
and ground-truth shape is A
, we always want

572 J. Yan et al.
that each (x
i
,y
i
) to have relationship with all the parts initialized by S
0
(which
is the mean shape) to capture the global information. The function can be very
complex, and in this paper we use a simple linear function to approximate it:
A = f(A
0
,I)=A
0
+ W
T
Φ(A
0
,I), (3)
where Φ(A
0
,I) is the local appearance feature extracted around all parts. In this
paper, we define it as the HOG feature [13] from the implementation in [20]. We
concatenate feature vectors of all parts specified by A
0
tobealongvector,which
has Nn
d
values and n
d
is the length of HOG vector for a part. The dimension
of corresponding regression matrix W is Nn
d
× 2N . In Eq. 3, each new part
location is estimated based on all the initial part locations, thus Eq. 3 encodes
global information which previously cannot be captured in pictorial structure
based models. No parametric shape prior, such as global shape PCA in ASM
and local part Gaussian deformation in pictorial structure, is assumed in Eq. 3.
It has advantage especially for real world objects, whose spatial deformation can
be very complex and simple parametric prior cannot describe it well.
The above shape regression, however, is not enough for object part localiza-
tion. The reason is that it cannot measure the confidence of the estimated part
locations, which is very important for sliding window based scanning. Addition-
ally, the global shape regression matrix not explicitly consider the appearance
consistency of regressed part location. To this end, we further use the deformable
part model to incorporate shape regression, by replacing the fixed anchor loca-
tion with the shape regression output
A:
s(S, I)=β
T
0
φ
a
(x
0
,y
0
,I)+
N
i=1
(β
T
i
φ
a
(x
i
,y
i
,I) d
T
i
φ
d
(x
i
,y
i
, a
x
i
, a
y
i
)) (4)
where
A =[a
x
1
, a
y
1
, ··· , a
x
N
, a
y
N
]
T
= A
0
+ W
T
Φ(A
0
,I).
For each sliding window in localization, we find the S to maximize the confi-
dence score defined above, and take it the the estimated shape configuration of
the sliding window. The deformable part model with shape regression (SR-DPM)
provides the flexibility to capture large variations, but it also brings challenges,
since the regression matrix W and the deformable part model parameter β are
all unknown. In the following part, we present the objective function for joint
learning and show the optimization method.
3.1 Model Learning
The objective function for model learning is motivated by the original DPM used
in object detection, which is defined as:
arg min
β,S
m
1
2
β
2
+ C
M
m=1
max(0, 1 y
m
· s(S
m
,I
m
)), (5)
where the first term is used for regularization and the second term is the hinge
loss to punish error in detection. M is the number of training samples, and S
m

Citations
More filters
Proceedings ArticleDOI

Global supervised descent method

TL;DR: Global SDM is proposed, an extension of Supervised Descent Method that divides the search space into regions of similar gradient directions that provides a better and more efficient strategy to minimize non-linear least squares functions in computer vision problems.
PatentDOI

Deep Deformation Network for Object Landmark Localization

TL;DR: In this article, a processor is configured to generate a response map for an image, using a four-stage convolutional structure, and a plurality of landmark points for the image based on the response map is generated using a shape basis neural network.
Posted Content

Deep Deformation Network for Object Landmark Localization

TL;DR: It is demonstrated that the regularization induced through geometric priors in the DDN makes it easier to train, yet produces superior results, in contrast to prior cascaded networks for landmark localization that learn a mapping from feature space to landmark locations.
Posted Content

Unsupervised learning of object landmarks by factorized spatial embeddings

TL;DR: In this paper, an unsupervised approach that can discover and learn landmarks in object categories, thus characterizing their structure, is proposed based on factorizing image deformations, as induced by a viewpoint change or an object deformation, by learning a deep neural network.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Posted Content

Rich feature hierarchies for accurate object detection and semantic segmentation

TL;DR: This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Stacked deformable part model with shape regression for object part localization" ?

This paper explores the localization of pre-defined semantic object parts, which is much more challenging than traditional object detection and very important for applications such as face recognition, HCI and fine-grained object recognition. To address this problem, the authors make two critical improvements over the widely used deformable part model ( DPM ). The second is that the authors propose to stack multiple SR-DPMs, where each layer uses the output of previous SR-DPM as the input to progressively refine the result.