What have the authors contributed in "Articulated human detection with flexible mixtures of parts" ?

The authors describe a method for articulated human detection and human pose estimation in static images based on a new representation of deformable part models. The authors describe a general, flexible mixture model that jointly captures spatial relations between part locations and co-occurrence relations between part mixtures, augmenting standard pictorial structure models that encode just spatial relations. Because their model is efficient enough to be used as a detector that searches over scales and image locations, the authors introduce novel criteria for evaluating pose estimation and human detection, both separately and jointly. The authors show that currently used evaluation criteria may conflate these two issues. Most previous approaches model limbs with rigid and articulated templates that are trained independently of each other, while the authors present an extensive diagnostic evaluation that suggests that flexible structure and joint training are crucial for strong performance. The authors present experimental results on standard benchmarks that suggest their approach is the state-of-the-art system for pose estimation, improving past work on the challenging Parse and Buffy datasets while being orders of magnitude faster.

How can the authors increase the representational power of local part mixtures?

Their local part mixtures can becomposed to generate an exponential number of globalmixtures, greatly increasing their representational powerwithout sacrificing computational efficiency.

Why do the authors use PCK to perform diagnostic experiments?

Because PCK is easier to interpret and faster to evaluate than APK, the authors use PCK to perform diagnostic experiments exploring different aspects of their model in the next section.

What is the simplest way to score a configuration of parts?

To score a configuration of parts, the authors first define a compatibility function for part types that factors into a sum of local and pairwise scores:SðtÞ ¼ X i2V btii þ X ij2E b ti;tj ij : ð1ÞThe parameter btii favors particular type assignments for part i, while the pairwise parameter bti;tj ij favors particularco-occurrences of part types.

How much does joint training of orientation-variant parts increase performance?

Wefind that joint training of orientation-variant parts in-creases performance by nearly a factor of 2, from 39 to72 percent PCK.

What does the author find important about the latent updating of mixture labels?

The authors find that the latent updating of mixture labels is not crucial, a star model definitively hurts performance, and adding rotated copies of their training images increases performance by a small but noticeable amount.

How many positive examples are there on the training dataset?

On their training datasets, the number of positive examples varies from 200 to 1,000 and the number of negative images is roughly 1,000.

(Open Access) Articulated Human Detection with Flexible Mixtures of Parts (2013) | Yi Yang

Q: How long does it take to process a typical benchmark image?

Their model requires roughly 1 second to process a typical benchmark image, allowing for the possibility of real-time performance with further speedups (such as cascaded [5] or parallelized implementations).

Q: What is the main reason for the differences in appearance of limbs?

limbs vary greatly in appearancedue to changes in clothing and body shape, as well aschanges in viewpoint manifested in in-plane rotations and foreshortening.

UC Irvine

UC Irvine Previously Published Works

Title

Articulated human detection with flexible mixtures of parts.

Permalink

https://escholarship.org/uc/item/7sk1s10g

Journal

IEEE transactions on pattern analysis and machine intelligence, 35(12)

ISSN

0162-8828

Authors

Yang, Yi

Ramanan, Deva

Publication Date

2013-12-01

DOI

10.1109/tpami.2012.261

This work is made available under the terms of a Creative Commons Attribution

License, availalbe at https://creativecommons.org/licenses/by/4.0/

Peer reviewed

eScholarship.org Powered by the California Digital Library

University of California

Articulated Human Detection with

Flexible Mixtures of Parts

Yi Yang, Member, IEEE, and Deva Ramanan, Member, IEEE

Abstract—We describe a method for articulated human detection and human pose estimation in static images based on a new

representation of deformable part models. Rather than modeling articulation using a family of warped (rotated and foreshortened)

templates, we use a mixture of small, nonoriented parts. We describe a general, flexible mixture model that jointly captures spatial

relations between part locations and co-occurrence relations between part mixtures, augmenting standard pictorial structure models

that encode just spatial relations. Our models have several notable properties: 1) They efficiently model articulation by sharing

computation across similar warps, 2) they efficiently model an exponentially large set of global mixtures through composition of local

mixtures, and 3) they capture the dependency of global geometry on local appearance (parts look different at different locations). When

relations are tree structured, our models can be efficiently optimized with dynamic programming. We learn all parameters, including

local appearances, spatial relations, and co-occurrence relations (which encode local rigidity) with a structured SVM solver. Because

our model is efficient enough to be used as a detector that searches over scales and image locations, we introduce novel criteria for

evaluating pose estimation and human detection, both separately and jointly. We show that currently used evaluation criteria may

conflate these two issues. Most previous approaches model limbs with rigid and articulated templates that are trained independently of

each other, while we present an extensive diagnostic evaluation that suggests that flexible structure and joint training are crucial for

strong performance. We present experimental results on standard benchmarks that suggest our approach is the state-of-the-art

system for pose estimation, improving past work on the challenging Parse and Buffy datasets while being orders of magnitude faster.

Index Terms—Pose estimation, object detection, articulated shapes, deformable part models

1INTRODUCTION

N articulated pose estimation is a fundamental task in

computer vision. A working technology would im-

mediately impact many key vision tasks such as image

understanding and activity recognition. An influential

approach is the pictorial structure framework [1], [2] which

decomposes the appearance of objects into loc al part

templates, together with geometric constraints on pairs of

parts, often visualized as springs. When parts are para-

meterized by pixel location and orientation, the resulting

structure can model articulation. This has been the

dominant approach for human pose estimation. In contrast,

traditional models for object recognition use parts para-

meterizedsolelybylocations,whichsimplifiesboth

inference and learning. Such models have been shown to

be very successful for object detection [3], [4]. In this work,

we introduce a novel, unified representation for both

models which produces state-of-the-art results for the tasks

of detecting articulated people and estimating their poses.

Representations for articulated pose: Full-body pose estima-

tion is difficult because of the many degrees of freedom to

be estimated. Moreover, limbs vary greatly in appearance

due to changes in clothing and body shape, as well as

changes in viewpoint manifested in in-plane rotations and

foreshortening. These difficulties complicate inference as

one must typically search images with a large number of

warped (rotated and foreshortened) templates. We address

these problems by introducing a simple representation for

modeling a family of warped templates: a mixture of

pictorial structures with small, nonoriented parts (Fig. 1).

Our approach is significantly faster than an articulated

model because we exploit dynamic programming to share

computation across similar warps during matching. Our

approach can also outperform articulated models because

we capture the effect of global geometry on local appear-

ance; an elbow looks different when positioned above the

head or beside the torso. One reason for this is that elbows

rotate and foreshorten. However, appearance changes also

arise from other geometric factors, such as partial occlu-

sions and interactions with clothing. Our models capture

such often ignored dependencies because local mixtures

depend on the spatial arrangement of parts.

Representations for objects: Part models are also common in

general object recognition. Because translating parts do not

deform too much in practice, one often resorts to global

mixture models to capture large appearance changes [4].

Rather, we compose together local part mixtures to model an

exponen tially large set of global mixtures. Not all such

combinations are equally likely; we learn a prior over what

local mixtures can co-occur. This allows our model to learn

notions of local rigidity; for example, two parts on the same

rigid limb must co-occur with a consistent-oriented edge

structure. An open challenge is that of learni ng such

complex object representations from data. We find that

supervision is a key ingredient for learning structured

2878 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 12, DECEMBER 2013

. The authors are with the Department of Computer Science, University of

California at Irvine, Irvine, CA 92697.

E-mail: {yyang8, dramanan}@ics.uci.edu.

Manuscript received 16 Apr. 2012; revised 27 July 2012; accepted 1 Dec.

2012; published online 11 Dec. 2012.

Recommended for acceptance by P. Felzenszwalb, D. Forsyth, P. Fua, and

T.E. Boult.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMISI-2012-04-0298.

Digital Object Identifier no. 10.1109/TPAMI.2012.261.

0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

relational models; one can use limb orientation as a super-

visory signal to annotate part mixture labels in training data.

Efficiency: For computational reasons, most prior work on

pose estimation assumes that people are prelocalized with a

detector that provides the rough pixel location and scale of

each person. Our model is fast enough to search over all

locations and scales, and so we both detect and estimate

human poses without any preprocessing. Our m odel

requires roughly 1 second to process a typical benchmark

image, allowing for the possibility of real-time performance

with further speedups (such as cascaded [5] or parallelized

implementations). We have released open-source code [6]

which appears to be in use within the community.

Evaluation: The most popular evaluation criteria for pose

estimation are the percentage of correctly localized parts

(PCP) criteria introduced in [7]. Though these criteria were

crucial and influential in spurring quantitative evaluation,

they were somewhat ambiguously specified in [7], resulting

in possibly conflicting implementations.

One point of confusion is that PCP, as originally

specified, assume humans are predetected on test images.

This assumption may be unrealistic because it is hard to

build detectors for highly articulated poses (for the same

reason it is hard to correctly estimate their configurations).

Another point of confusion is that there appear to be two

interpretations of the definition of correctly localized parts

criteria introduced in [7]. We will give a detailed descrip-

tion of these issues in Section 7.

Unfortunately, these subtle confusions lead to significant

differences in terms of final performance results. We show

that that there may exist a negative correlation between

body-part detection accuracy and PCP as implemented in

the toolkit released by [8]. We then introduce new

evaluation criteria for pose estimation and body-part

detection that are self-consistent. We evaluate all different

types of PCP criteria and our new criteria on two standard

benchmark datasets [7], [9].

Overview: An earlier version of this manuscript appeared

in [10]. This version includes a slightly refined model,

additional diagnostic experiments, and an in-depth discus-

sion of evaluation criteria. After discussing related work,

we motivate our approach in Section 3, describe our model

in Section 4, describe algorithms for inference in Section 5,

and describe methods for learning parameters from

training data in Section 6. We then show experimental

results and diagnostic experiments on our benchmark data

sets in Section 7.

2RELATED WORK

Pose estimation has typically been addressed in the video

domain, dating back to the classic model-based approaches

of O

Rourke and Badler [11], Hogg [12], Rohr [13]. Recent

work has examined the problem for static images, assuming

that such techniques will be needed to initialize video-based

articulated trackers. We refer the reader to the recent survey

article [14] for a full review of contemporary approaches.

Spatial structure: One area of research is the encoding of

spatial structure, often described through the formalism of

probabilistic graphical models. Tree-structured graphical

models allow for efficient inference [1], [15], but are plagued

by double counting; given a parent torso, two legs are

localized independently and often respond to the same

image region. Loopy constraints address this limitation but

require approximate inference strategies such as sampling

[1], [16], [17], loopy belief propagation [18], or iterative

approximations [19]. Recent work has suggested that

branch-and-bound algorithms with tree-based lower

bounds can globally solve such problems [20], [21]. Another

approach to eliminating double counting is the use of

stronger pose priors [22]. However, such methods may

overfit to the statistics of a particular dataset, as warned by

[18], [23]. We find that simple tree models, when trained

contextually with part models in a discriminative frame-

work, are fairly effective.

Learning: An alternate family of techniques has explored

the tradeoff between generative and discriminative mod-

els. Approaches include conditional random fields [24],

margin-based learning [25], and boosted detectors [26],

[27], [21]. Most previous approaches train limb detectors

independently, in part due to the computational burdens

of inference. Our representation is efficient enough to be

learned jointly; we show in our experimental results that

joint learning is crucial for accurate performance. A small

part trained by itself is too weak to provide a strong

signal, but a collection of patches trained contextually are

rather discriminative.

Image features: An important issue for computer vision

tasks is feature description. Past work has explored the use

of superpixels [28], contours [26], [29], [30], foreground/

background color models [9], [7], edge-based descriptors

[31], [32], and gradient descriptors [27], [33]. We use

oriented gradient descriptors [34] that allow for fast

computation, but our approach could be combined with

other descriptors. Recent work has integrated our models

with steerable image descriptors for highly efficient pose

estimation [35].

Large versus small parts: In recent history, researchers

have begun exploring large-scale, nonarticulated parts that

span multiple limbs on the body (“Poselets”) [3]. Such

models were originally developed for human detection, but

[36] extends them to pose estimation. Large-scale parts can

be integrated into a hierarchical, coarse-to-fine representa-

tion [37], [38]. The underlying intuition behind such

approaches stems from the observation that it is hard to

YANG AND RAMANAN: ARTICULATED HUMAN DETECTION WITH FLEXIBLE MIXTURES OF PARTS 2879

Fig. 1. Our flexible mixture-of-parts model ( middle) differs from classic

approaches (left) that model articulation by warping a single template to

different orientation and foreshortening states (top right). Instead, we

approximate small warps by translating patches connected with a spring

(bottom right). For a large warp, we use a different set of patches and a

different spring. Hence, our model captures the dependence of local part

appearance on geometry (i.e., elbows in different spatial arrangements

look different).

build accurate limb detectors because they are nondescript

in appearance (i.e., limbs are defined by parallel lines that

may commonly occur in clutter). This motivates the use of

larger parts with more context. We demonstrate that jointly

training small parts has the same contextual effect.

Object detection: In terms of object detection, our work is

most similar to pictorial structure models that reason about

mixtures of parts [39], [1], [4], [15]. We show that our model

generalizes such representations in Section 4.1. Our local

mixture model can also be seen as an AND-OR grammar

where a pose is derived by AND’ing across all parts and

OR’ing across all local mixtures [4], [40].

3MOTIVATION

Our model is an approximation for capturing a continuous

family of warps. The classic approach of using a finite set of

articulated templates is also an approximation. In this

section, we present a straightforward theoretical analysis of

both. For simplicity, we restrict ourselves to affine warps,

though a similar derivation holds for any smooth warping

function, including perspective warps (Fig. 2).

Let us write x for a 2D pixel position in a template and

wðxÞ¼ðI þ AÞx þ b for its new position under a small

affine warp A ¼ I þ A and any translation b. We use A

to parameterize the deviation of the warp from an identity

warp. Define sðxÞ¼wðxÞx to be the shift of position x.

The shift of a nearby position x þ x can be written as

sðx þ xÞ¼wðx þ xÞðx þ xÞ

¼ðI þ AÞðx þ xÞþb  x  x

¼ sðxÞþAx:

Both pixels x and

x þ x shift by the same amount (and can

be modeled as a single part) if the product Ax is small,

which is true if A has small determinant or x has small

norm. Classic articulated models use a large family of

discretized articulations, where each discrete template only

needs to explain a small range of rotations and foreshorten-

ing (e.g., a small-determinant A). We take the opposite

approach, making x small by using small parts. Since we

want the norm of x to be small, this suggests that circular

parts would work best, but we use square parts as a discrete

approximation. In the extreme case, one could define a set

of single-pixel parts. Such a representation is indeed the

most flexible, but becomes difficult to train given our

learning formulation described below.

4MODEL

Let us write I for an image, l

¼ðx; yÞ for the pixel location

of part i and t

for the mixture component of part i. We write

i 2f1; ...Kg, l

2f1; ...Lg, and t

2f1; ...T g. We call t

the

“type” of part i. Our motivating examples of types include

orientations of a part (e.g., a vertical versus horizontally

oriented hand), but types may span out-of-plane rotations

(front-view head versus side-view head) or even semantic

classes (an open versus closed hand). For notational

convenience, we define the lack of subscript to indicate a

set spanned by that subscript (e.g., t ¼ft

; ...t

g). For

simplicity, we define our model at a fixed scale; at test time,

we detect people of different sizes by searching over an

image pyramid.

Co-occurrence model: To score a configuration of parts, we

first define a compatibility function for part types that

factors into a sum of local and pairwise scores:

SðtÞ¼

i2V

ij2E

: ð1Þ

The parameter b

favors particular type assignments for

part i, while the pairwise parameter b

favors particular

co-occurrences of part types. For example, if part types

correspond to orientations and parts i and j are on the same

rigid limb, then b

would favor consistent orientation

assignments. Specifically, b

should be a large positive

number for consistent orientations t

and t

, and a large

negative number for inconsistent orientations t

and t

Rigidity: We write G ¼ðV;EÞ for a (tree-structured)

K-node relational graph whose edges specify which pairs of

parts are constrained to have consistent relations. Such a

graph can still encode relations between distant parts

through transitivity. For example, our model can force a

collection of parts to share the same orientation so long as

the parts form a connected subtree of G ¼ðV;EÞ. We use

this property to model multiple parts on the torso. Since co-

occurrence parameters are learned, our model learns which

collections of parts should be rigid.

We can now write the full score associated with a

configuration of part types and positions:

SðI; l; tÞ¼SðtÞþ

i2V

 ðI;l

ij2E

 ðl

 l

Þ;

ð2Þ

where ðI; l

Þ is a feature vector (e.g., HOG descriptor [34])

extracted from pixel location l

in image I. We write

ðl

 l

Þ¼½

dx dx

dy dy



, where dx ¼ x

 x

and

dy ¼ y

 y

, the relative location of part i with respect to j.

Notably, this relative location is defined with respect to the

pixel grid and not the orientation of part i (as in classic

articulated pictorial structures [1]).

2880 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 12, DECEMBER 2013

Fig. 2. We show that four small, translating parts can approximate

nonaffine (e.g., perspective) warps.

Appearance model: The first sum in (2) is an appearance

model that computes the local score of placing a template

for part i, tuned for type t

, at location l

Deformation model: The second term can be interpreted as

a “switching” spring model that controls the relative

placement of parts i and j by switching between a collection

of springs. Each spring is tailored for a particular pair of

types ðt

Þ, and is parameterized by its rest location and

rigidity, which are encoded by w

. Our switching spring

model encodes the dependence of local appearance on

geometry, since different pairs of local mixtures are

constrained to use different springs. Together with the co-

occurrence term, it specifies an image-independent “prior”

over part locations and types.

4.1 Special Cases

We now describe various special cases of our model. The

first three correspond to special cases that have previously

occurred in the literature, while the last refers to a special

case we implement in our experiments.

Stretchable human models: Sapp et al. [41] describe a

human part model that consists of a single part at each joint.

This is equivalent to our model with K ¼ 14 parts, each

with a single mixture T ¼ 1. Similarly to us, Sapp et al. [41]

argue that a joint-centric representation efficiently captures

foreshortening and articulation effects. However, our local

mixture models (for T>1) also capture the dependence of

global geometry on local appearance; elbows look different

when positioned above the head or beside the torso. We

compare to such a model in our diagnostic experiments.

Semantic part models: Epshtein and Ullman [39] argue that

part appearances should capture semantic classes and not

visual classes; this can be done with a type model. Consider

a face model with eye and mouth parts. One may want to

model different types of eyes (open and closed) and mouths

(smiling and frowning). The spatial relationship between

the two does not likely depend on their type, but open eyes

may tend to co-occur with smiling mouths. This can be

obtained as a special case of our model by using a single

spring for all types of a particular pair of parts:

¼ w

: ð3Þ

Mixtures of deformable parts: Felzenszwalb et al. [4]

define a mixture of models, where each model is a star-

based pictorial structure. This can be achieved by restrict-

ing the co-occurrence model to allow for only globally

consistent types:

0ift

¼ t

1 otherwise:



ð4Þ

Articulation: In our experiments, we explore a simplified

version of (2) with a reduced set of springs:

¼ w

: ð5Þ

The above simplification states that the relative location of

part with respect to its parent is dependent on part type,

but not parent type. For example, let i be a hand part, j its

parent elbow part, and assume part types capture

orientation. The above relational model states that a

sideways-oriented hand should tend to lie next to the

elbow, while a downward-oriented hand should lie below

the elbow, regardless of the orientation of the upper arm.

5INFERENCE

Inference corresponds to maximizing SðI; l; tÞ from (2) over

l and t. When the relational graph G ¼ðV;EÞ is a tree, this

can be done efficiently with dynamic programming. To

illustrate inference, let us rewrite (2) by defining z

¼ðl

to denote both the discrete pixel location and discrete

mixture type of part i:

SðI; zÞ¼

i2V



ðI;z

Þþ

ij2E

ðz

Þ;

where 

ðI;z

Þ¼w

 ðI;l

Þþb

ðz

Þ¼w

 ðl

 l

Þþb

From this perspective, it is clear that our final model is a

discrete, pairwise Markov random field. When G ¼ðV;EÞ

is tree struct ured, one can c ompute max

SðI; zÞ with

dynamic programming.

To be precise, we iterate over all parts starting from the

leaves and moving “upstream” to the root part. We define

kidsðiÞ be the set of children of part i, which is the empty set

for leaf parts. We compute the message part i passes to its

parent j by the following:

score

ðz

Þ¼

ðI;z

Þþ

k2kidsðiÞ

ðz

Þð6Þ

ðz

Þ¼max

½score

ðz

Þþ

ðz

Þ: ð7Þ

Equation (6) computes the local score of part i, at all pixel

locations l

and for all possible types t

, by collecting

messages from the children of i. Equation (7) computes for

every location and possible type of part j, the best scoring

location and type of its child part i. Once messages are

passed to the root part ði ¼ 1Þ, score

ðz

Þ represents the best

scoring configuration for each root position and type. One

can use these root scores to generate multiple detections in

image I by thresholding them and applying nonmaximum

suppression (NMS). By keeping track of the argmax indices,

one can backtrack to find the location and type of each part

in each maximal configuration. To find multiple detections

anchored at the same root, one can use N-best extensions of

dynamic programming [42].

Computation: The computationally taxing portion of

dynamic programming is (7). We rewrite this step in detail:

ðt

Þ¼max



þ max

score

ðt

þ w

 ðl

 l

Þ:

ð8Þ

One has to loop over L  T possible parent locations and

types, and compute a max over L  T possible child

locations and types, making the computation OðL

Þ for

each part. When ðl

 l

Þ is a quadratic function (as is the

case for us), the inner maximization in (8) can be efficiently

computed for each combination of t

and t

in OðLÞ with a

max-convolution or distance transform [1]. Since one has to

YANG AND RAMANAN: ARTICULATED HUMAN DETECTION WITH FLEXIBLE MIXTURES OF PARTS 2881

Articulated Human Detection with Flexible Mixtures of Parts

Figures

Citations

Going deeper with convolutions

Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields

Stacked Hourglass Networks for Human Pose Estimation

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields

References

Histograms of oriented gradients for human detection

The Pascal Visual Object Classes (VOC) Challenge

Object Detection with Discriminatively Trained Part-Based Models

LIBLINEAR: A Library for Large Linear Classification

Pictorial Structures for Object Recognition

Related Papers (5)

DeepPose: Human Pose Estimation via Deep Neural Networks

Histograms of oriented gradients for human detection

Object Detection with Discriminatively Trained Part-Based Models

Deep Residual Learning for Image Recognition

Convolutional Pose Machines

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Articulated human detection with flexible mixtures of parts" ?

Q2. How long does it take to process a typical benchmark image?

Q3. Why do the authors often resort to global mixture models to capture large appearance changes?

Q4. How can the authors increase the representational power of local part mixtures?

Q5. Why do the authors use PCK to perform diagnostic experiments?

Q6. What is the simplest way to score a configuration of parts?

Q7. What is the main reason for the differences in appearance of limbs?

Q8. How much does joint training of orientation-variant parts increase performance?

Q9. What does the author find important about the latent updating of mixture labels?

Q10. How many positive examples are there on the training dataset?