scispace - formally typeset
Open AccessJournal ArticleDOI

Articulated Human Detection with Flexible Mixtures of Parts

Yi Yang, +1 more
- 01 Dec 2013 - 
- Vol. 35, Iss: 12, pp 2878-2890
Reads0
Chats0
TLDR
A general, flexible mixture model that jointly captures spatial relations between part locations and co-occurrence Relations between part mixtures, augmenting standard pictorial structure models that encode just spatial relations.
Abstract
We describe a method for articulated human detection and human pose estimation in static images based on a new representation of deformable part models. Rather than modeling articulation using a family of warped (rotated and foreshortened) templates, we use a mixture of small, nonoriented parts. We describe a general, flexible mixture model that jointly captures spatial relations between part locations and co-occurrence relations between part mixtures, augmenting standard pictorial structure models that encode just spatial relations. Our models have several notable properties: 1) They efficiently model articulation by sharing computation across similar warps, 2) they efficiently model an exponentially large set of global mixtures through composition of local mixtures, and 3) they capture the dependency of global geometry on local appearance (parts look different at different locations). When relations are tree structured, our models can be efficiently optimized with dynamic programming. We learn all parameters, including local appearances, spatial relations, and co-occurrence relations (which encode local rigidity) with a structured SVM solver. Because our model is efficient enough to be used as a detector that searches over scales and image locations, we introduce novel criteria for evaluating pose estimation and human detection, both separately and jointly. We show that currently used evaluation criteria may conflate these two issues. Most previous approaches model limbs with rigid and articulated templates that are trained independently of each other, while we present an extensive diagnostic evaluation that suggests that flexible structure and joint training are crucial for strong performance. We present experimental results on standard benchmarks that suggest our approach is the state-of-the-art system for pose estimation, improving past work on the challenging Parse and Buffy datasets while being orders of magnitude faster.

read more

Content maybe subject to copyright    Report

UC Irvine
UC Irvine Previously Published Works
Title
Articulated human detection with flexible mixtures of parts.
Permalink
https://escholarship.org/uc/item/7sk1s10g
Journal
IEEE transactions on pattern analysis and machine intelligence, 35(12)
ISSN
0162-8828
Authors
Yang, Yi
Ramanan, Deva
Publication Date
2013-12-01
DOI
10.1109/tpami.2012.261
Copyright Information
This work is made available under the terms of a Creative Commons Attribution
License, availalbe at https://creativecommons.org/licenses/by/4.0/
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California

Articulated Human Detection with
Flexible Mixtures of Parts
Yi Yang, Member, IEEE, and Deva Ramanan, Member, IEEE
Abstract—We describe a method for articulated human detection and human pose estimation in static images based on a new
representation of deformable part models. Rather than modeling articulation using a family of warped (rotated and foreshortened)
templates, we use a mixture of small, nonoriented parts. We describe a general, flexible mixture model that jointly captures spatial
relations between part locations and co-occurrence relations between part mixtures, augmenting standard pictorial structure models
that encode just spatial relations. Our models have several notable properties: 1) They efficiently model articulation by sharing
computation across similar warps, 2) they efficiently model an exponentially large set of global mixtures through composition of local
mixtures, and 3) they capture the dependency of global geometry on local appearance (parts look different at different locations). When
relations are tree structured, our models can be efficiently optimized with dynamic programming. We learn all parameters, including
local appearances, spatial relations, and co-occurrence relations (which encode local rigidity) with a structured SVM solver. Because
our model is efficient enough to be used as a detector that searches over scales and image locations, we introduce novel criteria for
evaluating pose estimation and human detection, both separately and jointly. We show that currently used evaluation criteria may
conflate these two issues. Most previous approaches model limbs with rigid and articulated templates that are trained independently of
each other, while we present an extensive diagnostic evaluation that suggests that flexible structure and joint training are crucial for
strong performance. We present experimental results on standard benchmarks that suggest our approach is the state-of-the-art
system for pose estimation, improving past work on the challenging Parse and Buffy datasets while being orders of magnitude faster.
Index Terms—Pose estimation, object detection, articulated shapes, deformable part models
Ç
1INTRODUCTION
A
N articulated pose estimation is a fundamental task in
computer vision. A working technology would im-
mediately impact many key vision tasks such as image
understanding and activity recognition. An influential
approach is the pictorial structure framework [1], [2] which
decomposes the appearance of objects into loc al part
templates, together with geometric constraints on pairs of
parts, often visualized as springs. When parts are para-
meterized by pixel location and orientation, the resulting
structure can model articulation. This has been the
dominant approach for human pose estimation. In contrast,
traditional models for object recognition use parts para-
meterizedsolelybylocations,whichsimplifiesboth
inference and learning. Such models have been shown to
be very successful for object detection [3], [4]. In this work,
we introduce a novel, unified representation for both
models which produces state-of-the-art results for the tasks
of detecting articulated people and estimating their poses.
Representations for articulated pose: Full-body pose estima-
tion is difficult because of the many degrees of freedom to
be estimated. Moreover, limbs vary greatly in appearance
due to changes in clothing and body shape, as well as
changes in viewpoint manifested in in-plane rotations and
foreshortening. These difficulties complicate inference as
one must typically search images with a large number of
warped (rotated and foreshortened) templates. We address
these problems by introducing a simple representation for
modeling a family of warped templates: a mixture of
pictorial structures with small, nonoriented parts (Fig. 1).
Our approach is significantly faster than an articulated
model because we exploit dynamic programming to share
computation across similar warps during matching. Our
approach can also outperform articulated models because
we capture the effect of global geometry on local appear-
ance; an elbow looks different when positioned above the
head or beside the torso. One reason for this is that elbows
rotate and foreshorten. However, appearance changes also
arise from other geometric factors, such as partial occlu-
sions and interactions with clothing. Our models capture
such often ignored dependencies because local mixtures
depend on the spatial arrangement of parts.
Representations for objects: Part models are also common in
general object recognition. Because translating parts do not
deform too much in practice, one often resorts to global
mixture models to capture large appearance changes [4].
Rather, we compose together local part mixtures to model an
exponen tially large set of global mixtures. Not all such
combinations are equally likely; we learn a prior over what
local mixtures can co-occur. This allows our model to learn
notions of local rigidity; for example, two parts on the same
rigid limb must co-occur with a consistent-oriented edge
structure. An open challenge is that of learni ng such
complex object representations from data. We find that
supervision is a key ingredient for learning structured
2878 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 12, DECEMBER 2013
. The authors are with the Department of Computer Science, University of
California at Irvine, Irvine, CA 92697.
E-mail: {yyang8, dramanan}@ics.uci.edu.
Manuscript received 16 Apr. 2012; revised 27 July 2012; accepted 1 Dec.
2012; published online 11 Dec. 2012.
Recommended for acceptance by P. Felzenszwalb, D. Forsyth, P. Fua, and
T.E. Boult.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMISI-2012-04-0298.
Digital Object Identifier no. 10.1109/TPAMI.2012.261.
0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

relational models; one can use limb orientation as a super-
visory signal to annotate part mixture labels in training data.
Efficiency: For computational reasons, most prior work on
pose estimation assumes that people are prelocalized with a
detector that provides the rough pixel location and scale of
each person. Our model is fast enough to search over all
locations and scales, and so we both detect and estimate
human poses without any preprocessing. Our m odel
requires roughly 1 second to process a typical benchmark
image, allowing for the possibility of real-time performance
with further speedups (such as cascaded [5] or parallelized
implementations). We have released open-source code [6]
which appears to be in use within the community.
Evaluation: The most popular evaluation criteria for pose
estimation are the percentage of correctly localized parts
(PCP) criteria introduced in [7]. Though these criteria were
crucial and influential in spurring quantitative evaluation,
they were somewhat ambiguously specified in [7], resulting
in possibly conflicting implementations.
One point of confusion is that PCP, as originally
specified, assume humans are predetected on test images.
This assumption may be unrealistic because it is hard to
build detectors for highly articulated poses (for the same
reason it is hard to correctly estimate their configurations).
Another point of confusion is that there appear to be two
interpretations of the definition of correctly localized parts
criteria introduced in [7]. We will give a detailed descrip-
tion of these issues in Section 7.
Unfortunately, these subtle confusions lead to significant
differences in terms of final performance results. We show
that that there may exist a negative correlation between
body-part detection accuracy and PCP as implemented in
the toolkit released by [8]. We then introduce new
evaluation criteria for pose estimation and body-part
detection that are self-consistent. We evaluate all different
types of PCP criteria and our new criteria on two standard
benchmark datasets [7], [9].
Overview: An earlier version of this manuscript appeared
in [10]. This version includes a slightly refined model,
additional diagnostic experiments, and an in-depth discus-
sion of evaluation criteria. After discussing related work,
we motivate our approach in Section 3, describe our model
in Section 4, describe algorithms for inference in Section 5,
and describe methods for learning parameters from
training data in Section 6. We then show experimental
results and diagnostic experiments on our benchmark data
sets in Section 7.
2RELATED WORK
Pose estimation has typically been addressed in the video
domain, dating back to the classic model-based approaches
of O
0
Rourke and Badler [11], Hogg [12], Rohr [13]. Recent
work has examined the problem for static images, assuming
that such techniques will be needed to initialize video-based
articulated trackers. We refer the reader to the recent survey
article [14] for a full review of contemporary approaches.
Spatial structure: One area of research is the encoding of
spatial structure, often described through the formalism of
probabilistic graphical models. Tree-structured graphical
models allow for efficient inference [1], [15], but are plagued
by double counting; given a parent torso, two legs are
localized independently and often respond to the same
image region. Loopy constraints address this limitation but
require approximate inference strategies such as sampling
[1], [16], [17], loopy belief propagation [18], or iterative
approximations [19]. Recent work has suggested that
branch-and-bound algorithms with tree-based lower
bounds can globally solve such problems [20], [21]. Another
approach to eliminating double counting is the use of
stronger pose priors [22]. However, such methods may
overfit to the statistics of a particular dataset, as warned by
[18], [23]. We find that simple tree models, when trained
contextually with part models in a discriminative frame-
work, are fairly effective.
Learning: An alternate family of techniques has explored
the tradeoff between generative and discriminative mod-
els. Approaches include conditional random fields [24],
margin-based learning [25], and boosted detectors [26],
[27], [21]. Most previous approaches train limb detectors
independently, in part due to the computational burdens
of inference. Our representation is efficient enough to be
learned jointly; we show in our experimental results that
joint learning is crucial for accurate performance. A small
part trained by itself is too weak to provide a strong
signal, but a collection of patches trained contextually are
rather discriminative.
Image features: An important issue for computer vision
tasks is feature description. Past work has explored the use
of superpixels [28], contours [26], [29], [30], foreground/
background color models [9], [7], edge-based descriptors
[31], [32], and gradient descriptors [27], [33]. We use
oriented gradient descriptors [34] that allow for fast
computation, but our approach could be combined with
other descriptors. Recent work has integrated our models
with steerable image descriptors for highly efficient pose
estimation [35].
Large versus small parts: In recent history, researchers
have begun exploring large-scale, nonarticulated parts that
span multiple limbs on the body (“Poselets”) [3]. Such
models were originally developed for human detection, but
[36] extends them to pose estimation. Large-scale parts can
be integrated into a hierarchical, coarse-to-fine representa-
tion [37], [38]. The underlying intuition behind such
approaches stems from the observation that it is hard to
YANG AND RAMANAN: ARTICULATED HUMAN DETECTION WITH FLEXIBLE MIXTURES OF PARTS 2879
Fig. 1. Our flexible mixture-of-parts model ( middle) differs from classic
approaches (left) that model articulation by warping a single template to
different orientation and foreshortening states (top right). Instead, we
approximate small warps by translating patches connected with a spring
(bottom right). For a large warp, we use a different set of patches and a
different spring. Hence, our model captures the dependence of local part
appearance on geometry (i.e., elbows in different spatial arrangements
look different).

build accurate limb detectors because they are nondescript
in appearance (i.e., limbs are defined by parallel lines that
may commonly occur in clutter). This motivates the use of
larger parts with more context. We demonstrate that jointly
training small parts has the same contextual effect.
Object detection: In terms of object detection, our work is
most similar to pictorial structure models that reason about
mixtures of parts [39], [1], [4], [15]. We show that our model
generalizes such representations in Section 4.1. Our local
mixture model can also be seen as an AND-OR grammar
where a pose is derived by AND’ing across all parts and
OR’ing across all local mixtures [4], [40].
3MOTIVATION
Our model is an approximation for capturing a continuous
family of warps. The classic approach of using a finite set of
articulated templates is also an approximation. In this
section, we present a straightforward theoretical analysis of
both. For simplicity, we restrict ourselves to affine warps,
though a similar derivation holds for any smooth warping
function, including perspective warps (Fig. 2).
Let us write x for a 2D pixel position in a template and
wðxÞ¼ðI þ AÞx þ b for its new position under a small
affine warp A ¼ I þ A and any translation b. We use A
to parameterize the deviation of the warp from an identity
warp. Define sðxÞ¼wðxÞx to be the shift of position x.
The shift of a nearby position x þ x can be written as
sðx þ xÞ¼wðx þ xÞðx þ xÞ
¼ðI þ AÞðx þ xÞþb x x
¼ sðxÞþAx:
Both pixels x and
x þ x shift by the same amount (and can
be modeled as a single part) if the product Ax is small,
which is true if A has small determinant or x has small
norm. Classic articulated models use a large family of
discretized articulations, where each discrete template only
needs to explain a small range of rotations and foreshorten-
ing (e.g., a small-determinant A). We take the opposite
approach, making x small by using small parts. Since we
want the norm of x to be small, this suggests that circular
parts would work best, but we use square parts as a discrete
approximation. In the extreme case, one could define a set
of single-pixel parts. Such a representation is indeed the
most flexible, but becomes difficult to train given our
learning formulation described below.
4MODEL
Let us write I for an image, l
i
¼ðx; yÞ for the pixel location
of part i and t
i
for the mixture component of part i. We write
i 2f1; ...Kg, l
i
2f1; ...Lg, and t
i
2f1; ...T g. We call t
i
the
“type” of part i. Our motivating examples of types include
orientations of a part (e.g., a vertical versus horizontally
oriented hand), but types may span out-of-plane rotations
(front-view head versus side-view head) or even semantic
classes (an open versus closed hand). For notational
convenience, we define the lack of subscript to indicate a
set spanned by that subscript (e.g., t ¼ft
1
; ...t
K
g). For
simplicity, we define our model at a fixed scale; at test time,
we detect people of different sizes by searching over an
image pyramid.
Co-occurrence model: To score a configuration of parts, we
first define a compatibility function for part types that
factors into a sum of local and pairwise scores:
SðtÞ¼
X
i2V
b
t
i
i
þ
X
ij2E
b
t
i
;t
j
ij
: ð1Þ
The parameter b
t
i
i
favors particular type assignments for
part i, while the pairwise parameter b
t
i
;t
j
ij
favors particular
co-occurrences of part types. For example, if part types
correspond to orientations and parts i and j are on the same
rigid limb, then b
t
i
;t
j
ij
would favor consistent orientation
assignments. Specifically, b
t
i
;t
j
ij
should be a large positive
number for consistent orientations t
i
and t
j
, and a large
negative number for inconsistent orientations t
i
and t
j
.
Rigidity: We write G ¼ðV;EÞ for a (tree-structured)
K-node relational graph whose edges specify which pairs of
parts are constrained to have consistent relations. Such a
graph can still encode relations between distant parts
through transitivity. For example, our model can force a
collection of parts to share the same orientation so long as
the parts form a connected subtree of G ¼ðV;EÞ. We use
this property to model multiple parts on the torso. Since co-
occurrence parameters are learned, our model learns which
collections of parts should be rigid.
We can now write the full score associated with a
configuration of part types and positions:
SðI; l; tÞ¼SðtÞþ
X
i2V
w
t
i
i
ðI;l
i
Þ
þ
X
ij2E
w
t
i
;t
j
ij
ðl
i
l
j
Þ;
ð2Þ
where ðI; l
i
Þ is a feature vector (e.g., HOG descriptor [34])
extracted from pixel location l
i
in image I. We write
ðl
i
l
j
Þ¼½
dx dx
2
dy dy
2
T
, where dx ¼ x
i
x
j
and
dy ¼ y
i
y
j
, the relative location of part i with respect to j.
Notably, this relative location is defined with respect to the
pixel grid and not the orientation of part i (as in classic
articulated pictorial structures [1]).
2880 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 12, DECEMBER 2013
Fig. 2. We show that four small, translating parts can approximate
nonaffine (e.g., perspective) warps.

Appearance model: The first sum in (2) is an appearance
model that computes the local score of placing a template
w
t
i
i
for part i, tuned for type t
i
, at location l
i
.
Deformation model: The second term can be interpreted as
a “switching” spring model that controls the relative
placement of parts i and j by switching between a collection
of springs. Each spring is tailored for a particular pair of
types ðt
i
;t
j
Þ, and is parameterized by its rest location and
rigidity, which are encoded by w
t
i
;t
j
ij
. Our switching spring
model encodes the dependence of local appearance on
geometry, since different pairs of local mixtures are
constrained to use different springs. Together with the co-
occurrence term, it specifies an image-independent “prior”
over part locations and types.
4.1 Special Cases
We now describe various special cases of our model. The
first three correspond to special cases that have previously
occurred in the literature, while the last refers to a special
case we implement in our experiments.
Stretchable human models: Sapp et al. [41] describe a
human part model that consists of a single part at each joint.
This is equivalent to our model with K ¼ 14 parts, each
with a single mixture T ¼ 1. Similarly to us, Sapp et al. [41]
argue that a joint-centric representation efficiently captures
foreshortening and articulation effects. However, our local
mixture models (for T>1) also capture the dependence of
global geometry on local appearance; elbows look different
when positioned above the head or beside the torso. We
compare to such a model in our diagnostic experiments.
Semantic part models: Epshtein and Ullman [39] argue that
part appearances should capture semantic classes and not
visual classes; this can be done with a type model. Consider
a face model with eye and mouth parts. One may want to
model different types of eyes (open and closed) and mouths
(smiling and frowning). The spatial relationship between
the two does not likely depend on their type, but open eyes
may tend to co-occur with smiling mouths. This can be
obtained as a special case of our model by using a single
spring for all types of a particular pair of parts:
w
t
i
;t
j
ij
¼ w
ij
: ð3Þ
Mixtures of deformable parts: Felzenszwalb et al. [4]
define a mixture of models, where each model is a star-
based pictorial structure. This can be achieved by restrict-
ing the co-occurrence model to allow for only globally
consistent types:
b
t
i
;t
j
ij
¼
0ift
i
¼ t
j
1 otherwise:
ð4Þ
Articulation: In our experiments, we explore a simplified
version of (2) with a reduced set of springs:
w
t
i
;t
j
ij
¼ w
t
i
ij
: ð5Þ
The above simplification states that the relative location of
part with respect to its parent is dependent on part type,
but not parent type. For example, let i be a hand part, j its
parent elbow part, and assume part types capture
orientation. The above relational model states that a
sideways-oriented hand should tend to lie next to the
elbow, while a downward-oriented hand should lie below
the elbow, regardless of the orientation of the upper arm.
5INFERENCE
Inference corresponds to maximizing SðI; l; tÞ from (2) over
l and t. When the relational graph G ¼ðV;EÞ is a tree, this
can be done efficiently with dynamic programming. To
illustrate inference, let us rewrite (2) by defining z
i
¼ðl
i
;t
i
Þ
to denote both the discrete pixel location and discrete
mixture type of part i:
SðI; zÞ¼
X
i2V
i
ðI;z
i
Þþ
X
ij2E
ij
ðz
i
;z
j
Þ;
where
i
ðI;z
i
Þ¼w
t
i
i
ðI;l
i
Þþb
t
i
i
ij
ðz
i
;z
j
Þ¼w
t
i
;t
j
ij
ðl
i
l
j
Þþb
t
i
;t
j
ij
:
From this perspective, it is clear that our final model is a
discrete, pairwise Markov random field. When G ¼ðV;EÞ
is tree struct ured, one can c ompute max
z
SðI; zÞ with
dynamic programming.
To be precise, we iterate over all parts starting from the
leaves and moving “upstream” to the root part. We define
kidsðiÞ be the set of children of part i, which is the empty set
for leaf parts. We compute the message part i passes to its
parent j by the following:
score
i
ðz
i
Þ¼
i
ðI;z
i
Þþ
X
k2kidsðiÞ
m
k
ðz
i
Þð6Þ
m
i
ðz
j
Þ¼max
z
i
½score
i
ðz
i
Þþ
ij
ðz
i
;z
j
Þ: ð7Þ
Equation (6) computes the local score of part i, at all pixel
locations l
i
and for all possible types t
i
, by collecting
messages from the children of i. Equation (7) computes for
every location and possible type of part j, the best scoring
location and type of its child part i. Once messages are
passed to the root part ði ¼ 1Þ, score
1
ðz
1
Þ represents the best
scoring configuration for each root position and type. One
can use these root scores to generate multiple detections in
image I by thresholding them and applying nonmaximum
suppression (NMS). By keeping track of the argmax indices,
one can backtrack to find the location and type of each part
in each maximal configuration. To find multiple detections
anchored at the same root, one can use N-best extensions of
dynamic programming [42].
Computation: The computationally taxing portion of
dynamic programming is (7). We rewrite this step in detail:
m
i
ðt
j
;l
j
Þ¼max
t
i
b
t
i
;t
j
ij
þ max
l
i
score
i
ðt
i
;l
i
Þ
þ w
t
i
;t
j
ij
ðl
i
l
j
Þ:
ð8Þ
One has to loop over L T possible parent locations and
types, and compute a max over L T possible child
locations and types, making the computation OðL
2
T
2
Þ for
each part. When ðl
i
l
j
Þ is a quadratic function (as is the
case for us), the inner maximization in (8) can be efficiently
computed for each combination of t
i
and t
j
in OðLÞ with a
max-convolution or distance transform [1]. Since one has to
YANG AND RAMANAN: ARTICULATED HUMAN DETECTION WITH FLEXIBLE MIXTURES OF PARTS 2881

Figures
Citations
More filters
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings ArticleDOI

Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields

TL;DR: Part Affinity Fields (PAFs) as discussed by the authors uses a nonparametric representation to learn to associate body parts with individuals in the image and achieves state-of-the-art performance on the MPII Multi-Person benchmark.
Book ChapterDOI

Stacked Hourglass Networks for Human Pose Estimation

TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Posted Content

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

TL;DR: This work presents an approach to efficiently detect the 2D pose of multiple people in an image using a nonparametric representation, which it refers to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image.
Journal ArticleDOI

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields

TL;DR: OpenPose as mentioned in this paper uses Part Affinity Fields (PAFs) to learn to associate body parts with individuals in the image, which achieves high accuracy and real-time performance.
References
More filters
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Journal Article

LIBLINEAR: A Library for Large Linear Classification

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.
Journal ArticleDOI

Pictorial Structures for Object Recognition

TL;DR: A computationally efficient framework for part-based modeling and recognition of objects, motivated by the pictorial structure models introduced by Fischler and Elschlager, that allows for qualitative descriptions of visual appearance and is suitable for generic recognition problems.
Frequently Asked Questions (10)
Q1. What have the authors contributed in "Articulated human detection with flexible mixtures of parts" ?

The authors describe a method for articulated human detection and human pose estimation in static images based on a new representation of deformable part models. The authors describe a general, flexible mixture model that jointly captures spatial relations between part locations and co-occurrence relations between part mixtures, augmenting standard pictorial structure models that encode just spatial relations. Because their model is efficient enough to be used as a detector that searches over scales and image locations, the authors introduce novel criteria for evaluating pose estimation and human detection, both separately and jointly. The authors show that currently used evaluation criteria may conflate these two issues. Most previous approaches model limbs with rigid and articulated templates that are trained independently of each other, while the authors present an extensive diagnostic evaluation that suggests that flexible structure and joint training are crucial for strong performance. The authors present experimental results on standard benchmarks that suggest their approach is the state-of-the-art system for pose estimation, improving past work on the challenging Parse and Buffy datasets while being orders of magnitude faster. 

Their model requires roughly 1 second to process a typical benchmark image, allowing for the possibility of real-time performance with further speedups (such as cascaded [5] or parallelized implementations). 

Because translating parts do not deform too much in practice, one often resorts to global mixture models to capture large appearance changes [4]. 

Their local part mixtures can becomposed to generate an exponential number of globalmixtures, greatly increasing their representational powerwithout sacrificing computational efficiency. 

Because PCK is easier to interpret and faster to evaluate than APK, the authors use PCK to perform diagnostic experiments exploring different aspects of their model in the next section. 

To score a configuration of parts, the authors first define a compatibility function for part types that factors into a sum of local and pairwise scores:SðtÞ ¼ X i2V btii þ X ij2E b ti;tj ij : ð1ÞThe parameter btii favors particular type assignments for part i, while the pairwise parameter bti;tj ij favors particularco-occurrences of part types. 

limbs vary greatly in appearancedue to changes in clothing and body shape, as well aschanges in viewpoint manifested in in-plane rotations and foreshortening. 

Wefind that joint training of orientation-variant parts in-creases performance by nearly a factor of 2, from 39 to72 percent PCK. 

The authors find that the latent updating of mixture labels is not crucial, a star model definitively hurts performance, and adding rotated copies of their training images increases performance by a small but noticeable amount. 

On their training datasets, the number of positive examples varies from 200 to 1,000 and the number of negative images is roughly 1,000.