scispace - formally typeset
Open AccessBook ChapterDOI

A local basis representation for estimating human pose from cluttered images

TLDR
A bottom-up approach that uses local image features to estimate human upper body pose from single images in cluttered backgrounds, and shows that it estimates pose with similar performance levels to current example-based methods, but unlike them it works in the presence of natural backgrounds, without any prior segmentation.
Abstract
Recovering the pose of a person from single images is a challenging problem. This paper discusses a bottom-up approach that uses local image features to estimate human upper body pose from single images in cluttered backgrounds. The method takes the image window with a dense grid of local gradient orientation histograms, followed by non negative matrix factorization to learn a set of bases that correspond to local features on the human body, enabling selective encoding of human-like features in the presence of background clutter. Pose is then recovered by direct regression. This approach allows us to key on gradient patterns such as shoulder contours and bent elbows that are characteristic of humans and carry important pose information, unlike current regressive methods that either use weak limb detectors or require prior segmentation to work. The system is trained on a database of images with labelled poses. We show that it estimates pose with similar performance levels to current example-based methods, but unlike them it works in the presence of natural backgrounds, without any prior segmentation.

read more

Content maybe subject to copyright    Report

HAL Id: inria-00548593
https://hal.inria.fr/inria-00548593
Submitted on 20 Dec 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A local basis representation for estimating human pose
from cluttered images
Ankur Agarwal, Bill Triggs
To cite this version:
Ankur Agarwal, Bill Triggs. A local basis representation for estimating human pose from cluttered
images. Asian Conference on Computer Vision (ACCV ’06), Jan 2006, Hyderabad, India. pp.50–59,
�10.1007/11612032_6�. �inria-00548593�

To appear in Proceedings of the 7th Asian Conference on Computer Vision, 2006.
A Local Basis Representation for Estimating Human
Pose from Cluttered Images
Ankur Agarwal and Bill Triggs
GRAVIR-INRIA-CNRS, 655 avenue de l’Europe, Montbonnot 38330, France
{Ankur.Agarwal, Bill.Triggs}@inrialpes.fr
http://lear.inrialpes.fr
Abstract. Recovering the pose of a person from single images is a challenging
problem. This paper discusses a bottom-up approach that uses local image fea-
tures to estimate human upper body pose from single images in cluttered back-
grounds. The method takes the image window with a dense grid of local gradient
orientation histograms, followed by non negative matrix factorization to learn a
set of bases that correspond to local features on the human body, enabling selec-
tive encoding of human-like features in the presence of background clutter. Pose
is then recovered by direct regression. This approach allows us to key on gradi-
ent patterns such as shoulder contours and bent elbows that are characteristic of
humans and carry important pose information, unlike current regressive methods
that either use weak limb detectors or require prior segmentation to work. The
system is trained on a database of images with labelled poses. We show that it
estimates pose with similar performance levels to current example-based meth-
ods, but unlike them it works in the presence of natural backgrounds, without any
prior segmentation.
1. Introduction
The ability to identify objects or their parts in the presence of cluttered backgrounds
is critical to the success of many computer vision algorithms, but finding descriptors
that can distinguish objects of interest from the background is often very difficult. We
address this problem in the context of understanding human body pose from general
images. Images of people are seen everywhere. A system that was capable of reliably
estimating the configuration of a person’s limbs from images would have applications
spanning from human computer interaction to activity recognition from images to an-
notating video content. In this paper, we focus on recognizing upper body gestures.
Human arm gestures often convey a lot of information e.g. during communication
and automated inference and interpretation of these could allow for critical under-
standing of a person’s behaviour.
Current methods for human pose inference usually rely on background subtraction
to isolate the subject. This limits their applicability to fixed environments. Model-based
approaches use a manual/heuristic initialization of pose as a starting point to optimize
over image likelihoods, or to track through subsequent frames in a video sequence. The
application of such methods to 3D pose recovery requires camera parameter estimates

and realistic human body models. We prefer to take a bottom-up approach to the prob-
lem, considering pose inference from general images in terms of two interdependent
sub-problems: (i) identifying/localizing the human parts of interest in the image, and
(ii) estimating 3D pose from them. We combine methods that are currently used mainly
for object and pedestrian detection with recent advances in example-based pose esti-
mation from human silhouettes or segmented images, implicitly using the knowledge
contained in human body configurations to learn to localize body parts in the presence
of cluttered backgrounds and to infer 3D pose.
Our approach to modeling human body parts is based on using SIFT-like histograms
[5] computed on a uniform grid of overlapping patches on an image to encode the image
content as an array of 128-d feature vectors. This scheme encodes local image content in
terms of gradient patterns invariant to illumination changes, while still retaining spatial
position information. It allows us to key on gradient patterns such as head/shoulder con-
tours or bent elbows that are characteristic of humans and that contain important pose
information, in contrast to limb based representations that either key on skin colour and
face detection (e.g. [11]), or learn individual limb detectors and then apply kinematic
tree based constraints [16,20].
As the human body is highly articulated, it is a complicated object to detect, par-
ticularly at the resolution of individual body parts. Although explicit kinematic tree
based structures can be an effective tool in this regard, we avoid such assumptions,
instead learning characteristic spatial configurations directly from images. Our patch
based representation allows us to work on the scale of small body parts, and besides
providing spatial information for each of these parts, enables us to mix and match part
combinations for modeling generic appearance.
Previous work: There are currently only a few bottom up approaches to the estimation
of human pose from images and video. Many of these methods use combinations of
weak limb detectors to detect the presence of a person [16,9], but are not capable of
deducing 3D poses accurately enough to infer actions and gestures. Similarly, in [15],
loose 2D configurations of body parts are used to coarsely track people in video by
filtering potential limb-like objects based on motion and color statistics.
Most methods for precise pose estimation adopt top-down approaches in the sense
that they try to minimize projection errors of kinematic models, either using numerical
optimization [21] or by generating large number of pose hypotheses [11]. With suitable
initialization or sufficiently fine sampling such methods can produce accurate results,
but the computational cost is high. Efficient matching methods such as [6] fall back to
the assumption of having pre-segmented images. [20] discusses an interesting approach
that combines weak responses from bottom-up limb detectors based on a statistical
model of image likelihoods with a full articulated body model using belief propaga-
tion. However, this approach uses background subtraction and it also relies on multiple
calibrated cameras.
A recent work that addresses upper body pose from single images in clutter is [11].
This is based on the use of heuristic image cues including a clothes model and skin
color detection; and relies on generating and testing large numbers of pose hypotheses
using a 3D body model. Here we adopt an example based approach inspired by [19]
and [1]. Both of these approaches infer pose from edge feature representations of the

1
1
(a) (b) (c) (d) (e)
Fig.1. An overview of our method of pose estimation from cluttered images. (a) original image,
(b) a grid of fixed points where the descriptors are computed (each descriptor block covers an
array of 4x4 cells, giving a 50% overlap with it’s neighbouring blocks), (c) SIFT descriptors
computed at these points, the intensity of each line representing the weight of the corresponding
orientation bin in that cell, (d) Suppressing background using a sparse set of learned (NMF) bases
encoding human-like parts, (e) final pose obtained by regression
input image using a model learned from a number of labeled training examples (image-
pose pairs). However, both require clean backgrounds for their representations. Here
we develop a more general approach that works with cluttered backgrounds. Our im-
age representation is based on local appearance descriptors extracted from a uniformly
spaced grid of image patches. This notion, in the form of superpixels, or image sites,
has previously been used in several different contexts, e.g. [4,13,17]. We also take in-
spiration from the image coding and object localization methods described in [22,14].
2. Regression based approach
Example based methods often have problems when working in high dimensional spaces
as it is difficult to create or incorporate enough examples to densely cover the space.
This is particularly true for human pose estimation which must recover many articular
degrees of freedom from a complex image signal. The sparsity of examples is usually
tackled by smoothly interpolating between nearby examples. Learning a single smooth
inference model in the form of a regressor was suggested in [1]. This has the advantage
of directly recovering pose parameters from image observations, which obviates the
need to attach explicit meanings or attributions to image features (e.g. labels designating
the body parts seen). However it requires a robust and discriminative representation of
the input image. Following [1], we take a regression based approach, extending it to deal
with the presence of cluttered image background. Encoding pose by the 3D locations
of 8 key upper body joint centres, we regress a 24-d output pose vector y on a set of
image features x:
y = A φ(x) + (1)
where φ(x) is a vector of basis functions, A is a matrix of weight vectors, and is a
residual error vector. The matrix A is estimated by minimizing least squares error while
applying a regularization term to control overfitting.
The method turns out to be relatively insensitive to the choice of regression methods.
Here we work with a classical single-valued regressor as frontal upper body gestures
have relatively few multimodality problems in comparison to the full body case, but the
multimodal multi-valued regression method of [2] could also be used if necessary. Our

main focus is on exploring suitable image representations and mechanisms for dealing
with background clutter.
3. Image Features
Image information can be encoded in many different ways. Given the variability of
clothing and the fact that we want to be able to use black and white images, we do
not use colour information. Silhouette shape and body contours have proven effective
in cases where segmentations are available, but with current segmentation algorithms
they do not extend reliably to images with cluttered backgrounds [12]. Furthermore,
more local, part-based representations are likely to be able to adapt better to the highly
non-rigid structure of the human body. To allow the method to key on important body
contours, we based our representation on local image gradients. For effective encoding,
we use histograms of gradient orientations in small spatial cells. The relative coarse-
ness of the spatial coding provides some robustness to small position variations, while
still capturing the essential spatial position and limb orientation information. Note that
owing to loose clothing, the positions of limb contours do not in any case have a very
precise relation to the pose, whereas orientation of body edges is a much more reliable
cue. Hence a SIFT-like representation is appropriate. We compute these histograms in
the same way as SIFT descriptors [5], quantizing gradient orientations into discrete
values in small spatial cells and normalizing these distributions over local blocks of
cells to achieve insensitivity to illumination changes. To retain the information about
image location that is indispensable for pose estimation, the descriptors are computed
at fixed grid locations in the image window. Figure 1(c) shows the features extracted
from a sample image. We denote the descriptor vectors at each of these L locations as
v
l
, l {1 . . . L}, and represent the complete image as a large vector x, a concatenation
of the individual descriptors: x (v
1
>
, v
2
>
, . . . v
L
>
)
>
.
An alternate approach that failed to provide convincing results in our experiments
is a bag of features style of representation. In the absence of reliable salient points on
the human body, we computed SIFT descriptors at all edge points in the image and
added spatial information by appending image coordinates to the descriptor vector. For
effective pose estimation, though, it seems that coding location precisely is extremely
important and extracting descriptors on a fixed grid of locations is preferable.
3.1. Similarity based encoding
Representations based on collections of local parts are commonly used in object recog-
nition [18,3,7]. A common scheme is to identify a representative set of parts as a vo-
cabulary for representing new images. In an analogous manner, the human body can
be represented as a collection of limbs and other key body parts in particular configura-
tions. To test this, we independently clustered patches at each image location to identify
representative configurations of the body parts that are seen in these locations. Each im-
age patch was then represented by softly vector quantizing the SIFT descriptor by vot-
ing into each of its corresponding k-means centers, i.e. as a sparse vector of similarity
weights computed from each cluster center. Results from this and other representations
are summarized in figure 4 and discussed in the experimental section.

Figures
Citations
More filters
Journal ArticleDOI

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

TL;DR: A new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, is introduced for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms.
Journal ArticleDOI

Vision-based human motion analysis: An overview

TL;DR: The characteristics of human motion analysis are discussed to highlight trends in the domain and to point out limitations of the current state of the art.
Posted Content

Domain Adaptation for Visual Applications: A Comprehensive Survey

TL;DR: An overview of domain adaptation and transfer learning with a specific view on visual applications and the methods that go beyond image categorization, such as object detection or image segmentation, video analyses or learning visual attributes are overviewed.
Proceedings ArticleDOI

Pose primitive based human action recognition in videos or still images

TL;DR: This paper presents a method for recognizing human actions based on pose primitives that does not rely on background subtraction or dynamic features and thus allows for action recognition in still images.
Proceedings ArticleDOI

Latent structured models for human pose estimation

TL;DR: This work presents an approach for automatic 3D human pose reconstruction from monocular images, based on a discriminative formulation with latent segmentation inputs, and provides primal linear re-formulations based on Fourier kernel approximations in order to scale-up the non-linear latent structured prediction methodology.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

Learning the parts of objects by non-negative matrix factorization

TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.

Learning parts of objects by non-negative matrix factorization

D. D. Lee
TL;DR: In this article, non-negative matrix factorization is used to learn parts of faces and semantic features of text, which is in contrast to principal components analysis and vector quantization that learn holistic, not parts-based, representations.
Journal ArticleDOI

Non-negative Matrix Factorization with Sparseness Constraints

TL;DR: In this paper, the notion of sparseness is incorporated into NMF to improve the found decompositions, and the authors provide complete MATLAB code both for standard NMF and for their extension.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the future works mentioned in the paper "A local basis representation for estimating human pose from cluttered images" ?

Future work: As regards immediate extensions, the method will be trained on a larger database of common gestures and extended to incorporate motion information for tracking full body motion in cluttered backgrounds. 

This paper discusses a bottom-up approach that uses local image features to estimate human upper body pose from single images in cluttered backgrounds. The method takes the image window with a dense grid of local gradient orientation histograms, followed by non negative matrix factorization to learn a set of bases that correspond to local features on the human body, enabling selective encoding of human-like features in the presence of background clutter. This approach allows us to key on gradient patterns such as shoulder contours and bent elbows that are characteristic of humans and carry important pose information, unlike current regressive methods that either use weak limb detectors or require prior segmentation to work. The authors show that it estimates pose with similar performance levels to current example-based methods, but unlike them it works in the presence of natural backgrounds, without any prior segmentation. 

Varying the sparsity of the basis vectors W has very little effect on the performance, while varying the sparsity of the coefficients H gives results spanning the range of performances from k-means to unconstrained NMF. 

Besides capturing the local edges representative of human contours, the NMF bases allow us to compactly code each 128-d SIFT descriptor directly by its corresponding vector h of basis coefficients. 

As the sparsity prior on H is increased to a maximum, NMF is forced to use only a few basis vectors for each training example, in the extreme case giving a solution very similar to k-means. 

of the 10.88 cm of error obtained in the experiment on cluttered images, 9.65 cm comes from x and y, while 12.97 cm from errors in z. 

The relative coarseness of the spatial coding provides some robustness to small position variations, while still capturing the essential spatial position and limb orientation information. 

a linear regressor on the vector x performs very well despite the clutter — an examination of the elements of the weight matrix A reveals this is due to automatic downweighting of descriptor elements that usually contain only background. 

Each image patch was then represented by softly vector quantizing the SIFT descriptor by voting into each of its corresponding k-means centers, i.e. as a sparse vector of similarity weights computed from each cluster center. 

To see the effect of depth ambiguities on these results, the authors computed errors separately in the x and y coordinates corresponding to the image plane and z, corresponding to depth. 

The best performance, as expected, is obtained by training and testing on clean, background-free images, irrespective of the descriptor encoding used. 

With suitable initialization or sufficiently fine sampling such methods can produce accurate results, but the computational cost is high. 

The authors prefer to take a bottom-up approach to the problem, considering pose inference from general images in terms of two interdependent sub-problems: (i) identifying/localizing the human parts of interest in the image, and (ii) estimating 3D pose from them.