What are the future works mentioned in the paper "A local basis representation for estimating human pose from cluttered images" ?

Future work: As regards immediate extensions, the method will be trained on a larger database of common gestures and extended to incorporate motion information for tracking full body motion in cluttered backgrounds.

What is the effect of sparsity on the performance of the NMF?

Varying the sparsity of the basis vectors W has very little effect on the performance, while varying the sparsity of the coefficients H gives results spanning the range of performances from k-means to unconstrained NMF.

What is the advantage of using NMF to represent images?

Besides capturing the local edges representative of human contours, the NMF bases allow us to compactly code each 128-d SIFT descriptor directly by its corresponding vector h of basis coefficients.

What is the effect of sparsity prior on H?

As the sparsity prior on H is increased to a maximum, NMF is forced to use only a few basis vectors for each training example, in the extreme case giving a solution very similar to k-means.

How many errors are obtained in the experiment?

of the 10.88 cm of error obtained in the experiment on cluttered images, 9.65 cm comes from x and y, while 12.97 cm from errors in z.

What is the reason why a linear regressor performs so well?

a linear regressor on the vector x performs very well despite the clutter — an examination of the elements of the weight matrix A reveals this is due to automatic downweighting of descriptor elements that usually contain only background.

How did the authors compute errors in the x and y coordinates?

To see the effect of depth ambiguities on these results, the authors computed errors separately in the x and y coordinates corresponding to the image plane and z, corresponding to depth.

What is the performance of the regressor?

The best performance, as expected, is obtained by training and testing on clean, background-free images, irrespective of the descriptor encoding used.

How can the authors make accurate results from a 3D body model?

With suitable initialization or sufficiently fine sampling such methods can produce accurate results, but the computational cost is high.

(Open Access) A local basis representation for estimating human pose from cluttered images (2006) | Ankur Agarwal

Q: What is the relative coarseness of the spatial coding?

The relative coarseness of the spatial coding provides some robustness to small position variations, while still capturing the essential spatial position and limb orientation information.

Q: What is the corresponding vector of similarity weights?

Each image patch was then represented by softly vector quantizing the SIFT descriptor by voting into each of its corresponding k-means centers, i.e. as a sparse vector of similarity weights computed from each cluster center.

HAL Id: inria-00548593

https://hal.inria.fr/inria-00548593

Submitted on 20 Dec 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A local basis representation for estimating human pose

from cluttered images

Ankur Agarwal, Bill Triggs

To cite this version:

Ankur Agarwal, Bill Triggs. A local basis representation for estimating human pose from cluttered

images. Asian Conference on Computer Vision (ACCV ’06), Jan 2006, Hyderabad, India. pp.50–59,

�10.1007/11612032_6�. �inria-00548593�

To appear in Proceedings of the 7th Asian Conference on Computer Vision, 2006.

A Local Basis Representation for Estimating Human

Pose from Cluttered Images

Ankur Agarwal and Bill Triggs

GRAVIR-INRIA-CNRS, 655 avenue de l’Europe, Montbonnot 38330, France

{Ankur.Agarwal, Bill.Triggs}@inrialpes.fr

http://lear.inrialpes.fr

Abstract. Recovering the pose of a person from single images is a challenging

problem. This paper discusses a bottom-up approach that uses local image fea-

tures to estimate human upper body pose from single images in cluttered back-

grounds. The method takes the image window with a dense grid of local gradient

orientation histograms, followed by non negative matrix factorization to learn a

set of bases that correspond to local features on the human body, enabling selec-

tive encoding of human-like features in the presence of background clutter. Pose

is then recovered by direct regression. This approach allows us to key on gradi-

ent patterns such as shoulder contours and bent elbows that are characteristic of

humans and carry important pose information, unlike current regressive methods

that either use weak limb detectors or require prior segmentation to work. The

system is trained on a database of images with labelled poses. We show that it

estimates pose with similar performance levels to current example-based meth-

ods, but unlike them it works in the presence of natural backgrounds, without any

prior segmentation.

1. Introduction

The ability to identify objects or their parts in the presence of cluttered backgrounds

is critical to the success of many computer vision algorithms, but ﬁnding descriptors

that can distinguish objects of interest from the background is often very difﬁcult. We

address this problem in the context of understanding human body pose from general

images. Images of people are seen everywhere. A system that was capable of reliably

estimating the conﬁguration of a person’s limbs from images would have applications

spanning from human computer interaction to activity recognition from images to an-

notating video content. In this paper, we focus on recognizing upper body gestures.

Human arm gestures often convey a lot of information — e.g. during communication

— and automated inference and interpretation of these could allow for critical under-

standing of a person’s behaviour.

Current methods for human pose inference usually rely on background subtraction

to isolate the subject. This limits their applicability to ﬁxed environments. Model-based

approaches use a manual/heuristic initialization of pose as a starting point to optimize

over image likelihoods, or to track through subsequent frames in a video sequence. The

application of such methods to 3D pose recovery requires camera parameter estimates

and realistic human body models. We prefer to take a bottom-up approach to the prob-

lem, considering pose inference from general images in terms of two interdependent

sub-problems: (i) identifying/localizing the human parts of interest in the image, and

(ii) estimating 3D pose from them. We combine methods that are currently used mainly

for object and pedestrian detection with recent advances in example-based pose esti-

mation from human silhouettes or segmented images, implicitly using the knowledge

contained in human body conﬁgurations to learn to localize body parts in the presence

of cluttered backgrounds and to infer 3D pose.

Our approach to modeling human body parts is based on using SIFT-like histograms

[5] computed on a uniform grid of overlapping patches on an image to encode the image

content as an array of 128-d feature vectors. This scheme encodes local image content in

terms of gradient patterns invariant to illumination changes, while still retaining spatial

position information. It allows us to key on gradient patterns such as head/shoulder con-

tours or bent elbows that are characteristic of humans and that contain important pose

information, in contrast to limb based representations that either key on skin colour and

face detection (e.g. [11]), or learn individual limb detectors and then apply kinematic

tree based constraints [16,20].

As the human body is highly articulated, it is a complicated object to detect, par-

ticularly at the resolution of individual body parts. Although explicit kinematic tree

based structures can be an effective tool in this regard, we avoid such assumptions,

instead learning characteristic spatial conﬁgurations directly from images. Our patch

based representation allows us to work on the scale of small body parts, and besides

providing spatial information for each of these parts, enables us to mix and match part

combinations for modeling generic appearance.

Previous work: There are currently only a few bottom up approaches to the estimation

of human pose from images and video. Many of these methods use combinations of

weak limb detectors to detect the presence of a person [16,9], but are not capable of

deducing 3D poses accurately enough to infer actions and gestures. Similarly, in [15],

loose 2D conﬁgurations of body parts are used to coarsely track people in video by

ﬁltering potential limb-like objects based on motion and color statistics.

Most methods for precise pose estimation adopt top-down approaches in the sense

that they try to minimize projection errors of kinematic models, either using numerical

optimization [21] or by generating large number of pose hypotheses [11]. With suitable

initialization or sufﬁciently ﬁne sampling such methods can produce accurate results,

but the computational cost is high. Efﬁcient matching methods such as [6] fall back to

the assumption of having pre-segmented images. [20] discusses an interesting approach

that combines weak responses from bottom-up limb detectors based on a statistical

model of image likelihoods with a full articulated body model using belief propaga-

tion. However, this approach uses background subtraction and it also relies on multiple

calibrated cameras.

A recent work that addresses upper body pose from single images in clutter is [11].

This is based on the use of heuristic image cues including a clothes model and skin

color detection; and relies on generating and testing large numbers of pose hypotheses

using a 3D body model. Here we adopt an example based approach inspired by [19]

and [1]. Both of these approaches infer pose from edge feature representations of the

(a) (b) (c) (d) (e)

Fig.1. An overview of our method of pose estimation from cluttered images. (a) original image,

(b) a grid of ﬁxed points where the descriptors are computed (each descriptor block covers an

array of 4x4 cells, giving a 50% overlap with it’s neighbouring blocks), (c) SIFT descriptors

computed at these points, the intensity of each line representing the weight of the corresponding

orientation bin in that cell, (d) Suppressing background using a sparse set of learned (NMF) bases

encoding human-like parts, (e) ﬁnal pose obtained by regression

input image using a model learned from a number of labeled training examples (image-

pose pairs). However, both require clean backgrounds for their representations. Here

we develop a more general approach that works with cluttered backgrounds. Our im-

age representation is based on local appearance descriptors extracted from a uniformly

spaced grid of image patches. This notion, in the form of superpixels, or image sites,

has previously been used in several different contexts, e.g. [4,13,17]. We also take in-

spiration from the image coding and object localization methods described in [22,14].

2. Regression based approach

Example based methods often have problems when working in high dimensional spaces

as it is difﬁcult to create or incorporate enough examples to densely cover the space.

This is particularly true for human pose estimation which must recover many articular

degrees of freedom from a complex image signal. The sparsity of examples is usually

tackled by smoothly interpolating between nearby examples. Learning a single smooth

inference model in the form of a regressor was suggested in [1]. This has the advantage

of directly recovering pose parameters from image observations, which obviates the

need to attach explicit meanings or attributions to image features (e.g. labels designating

the body parts seen). However it requires a robust and discriminative representation of

the input image. Following [1], we take a regression based approach, extending it to deal

with the presence of cluttered image background. Encoding pose by the 3D locations

of 8 key upper body joint centres, we regress a 24-d output pose vector y on a set of

image features x:

y = A φ(x) +  (1)

where φ(x) is a vector of basis functions, A is a matrix of weight vectors, and  is a

residual error vector. The matrix A is estimated by minimizing least squares error while

applying a regularization term to control overﬁtting.

The method turns out to be relatively insensitive to the choice of regression methods.

Here we work with a classical single-valued regressor as frontal upper body gestures

have relatively few multimodality problems in comparison to the full body case, but the

multimodal multi-valued regression method of [2] could also be used if necessary. Our

main focus is on exploring suitable image representations and mechanisms for dealing

with background clutter.

3. Image Features

Image information can be encoded in many different ways. Given the variability of

clothing and the fact that we want to be able to use black and white images, we do

not use colour information. Silhouette shape and body contours have proven effective

in cases where segmentations are available, but with current segmentation algorithms

they do not extend reliably to images with cluttered backgrounds [12]. Furthermore,

more local, part-based representations are likely to be able to adapt better to the highly

non-rigid structure of the human body. To allow the method to key on important body

contours, we based our representation on local image gradients. For effective encoding,

we use histograms of gradient orientations in small spatial cells. The relative coarse-

ness of the spatial coding provides some robustness to small position variations, while

still capturing the essential spatial position and limb orientation information. Note that

owing to loose clothing, the positions of limb contours do not in any case have a very

precise relation to the pose, whereas orientation of body edges is a much more reliable

cue. Hence a SIFT-like representation is appropriate. We compute these histograms in

the same way as SIFT descriptors [5], quantizing gradient orientations into discrete

values in small spatial cells and normalizing these distributions over local blocks of

cells to achieve insensitivity to illumination changes. To retain the information about

image location that is indispensable for pose estimation, the descriptors are computed

at ﬁxed grid locations in the image window. Figure 1(c) shows the features extracted

from a sample image. We denote the descriptor vectors at each of these L locations as

, l ∈ {1 . . . L}, and represent the complete image as a large vector x, a concatenation

of the individual descriptors: x ≡ (v

, v

, . . . v

)

An alternate approach that failed to provide convincing results in our experiments

is a bag of features style of representation. In the absence of reliable salient points on

the human body, we computed SIFT descriptors at all edge points in the image and

added spatial information by appending image coordinates to the descriptor vector. For

effective pose estimation, though, it seems that coding location precisely is extremely

important and extracting descriptors on a ﬁxed grid of locations is preferable.

3.1. Similarity based encoding

Representations based on collections of local parts are commonly used in object recog-

nition [18,3,7]. A common scheme is to identify a representative set of parts as a vo-

cabulary for representing new images. In an analogous manner, the human body can

be represented as a collection of limbs and other key body parts in particular conﬁgura-

tions. To test this, we independently clustered patches at each image location to identify

representative conﬁgurations of the body parts that are seen in these locations. Each im-

age patch was then represented by softly vector quantizing the SIFT descriptor by vot-

ing into each of its corresponding k-means centers, i.e. as a sparse vector of similarity

weights computed from each cluster center. Results from this and other representations

are summarized in ﬁgure 4 and discussed in the experimental section.

A local basis representation for estimating human pose from cluttered images

Figures

Citations

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

Vision-based human motion analysis: An overview

Domain Adaptation for Visual Applications: A Comprehensive Survey

Pose primitive based human action recognition in videos or still images

Latent structured models for human pose estimation

References

Distinctive Image Features from Scale-Invariant Keypoints

Histograms of oriented gradients for human detection

Learning the parts of objects by non-negative matrix factorization

Learning parts of objects by non-negative matrix factorization

Non-negative Matrix Factorization with Sparseness Constraints

Related Papers (5)

A model-based approach for estimating human 3D poses in static images

2D-3D Pose Estimation of Heterogeneous Objects Using a Region Based Approach

POSECUT: simultaneous segmentation and 3D pose estimation of humans using dynamic graph-cuts

Human body pose estimation using silhouette shape analysis

Non-rigid 2D-3D pose estimation and 2D image segmentation

Frequently Asked Questions (13)

Q1. What are the future works mentioned in the paper "A local basis representation for estimating human pose from cluttered images" ?

Q2. What are the contributions in "A local basis representation for estimating human pose from cluttered images" ?

Q3. What is the effect of sparsity on the performance of the NMF?

Q4. What is the advantage of using NMF to represent images?

Q5. What is the effect of sparsity prior on H?

Q6. How many errors are obtained in the experiment?

Q7. What is the relative coarseness of the spatial coding?

Q8. What is the reason why a linear regressor performs so well?

Q9. What is the corresponding vector of similarity weights?

Q10. How did the authors compute errors in the x and y coordinates?

Q11. What is the performance of the regressor?

Q12. How can the authors make accurate results from a 3D body model?

Q13. What is the approach to pose inference?