Proceedings Article•DOI•

Real-time human pose recognition in parts from single depth images

Q: What are the contributions in "Real-time human pose recognition in parts from single depth images" ?

The authors propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. The authors take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Finally the authors generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The authors achieve state of the art accuracy in their comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

Q: What future works have the authors mentioned in the paper "Real-time human pose recognition in parts from single depth images" ?

As future work, the authors plan further study of the variability Combined Comparisons in the source mocap data, the properties of the generative model underlying the synthesis pipeline, and the particular part definitions.

Q: How did the authors train the simulated forest?

Using a highly varied synthetic training set allowed us to train very deep decision forests using simple depthinvariant features without overfitting, learning invariance to both pose and shape.

Q: How do the authors train a deep decision forest?

The authors train a deep randomized decision forest classifier which avoids overfitting by using hundreds of thousands of training images.

Q: How did the authors find it necessary to iterate the mocap database?

The authors have found it necessary to iterate the process of motion capture, sampling from their model, training the classifier, and testing joint prediction accuracy in order to refine the mocap database with regions of pose space that had been previously missed out.

Jamie Shotton¹, Andrew Fitzgibbon¹, Mat Cook¹, Toby Sharp¹, Mark J. Finocchio¹, Richard E. Moore¹, Alex Aben-Athar Kipman¹, Andrew Blake¹ - Show less +4 more•Institutions (1)

Microsoft¹

20 Jun 2011-pp 1297-1304

TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.

read less

Abstract: We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

...read moreread less

Summary (4 min read)

Jump to: [1. Introduction] – [2. Data] – [2.1. Depth imaging] – [2.2. Motion capture data] – [2.3. Generating synthetic data] – [3.1. Body part labeling] – [3.2. Depth image features] – [3.3. Randomized decision forests] – [3.4. Joint position proposals] – [4. Experiments] – [4.1. Qualitative results] – [4.2. Classification accuracy] – [4.3. Joint prediction accuracy] and [5. Discussion]

1. Introduction

Robust interactive human body tracking has applications including gaming, human-computer interaction, security, telepresence, and even health-care.
In particular, until the launch of Kinect [21], none ran at interactive rates on consumer hardware while handling a full range of human body shapes and sizes undergoing general body motions.
Reprojecting the inferred parts into world space, the authors localize spatial modes of each part distribution and thus generate (possibly several) confidence-weighted proposals for the 3D locations of each skeletal joint.
The authors experiments also carry several insights: (i) synthetic depth training data is an excellent proxy for real data; (ii) scaling up the learning problem with varied synthetic data is important for high accuracy; and (iii) their parts-based approach generalizes better than even an oracular exact nearest neighbor.
Felzenszwalb & Huttenlocher [11] apply pictorial structures to estimate pose efficiently.

2. Data

Pose estimation research has often focused on techniques to overcome lack of training data [25], because of two problems.
First, generating realistic intensity images using computer graphics techniques [33, 27, 26] is hampered by the huge color and texture variability induced by clothing, hair, and skin, often meaning that the data are reduced to 2D silhouettes [1].
The second limitation is that synthetic body pose images are of necessity fed by motion-capture data.
Although techniques exist to simulate human motion (e.g. [38]) they do not yet produce the range of volitional motions of a human subject.
The authors believe this dataset to considerably advance the state of the art in both scale and variety, and demonstrate the importance of such a large dataset in their evaluation.

2.1. Depth imaging

Depth imaging technology has advanced dramatically over the last few years, finally reaching a consumer price point with the launch of Kinect [21].
Pixels in a depth image indicate calibrated depth in the scene, rather than a measure of intensity or color.
The authors employ the Kinect camera which gives a 640x480 image at 30 frames per second with depth resolution of a few centimeters.
Depth cameras offer several advantages over traditional intensity sensors, working in low light levels, giving a calibrated scale estimate, being color and texture invariant, and resolving silhouette ambiguities in pose.
But most importantly for their approach, it is straightforward to synthesize realistic depth images of people and thus build a large training dataset cheaply.

2.2. Motion capture data

The human body is capable of an enormous range of poses which are difficult to simulate.
Instead, the authors capture a large database of motion capture of human actions.
The authors aim was to span the wide variety of poses people would make in an entertainment scenario.
Often, changes in pose from one mocap frame to the next are so small as to be insignificant.
The authors thus discard many similar, redundant poses from the initial mocap data using ‘furthest neighbor’ clustering [15] where the distance between poses p1 and p2 is defined as maxj ‖pj1−p j 2‖2, the maximum Euclidean distance over body joints j.

2.3. Generating synthetic data

The authors build a randomized rendering pipeline from which they can sample fully labeled training images.
The authors goals in building this pipeline were twofold: realism and variety.
For the learned model to work well, the samples must closely resemble real camera images, and contain good coverage of the appearance variations the authors hope to recognize at test time.
While depth/scale and translation variations are handled explicitly in their features (see below), other invariances cannot be encoded efficiently.
Further slight random variation in height and weight give extra coverage of body shapes.

3.1. Body part labeling

A key contribution of this work is their intermediate body part representation.
Some of these parts are defined to directly localize particular skeletal joints of interest, while others fill the gaps or could be used in combination to predict other joints.
The authors intermediate representation transforms the problem into one that can readily be solved by efficient classification algorithms; the authors show in Sec. 4.3 that the penalty paid for this transformation is small.
The pairs of depth and body part images are used as fully labeled data for learning the classifier (see below).
In an upper body tracking scenario, all the lower body parts could be merged.

3.2. Depth image features

The authors employ simple depth comparison features, inspired by those in [20].
The features are thus 3D translation invariant (modulo perspective effects).
Eq. 1 will give a large positive response for pixels x near the top of the body, but a value close to zero for pixels x lower down the body, also known as Feature fθ1 looks upwards.
The design of these features was strongly motivated by their computational efficiency: no preprocessing is needed; each feature need only read at most 3 image pixels and perform at most 5 arithmetic operations; and the features can be straightforwardly implemented on the GPU.

3.3. Randomized decision forests

At the leaf node reached in tree t, a learned distribution Pt(c|I,x) over body part labels c is stored.
A random subset of 2000 example pixels from each image is chosen to ensure a roughly even distribution across body parts.
Each tree is trained using the following algorithm [20]: 1. Randomly propose a set of splitting candidates φ = (θ, τ) (feature parameters θ and thresholds τ ).
To keep the training times down the authors employ a distributed implementation.

3.4. Joint position proposals

Body part recognition as described above infers per-pixel information.
These proposals are the final output of their algorithm, and could be used by a tracking algorithm to selfinitialize and recover from failure.
Depending on the definition of body parts, the posterior P (c|I,x) can be pre-accumulated over a small set of parts.
Mean shift is used to find modes in this density efficiently.
A final confidence estimate is given as a sum of the pixel weights reaching each mode.

4. Experiments

In this section the authors describe the experiments performed to evaluate their method.
For their synthetic test set, the authors synthesize 5000 depth images, together with the ground truth body part labels and joint positions.
The authors quantify both classification and joint prediction accuracy.
Any joint proposals outside D meters also count as false positives.
The authors set D = 0.1m below, approximately the accuracy of the handlabeled real test data ground truth.

4.1. Qualitative results

Fig. 5 shows example inferences of their algorithm.
Note high accuracy of both classification and joint prediction across large variations in body and camera pose, depth in scene, cropping, and body size and shape (e.g. small child vs. heavy adult).
The bottom row shows some failure modes of the body part classification.
The first example shows a failure to distinguish subtle changes in the depth image such as the crossed arms.
Often (as with the second and third failure examples) the most likely body part is incorrect, but there is still sufficient correct probability mass in distribution P (c|I,x) that an accurate proposal can still be generated.

4.2. Classification accuracy

The authors investigate the effect of several training parameters on classification accuracy.
The authors also show in Fig. 6(a) the quality of their approach on synthetic silhouette images, where the features in Eq. 1 are either given scale (as the mean depth) or not (a fixed constant depth).
Using only 15k images the authors observe overfitting beginning around depth 17, but the enlarged 900k training set avoids this.
The authors compare the actual performance of their system (red) with the best achievable result (blue) given the ground truth body part labels.
Accuracy increases with the maximum probe offset, though levels off around 129 pixel meters.

4.3. Joint prediction accuracy

In Fig. 7 the authors show average precision results on the synthetic test set, achieving 0.731 mAP.
The authors compare an idealized setup that is given the ground truth body part labels to the real setup using inferred body parts.
The speed of nearest neighbor chamfer matching is also drastically slower (2 fps) than their algorithm.
The authors of [13] provided their test data and results for direct comparison.
To evaluate the full 360◦ rotation scenario, the authors trained a forest on 900k images containing full rotations and tested on 5k synthetic full rotation images (with held out poses).

5. Discussion

The authors have seen how accurate proposals for the 3D locations of body joints can be estimated in super real-time from single depth images.
Detecting modes in a density function gives the final set of confidence-weighted 3D joint proposals.
Whether a similarly efficient approach that can directly regress joint positions is also an open question.
Perhaps a global estimate of latent variables such as coarse person orientation could be used to condition the body part inference and remove ambiguities in local pose estimates.

Did you find this useful? Give us your feedback

Figures (7)

Figure 1. Overview. From an single input depth image, a per-pixel body part distribution is inferred. (Colors indicate the most likely part labels at each pixel, and correspond in the joint proposals). Local modes of this signal are estimated to give high-quality proposals for the 3D locations of body joints, even for multiple users.

Figure 6. Training parameters vs. classification accuracy. (a) Number of training images. (b) Depth of trees. (c) Maximum probe offset.

Figure 2. Synthetic and real data. Pairs of depth image and ground truth body parts. Note wide variety in pose, shape, clothing, and crop.

Figure 8. Comparisons. (a) Comparison with nearest neighbor matching. (b) Comparison with [13]. Even without the kinematic and temporal constraints exploited by [13], our algorithm is able to more accurately localize body joints.

Figure 4. Randomized Decision Forests. A forest is an ensemble of trees. Each tree consists of split nodes (blue) and leaf nodes (green). The red arrows indicate the different paths that might be taken by different trees for a particular input.

Figure 3. Depth image features. The yellow crosses indicates the pixel x being classified. The red circles indicate the offset pixels as defined in Eq. 1. In (a), the two example features give a large depth difference response. In (b), the same two features at new image locations give a much smaller response.

Figure 5. Example inferences. Synthetic (top row); real (middle); failure modes (bottom). Left column: ground truth for a neutral pose as a reference. In each example we see the depth image, the inferred most likely body part labels, and the joint proposals show as front, right, and top views (overlaid on a depth point cloud). Only the most confident proposal for each joint above a fixed, shared threshold is shown.

Content maybe subject to copyright Report

Real-Time Human Pose Recognition in Parts from Single Depth Images

Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchio

Richard Moore Alex Kipman Andrew Blake

Microsoft Research Cambridge & Xbox Incubation

Abstract

We propose a new method to quickly and accurately pre-

dict 3D positions of body joints from a single depth image,

using no temporal information. We take an object recog-

nition approach, designing an intermediate body parts rep-

resentation that maps the difﬁcult pose estimation problem

into a simpler per-pixel classiﬁcation problem. Our large

and highly varied training dataset allows the classiﬁer to

estimate body parts invariant to pose, body shape, clothing,

etc. Finally we generate conﬁdence-scored 3D proposals of

several body joints by reprojecting the classiﬁcation result

and ﬁnding local modes.

The system runs at 200 frames per second on consumer

hardware. Our evaluation shows high accuracy on both

synthetic and real test sets, and investigates the effect of sev-

eral training parameters. We achieve state of the art accu-

racy in our comparison with related work and demonstrate

improved generalization over exact whole-skeleton nearest

neighbor matching.

1. Introduction

Robust interactive human body tracking has applica-

tions including gaming, human-computer interaction, secu-

rity, telepresence, and even health-care. The task has re-

cently been greatly simpliﬁed by the introduction of real-

time depth cameras [16, 19, 44, 37, 28, 13]. However, even

the best existing systems still exhibit limitations. In partic-

ular, until the launch of Kinect [21], none ran at interactive

rates on consumer hardware while handling a full range of

human body shapes and sizes undergoing general body mo-

tions. Some systems achieve high speeds by tracking from

frame to frame but struggle to re-initialize quickly and so

are not robust. In this paper, we focus on pose recognition

in parts: detecting from a single depth image a small set of

3D position candidates for each skeletal joint. Our focus on

per-frame initialization and recovery is designed to comple-

ment any appropriate tracking algorithm [7, 39, 16, 42, 13]

that might further incorporate temporal and kinematic co-

herence. The algorithm presented here forms a core com-

ponent of the Kinect gaming platform [21].

Illustrated in Fig. 1 and inspired by recent object recog-

nition work that divides objects into parts (e.g. [12, 43]),

our approach is driven by two key design goals: computa-

tional efﬁciency and robustness. A single input depth image

is segmented into a dense probabilistic body part labeling,

with the parts deﬁned to be spatially localized near skeletal

CVPR Teaser

seq 1: frame 15

seq 2: frame 236

seq 5: take 1, 72

depth image body parts 3D joint proposals

Figure 1. Overview. From an single input depth image, a per-pixel

body part distribution is inferred. (Colors indicate the most likely

part labels at each pixel, and correspond in the joint proposals).

Local modes of this signal are estimated to give high-quality pro-

posals for the 3D locations of body joints, even for multiple users.

joints of interest. Reprojecting the inferred parts into world

space, we localize spatial modes of each part distribution

and thus generate (possibly several) conﬁdence-weighted

proposals for the 3D locations of each skeletal joint.

We treat the segmentation into body parts as a per-pixel

classiﬁcation task (no pairwise terms or CRF have proved

necessary). Evaluating each pixel separately avoids a com-

binatorial search over the different body joints, although

within a single part there are of course still dramatic dif-

ferences in the contextual appearance. For training data,

we generate realistic synthetic depth images of humans of

many shapes and sizes in highly varied poses sampled from

a large motion capture database. We train a deep ran-

domized decision forest classiﬁer which avoids overﬁtting

by using hundreds of thousands of training images. Sim-

ple, discriminative depth comparison image features yield

3D translation invariance while maintaining high computa-

tional efﬁciency. For further speed, the classiﬁer can be run

in parallel on each pixel on a GPU [34]. Finally, spatial

modes of the inferred per-pixel distributions are computed

using mean shift [10] resulting in the 3D joint proposals.

An optimized implementation of our algorithm runs in

under 5ms per frame (200 frames per second) on the Xbox

360 GPU, at least one order of magnitude faster than exist-

ing approaches. It works frame-by-frame across dramati-

cally differing body shapes and sizes, and the learned dis-

criminative approach naturally handles self-occlusions and

poses cropped by the image frame. We evaluate on both real

and synthetic depth images, containing challenging poses of

a varied set of subjects. Even without exploiting temporal

or kinematic constraints, the 3D joint proposals are both ac-

curate and stable. We investigate the effect of several train-

ing parameters and show how very deep trees can still avoid

overﬁtting due to the large training set. We demonstrate

that our part proposals generalize at least as well as exact

nearest-neighbor in both an idealized and realistic setting,

and show a substantial improvement over the state of the

art. Further, results on silhouette images suggest more gen-

eral applicability of our approach.

Our main contribution is to treat pose estimation as ob-

ject recognition using a novel intermediate body parts rep-

resentation designed to spatially localize joints of interest

at low computational cost and high accuracy. Our experi-

ments also carry several insights: (i) synthetic depth train-

ing data is an excellent proxy for real data; (ii) scaling up

the learning problem with varied synthetic data is important

for high accuracy; and (iii) our parts-based approach gener-

alizes better than even an oracular exact nearest neighbor.

Related Work. Human pose estimation has generated a

vast literature (surveyed in [22, 29]). The recent availability

of depth cameras has spurred further progress [16, 19, 28].

Grest et al. [16] use Iterated Closest Point to track a skele-

ton of a known size and starting position. Anguelov et al.

[3] segment puppets in 3D range scan data into head, limbs,

torso, and background using spin images and a MRF. In

[44], Zhu & Fujimura build heuristic detectors for coarse

upper body parts (head, torso, arms) using a linear program-

ming relaxation, but require a T-pose initialization to size

the model. Siddiqui & Medioni [37] hand craft head, hand,

and forearm detectors, and show data-driven MCMC model

ﬁtting outperforms ICP. Kalogerakis et al. [18] classify and

segment vertices in a full closed 3D mesh into different

parts, but do not deal with occlusions and are sensitive to

mesh topology. Most similar to our approach, Plagemann

et al. [28] build a 3D mesh to ﬁnd geodesic extrema inter-

est points which are classiﬁed into 3 parts: head, hand, and

foot. Their method provides both a location and orientation

estimate of these parts, but does not distinguish left from

right and the use of interest points limits the choice of parts.

Advances have also been made using conventional in-

tensity cameras, though typically at much higher computa-

tional cost. Bregler & Malik [7] track humans using twists

and exponential maps from a known initial pose. Ioffe &

Forsyth [17] group parallel edges as candidate body seg-

ments and prune combinations of segments using a pro-

jected classiﬁer. Mori & Malik [24] use the shape con-

text descriptor to match exemplars. Ramanan & Forsyth

[31] ﬁnd candidate body segments as pairs of parallel lines,

clustering appearances across frames. Shakhnarovich et al.

[33] estimate upper body pose, interpolating k-NN poses

matched by parameter sensitive hashing. Agarwal & Triggs

[1] learn a regression from kernelized image silhouettes fea-

tures to pose. Sigal et al. [39] use eigen-appearance tem-

plate detectors for head, upper arms and lower legs pro-

posals. Felzenszwalb & Huttenlocher [11] apply pictorial

structures to estimate pose efﬁciently. Navaratnam et al.

[25] use the marginal statistics of unlabeled data to im-

prove pose estimation. Urtasun & Darrel [41] proposed a

local mixture of Gaussian Processes to regress human pose.

Auto-context was used in [40] to obtain a coarse body part

labeling but this was not deﬁned to localize joints and clas-

sifying each frame took about 40 seconds. Rogez et al. [32]

train randomized decision forests on a hierarchy of classes

deﬁned on a torus of cyclic human motion patterns and cam-

era angles. Wang & Popovi

c [42] track a hand clothed in a

colored glove. Our system could be seen as automatically

inferring the colors of an virtual colored suit from a depth

image. Bourdev & Malik [6] present ‘poselets’ that form

tight clusters in both 3D pose and 2D image appearance,

detectable using SVMs.

2. Data

Pose estimation research has often focused on techniques

to overcome lack of training data [25], because of two prob-

lems. First, generating realistic intensity images using com-

puter graphics techniques [33, 27, 26] is hampered by the

huge color and texture variability induced by clothing, hair,

and skin, often meaning that the data are reduced to 2D sil-

houettes [1]. Although depth cameras signiﬁcantly reduce

this difﬁculty, considerable variation in body and clothing

shape remains. The second limitation is that synthetic body

pose images are of necessity fed by motion-capture (mocap)

data. Although techniques exist to simulate human motion

(e.g. [38]) they do not yet produce the range of volitional

motions of a human subject.

In this section we review depth imaging and show how

we use real mocap data, retargetted to a variety of base char-

acter models, to synthesize a large, varied dataset. We be-

lieve this dataset to considerably advance the state of the art

in both scale and variety, and demonstrate the importance

of such a large dataset in our evaluation.

2.1. Depth imaging

Depth imaging technology has advanced dramatically

over the last few years, ﬁnally reaching a consumer price

point with the launch of Kinect [21]. Pixels in a depth image

indicate calibrated depth in the scene, rather than a measure

of intensity or color. We employ the Kinect camera which

gives a 640x480 image at 30 frames per second with depth

resolution of a few centimeters.

Depth cameras offer several advantages over traditional

intensity sensors, working in low light levels, giving a cali-

brated scale estimate, being color and texture invariant, and

resolving silhouette ambiguities in pose. They also greatly

Training & Test Data

synthetic (train & test)

real (test)

synthetic (train & test)

real (test)

Figure 2. Synthetic and real data. Pairs of depth image and ground truth body parts. Note wide variety in pose, shape, clothing, and crop.

simplify the task of background subtraction which we as-

sume in this work. But most importantly for our approach,

it is straightforward to synthesize realistic depth images of

people and thus build a large training dataset cheaply.

2.2. Motion capture data

The human body is capable of an enormous range of

poses which are difﬁcult to simulate. Instead, we capture a

large database of motion capture (mocap) of human actions.

Our aim was to span the wide variety of poses people would

make in an entertainment scenario. The database consists of

approximately 500k frames in a few hundred sequences of

driving, dancing, kicking, running, navigating menus, etc.

We expect our semi-local body part classiﬁer to gener-

alize somewhat to unseen poses. In particular, we need not

record all possible combinations of the different limbs; in

practice, a wide range of poses proves sufﬁcient. Further,

we need not record mocap with variation in rotation about

the vertical axis, mirroring left-right, scene position, body

shape and size, or camera pose, all of which can be added

in (semi-)automatically.

Since the classiﬁer uses no temporal information, we

are interested only in static poses and not motion. Often,

changes in pose from one mocap frame to the next are so

small as to be insigniﬁcant. We thus discard many similar,

redundant poses from the initial mocap data using ‘furthest

neighbor’ clustering [15] where the distance between poses

and p

is deﬁned as max

− p

, the maximum Eu-

clidean distance over body joints j. We use a subset of 100k

poses such that no two poses are closer than 5cm.

We have found it necessary to iterate the process of mo-

tion capture, sampling from our model, training the classi-

ﬁer, and testing joint prediction accuracy in order to reﬁne

the mocap database with regions of pose space that had been

previously missed out. Our early experiments employed

the CMU mocap database [9] which gave acceptable results

though covered far less of pose space.

2.3. Generating synthetic data

We build a randomized rendering pipeline from which

we can sample fully labeled training images. Our goals in

building this pipeline were twofold: realism and variety. For

the learned model to work well, the samples must closely

resemble real camera images, and contain good coverage of

the appearance variations we hope to recognize at test time.

While depth/scale and translation variations are handled ex-

plicitly in our features (see below), other invariances cannot

be encoded efﬁciently. Instead we learn invariance from the

data to camera pose, body pose, and body size and shape.

The synthesis pipeline ﬁrst randomly samples a set of

parameters, and then uses standard computer graphics tech-

niques to render depth and (see below) body part images

from texture mapped 3D meshes. The mocap is retarget-

ting to each of 15 base meshes spanning the range of body

shapes and sizes, using [4]. Further slight random vari-

ation in height and weight give extra coverage of body

shapes. Other randomized parameters include the mocap

frame, camera pose, camera noise, clothing and hairstyle.

We provide more details of these variations in the supple-

mentary material. Fig. 2 compares the varied output of the

pipeline to hand-labeled real camera images.

3. Body Part Inference and Joint Proposals

In this section we describe our intermediate body parts

representation, detail the discriminative depth image fea-

tures, review decision forests and their application to body

part recognition, and ﬁnally discuss how a mode ﬁnding al-

gorithm is used to generate joint position proposals.

3.1. Body part labeling

A key contribution of this work is our intermediate body

part representation. We deﬁne several localized body part

labels that densely cover the body, as color-coded in Fig. 2.

Some of these parts are deﬁned to directly localize partic-

ular skeletal joints of interest, while others ﬁll the gaps or

could be used in combination to predict other joints. Our in-

termediate representation transforms the problem into one

that can readily be solved by efﬁcient classiﬁcation algo-

rithms; we show in Sec. 4.3 that the penalty paid for this

transformation is small.

The parts are speciﬁed in a texture map that is retargetted

to skin the various characters during rendering. The pairs of

depth and body part images are used as fully labeled data for

learning the classiﬁer (see below). For the experiments in

this paper, we use 31 body parts: LU/RU/LW/RW head, neck,

L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R

hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee,

L/R ankle, L/R foot (Left, Right, Upper, loWer). Distinct

(a)

body parts

Image Features

(b)

𝜃

Figure 3. Depth image features. The yellow crosses indicates the

pixel x being classiﬁed. The red circles indicate the offset pixels

as deﬁned in Eq. 1. In (a), the two example features give a large

depth difference response. In (b), the same two features at new

image locations give a much smaller response.

parts for left and right allow the classiﬁer to disambiguate

the left and right sides of the body.

Of course, the precise deﬁnition of these parts could be

changed to suit a particular application. For example, in an

upper body tracking scenario, all the lower body parts could

be merged. Parts should be sufﬁciently small to accurately

localize body joints, but not too numerous as to waste ca-

pacity of the classiﬁer.

3.2. Depth image features

We employ simple depth comparison features, inspired

by those in [20]. At a given pixel x, the features compute

(I, x) = d



x +

(x)



− d



x +

(x)



, (1)

where d

(x) is the depth at pixel x in image I, and parame-

ters θ = (u, v) describe offsets u and v. The normalization

of the offsets by

(x)

ensures the features are depth invari-

ant: at a given point on the body, a ﬁxed world space offset

will result whether the pixel is close or far from the camera.

The features are thus 3D translation invariant (modulo per-

spective effects). If an offset pixel lies on the background

or outside the bounds of the image, the depth probe d

)

is given a large positive constant value.

Fig. 3 illustrates two features at different pixel locations

x. Feature f

looks upwards: Eq. 1 will give a large pos-

itive response for pixels x near the top of the body, but a

value close to zero for pixels x lower down the body. Fea-

ture f

may instead help ﬁnd thin vertical structures such

as the arm.

Individually these features provide only a weak signal

about which part of the body the pixel belongs to, but in

combination in a decision forest they are sufﬁcient to accu-

rately disambiguate all trained parts. The design of these

features was strongly motivated by their computational efﬁ-

ciency: no preprocessing is needed; each feature need only

read at most 3 image pixels and perform at most 5 arithmetic

operations; and the features can be straightforwardly imple-

mented on the GPU. Given a larger computational budget,

one could employ potentially more powerful features based

on, for example, depth integrals over regions, curvature, or

local descriptors e.g. [5].

Random Forests

…

tree 

tree 

󰇛 󰇜





󰇛󰇜





󰇛󰇜

Figure 4. Randomized Decision Forests. A forest is an ensemble

of trees. Each tree consists of split nodes (blue) and leaf nodes

(green). The red arrows indicate the different paths that might be

taken by different trees for a particular input.

3.3. Randomized decision forests

Randomized decision trees and forests [35, 30, 2, 8] have

proven fast and effective multi-class classiﬁers for many

tasks [20, 23, 36], and can be implemented efﬁciently on the

GPU [34]. As illustrated in Fig. 4, a forest is an ensemble

of T decision trees, each consisting of split and leaf nodes.

Each split node consists of a feature f

and a threshold τ .

To classify pixel x in image I, one starts at the root and re-

peatedly evaluates Eq. 1, branching left or right according

to the comparison to threshold τ . At the leaf node reached

in tree t, a learned distribution P

(c|I, x) over body part la-

bels c is stored. The distributions are averaged together for

all trees in the forest to give the ﬁnal classiﬁcation

P (c|I, x) =

t=1

(c|I, x) . (2)

Training. Each tree is trained on a different set of randomly

synthesized images. A random subset of 2000 example pix-

els from each image is chosen to ensure a roughly even dis-

tribution across body parts. Each tree is trained using the

following algorithm [20]:

1. Randomly propose a set of splitting candidates φ =

(θ, τ ) (feature parameters θ and thresholds τ ).

2. Partition the set of examples Q = {(I, x)} into left

and right subsets by each φ:

(φ) = { (I, x) | f

(I, x) < τ } (3)

(φ) = Q \ Q

(φ) (4)

3. Compute the φ giving the largest gain in information:

= argmax

G(φ) (5)

G(φ) = H(Q) −

s∈{l,r}

(φ)|

|Q|

H(Q

(φ)) (6)

where Shannon entropy H(Q) is computed on the nor-

malized histogram of body part labels l

(x) for all

(I, x) ∈ Q.

4. If the largest gain G(φ

) is sufﬁcient, and the depth in

the tree is below a maximum, then recurse for left and

right subsets Q

(φ

) and Q

(φ

• depth, map, front/right/top

• pose, distances, cropping, camera angles, body size and shape (e.g. small child, thin/fat),

• failure modes: underlying probability correct, can detect failures with confidence

• synthetic / real / failures

Example inferences

Figure 5. Example inferences. Synthetic (top row); real (middle); failure modes (bottom). Left column: ground truth for a neutral pose as

a reference. In each example we see the depth image, the inferred most likely body part labels, and the joint proposals show as front, right,

and top views (overlaid on a depth point cloud). Only the most conﬁdent proposal for each joint above a ﬁxed, shared threshold is shown.

To keep the training times down we employ a distributed

implementation. Training 3 trees to depth 20 from 1 million

images takes about a day on a 1000 core cluster.

3.4. Joint position proposals

Body part recognition as described above infers per-pixel

information. This information must now be pooled across

pixels to generate reliable proposals for the positions of 3D

skeletal joints. These proposals are the ﬁnal output of our

algorithm, and could be used by a tracking algorithm to self-

initialize and recover from failure.

A simple option is to accumulate the global 3D centers

of probability mass for each part, using the known cali-

brated depth. However, outlying pixels severely degrade

the quality of such a global estimate. Instead we employ a

local mode-ﬁnding approach based on mean shift [10] with

a weighted Gaussian kernel.

We deﬁne a density estimator per body part as

(

x) ∝

i=1

exp

−



x −



, (7)

where

x is a coordinate in 3D world space, N is the number

of image pixels, w

is a pixel weighting,

is the reprojec-

tion of image pixel x

into world space given depth d

and b

is a learned per-part bandwidth. The pixel weighting

considers both the inferred body part probability at the

pixel and the world surface area of the pixel:

= P (c|I, x

) · d

)

. (8)

This ensures density estimates are depth invariant and gave

a small but signiﬁcant improvement in joint prediction ac-

curacy. Depending on the deﬁnition of body parts, the pos-

terior P (c|I, x) can be pre-accumulated over a small set of

parts. For example, in our experiments the four body parts

covering the head are merged to localize the head joint.

Mean shift is used to ﬁnd modes in this density efﬁ-

ciently. All pixels above a learned probability threshold λ

are used as starting points for part c. A ﬁnal conﬁdence es-

timate is given as a sum of the pixel weights reaching each

mode. This proved more reliable than taking the modal den-

sity estimate.

The detected modes lie on the surface of the body. Each

mode is therefore pushed back into the scene by a learned

z offset ζ

to produce a ﬁnal joint position proposal. This

simple, efﬁcient approach works well in practice. The band-

widths b

, probability threshold λ

, and surface-to-interior

z offset ζ

are optimized per-part on a hold-out validation

set of 5000 images by grid search. (As an indication, this

resulted in mean bandwidth 0.065m, probability threshold

0.14, and z offset 0.039m).

4. Experiments

In this section we describe the experiments performed to

evaluate our method. We show both qualitative and quan-

titative results on several challenging datasets, and com-

pare with both nearest-neighbor approaches and the state

of the art [13]. We provide further results in the supple-

mentary material. Unless otherwise speciﬁed, parameters

below were set as: 3 trees, 20 deep, 300k training images

per tree, 2000 training example pixels per image, 2000 can-

didate features θ, and 50 candidate thresholds τ per feature.

Test data. We use challenging synthetic and real depth im-

ages to evaluate our approach. For our synthetic test set,

we synthesize 5000 depth images, together with the ground

truth body part labels and joint positions. The original mo-

cap poses used to generate these images are held out from

the training data. Our real test set consists of 8808 frames of

real depth images over 15 different subjects, hand-labeled

with dense body parts and 7 upper body joint positions. We

also evaluate on the real depth data from [13]. The results

suggest that effects seen on synthetic data are mirrored in

the real data, and further that our synthetic test set is by far

the ‘hardest’ due to the extreme variability in pose and body

shape. For most experiments we limit the rotation of the

user to ±120

◦

in both training and synthetic test data since

the user is facing the camera (0

◦

) in our main entertainment

scenario, though we also evaluate the full 360

◦

scenario.

Error metrics. We quantify both classiﬁcation and joint

prediction accuracy. For classiﬁcation, we report the av-

erage per-class accuracy, i.e. the average of the diagonal of

the confusion matrix between the ground truth part label and

the most likely inferred part label. This metric weights each

HTML Viewer

Frequently Asked Questions (19)

Q1. What are the contributions in "Real-time human pose recognition in parts from single depth images" ?

The authors propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. The authors take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Finally the authors generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The authors achieve state of the art accuracy in their comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

Q2. What future works have the authors mentioned in the paper "Real-time human pose recognition in parts from single depth images" ?

As future work, the authors plan further study of the variability Combined Comparisons in the source mocap data, the properties of the generative model underlying the synthesis pipeline, and the particular part definitions.

Q3. What are the advantages of depth cameras?

Depth cameras offer several advantages over traditional intensity sensors, working in low light levels, giving a calibrated scale estimate, being color and texture invariant, and resolving silhouette ambiguities in pose.

Q4. How did the authors train the simulated forest?

Using a highly varied synthetic training set allowed us to train very deep decision forests using simple depthinvariant features without overfitting, learning invariance to both pose and shape.

Q5. How many depth images are used in their synthetic test set?

For their synthetic test set, the authors synthesize 5000 depth images, together with the ground truth body part labels and joint positions.

Q6. How do the authors train a deep decision forest?

The authors train a deep randomized decision forest classifier which avoids overfitting by using hundreds of thousands of training images.

Q7. How many sequences of driving, dancing, kicking, running, etc.?

The database consists of approximately 500k frames in a few hundred sequences of driving, dancing, kicking, running, navigating menus, etc.

Q8. What are the main problems of generating realistic intensity images using computer graphics techniques?

generating realistic intensity images using computer graphics techniques [33, 27, 26] is hampered by the huge color and texture variability induced by clothing, hair, and skin, often meaning that the data are reduced to 2D silhouettes [1].

Q9. How did the authors find it necessary to iterate the mocap database?

The authors have found it necessary to iterate the process of motion capture, sampling from their model, training the classifier, and testing joint prediction accuracy in order to refine the mocap database with regions of pose space that had been previously missed out.

Q10. What is the significance of this dataset?

The authors believe this dataset to considerably advance the state of the art in both scale and variety, and demonstrate the importance of such a large dataset in their evaluation.

Q11. What are the key design goals of this paper?

Illustrated in Fig. 1 and inspired by recent object recognition work that divides objects into parts (e.g. [12, 43]), their approach is driven by two key design goals: computational efficiency and robustness.

Q12. How long did it take to obtain a coarse body part labeling?

Auto-context was used in [40] to obtain a coarse body part labeling but this was not defined to localize joints and classifying each frame took about 40 seconds.

Q13. How many mAPs did the authors get with scale?

For the corresponding joint prediction using a 2D metric with a 10 pixel true positive threshold, the authors got 0.539 mAP with scale and 0.465 mAP without.

Q14. How do the authors discard poses from the initial mocap data?

The authors thus discard many similar, redundant poses from the initial mocap data using ‘furthest neighbor’ clustering [15] where the distance between poses p1 and p2 is defined as maxj ‖pj1−p j 2‖2, the maximum Euclidean distance over body joints j.

Q15. What is the example of a failure to generalize to an unseen pose?

The fourth example shows a failure to generalize well to an unseen pose, but the confidence gates bad proposals, maintaining high precision at the expense of recall.

Q16. What is the way to evaluate the results of their synthetic test set?

The results suggest that effects seen on synthetic data are mirrored in the real data, and further that their synthetic test set is by far the ‘hardest’ due to the extreme variability in pose and body shape.

Q17. How many randomly generated training images do the authors show?

In Fig. 6(a) the authors show how test accuracy increases approximately logarithmically with the number of randomly generated training images, though starts to tail off around 100k images.

Q18. What is the likely body part to be incorrect?

Often (as with the second and third failure examples) the most likely body part is incorrect, but there is still sufficient correct probability mass in distribution P (c|I,x) that an accurate proposal can still be generated.

Q19. How can the authors predict joint positions for multiple people in the image?

Their approach can propose joint positions for multiple people in the image, since the per-pixel classifier generalizes well even without explicit training for this scenario.

Real-time human pose recognition in parts from single depth images

Summary (4 min read)

1. Introduction

2. Data

2.1. Depth imaging

2.2. Motion capture data

2.3. Generating synthetic data

3.1. Body part labeling

3.2. Depth image features

3.3. Randomized decision forests

3.4. Joint position proposals

4. Experiments

4.1. Qualitative results

4.2. Classification accuracy

4.3. Joint prediction accuracy

5. Discussion

Figures (7)

Citations

References

"Real-time human pose recognition in..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (19)

Q1. What are the contributions in "Real-time human pose recognition in parts from single depth images" ?

Q2. What future works have the authors mentioned in the paper "Real-time human pose recognition in parts from single depth images" ?

Q3. What are the advantages of depth cameras?

Q4. How did the authors train the simulated forest?

Q5. How many depth images are used in their synthetic test set?

Q6. How do the authors train a deep decision forest?

Q7. How many sequences of driving, dancing, kicking, running, etc.?

Q8. What are the main problems of generating realistic intensity images using computer graphics techniques?

Q9. How did the authors find it necessary to iterate the mocap database?

Q10. What is the significance of this dataset?

Q11. What are the key design goals of this paper?

Q12. How long did it take to obtain a coarse body part labeling?

Q13. How many mAPs did the authors get with scale?

Q14. How do the authors discard poses from the initial mocap data?

Q15. What is the example of a failure to generalize to an unseen pose?

Q16. What is the way to evaluate the results of their synthetic test set?

Q17. How many randomly generated training images do the authors show?

Q18. What is the likely body part to be incorrect?

Q19. How can the authors predict joint positions for multiple people in the image?