scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Real-time human pose recognition in parts from single depth images

20 Jun 2011-pp 1297-1304
TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Abstract: We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

Summary (4 min read)

1. Introduction

  • Robust interactive human body tracking has applications including gaming, human-computer interaction, security, telepresence, and even health-care.
  • In particular, until the launch of Kinect [21], none ran at interactive rates on consumer hardware while handling a full range of human body shapes and sizes undergoing general body motions.
  • Reprojecting the inferred parts into world space, the authors localize spatial modes of each part distribution and thus generate (possibly several) confidence-weighted proposals for the 3D locations of each skeletal joint.
  • The authors experiments also carry several insights: (i) synthetic depth training data is an excellent proxy for real data; (ii) scaling up the learning problem with varied synthetic data is important for high accuracy; and (iii) their parts-based approach generalizes better than even an oracular exact nearest neighbor.
  • Felzenszwalb & Huttenlocher [11] apply pictorial structures to estimate pose efficiently.

2. Data

  • Pose estimation research has often focused on techniques to overcome lack of training data [25], because of two problems.
  • First, generating realistic intensity images using computer graphics techniques [33, 27, 26] is hampered by the huge color and texture variability induced by clothing, hair, and skin, often meaning that the data are reduced to 2D silhouettes [1].
  • The second limitation is that synthetic body pose images are of necessity fed by motion-capture data.
  • Although techniques exist to simulate human motion (e.g. [38]) they do not yet produce the range of volitional motions of a human subject.
  • The authors believe this dataset to considerably advance the state of the art in both scale and variety, and demonstrate the importance of such a large dataset in their evaluation.

2.1. Depth imaging

  • Depth imaging technology has advanced dramatically over the last few years, finally reaching a consumer price point with the launch of Kinect [21].
  • Pixels in a depth image indicate calibrated depth in the scene, rather than a measure of intensity or color.
  • The authors employ the Kinect camera which gives a 640x480 image at 30 frames per second with depth resolution of a few centimeters.
  • Depth cameras offer several advantages over traditional intensity sensors, working in low light levels, giving a calibrated scale estimate, being color and texture invariant, and resolving silhouette ambiguities in pose.
  • But most importantly for their approach, it is straightforward to synthesize realistic depth images of people and thus build a large training dataset cheaply.

2.2. Motion capture data

  • The human body is capable of an enormous range of poses which are difficult to simulate.
  • Instead, the authors capture a large database of motion capture of human actions.
  • The authors aim was to span the wide variety of poses people would make in an entertainment scenario.
  • Often, changes in pose from one mocap frame to the next are so small as to be insignificant.
  • The authors thus discard many similar, redundant poses from the initial mocap data using ‘furthest neighbor’ clustering [15] where the distance between poses p1 and p2 is defined as maxj ‖pj1−p j 2‖2, the maximum Euclidean distance over body joints j.

2.3. Generating synthetic data

  • The authors build a randomized rendering pipeline from which they can sample fully labeled training images.
  • The authors goals in building this pipeline were twofold: realism and variety.
  • For the learned model to work well, the samples must closely resemble real camera images, and contain good coverage of the appearance variations the authors hope to recognize at test time.
  • While depth/scale and translation variations are handled explicitly in their features (see below), other invariances cannot be encoded efficiently.
  • Further slight random variation in height and weight give extra coverage of body shapes.

3.1. Body part labeling

  • A key contribution of this work is their intermediate body part representation.
  • Some of these parts are defined to directly localize particular skeletal joints of interest, while others fill the gaps or could be used in combination to predict other joints.
  • The authors intermediate representation transforms the problem into one that can readily be solved by efficient classification algorithms; the authors show in Sec. 4.3 that the penalty paid for this transformation is small.
  • The pairs of depth and body part images are used as fully labeled data for learning the classifier (see below).
  • In an upper body tracking scenario, all the lower body parts could be merged.

3.2. Depth image features

  • The authors employ simple depth comparison features, inspired by those in [20].
  • The features are thus 3D translation invariant (modulo perspective effects).
  • Eq. 1 will give a large positive response for pixels x near the top of the body, but a value close to zero for pixels x lower down the body, also known as Feature fθ1 looks upwards.
  • The design of these features was strongly motivated by their computational efficiency: no preprocessing is needed; each feature need only read at most 3 image pixels and perform at most 5 arithmetic operations; and the features can be straightforwardly implemented on the GPU.

3.3. Randomized decision forests

  • At the leaf node reached in tree t, a learned distribution Pt(c|I,x) over body part labels c is stored.
  • A random subset of 2000 example pixels from each image is chosen to ensure a roughly even distribution across body parts.
  • Each tree is trained using the following algorithm [20]: 1. Randomly propose a set of splitting candidates φ = (θ, τ) (feature parameters θ and thresholds τ ).
  • To keep the training times down the authors employ a distributed implementation.

3.4. Joint position proposals

  • Body part recognition as described above infers per-pixel information.
  • These proposals are the final output of their algorithm, and could be used by a tracking algorithm to selfinitialize and recover from failure.
  • Depending on the definition of body parts, the posterior P (c|I,x) can be pre-accumulated over a small set of parts.
  • Mean shift is used to find modes in this density efficiently.
  • A final confidence estimate is given as a sum of the pixel weights reaching each mode.

4. Experiments

  • In this section the authors describe the experiments performed to evaluate their method.
  • For their synthetic test set, the authors synthesize 5000 depth images, together with the ground truth body part labels and joint positions.
  • The authors quantify both classification and joint prediction accuracy.
  • Any joint proposals outside D meters also count as false positives.
  • The authors set D = 0.1m below, approximately the accuracy of the handlabeled real test data ground truth.

4.1. Qualitative results

  • Fig. 5 shows example inferences of their algorithm.
  • Note high accuracy of both classification and joint prediction across large variations in body and camera pose, depth in scene, cropping, and body size and shape (e.g. small child vs. heavy adult).
  • The bottom row shows some failure modes of the body part classification.
  • The first example shows a failure to distinguish subtle changes in the depth image such as the crossed arms.
  • Often (as with the second and third failure examples) the most likely body part is incorrect, but there is still sufficient correct probability mass in distribution P (c|I,x) that an accurate proposal can still be generated.

4.2. Classification accuracy

  • The authors investigate the effect of several training parameters on classification accuracy.
  • The authors also show in Fig. 6(a) the quality of their approach on synthetic silhouette images, where the features in Eq. 1 are either given scale (as the mean depth) or not (a fixed constant depth).
  • Using only 15k images the authors observe overfitting beginning around depth 17, but the enlarged 900k training set avoids this.
  • The authors compare the actual performance of their system (red) with the best achievable result (blue) given the ground truth body part labels.
  • Accuracy increases with the maximum probe offset, though levels off around 129 pixel meters.

4.3. Joint prediction accuracy

  • In Fig. 7 the authors show average precision results on the synthetic test set, achieving 0.731 mAP.
  • The authors compare an idealized setup that is given the ground truth body part labels to the real setup using inferred body parts.
  • The speed of nearest neighbor chamfer matching is also drastically slower (2 fps) than their algorithm.
  • The authors of [13] provided their test data and results for direct comparison.
  • To evaluate the full 360◦ rotation scenario, the authors trained a forest on 900k images containing full rotations and tested on 5k synthetic full rotation images (with held out poses).

5. Discussion

  • The authors have seen how accurate proposals for the 3D locations of body joints can be estimated in super real-time from single depth images.
  • Detecting modes in a density function gives the final set of confidence-weighted 3D joint proposals.
  • Whether a similarly efficient approach that can directly regress joint positions is also an open question.
  • Perhaps a global estimate of latent variables such as coarse person orientation could be used to condition the body part inference and remove ambiguities in local pose estimates.

Did you find this useful? Give us your feedback

Figures (7)

Content maybe subject to copyright    Report

Real-Time Human Pose Recognition in Parts from Single Depth Images
Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchio
Richard Moore Alex Kipman Andrew Blake
Microsoft Research Cambridge & Xbox Incubation
Abstract
We propose a new method to quickly and accurately pre-
dict 3D positions of body joints from a single depth image,
using no temporal information. We take an object recog-
nition approach, designing an intermediate body parts rep-
resentation that maps the difficult pose estimation problem
into a simpler per-pixel classification problem. Our large
and highly varied training dataset allows the classifier to
estimate body parts invariant to pose, body shape, clothing,
etc. Finally we generate confidence-scored 3D proposals of
several body joints by reprojecting the classification result
and finding local modes.
The system runs at 200 frames per second on consumer
hardware. Our evaluation shows high accuracy on both
synthetic and real test sets, and investigates the effect of sev-
eral training parameters. We achieve state of the art accu-
racy in our comparison with related work and demonstrate
improved generalization over exact whole-skeleton nearest
neighbor matching.
1. Introduction
Robust interactive human body tracking has applica-
tions including gaming, human-computer interaction, secu-
rity, telepresence, and even health-care. The task has re-
cently been greatly simplified by the introduction of real-
time depth cameras [16, 19, 44, 37, 28, 13]. However, even
the best existing systems still exhibit limitations. In partic-
ular, until the launch of Kinect [21], none ran at interactive
rates on consumer hardware while handling a full range of
human body shapes and sizes undergoing general body mo-
tions. Some systems achieve high speeds by tracking from
frame to frame but struggle to re-initialize quickly and so
are not robust. In this paper, we focus on pose recognition
in parts: detecting from a single depth image a small set of
3D position candidates for each skeletal joint. Our focus on
per-frame initialization and recovery is designed to comple-
ment any appropriate tracking algorithm [7, 39, 16, 42, 13]
that might further incorporate temporal and kinematic co-
herence. The algorithm presented here forms a core com-
ponent of the Kinect gaming platform [21].
Illustrated in Fig. 1 and inspired by recent object recog-
nition work that divides objects into parts (e.g. [12, 43]),
our approach is driven by two key design goals: computa-
tional efficiency and robustness. A single input depth image
is segmented into a dense probabilistic body part labeling,
with the parts defined to be spatially localized near skeletal
CVPR Teaser
seq 1: frame 15
seq 2: frame 236
seq 5: take 1, 72
depth image body parts 3D joint proposals
Figure 1. Overview. From an single input depth image, a per-pixel
body part distribution is inferred. (Colors indicate the most likely
part labels at each pixel, and correspond in the joint proposals).
Local modes of this signal are estimated to give high-quality pro-
posals for the 3D locations of body joints, even for multiple users.
joints of interest. Reprojecting the inferred parts into world
space, we localize spatial modes of each part distribution
and thus generate (possibly several) confidence-weighted
proposals for the 3D locations of each skeletal joint.
We treat the segmentation into body parts as a per-pixel
classification task (no pairwise terms or CRF have proved
necessary). Evaluating each pixel separately avoids a com-
binatorial search over the different body joints, although
within a single part there are of course still dramatic dif-
ferences in the contextual appearance. For training data,
we generate realistic synthetic depth images of humans of
many shapes and sizes in highly varied poses sampled from
a large motion capture database. We train a deep ran-
domized decision forest classifier which avoids overfitting
by using hundreds of thousands of training images. Sim-
ple, discriminative depth comparison image features yield
3D translation invariance while maintaining high computa-
tional efficiency. For further speed, the classifier can be run
in parallel on each pixel on a GPU [34]. Finally, spatial
modes of the inferred per-pixel distributions are computed
using mean shift [10] resulting in the 3D joint proposals.
An optimized implementation of our algorithm runs in
under 5ms per frame (200 frames per second) on the Xbox
360 GPU, at least one order of magnitude faster than exist-
ing approaches. It works frame-by-frame across dramati-
cally differing body shapes and sizes, and the learned dis-
criminative approach naturally handles self-occlusions and
1

poses cropped by the image frame. We evaluate on both real
and synthetic depth images, containing challenging poses of
a varied set of subjects. Even without exploiting temporal
or kinematic constraints, the 3D joint proposals are both ac-
curate and stable. We investigate the effect of several train-
ing parameters and show how very deep trees can still avoid
overfitting due to the large training set. We demonstrate
that our part proposals generalize at least as well as exact
nearest-neighbor in both an idealized and realistic setting,
and show a substantial improvement over the state of the
art. Further, results on silhouette images suggest more gen-
eral applicability of our approach.
Our main contribution is to treat pose estimation as ob-
ject recognition using a novel intermediate body parts rep-
resentation designed to spatially localize joints of interest
at low computational cost and high accuracy. Our experi-
ments also carry several insights: (i) synthetic depth train-
ing data is an excellent proxy for real data; (ii) scaling up
the learning problem with varied synthetic data is important
for high accuracy; and (iii) our parts-based approach gener-
alizes better than even an oracular exact nearest neighbor.
Related Work. Human pose estimation has generated a
vast literature (surveyed in [22, 29]). The recent availability
of depth cameras has spurred further progress [16, 19, 28].
Grest et al. [16] use Iterated Closest Point to track a skele-
ton of a known size and starting position. Anguelov et al.
[3] segment puppets in 3D range scan data into head, limbs,
torso, and background using spin images and a MRF. In
[44], Zhu & Fujimura build heuristic detectors for coarse
upper body parts (head, torso, arms) using a linear program-
ming relaxation, but require a T-pose initialization to size
the model. Siddiqui & Medioni [37] hand craft head, hand,
and forearm detectors, and show data-driven MCMC model
fitting outperforms ICP. Kalogerakis et al. [18] classify and
segment vertices in a full closed 3D mesh into different
parts, but do not deal with occlusions and are sensitive to
mesh topology. Most similar to our approach, Plagemann
et al. [28] build a 3D mesh to find geodesic extrema inter-
est points which are classified into 3 parts: head, hand, and
foot. Their method provides both a location and orientation
estimate of these parts, but does not distinguish left from
right and the use of interest points limits the choice of parts.
Advances have also been made using conventional in-
tensity cameras, though typically at much higher computa-
tional cost. Bregler & Malik [7] track humans using twists
and exponential maps from a known initial pose. Ioffe &
Forsyth [17] group parallel edges as candidate body seg-
ments and prune combinations of segments using a pro-
jected classifier. Mori & Malik [24] use the shape con-
text descriptor to match exemplars. Ramanan & Forsyth
[31] find candidate body segments as pairs of parallel lines,
clustering appearances across frames. Shakhnarovich et al.
[33] estimate upper body pose, interpolating k-NN poses
matched by parameter sensitive hashing. Agarwal & Triggs
[1] learn a regression from kernelized image silhouettes fea-
tures to pose. Sigal et al. [39] use eigen-appearance tem-
plate detectors for head, upper arms and lower legs pro-
posals. Felzenszwalb & Huttenlocher [11] apply pictorial
structures to estimate pose efficiently. Navaratnam et al.
[25] use the marginal statistics of unlabeled data to im-
prove pose estimation. Urtasun & Darrel [41] proposed a
local mixture of Gaussian Processes to regress human pose.
Auto-context was used in [40] to obtain a coarse body part
labeling but this was not defined to localize joints and clas-
sifying each frame took about 40 seconds. Rogez et al. [32]
train randomized decision forests on a hierarchy of classes
defined on a torus of cyclic human motion patterns and cam-
era angles. Wang & Popovi
´
c [42] track a hand clothed in a
colored glove. Our system could be seen as automatically
inferring the colors of an virtual colored suit from a depth
image. Bourdev & Malik [6] present ‘poselets’ that form
tight clusters in both 3D pose and 2D image appearance,
detectable using SVMs.
2. Data
Pose estimation research has often focused on techniques
to overcome lack of training data [25], because of two prob-
lems. First, generating realistic intensity images using com-
puter graphics techniques [33, 27, 26] is hampered by the
huge color and texture variability induced by clothing, hair,
and skin, often meaning that the data are reduced to 2D sil-
houettes [1]. Although depth cameras significantly reduce
this difficulty, considerable variation in body and clothing
shape remains. The second limitation is that synthetic body
pose images are of necessity fed by motion-capture (mocap)
data. Although techniques exist to simulate human motion
(e.g. [38]) they do not yet produce the range of volitional
motions of a human subject.
In this section we review depth imaging and show how
we use real mocap data, retargetted to a variety of base char-
acter models, to synthesize a large, varied dataset. We be-
lieve this dataset to considerably advance the state of the art
in both scale and variety, and demonstrate the importance
of such a large dataset in our evaluation.
2.1. Depth imaging
Depth imaging technology has advanced dramatically
over the last few years, finally reaching a consumer price
point with the launch of Kinect [21]. Pixels in a depth image
indicate calibrated depth in the scene, rather than a measure
of intensity or color. We employ the Kinect camera which
gives a 640x480 image at 30 frames per second with depth
resolution of a few centimeters.
Depth cameras offer several advantages over traditional
intensity sensors, working in low light levels, giving a cali-
brated scale estimate, being color and texture invariant, and
resolving silhouette ambiguities in pose. They also greatly

Training & Test Data
synthetic (train & test)
real (test)
synthetic (train & test)
real (test)
Figure 2. Synthetic and real data. Pairs of depth image and ground truth body parts. Note wide variety in pose, shape, clothing, and crop.
simplify the task of background subtraction which we as-
sume in this work. But most importantly for our approach,
it is straightforward to synthesize realistic depth images of
people and thus build a large training dataset cheaply.
2.2. Motion capture data
The human body is capable of an enormous range of
poses which are difficult to simulate. Instead, we capture a
large database of motion capture (mocap) of human actions.
Our aim was to span the wide variety of poses people would
make in an entertainment scenario. The database consists of
approximately 500k frames in a few hundred sequences of
driving, dancing, kicking, running, navigating menus, etc.
We expect our semi-local body part classifier to gener-
alize somewhat to unseen poses. In particular, we need not
record all possible combinations of the different limbs; in
practice, a wide range of poses proves sufficient. Further,
we need not record mocap with variation in rotation about
the vertical axis, mirroring left-right, scene position, body
shape and size, or camera pose, all of which can be added
in (semi-)automatically.
Since the classifier uses no temporal information, we
are interested only in static poses and not motion. Often,
changes in pose from one mocap frame to the next are so
small as to be insignificant. We thus discard many similar,
redundant poses from the initial mocap data using ‘furthest
neighbor’ clustering [15] where the distance between poses
p
1
and p
2
is defined as max
j
kp
j
1
p
j
2
k
2
, the maximum Eu-
clidean distance over body joints j. We use a subset of 100k
poses such that no two poses are closer than 5cm.
We have found it necessary to iterate the process of mo-
tion capture, sampling from our model, training the classi-
fier, and testing joint prediction accuracy in order to refine
the mocap database with regions of pose space that had been
previously missed out. Our early experiments employed
the CMU mocap database [9] which gave acceptable results
though covered far less of pose space.
2.3. Generating synthetic data
We build a randomized rendering pipeline from which
we can sample fully labeled training images. Our goals in
building this pipeline were twofold: realism and variety. For
the learned model to work well, the samples must closely
resemble real camera images, and contain good coverage of
the appearance variations we hope to recognize at test time.
While depth/scale and translation variations are handled ex-
plicitly in our features (see below), other invariances cannot
be encoded efficiently. Instead we learn invariance from the
data to camera pose, body pose, and body size and shape.
The synthesis pipeline first randomly samples a set of
parameters, and then uses standard computer graphics tech-
niques to render depth and (see below) body part images
from texture mapped 3D meshes. The mocap is retarget-
ting to each of 15 base meshes spanning the range of body
shapes and sizes, using [4]. Further slight random vari-
ation in height and weight give extra coverage of body
shapes. Other randomized parameters include the mocap
frame, camera pose, camera noise, clothing and hairstyle.
We provide more details of these variations in the supple-
mentary material. Fig. 2 compares the varied output of the
pipeline to hand-labeled real camera images.
3. Body Part Inference and Joint Proposals
In this section we describe our intermediate body parts
representation, detail the discriminative depth image fea-
tures, review decision forests and their application to body
part recognition, and finally discuss how a mode finding al-
gorithm is used to generate joint position proposals.
3.1. Body part labeling
A key contribution of this work is our intermediate body
part representation. We define several localized body part
labels that densely cover the body, as color-coded in Fig. 2.
Some of these parts are defined to directly localize partic-
ular skeletal joints of interest, while others fill the gaps or
could be used in combination to predict other joints. Our in-
termediate representation transforms the problem into one
that can readily be solved by efficient classification algo-
rithms; we show in Sec. 4.3 that the penalty paid for this
transformation is small.
The parts are specified in a texture map that is retargetted
to skin the various characters during rendering. The pairs of
depth and body part images are used as fully labeled data for
learning the classifier (see below). For the experiments in
this paper, we use 31 body parts: LU/RU/LW/RW head, neck,
L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R
hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee,
L/R ankle, L/R foot (Left, Right, Upper, loWer). Distinct

(a)
body parts
Image Features
(b)
𝜃
2
𝜃
1
𝜃
2
𝜃
2
𝜃
1
𝜃
2
Figure 3. Depth image features. The yellow crosses indicates the
pixel x being classified. The red circles indicate the offset pixels
as defined in Eq. 1. In (a), the two example features give a large
depth difference response. In (b), the same two features at new
image locations give a much smaller response.
parts for left and right allow the classifier to disambiguate
the left and right sides of the body.
Of course, the precise definition of these parts could be
changed to suit a particular application. For example, in an
upper body tracking scenario, all the lower body parts could
be merged. Parts should be sufficiently small to accurately
localize body joints, but not too numerous as to waste ca-
pacity of the classifier.
3.2. Depth image features
We employ simple depth comparison features, inspired
by those in [20]. At a given pixel x, the features compute
f
θ
(I, x) = d
I
x +
u
d
I
(x)
d
I
x +
v
d
I
(x)
, (1)
where d
I
(x) is the depth at pixel x in image I, and parame-
ters θ = (u, v) describe offsets u and v. The normalization
of the offsets by
1
d
I
(x)
ensures the features are depth invari-
ant: at a given point on the body, a fixed world space offset
will result whether the pixel is close or far from the camera.
The features are thus 3D translation invariant (modulo per-
spective effects). If an offset pixel lies on the background
or outside the bounds of the image, the depth probe d
I
(x
0
)
is given a large positive constant value.
Fig. 3 illustrates two features at different pixel locations
x. Feature f
θ
1
looks upwards: Eq. 1 will give a large pos-
itive response for pixels x near the top of the body, but a
value close to zero for pixels x lower down the body. Fea-
ture f
θ
2
may instead help find thin vertical structures such
as the arm.
Individually these features provide only a weak signal
about which part of the body the pixel belongs to, but in
combination in a decision forest they are sufficient to accu-
rately disambiguate all trained parts. The design of these
features was strongly motivated by their computational effi-
ciency: no preprocessing is needed; each feature need only
read at most 3 image pixels and perform at most 5 arithmetic
operations; and the features can be straightforwardly imple-
mented on the GPU. Given a larger computational budget,
one could employ potentially more powerful features based
on, for example, depth integrals over regions, curvature, or
local descriptors e.g. [5].
Random Forests
tree
tree
󰇛 󰇜
󰇛 󰇜
󰇛󰇜
󰇛󰇜
Figure 4. Randomized Decision Forests. A forest is an ensemble
of trees. Each tree consists of split nodes (blue) and leaf nodes
(green). The red arrows indicate the different paths that might be
taken by different trees for a particular input.
3.3. Randomized decision forests
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
tasks [20, 23, 36], and can be implemented efficiently on the
GPU [34]. As illustrated in Fig. 4, a forest is an ensemble
of T decision trees, each consisting of split and leaf nodes.
Each split node consists of a feature f
θ
and a threshold τ .
To classify pixel x in image I, one starts at the root and re-
peatedly evaluates Eq. 1, branching left or right according
to the comparison to threshold τ . At the leaf node reached
in tree t, a learned distribution P
t
(c|I, x) over body part la-
bels c is stored. The distributions are averaged together for
all trees in the forest to give the final classification
P (c|I, x) =
1
T
T
X
t=1
P
t
(c|I, x) . (2)
Training. Each tree is trained on a different set of randomly
synthesized images. A random subset of 2000 example pix-
els from each image is chosen to ensure a roughly even dis-
tribution across body parts. Each tree is trained using the
following algorithm [20]:
1. Randomly propose a set of splitting candidates φ =
(θ, τ ) (feature parameters θ and thresholds τ ).
2. Partition the set of examples Q = {(I, x)} into left
and right subsets by each φ:
Q
l
(φ) = { (I, x) | f
θ
(I, x) < τ } (3)
Q
r
(φ) = Q \ Q
l
(φ) (4)
3. Compute the φ giving the largest gain in information:
φ
?
= argmax
φ
G(φ) (5)
G(φ) = H(Q)
X
s∈{l,r}
|Q
s
(φ)|
|Q|
H(Q
s
(φ)) (6)
where Shannon entropy H(Q) is computed on the nor-
malized histogram of body part labels l
I
(x) for all
(I, x) Q.
4. If the largest gain G(φ
?
) is sufficient, and the depth in
the tree is below a maximum, then recurse for left and
right subsets Q
l
(φ
?
) and Q
r
(φ
?
).

depth, map, front/right/top
pose, distances, cropping, camera angles, body size and shape (e.g. small child, thin/fat),
failure modes: underlying probability correct, can detect failures with confidence
synthetic / real / failures
Example inferences
Figure 5. Example inferences. Synthetic (top row); real (middle); failure modes (bottom). Left column: ground truth for a neutral pose as
a reference. In each example we see the depth image, the inferred most likely body part labels, and the joint proposals show as front, right,
and top views (overlaid on a depth point cloud). Only the most confident proposal for each joint above a fixed, shared threshold is shown.
To keep the training times down we employ a distributed
implementation. Training 3 trees to depth 20 from 1 million
images takes about a day on a 1000 core cluster.
3.4. Joint position proposals
Body part recognition as described above infers per-pixel
information. This information must now be pooled across
pixels to generate reliable proposals for the positions of 3D
skeletal joints. These proposals are the final output of our
algorithm, and could be used by a tracking algorithm to self-
initialize and recover from failure.
A simple option is to accumulate the global 3D centers
of probability mass for each part, using the known cali-
brated depth. However, outlying pixels severely degrade
the quality of such a global estimate. Instead we employ a
local mode-finding approach based on mean shift [10] with
a weighted Gaussian kernel.
We define a density estimator per body part as
f
c
(
ˆ
x)
N
X
i=1
w
ic
exp
ˆ
x
ˆ
x
i
b
c
2
!
, (7)
where
ˆ
x is a coordinate in 3D world space, N is the number
of image pixels, w
ic
is a pixel weighting,
ˆ
x
i
is the reprojec-
tion of image pixel x
i
into world space given depth d
I
(x
i
),
and b
c
is a learned per-part bandwidth. The pixel weighting
w
ic
considers both the inferred body part probability at the
pixel and the world surface area of the pixel:
w
ic
= P (c|I, x
i
) · d
I
(x
i
)
2
. (8)
This ensures density estimates are depth invariant and gave
a small but significant improvement in joint prediction ac-
curacy. Depending on the definition of body parts, the pos-
terior P (c|I, x) can be pre-accumulated over a small set of
parts. For example, in our experiments the four body parts
covering the head are merged to localize the head joint.
Mean shift is used to find modes in this density effi-
ciently. All pixels above a learned probability threshold λ
c
are used as starting points for part c. A final confidence es-
timate is given as a sum of the pixel weights reaching each
mode. This proved more reliable than taking the modal den-
sity estimate.
The detected modes lie on the surface of the body. Each
mode is therefore pushed back into the scene by a learned
z offset ζ
c
to produce a final joint position proposal. This
simple, efficient approach works well in practice. The band-
widths b
c
, probability threshold λ
c
, and surface-to-interior
z offset ζ
c
are optimized per-part on a hold-out validation
set of 5000 images by grid search. (As an indication, this
resulted in mean bandwidth 0.065m, probability threshold
0.14, and z offset 0.039m).
4. Experiments
In this section we describe the experiments performed to
evaluate our method. We show both qualitative and quan-
titative results on several challenging datasets, and com-
pare with both nearest-neighbor approaches and the state
of the art [13]. We provide further results in the supple-
mentary material. Unless otherwise specified, parameters
below were set as: 3 trees, 20 deep, 300k training images
per tree, 2000 training example pixels per image, 2000 can-
didate features θ, and 50 candidate thresholds τ per feature.
Test data. We use challenging synthetic and real depth im-
ages to evaluate our approach. For our synthetic test set,
we synthesize 5000 depth images, together with the ground
truth body part labels and joint positions. The original mo-
cap poses used to generate these images are held out from
the training data. Our real test set consists of 8808 frames of
real depth images over 15 different subjects, hand-labeled
with dense body parts and 7 upper body joint positions. We
also evaluate on the real depth data from [13]. The results
suggest that effects seen on synthetic data are mirrored in
the real data, and further that our synthetic test set is by far
the ‘hardest’ due to the extreme variability in pose and body
shape. For most experiments we limit the rotation of the
user to ±120
in both training and synthetic test data since
the user is facing the camera (0
) in our main entertainment
scenario, though we also evaluate the full 360
scenario.
Error metrics. We quantify both classification and joint
prediction accuracy. For classification, we report the av-
erage per-class accuracy, i.e. the average of the diagonal of
the confusion matrix between the ground truth part label and
the most likely inferred part label. This metric weights each

Citations
More filters
Journal ArticleDOI
TL;DR: The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) as mentioned in this paper was organized in conjunction with the MICCAI 2012 and 2013 conferences, and twenty state-of-the-art tumor segmentation algorithms were applied to a set of 65 multi-contrast MR scans of low and high grade glioma patients.
Abstract: In this paper we report the set-up and results of the Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) organized in conjunction with the MICCAI 2012 and 2013 conferences Twenty state-of-the-art tumor segmentation algorithms were applied to a set of 65 multi-contrast MR scans of low- and high-grade glioma patients—manually annotated by up to four raters—and to 65 comparable scans generated using tumor image simulation software Quantitative evaluations revealed considerable disagreement between the human raters in segmenting various tumor sub-regions (Dice scores in the range 74%–85%), illustrating the difficulty of this task We found that different algorithms worked best for different sub-regions (reaching performance comparable to human inter-rater variability), but that no single algorithm ranked in the top for all sub-regions simultaneously Fusing several good algorithms using a hierarchical majority vote yielded segmentations that consistently ranked above all individual algorithms, indicating remaining opportunities for further methodological improvements The BRATS image data and manual annotations continue to be publicly available through an online evaluation system as an ongoing benchmarking resource

3,699 citations

Proceedings Article
27 Apr 2018
TL;DR: Wang et al. as discussed by the authors proposed a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data.
Abstract: Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.

2,681 citations

Journal ArticleDOI
Zhengyou Zhang1
TL;DR: While the Kinect sensor incorporates several advanced sensing hardware, this article focuses on the vision aspect of the sensor and its impact beyond the gaming industry.
Abstract: Recent advances in 3D depth cameras such as Microsoft Kinect sensors (www.xbox.com/en-US/kinect) have created many opportunities for multimedia computing. The Kinect sensor lets the computer directly sense the third dimension (depth) of the players and the environment. It also understands when users talk, knows who they are when they walk up to it, and can interpret their movements and translate them into a format that developers can use to build new experiences. While the Kinect sensor incorporates several advanced sensing hardware, this article focuses on the vision aspect of the Kinect sensor and its impact beyond the gaming industry.

2,294 citations

Journal ArticleDOI
TL;DR: A new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, is introduced for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms.
Abstract: We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m .

2,209 citations

Journal ArticleDOI
TL;DR: The state of the art in HAR based on wearable sensors is surveyed and a two-level taxonomy in accordance to the learning approach and the response time is proposed.
Abstract: Providing accurate and opportune information on people's activities and behaviors is one of the most important tasks in pervasive computing. Innumerable applications can be visualized, for instance, in medical, security, entertainment, and tactical scenarios. Despite human activity recognition (HAR) being an active field for more than a decade, there are still key aspects that, if addressed, would constitute a significant turn in the way people interact with mobile devices. This paper surveys the state of the art in HAR based on wearable sensors. A general architecture is first presented along with a description of the main components of any HAR system. We also propose a two-level taxonomy in accordance to the learning approach (either supervised or semi-supervised) and the response time (either offline or online). Then, the principal issues and challenges are discussed, as well as the main solutions to each one of them. Twenty eight systems are qualitatively evaluated in terms of recognition performance, energy consumption, obtrusiveness, and flexibility, among others. Finally, we present some open problems and ideas that, due to their high relevance, should be addressed in future research.

2,184 citations

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations

Journal ArticleDOI
TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.
Abstract: The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

17,177 citations

Journal ArticleDOI
TL;DR: It is proved the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density.
Abstract: A general non-parametric technique is proposed for the analysis of a complex multimodal feature space and to delineate arbitrarily shaped clusters in it. The basic computational module of the technique is an old pattern recognition procedure: the mean shift. For discrete data, we prove the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density. The relation of the mean shift procedure to the Nadaraya-Watson estimator from kernel regression and the robust M-estimators; of location is also established. Algorithms for two low-level vision tasks discontinuity-preserving smoothing and image segmentation - are described as applications. In these algorithms, the only user-set parameter is the resolution of the analysis, and either gray-level or color images are accepted as input. Extensive experimental results illustrate their excellent performance.

11,727 citations


"Real-time human pose recognition in..." refers methods in this paper

  • ...Finally, spatial modes of the inferred per-pixel distributions are computed using mean shift [ 10 ] resulting in the 3D joint proposals....

    [...]

  • ...Instead we employ a local mode-finding approach based on mean shift [ 10 ] with a weighted Gaussian kernel....

    [...]

Journal ArticleDOI
TL;DR: This paper presents work on computing shape models that are computationally fast and invariant basic transformations like translation, scaling and rotation, and proposes shape detection using a feature called shape context, which is descriptive of the shape of the object.
Abstract: We present a novel approach to measuring similarity between shapes and exploit it for object recognition. In our framework, the measurement of similarity is preceded by: (1) solving for correspondences between points on the two shapes; (2) using the correspondences to estimate an aligning transform. In order to solve the correspondence problem, we attach a descriptor, the shape context, to each point. The shape context at a reference point captures the distribution of the remaining points relative to it, thus offering a globally discriminative characterization. Corresponding points on two similar shapes will have similar shape contexts, enabling us to solve for correspondences as an optimal assignment problem. Given the point correspondences, we estimate the transformation that best aligns the two shapes; regularized thin-plate splines provide a flexible class of transformation maps for this purpose. The dissimilarity between the two shapes is computed as a sum of matching errors between corresponding points, together with a term measuring the magnitude of the aligning transform. We treat recognition in a nearest-neighbor classification framework as the problem of finding the stored prototype shape that is maximally similar to that in the image. Results are presented for silhouettes, trademarks, handwritten digits, and the COIL data set.

6,693 citations

Journal ArticleDOI
TL;DR: This survey reviews recent trends in video-based human capture and analysis, as well as discussing open problems for future research to achieve automatic visual analysis of human movement.

2,738 citations

Frequently Asked Questions (19)
Q1. What are the contributions in "Real-time human pose recognition in parts from single depth images" ?

The authors propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. The authors take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Finally the authors generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The authors achieve state of the art accuracy in their comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching. 

As future work, the authors plan further study of the variability Combined Comparisons in the source mocap data, the properties of the generative model underlying the synthesis pipeline, and the particular part definitions. 

Depth cameras offer several advantages over traditional intensity sensors, working in low light levels, giving a calibrated scale estimate, being color and texture invariant, and resolving silhouette ambiguities in pose. 

Using a highly varied synthetic training set allowed us to train very deep decision forests using simple depthinvariant features without overfitting, learning invariance to both pose and shape. 

For their synthetic test set, the authors synthesize 5000 depth images, together with the ground truth body part labels and joint positions. 

The authors train a deep randomized decision forest classifier which avoids overfitting by using hundreds of thousands of training images. 

The database consists of approximately 500k frames in a few hundred sequences of driving, dancing, kicking, running, navigating menus, etc. 

generating realistic intensity images using computer graphics techniques [33, 27, 26] is hampered by the huge color and texture variability induced by clothing, hair, and skin, often meaning that the data are reduced to 2D silhouettes [1]. 

The authors have found it necessary to iterate the process of motion capture, sampling from their model, training the classifier, and testing joint prediction accuracy in order to refine the mocap database with regions of pose space that had been previously missed out. 

The authors believe this dataset to considerably advance the state of the art in both scale and variety, and demonstrate the importance of such a large dataset in their evaluation. 

Illustrated in Fig. 1 and inspired by recent object recognition work that divides objects into parts (e.g. [12, 43]), their approach is driven by two key design goals: computational efficiency and robustness. 

Auto-context was used in [40] to obtain a coarse body part labeling but this was not defined to localize joints and classifying each frame took about 40 seconds. 

For the corresponding joint prediction using a 2D metric with a 10 pixel true positive threshold, the authors got 0.539 mAP with scale and 0.465 mAP without. 

The authors thus discard many similar, redundant poses from the initial mocap data using ‘furthest neighbor’ clustering [15] where the distance between poses p1 and p2 is defined as maxj ‖pj1−p j 2‖2, the maximum Euclidean distance over body joints j. 

The fourth example shows a failure to generalize well to an unseen pose, but the confidence gates bad proposals, maintaining high precision at the expense of recall. 

The results suggest that effects seen on synthetic data are mirrored in the real data, and further that their synthetic test set is by far the ‘hardest’ due to the extreme variability in pose and body shape. 

In Fig. 6(a) the authors show how test accuracy increases approximately logarithmically with the number of randomly generated training images, though starts to tail off around 100k images. 

Often (as with the second and third failure examples) the most likely body part is incorrect, but there is still sufficient correct probability mass in distribution P (c|I,x) that an accurate proposal can still be generated. 

Their approach can propose joint positions for multiple people in the image, since the per-pixel classifier generalizes well even without explicit training for this scenario.