Real-time human pose recognition in parts from single depth images
Summary (4 min read)
1. Introduction
- Robust interactive human body tracking has applications including gaming, human-computer interaction, security, telepresence, and even health-care.
- In particular, until the launch of Kinect [21], none ran at interactive rates on consumer hardware while handling a full range of human body shapes and sizes undergoing general body motions.
- Reprojecting the inferred parts into world space, the authors localize spatial modes of each part distribution and thus generate (possibly several) confidence-weighted proposals for the 3D locations of each skeletal joint.
- The authors experiments also carry several insights: (i) synthetic depth training data is an excellent proxy for real data; (ii) scaling up the learning problem with varied synthetic data is important for high accuracy; and (iii) their parts-based approach generalizes better than even an oracular exact nearest neighbor.
- Felzenszwalb & Huttenlocher [11] apply pictorial structures to estimate pose efficiently.
2. Data
- Pose estimation research has often focused on techniques to overcome lack of training data [25], because of two problems.
- First, generating realistic intensity images using computer graphics techniques [33, 27, 26] is hampered by the huge color and texture variability induced by clothing, hair, and skin, often meaning that the data are reduced to 2D silhouettes [1].
- The second limitation is that synthetic body pose images are of necessity fed by motion-capture data.
- Although techniques exist to simulate human motion (e.g. [38]) they do not yet produce the range of volitional motions of a human subject.
- The authors believe this dataset to considerably advance the state of the art in both scale and variety, and demonstrate the importance of such a large dataset in their evaluation.
2.1. Depth imaging
- Depth imaging technology has advanced dramatically over the last few years, finally reaching a consumer price point with the launch of Kinect [21].
- Pixels in a depth image indicate calibrated depth in the scene, rather than a measure of intensity or color.
- The authors employ the Kinect camera which gives a 640x480 image at 30 frames per second with depth resolution of a few centimeters.
- Depth cameras offer several advantages over traditional intensity sensors, working in low light levels, giving a calibrated scale estimate, being color and texture invariant, and resolving silhouette ambiguities in pose.
- But most importantly for their approach, it is straightforward to synthesize realistic depth images of people and thus build a large training dataset cheaply.
2.2. Motion capture data
- The human body is capable of an enormous range of poses which are difficult to simulate.
- Instead, the authors capture a large database of motion capture of human actions.
- The authors aim was to span the wide variety of poses people would make in an entertainment scenario.
- Often, changes in pose from one mocap frame to the next are so small as to be insignificant.
- The authors thus discard many similar, redundant poses from the initial mocap data using ‘furthest neighbor’ clustering [15] where the distance between poses p1 and p2 is defined as maxj ‖pj1−p j 2‖2, the maximum Euclidean distance over body joints j.
2.3. Generating synthetic data
- The authors build a randomized rendering pipeline from which they can sample fully labeled training images.
- The authors goals in building this pipeline were twofold: realism and variety.
- For the learned model to work well, the samples must closely resemble real camera images, and contain good coverage of the appearance variations the authors hope to recognize at test time.
- While depth/scale and translation variations are handled explicitly in their features (see below), other invariances cannot be encoded efficiently.
- Further slight random variation in height and weight give extra coverage of body shapes.
3.1. Body part labeling
- A key contribution of this work is their intermediate body part representation.
- Some of these parts are defined to directly localize particular skeletal joints of interest, while others fill the gaps or could be used in combination to predict other joints.
- The authors intermediate representation transforms the problem into one that can readily be solved by efficient classification algorithms; the authors show in Sec. 4.3 that the penalty paid for this transformation is small.
- The pairs of depth and body part images are used as fully labeled data for learning the classifier (see below).
- In an upper body tracking scenario, all the lower body parts could be merged.
3.2. Depth image features
- The authors employ simple depth comparison features, inspired by those in [20].
- The features are thus 3D translation invariant (modulo perspective effects).
- Eq. 1 will give a large positive response for pixels x near the top of the body, but a value close to zero for pixels x lower down the body, also known as Feature fθ1 looks upwards.
- The design of these features was strongly motivated by their computational efficiency: no preprocessing is needed; each feature need only read at most 3 image pixels and perform at most 5 arithmetic operations; and the features can be straightforwardly implemented on the GPU.
3.3. Randomized decision forests
- At the leaf node reached in tree t, a learned distribution Pt(c|I,x) over body part labels c is stored.
- A random subset of 2000 example pixels from each image is chosen to ensure a roughly even distribution across body parts.
- Each tree is trained using the following algorithm [20]: 1. Randomly propose a set of splitting candidates φ = (θ, τ) (feature parameters θ and thresholds τ ).
- To keep the training times down the authors employ a distributed implementation.
3.4. Joint position proposals
- Body part recognition as described above infers per-pixel information.
- These proposals are the final output of their algorithm, and could be used by a tracking algorithm to selfinitialize and recover from failure.
- Depending on the definition of body parts, the posterior P (c|I,x) can be pre-accumulated over a small set of parts.
- Mean shift is used to find modes in this density efficiently.
- A final confidence estimate is given as a sum of the pixel weights reaching each mode.
4. Experiments
- In this section the authors describe the experiments performed to evaluate their method.
- For their synthetic test set, the authors synthesize 5000 depth images, together with the ground truth body part labels and joint positions.
- The authors quantify both classification and joint prediction accuracy.
- Any joint proposals outside D meters also count as false positives.
- The authors set D = 0.1m below, approximately the accuracy of the handlabeled real test data ground truth.
4.1. Qualitative results
- Fig. 5 shows example inferences of their algorithm.
- Note high accuracy of both classification and joint prediction across large variations in body and camera pose, depth in scene, cropping, and body size and shape (e.g. small child vs. heavy adult).
- The bottom row shows some failure modes of the body part classification.
- The first example shows a failure to distinguish subtle changes in the depth image such as the crossed arms.
- Often (as with the second and third failure examples) the most likely body part is incorrect, but there is still sufficient correct probability mass in distribution P (c|I,x) that an accurate proposal can still be generated.
4.2. Classification accuracy
- The authors investigate the effect of several training parameters on classification accuracy.
- The authors also show in Fig. 6(a) the quality of their approach on synthetic silhouette images, where the features in Eq. 1 are either given scale (as the mean depth) or not (a fixed constant depth).
- Using only 15k images the authors observe overfitting beginning around depth 17, but the enlarged 900k training set avoids this.
- The authors compare the actual performance of their system (red) with the best achievable result (blue) given the ground truth body part labels.
- Accuracy increases with the maximum probe offset, though levels off around 129 pixel meters.
4.3. Joint prediction accuracy
- In Fig. 7 the authors show average precision results on the synthetic test set, achieving 0.731 mAP.
- The authors compare an idealized setup that is given the ground truth body part labels to the real setup using inferred body parts.
- The speed of nearest neighbor chamfer matching is also drastically slower (2 fps) than their algorithm.
- The authors of [13] provided their test data and results for direct comparison.
- To evaluate the full 360◦ rotation scenario, the authors trained a forest on 900k images containing full rotations and tested on 5k synthetic full rotation images (with held out poses).
5. Discussion
- The authors have seen how accurate proposals for the 3D locations of body joints can be estimated in super real-time from single depth images.
- Detecting modes in a density function gives the final set of confidence-weighted 3D joint proposals.
- Whether a similarly efficient approach that can directly regress joint positions is also an open question.
- Perhaps a global estimate of latent variables such as coarse person orientation could be used to condition the body part inference and remove ambiguities in local pose estimates.
Did you find this useful? Give us your feedback
Citations
3,699 citations
2,681 citations
2,294 citations
2,209 citations
2,184 citations
References
[...]
79,257 citations
17,177 citations
11,727 citations
"Real-time human pose recognition in..." refers methods in this paper
...Finally, spatial modes of the inferred per-pixel distributions are computed using mean shift [ 10 ] resulting in the 3D joint proposals....
[...]
...Instead we employ a local mode-finding approach based on mean shift [ 10 ] with a weighted Gaussian kernel....
[...]
6,693 citations
2,738 citations
Related Papers (5)
Frequently Asked Questions (19)
Q2. What future works have the authors mentioned in the paper "Real-time human pose recognition in parts from single depth images" ?
As future work, the authors plan further study of the variability Combined Comparisons in the source mocap data, the properties of the generative model underlying the synthesis pipeline, and the particular part definitions.
Q3. What are the advantages of depth cameras?
Depth cameras offer several advantages over traditional intensity sensors, working in low light levels, giving a calibrated scale estimate, being color and texture invariant, and resolving silhouette ambiguities in pose.
Q4. How did the authors train the simulated forest?
Using a highly varied synthetic training set allowed us to train very deep decision forests using simple depthinvariant features without overfitting, learning invariance to both pose and shape.
Q5. How many depth images are used in their synthetic test set?
For their synthetic test set, the authors synthesize 5000 depth images, together with the ground truth body part labels and joint positions.
Q6. How do the authors train a deep decision forest?
The authors train a deep randomized decision forest classifier which avoids overfitting by using hundreds of thousands of training images.
Q7. How many sequences of driving, dancing, kicking, running, etc.?
The database consists of approximately 500k frames in a few hundred sequences of driving, dancing, kicking, running, navigating menus, etc.
Q8. What are the main problems of generating realistic intensity images using computer graphics techniques?
generating realistic intensity images using computer graphics techniques [33, 27, 26] is hampered by the huge color and texture variability induced by clothing, hair, and skin, often meaning that the data are reduced to 2D silhouettes [1].
Q9. How did the authors find it necessary to iterate the mocap database?
The authors have found it necessary to iterate the process of motion capture, sampling from their model, training the classifier, and testing joint prediction accuracy in order to refine the mocap database with regions of pose space that had been previously missed out.
Q10. What is the significance of this dataset?
The authors believe this dataset to considerably advance the state of the art in both scale and variety, and demonstrate the importance of such a large dataset in their evaluation.
Q11. What are the key design goals of this paper?
Illustrated in Fig. 1 and inspired by recent object recognition work that divides objects into parts (e.g. [12, 43]), their approach is driven by two key design goals: computational efficiency and robustness.
Q12. How long did it take to obtain a coarse body part labeling?
Auto-context was used in [40] to obtain a coarse body part labeling but this was not defined to localize joints and classifying each frame took about 40 seconds.
Q13. How many mAPs did the authors get with scale?
For the corresponding joint prediction using a 2D metric with a 10 pixel true positive threshold, the authors got 0.539 mAP with scale and 0.465 mAP without.
Q14. How do the authors discard poses from the initial mocap data?
The authors thus discard many similar, redundant poses from the initial mocap data using ‘furthest neighbor’ clustering [15] where the distance between poses p1 and p2 is defined as maxj ‖pj1−p j 2‖2, the maximum Euclidean distance over body joints j.
Q15. What is the example of a failure to generalize to an unseen pose?
The fourth example shows a failure to generalize well to an unseen pose, but the confidence gates bad proposals, maintaining high precision at the expense of recall.
Q16. What is the way to evaluate the results of their synthetic test set?
The results suggest that effects seen on synthetic data are mirrored in the real data, and further that their synthetic test set is by far the ‘hardest’ due to the extreme variability in pose and body shape.
Q17. How many randomly generated training images do the authors show?
In Fig. 6(a) the authors show how test accuracy increases approximately logarithmically with the number of randomly generated training images, though starts to tail off around 100k images.
Q18. What is the likely body part to be incorrect?
Often (as with the second and third failure examples) the most likely body part is incorrect, but there is still sufficient correct probability mass in distribution P (c|I,x) that an accurate proposal can still be generated.
Q19. How can the authors predict joint positions for multiple people in the image?
Their approach can propose joint positions for multiple people in the image, since the per-pixel classifier generalizes well even without explicit training for this scenario.