A local basis representation for estimating human pose from cluttered images
read more
Citations
Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments
Vision-based human motion analysis: An overview
Domain Adaptation for Visual Applications: A Comprehensive Survey
Pose primitive based human action recognition in videos or still images
Latent structured models for human pose estimation
References
Distinctive Image Features from Scale-Invariant Keypoints
Histograms of oriented gradients for human detection
Learning the parts of objects by non-negative matrix factorization
Learning parts of objects by non-negative matrix factorization
Non-negative Matrix Factorization with Sparseness Constraints
Related Papers (5)
A model-based approach for estimating human 3D poses in static images
2D-3D Pose Estimation of Heterogeneous Objects Using a Region Based Approach
Frequently Asked Questions (13)
Q2. What are the contributions in "A local basis representation for estimating human pose from cluttered images" ?
This paper discusses a bottom-up approach that uses local image features to estimate human upper body pose from single images in cluttered backgrounds. The method takes the image window with a dense grid of local gradient orientation histograms, followed by non negative matrix factorization to learn a set of bases that correspond to local features on the human body, enabling selective encoding of human-like features in the presence of background clutter. This approach allows us to key on gradient patterns such as shoulder contours and bent elbows that are characteristic of humans and carry important pose information, unlike current regressive methods that either use weak limb detectors or require prior segmentation to work. The authors show that it estimates pose with similar performance levels to current example-based methods, but unlike them it works in the presence of natural backgrounds, without any prior segmentation.
Q3. What is the effect of sparsity on the performance of the NMF?
Varying the sparsity of the basis vectors W has very little effect on the performance, while varying the sparsity of the coefficients H gives results spanning the range of performances from k-means to unconstrained NMF.
Q4. What is the advantage of using NMF to represent images?
Besides capturing the local edges representative of human contours, the NMF bases allow us to compactly code each 128-d SIFT descriptor directly by its corresponding vector h of basis coefficients.
Q5. What is the effect of sparsity prior on H?
As the sparsity prior on H is increased to a maximum, NMF is forced to use only a few basis vectors for each training example, in the extreme case giving a solution very similar to k-means.
Q6. How many errors are obtained in the experiment?
of the 10.88 cm of error obtained in the experiment on cluttered images, 9.65 cm comes from x and y, while 12.97 cm from errors in z.
Q7. What is the relative coarseness of the spatial coding?
The relative coarseness of the spatial coding provides some robustness to small position variations, while still capturing the essential spatial position and limb orientation information.
Q8. What is the reason why a linear regressor performs so well?
a linear regressor on the vector x performs very well despite the clutter — an examination of the elements of the weight matrix A reveals this is due to automatic downweighting of descriptor elements that usually contain only background.
Q9. What is the corresponding vector of similarity weights?
Each image patch was then represented by softly vector quantizing the SIFT descriptor by voting into each of its corresponding k-means centers, i.e. as a sparse vector of similarity weights computed from each cluster center.
Q10. How did the authors compute errors in the x and y coordinates?
To see the effect of depth ambiguities on these results, the authors computed errors separately in the x and y coordinates corresponding to the image plane and z, corresponding to depth.
Q11. What is the performance of the regressor?
The best performance, as expected, is obtained by training and testing on clean, background-free images, irrespective of the descriptor encoding used.
Q12. How can the authors make accurate results from a 3D body model?
With suitable initialization or sufficiently fine sampling such methods can produce accurate results, but the computational cost is high.
Q13. What is the approach to pose inference?
The authors prefer to take a bottom-up approach to the problem, considering pose inference from general images in terms of two interdependent sub-problems: (i) identifying/localizing the human parts of interest in the image, and (ii) estimating 3D pose from them.