Monocular 3D Human Pose Estimation by Predicting Depth on Joints

doi:10.1109/ICCV.2017.373

Home
/
Papers
/
Monocular 3D Human Pose Estimation by Predicting Depth on Joints

Proceedings Article•DOI•

Monocular 3D Human Pose Estimation by Predicting Depth on Joints

Bruce Xiaohan Nie¹, Ping Wei², Song-Chun Zhu¹•Institutions (2)

University of California, Los Angeles¹, Xi'an Jiaotong University²

01 Oct 2017-pp 3467-3475

TL;DR: The empirical e-valuation on Human3.6M and HHOI dataset demonstrates the advantage of combining global 2D skeleton and local image patches for depth prediction, and the superior quantitative and qualitative performance relative to state-of-the-art methods.

read less

Abstract: This paper aims at estimating full-body 3D human poses from monocular images of which the biggest challenge is the inherent ambiguity introduced by lifting the 2D pose into 3D space. We propose a novel framework focusing on reducing this ambiguity by predicting the depth of human joints based on 2D human joint locations and body part images. Our approach is built on a two-level hierarchy of Long Short-Term Memory (LSTM) Networks which can be trained end-to-end. The first level consists of two components: 1) a skeleton-LSTM which learns the depth information from global human skeleton features; 2) a patch-LSTM which utilizes the local image evidence around joint locations. The both networks have tree structure defined on the kinematic relation of human skeleton, thus the information at different joints is broadcast through the whole skeleton in a top-down fashion. The two networks are first pre-trained separately on different data sources and then aggregated in the second layer for final depth prediction. The empirical e-valuation on Human3.6M and HHOI dataset demonstrates the advantage of combining global 2D skeleton and local image patches for depth prediction, and our superior quantitative and qualitative performance relative to state-of-the-art methods.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Video Salient Object Detection via Fully Convolutional Networks

[...]

Wenguan Wang¹, Jianbing Shen¹, Ling Shao²•Institutions (2)

Beijing Institute of Technology¹, University of East Anglia²

01 Jan 2018-IEEE Transactions on Image Processing

TL;DR: Wang et al. as discussed by the authors proposed a deep video saliency network consisting of two modules, for capturing the spatial and temporal saliency information, respectively, which can directly produce spatio-temporal saliency inference without time-consuming optical flow computation.

...read moreread less

Abstract: This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: 1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data and 2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image data sets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the densely annotated video segmentation data set (MAE of .06) and the Freiburg-Berkeley Motion Segmentation data set (MAE of .07), and do so with much improved speed (2 fps with all steps).

...read moreread less

550 citations

Book Chapter•DOI•

Integral Human Pose Regression

[...]

Xiao Sun¹, Bin Xiao¹, Fangyin Wei², Shuang Liang³, Yichen Wei¹ - Show less +1 more•Institutions (3)

Microsoft¹, Peking University², Tongji University³

08 Sep 2018

TL;DR: In this paper, a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the non-differentiable post-processing and quantization error of human pose estimation.

...read moreread less

Abstract: State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as non-differentiable post-processing and quantization error. This work shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time.

...read moreread less

536 citations

Proceedings Article•DOI•

3D Human Pose Estimation in the Wild by Adversarial Learning

[...]

Wei Yang¹, Wanli Ouyang², Xiaolong Wang³, Jimmy Ren, Hongsheng Li¹, Xiaogang Wang¹ - Show less +2 more•Institutions (3)

The Chinese University of Hong Kong¹, University of Sydney², Carnegie Mellon University³

26 Mar 2018

TL;DR: An adversarial learning framework is proposed, which distills the 3D human pose structures learned from the fully annotated dataset to in-the-wild images with only 2D pose annotations and designs a geometric descriptor, which computes the pairwise relative locations and distances between body joints, as a new information source for the discriminator.

...read moreread less

Abstract: Recently, remarkable advances have been achieved in 3D human pose estimation from monocular images because of the powerful Deep Convolutional Neural Networks (DCNNs). Despite their success on large-scale datasets collected in the constrained lab environment, it is difficult to obtain the 3D pose annotations for in-the-wild images. Therefore, 3D human pose estimation in the wild is still a challenge. In this paper, we propose an adversarial learning framework, which distills the 3D human pose structures learned from the fully annotated dataset to in-the-wild images with only 2D pose annotations. Instead of defining hard-coded rules to constrain the pose estimation results, we design a novel multi-source discriminator to distinguish the predicted 3D poses from the ground-truth, which helps to enforce the pose estimator to generate anthropometrically valid poses even with images in the wild. We also observe that a carefully designed information source for the discriminator is essential to boost the performance. Thus, we design a geometric descriptor, which computes the pairwise relative locations and distances between body joints, as a new information source for the discriminator. The efficacy of our adversarial learning framework with the new geometric descriptor has been demonstrated through extensive experiments on widely used public benchmarks. Our approach significantly improves the performance compared with previous state-of-the-art approaches.

...read moreread less

350 citations

Cites background from "Monocular 3D Human Pose Estimation ..."

...Two-stage approaches first estimate 2D poses and then lift 2D poses to 3D poses [58, 4, 2, 51, 28, 40, 25, 57, 30]....
[...]

Proceedings Article•

Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation

[...]

Hao-Shu Fang¹, Yuanlu Xu², Wenguan Wang³, Xiaobai Liu⁴, Song-Chun Zhu² - Show less +1 more•Institutions (4)

Shanghai Jiao Tong University¹, University of California, Los Angeles², Beijing Institute of Technology³, San Diego State University⁴

27 Apr 2018

TL;DR: This paper proposes a pose grammar to tackle the problem of 3D human pose estimation, which takes 2D pose as input and learns a generalized 2D-3D mapping function and enforces high-level constraints over human poses.

...read moreread less

Abstract: In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation. Our model directly takes 2D pose as input and learns a generalized 2D-3D mapping function. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a pose sample simulator to augment training samples in virtual camera views, which further improves our model generalizability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods. We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.

...read moreread less

330 citations

Cites background or methods from "Monocular 3D Human Pose Estimation ..."

...%) of previous 2D-3D reconstruction models (Pavlakos et al. 2017; Nie, Wei, and Zhu 2017; Zhou et al. 2017; Martinez et al. 2017), which demonstrates the blind spot of previous evaluation protocols and the over-fitting problem of those models....
[...]
...Furthermore, we re-train models proposed in (Nie, Wei, and Zhu 2017; Martinez et al. 2017) to validate the generalization of our PSS. Results also show a performance boost for their methods, which confirms the proposed PSS is a generalized technique....
[...]
...Notably, (Nie, Wei, and Zhu 2017) represented human body as a set of simplified kinematic grammar and learn their relations with LSTM....
[...]
...…methods (Ionescu et al. 2014; Tekin et al. 2016b; Du et al. 2016; Chen and Ramanan 2016; Sanzari, Ntouskos, and Pirri 2016; Rogez and Schmid 2016; Bogo et al. 2016; Pavlakos et al. 2017; Nie, Wei, and Zhu 2017; Zhou et al. 2017; Martinez et al. 2017) and report quantitative comparisons in Table 1....
[...]

Proceedings Article•DOI•

Single-Shot Multi-person 3D Pose Estimation from Monocular RGB

[...]

Dushyant Mehta¹, Oleksandr Sotnychenko¹, Franziska Mueller¹, Weipeng Xu¹, Srinath Sridhar², Gerard Pons-Moll¹, Christian Theobalt¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, Stanford University²

01 Sep 2018

TL;DR: This work proposes a new single-shot method for multi-person 3D pose estimation in general scenes from a monocular RGB camera which uses novel occlusion-robust pose-maps (ORPM) which enable full body pose inference even under strong partial occlusions by other people and objects in the scene.

...read moreread less

Abstract: We propose a new single-shot method for multi-person 3D pose estimation in general scenes from a monocular RGB camera. Our approach uses novel occlusion-robust pose-maps (ORPM) which enable full body pose inference even under strong partial occlusions by other people and objects in the scene. ORPM outputs a fixed number of maps which encode the 3D joint locations of all people in the scene. Body part associations [8] allow us to infer 3D pose for an arbitrary number of people without explicit bounding box prediction. To train our approach we introduce MuCo-3DHP, the first large scale training data set showing real images of sophisticated multi-person interactions and occlusions. We synthesize a large corpus of multi-person images by compositing images of individual people (with ground truth from mutli-view performance capture). We evaluate our method on our new challenging 3D annotated multi-person test set MuPoTs-3D where we achieve state-of-the-art performance. To further stimulate research in multi-person 3D pose estimation, we will make our new datasets, and associated code publicly available for research purposes.

...read moreread less

320 citations

Cites background from "Monocular 3D Human Pose Estimation ..."

...Some works split the problem in two: first estimate 2D joints and then lift them to 3D [58, 54, 9, 66, 35, 69, 1, 52, 20, 32, 6, 26, 38, 57, 2], e....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

55,235 citations

"Monocular 3D Human Pose Estimation ..." refers methods in this paper

...We apply the VGG model [28] to predict the depth from the cropped image of the whole person instead of body part....
[...]

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

49,914 citations

Proceedings Article•DOI•

A unified architecture for natural language processing: deep neural networks with multitask learning

[...]

Ronan Collobert¹, Jason Weston¹•Institutions (1)

Princeton University¹

05 Jul 2008

TL;DR: This work describes a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense using a language model.

...read moreread less

Abstract: We describe a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense (grammatically and semantically) using a language model. The entire network is trained jointly on all these tasks using weight-sharing, an instance of multitask learning. All the tasks use labeled data except the language model which is learnt from unlabeled text and represents a novel form of semi-supervised learning for the shared tasks. We show how both multitask learning and semi-supervised learning improve the generalization of the shared tasks, resulting in state-of-the-art-performance.

...read moreread less

5,759 citations

"Monocular 3D Human Pose Estimation ..." refers methods in this paper

...Although the 2D pose datasets do not have depth, we apply the multi-task learning [7] to combine it with Mocap dataset in the same network....
[...]

Proceedings Article•DOI•

Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields

[...]

Zhe Cao¹, Tomas Simon¹, Shih-En Wei¹, Yaser Sheikh¹•Institutions (1)

Carnegie Mellon University¹

21 Jul 2017

TL;DR: Part Affinity Fields (PAFs) as discussed by the authors uses a nonparametric representation to learn to associate body parts with individuals in the image and achieves state-of-the-art performance on the MPII Multi-Person benchmark.

...read moreread less

Abstract: We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.

...read moreread less

3,958 citations

Book Chapter•DOI•

Stacked Hourglass Networks for Human Pose Estimation

[...]

Alejandro Newell¹, Kaiyu Yang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

08 Oct 2016

TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.

...read moreread less

Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

...read moreread less

3,865 citations

"Monocular 3D Human Pose Estimation ..." refers background in this paper

...One inspiration of our work is the huge progress of 2D human pose estimation made by recent works based on deep architectures [33, 32, 17, 37, 3]....
[...]