scispace - formally typeset
Proceedings ArticleDOI

Monocular 3D Human Pose Estimation by Predicting Depth on Joints

Reads0
Chats0
TLDR
The empirical e-valuation on Human3.6M and HHOI dataset demonstrates the advantage of combining global 2D skeleton and local image patches for depth prediction, and the superior quantitative and qualitative performance relative to state-of-the-art methods.
Abstract
This paper aims at estimating full-body 3D human poses from monocular images of which the biggest challenge is the inherent ambiguity introduced by lifting the 2D pose into 3D space. We propose a novel framework focusing on reducing this ambiguity by predicting the depth of human joints based on 2D human joint locations and body part images. Our approach is built on a two-level hierarchy of Long Short-Term Memory (LSTM) Networks which can be trained end-to-end. The first level consists of two components: 1) a skeleton-LSTM which learns the depth information from global human skeleton features; 2) a patch-LSTM which utilizes the local image evidence around joint locations. The both networks have tree structure defined on the kinematic relation of human skeleton, thus the information at different joints is broadcast through the whole skeleton in a top-down fashion. The two networks are first pre-trained separately on different data sources and then aggregated in the second layer for final depth prediction. The empirical e-valuation on Human3.6M and HHOI dataset demonstrates the advantage of combining global 2D skeleton and local image patches for depth prediction, and our superior quantitative and qualitative performance relative to state-of-the-art methods.

read more

Citations
More filters
Proceedings ArticleDOI

Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation

TL;DR: A novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images by devising a simple and effective compression method to drastically reduce the size of this representation.
Posted Content

Ordinal Depth Supervision for 3D Human Pose Estimation

TL;DR: In this article, the authors propose to use a weaker supervision signal provided by the ordinal depths of human joints, which can be acquired by human annotators for a wide range of images and poses.
Journal ArticleDOI

3D Human Pose Machines with Self-Supervised Learning

TL;DR: Zhang et al. as discussed by the authors proposed a self-supervised correction mechanism to learn all intrinsic structures of human poses from abundant images, which involves two dual learning tasks, i.e., the 2D-to-3D pose transformation and 3Dto-2D pose projection, to serve as a bridge between 3D and 2D human poses in a type of free selfsupervision for accurate 3D human pose estimation.
Posted Content

Self-Supervised Learning of 3D Human Pose using Multi-view Geometry

TL;DR: EpipolarPose is presented, a self-supervised learning method for 3D human pose estimation, which does not need any 3D ground-truth data or camera extrinsics, and a new performance measure Pose Structure Score (PSS) which is a scale invariant, structure aware measure to evaluate the structural plausibility of a pose with respect to its ground truth.
Posted Content

DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare

TL;DR: A novel end-to-end framework for jointly estimating 3D human pose and body shape from a monocular RGB image and a large-scale synthetic dataset utilizing web-crawled Mocap sequences, 3D scans and animations is constructed.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

A unified architecture for natural language processing: deep neural networks with multitask learning

TL;DR: This work describes a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense using a language model.
Proceedings ArticleDOI

Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields

TL;DR: Part Affinity Fields (PAFs) as discussed by the authors uses a nonparametric representation to learn to associate body parts with individuals in the image and achieves state-of-the-art performance on the MPII Multi-Person benchmark.
Book ChapterDOI

Stacked Hourglass Networks for Human Pose Estimation

TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Related Papers (5)