scispace - formally typeset
Open AccessProceedings ArticleDOI

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose

Reads0
Chats0
TLDR
In this paper, a fine discretization of the 3D space around the subject and train a ConvNet to predict per voxel likelihoods for each joint is proposed.
Abstract
This paper addresses the challenge of 3D human pose estimation from a single color image. Despite the general success of the end-to-end learning paradigm, top performing approaches employ a two-step solution consisting of a Convolutional Network (ConvNet) for 2D joint localization and a subsequent optimization step to recover 3D pose. In this paper, we identify the representation of 3D pose as a critical issue with current ConvNet approaches and make two important contributions towards validating the value of end-to-end learning for this task. First, we propose a fine discretization of the 3D space around the subject and train a ConvNet to predict per voxel likelihoods for each joint. This creates a natural representation for 3D pose and greatly improves performance over the direct regression of joint coordinates. Second, to further improve upon initial estimates, we employ a coarse-to-fine prediction scheme. This step addresses the large dimensionality increase and enables iterative refinement and repeated processing of the image features. The proposed approach outperforms all state-of-the-art methods on standard benchmarks achieving a relative error reduction greater than 30% on average. Additionally, we investigate using our volumetric representation in a related architecture which is suboptimal compared to our end-to-end approach, but is of practical interest, since it enables training when no image with corresponding 3D groundtruth is available, and allows us to present compelling results for in-the-wild images.

read more

Content maybe subject to copyright    Report

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose
Georgios Pavlakos
1
, Xiaowei Zhou
1
, Konstantinos G. Derpanis
2
, Kostas Daniilidis
1
1
University of Pennsylvania
2
Ryerson University
Abstract
This paper addresses the challenge of 3D human pose
estimation from a single color image. Despite the general
success of the end-to-end learning paradigm, top perform-
ing approaches employ a two-step solution consisting of a
Convolutional Network (ConvNet) for 2D joint localization
and a subsequent optimization step to recover 3D pose. In
this paper, we identify the representation of 3D pose as a
critical issue with current ConvNet approaches and make
two important contributions towards validating the value of
end-to-end learning for this task. First, we propose a fine
discretization of the 3D space around the subject and train a
ConvNet to predict per voxel likelihoods for each joint. This
creates a natural representation for 3D pose and greatly im-
proves performance over the direct regression of joint coor-
dinates. Second, to further improve upon initial estimates,
we employ a coarse-to-fine prediction scheme. This step ad-
dresses the large dimensionality increase and enables iter-
ative refinement and repeated processing of the image fea-
tures. The proposed approach outperforms all state-of-the-
art methods on standard benchmarks achieving a relative
error reduction greater than 30% on average. Additionally,
we investigate using our volumetric representation in a re-
lated architecture which is suboptimal compared to our end-
to-end approach, but is of practical interest, since it enables
training when no image with corresponding 3D groundtruth
is available, and allows us to present compelling results for
in-the-wild images.
1. Introduction
Estimating the full-body 3D pose of a human from a sin-
gle monocular image is an open challenge, which has gar-
nered significant attention since the early days of computer
vision [
18]. Given its ill-posed nature, researchers have
generally approached 3D human pose estimation in sim-
plified settings, such as assuming background subtraction
is feasible [
1], relying on groundtruth 2D joint locations
to estimate 3D pose [
26, 43], employing additional cam-
era views [
7, 15], and capitalizing on temporal consistency
to improve upon single frame predictions [38, 3]. This di-
Image ConvNet Volumetric Output
Figure 1: Illustration of our volumetric representation for
3D human pose. We discretize the space around the subject
and use a ConvNet to predict per voxel likelihoods for each
joint from a single color image.
versity of assumptions and additional information sources
exemplifies the challenge presented by the task.
With the introduction of more powerful discriminative
approaches, such as Convolutional Networks (ConvNets),
many of these restrictive assumptions have been relaxed.
End-to-end learning approaches attempt to estimate 3D
pose directly from a single image by addressing it as co-
ordinate regression [
19, 35], nearest neighbor between im-
ages and poses [
20], or classification over a set of pose
classes [
27]. Yet to date, these approaches have been out-
performed by more traditional two-step pipelines, e.g., [
45,
6]. In these cases, ConvNets are used only for 2D joint
localization and 3D poses are generated during a post-
processing optimization step. Combining accurate 2D joint
localization with strong and expressive 3D priors has been
proven to be very effective. In this work, we show that
ConvNets are able to provide much richer information than
simply 2D joint locations.
To fully exploit the potential of ConvNets in the context
of 3D human pose, we propose the following items, and jus-
tify them empirically. First, we cast 3D pose estimation as
a keypoint localization problem in a discretized 3D space.
Instead of directly regressing the coordinates of the joints
(e.g., [
19, 35]), we train a ConvNet to predict per voxel
likelihoods for each joint in this volume. This volumetric
representation, illustrated in Figure
1, is much more sensi-
ble for the 3D nature of our problem and improves learning.
1
7025

Effectively, for every joint, the volumetric supervision pro-
vides the network with groundtruth for each voxel in the 3D
space. This provides much richer information than a set of
world coordinates. The empirical results also validate the
superiority of our proposed form of supervision.
Second, to deal with the increased dimensionality of the
volumetric representation, we propose a coarse-to-fine pre-
diction scheme. As demonstrated in the 2D pose case, inter-
mediate supervision and iterative estimation are particularly
effective strategies [
40, 8, 21]. For our volumetric repre-
sentation though, naively stacking an increasing number of
components and refining the estimates is not an effective so-
lution, as shown empirically. Instead, we gradually increase
the resolution of the supervision volume for the most chal-
lenging z-dimension (depth), during the processing. This
coarse-to-fine supervision, illustrated schematically in Fig-
ure
2, allows for more accurate estimates after each step.
We empirically demonstrate the advantage of this practice
over naively stacking more components together.
Our proposed approach achieves state-of-the-art results
on standard benchmarks, outperforming both ConvNet-only
and hybrid approaches that post-process the 2D output of a
ConvNet. Additionally, we investigate using our volumet-
ric representation within a related architecture that decou-
ples 2D joint localization and 3D joint reconstruction. In
particular, we use two separate networks (the output of one
serves as the input to the other) and two non-corresponding
data sources, i.e., 2D labeled imagery to train the first com-
ponent and an independent 3D data source (e.g., MoCap) to
train the second one separately. While this architecture has
practical benefits (e.g., predicting 3D pose for in-the-wild
images), we show empirically that it underperforms com-
pared to our end-to-end approach when images with cor-
responding 3D groundtruth are available for training. This
finding further underlines the benefit of predicting 3D pose
directly from an image, whenever this is possible, instead
of using 2D joint localization as an intermediate step.
In summary, we make the four following contributions:
we are the first to cast 3D human pose estimation as
a 3D keypoint localization problem in a voxel space
using the end-to-end learning paradigm;
we propose a coarse-to-fine prediction scheme to deal
with the large dimensionality of our representation and
enable iterative processing to realize further benefits;
our proposed approach achieves state-of-the-art results
on standard benchmarks, surpassing both ConvNet-
only and hybrid approaches that employ ConvNets for
2D pose estimation, with a relative error reduction that
exceeds 30% on average;
we show the practical use of our volumetric representa-
tion in cases when end-to-end training is not an option
and present compelling results on in-the-wild images.
2. Related work
The literature on 3D human pose estimation is vast with
approaches addressing the problem in a variety of settings.
Here, we survey works that are most relevant to ours with a
focus on ConvNet-based approaches; we refer the reader to
a recent survey [
29] for a more complete literature review.
The majority of recent ConvNet-only approaches cast 3D
pose estimation as a coordinate regression task, with the tar-
get output being the spatial x, y, z coordinates of the human
joints with respect to a known root joint, such as the pelvis.
Li and Chan [
19] pretrain their network with maps for 2D
joint classification. Tekin et al. [
35] include a pretrained
autoencoder within the network to enforce structural con-
straints on the output. Ghezelghieh et al. [
13] employ view-
point prediction as a side task to provide the network with
global joint configuration information. Zhou et al. [44] em-
bed a kinematic model to guarantee the validity of the re-
gressed pose. Park et al. [
22] concatenate the 2D joint pre-
dictions with image features to improve 3D joint localiza-
tion. Tekin et al. [
36] include temporal information in the
joint predictions by extracting spatiotemporal features from
a sequence of frames. In contrast to all these approaches, we
adopt a volumetric representation of the human pose, and
regress the per voxel likelihood for each joint separately.
This proves to have significant advantages for the network
performance and provides a richer output compared to the
low-dimensional vector of joint coordinates.
An alternative approach to the classical regression
paradigm is proposed by Li et al. [
20]. During training,
they learn a common embedding between color images and
3D poses. At test time, the test image is coupled with each
candidate pose and forwarded through the network; the in-
put image is assigned the candidate pose with the maximum
network score. This is a form of nearest neighbor classifi-
cation which is highly inefficient due to the requirement of
multiple forward network passes. On the other hand, Ro-
gez and Schmid [
27] cast pose estimation as a classification
problem. Given a predefined set of pose classes, each im-
age is assigned to the class with the highest score. This
guarantees a valid global pose prediction, but the approach
is constrained by the poses in the original classes and thus
returns only a rough pose estimate. In contrast to the inef-
ficient nearest neighbor approach and the coarse classifica-
tion approach, our volume regression allows for much more
accurate 3D joint localization, while also being efficient.
Despite the interest in end-to-end learning, ConvNet-
only approaches underperform those that employ a
ConvNet for the 2D localization of joints, and produce 3D
pose with a subsequent optimization step. Zhou et al. [
45]
utilize a standard 2D pose ConvNet to localize the joints
and retrieve the 3D pose using an optimization scheme over
a sequence of monocular images. Similarly, Du et al. [10]
include height-maps of the human body to improve 2D joint
7026

Figure 2: Illustration of our coarse-to-fine volumetric approach for 3D human pose estimation from a single image. The input
is a single color image and the output is a dense 3D volume with separate per voxel likelihoods for each joint. The network
consists of multiple fully convolutional components [
21], which are supervised in a coarse-to-fine fashion, to deal with the
large dimensionality of our representation. 3D heatmaps are synthesized for supervision by increasing the resolution for the
most challenging z-dimension (depth) after each component. The dashed lines indicate that the intermediate heatmaps are
fused with image features to produce the input for the next fully convolutional component. For presentation simplicity, the
illustrated heatmaps correspond to the location of only one joint.
localization. Bogo et al. [
6] use the joints predicted by a
2D ConvNet and fit a statistical body shape model to re-
cover the full shape of the human body. In contrast, our
approach achieves state-of-the-art results with a single net-
work. Furthermore, it provides a rich 3D output, amenable
to post-processing, such as pictorial structures optimization
to constrain limb lengths, or temporal filtering.
Another issue that has been addressed in the context of
using ConvNets for 3D human pose is the scarcity of train-
ing data. Chen et al. [
9] use a graphics renderer to create
images with known groundtruth. Similarly, Ghezelghieh et
al. [
13] augment the training set with synthesized examples.
A collage approach is proposed by Rogez and Schmid [
27],
where parts from in-the-wild images are combined to create
additional images with known 3D poses. However, there
is no guarantee that the statistics of the synthetic exam-
ples match those of real images. To investigate the data
scarcity issue, we take inspiration from the 3D Interpreter
Network [
41], which decouples the 3D pose estimation task
into 2D localization and 3D reconstruction within a single
ConvNet. In contrast, rather than using a predefined linear
basis for 3D reconstruction, we predict 3D joint locations
directly with our volumetric representation. This demon-
strates the practical use of our volumetric representation
even when end-to-end training is not an option.
Finally, while we do not compare explicitly with multi-
view pose estimation work (e.g., [
12, 31, 4, 11]), it is inter-
esting to note that the representation of 3D human pose in
a discretized 3D space has also been previously adopted in
multi-view settings [
7, 15, 23], where it was used to accom-
modate predictions from different viewpoints. For single
view pose estimation, it has been considered in the context
of random forests [
16]. This approach suffered from large
execution time (around three minutes), and required an ad-
ditional refinement step using a pictorial structures model.
In stark contrast, our network can provide complete volume
predictions with a single forward pass in a few milliseconds,
needs no additional refinement (although it is still a possi-
bility) to provide state-of-the-art results, and is integrated
within a coarse-to-fine prediction scheme to deal with ex-
cessive dimensionality.
3. Technical approach
The following subsections summarize our technical ap-
proach. Section
3.1 describes the proposed volumetric rep-
resentation for 3D human pose and discusses its merits.
Next, Section
3.2 describes our coarse-to-fine prediction ap-
proach that addresses the high dimensional nature of our
output representation. Finally, Section
3.3 describes the use
of our volumetric representation within a related decoupled
architecture and discusses its relative merits compared to
our coarse-to-fine volumetric prediction approach.
3.1. Volumetric representation for 3D human pose
The problem of 3D human pose estimation using
ConvNets has been primarily approached as a coordinate
regression problem. In this case, the target of the network
is a 3N-dimensional vector comprised of the concatenation
of the x, y, z coordinates of the N joints of the human body.
For training, an L
2
regression loss is employed:
L =
N
X
n=1
kx
n
gt
x
n
pr
k
2
2
, (1)
where x
n
gt
is the groundtruth and x
n
pr
is the predicted lo-
cation for joint n. The location of each joint is expressed
globally, with respect to a root joint, or locally, with re-
spect to its parent joint in the kinematic tree. The second
formulation has some benefits, as discussed also by Li et
7027

al. [19] (e.g., easier to learn to predict small, local devi-
ations), but still suffers from the fact that small errors can
easily propagate hierarchically to children joints of the kine-
matic tree. In general, despite its simplicity, the coordinate
regression approach makes the problem highly non-linear
and presents problems for the learning procedure. These
issues have previously been identified in the context of 2D
human pose [
37, 24].
To improve learning, we propose a volumetric represen-
tation for 3D human pose. The volume around the subject
is discretized uniformly in each dimension. For each joint
we create a volume of size w ×h ×d. Let p
n
(i,j,k)
denote the
predicted likelihood of joint n being in voxel (i, j, k). To
train this network, the supervision is also provided in vol-
umetric form. The target for each joint is a volume with
a 3D Gaussian centered around the groundtruth position
x
n
gt
= (x, y, z) of the joint in the 3D grid:
G
i,j,k
(x
n
gt
) =
1
2πσ
2
e
(xi)
2
+(yj)
2
+(zk)
2
2σ
2
, (2)
where the value σ = 2 is used for our experiments. For
training, we use the mean squared error loss:
L =
N
X
n=1
X
i,j,k
kG
(i,j,k)
(x
n
gt
) p
n
(i,j,k)
k
2
. (3)
In theory, the output of the network is four dimensional,
i.e., (w × h × d × N ), but in practice we organize it in
channels, thus our output is three dimensional, i.e., w × h ×
dN. The voxel with the maximum response in each 3D grid
is selected as the joint’s 3D location.
A major advantage of the volumetric representation is
that it casts the highly non-linear problem of direct 3D coor-
dinate regression to a more manageable form of prediction
in a discretized space. In this case, the predictions do not
necessarily commit to a unique location for each joint, but
instead an estimate of the confidence is provided for each
voxel. This makes it easier for the network to learn the tar-
get mapping. A similar argument has been previously put
forth in the 2D pose case, validating the benefit of predicting
per pixel likelihoods instead of pixel coordinates [
37, 24].
In terms of the network architecture, an important benefit
of the volumetric representation is that it enables the use
of a fully convolutional network for prediction. Here, we
adopt the hourglass design [
21]. This leads to less network
parameters than using fully connected layers for coordinate
regression or pose classification. Finally, in terms of the
predicted output, besides being more accurate, our network
predictions in the form of dense 3D heatmaps are useful
for subsequent post-processing applications. For example,
structural constraints can be enforced with the use of a 3D
Pictorial Structures model, e.g., [
7, 23]. Another option is
to use the dense predictions in a filtering framework in cases
where multiple input frames are available.
3.2. Coarse-to-fine prediction
A design choice that has been particularly effective in
the case of 2D human pose is the iterative processing of the
network output [
8, 40, 21]. Instead of using a single compo-
nent with a single output, the network is forced to produce
predictions in multiple processing stages. These intermedi-
ate predictions are gradually refined to produce more accu-
rate estimates. Additionally, the use of intermediate super-
vision on the “earlier” outputs allows for a richer gradient
signal, which has been demonstrated empirically as an ef-
fective learning practice [
17, 34].
Inspired by the success of iterative refinement in the
context of 2D pose, we also consider a gradual refinement
scheme. Empirically, we found that naively stacking mul-
tiple components yielded diminishing returns because of
the large dimensionality of our representation. In fact, for
the highest 3D resolution of 64 × 64 × 64 with 16 joints,
we would need to estimate the likelihood for more that 4
million voxels. To deal with this curse of dimensionality,
we propose to use a coarse-to-fine prediction scheme. In
particular, the first steps are supervised with lower resolu-
tion targets for the (most challenging and technically un-
observed) z-dimension. Precisely, we use targets of size
64 × 64 × d per joint, where d typically takes values from
the set {1, 2, 4, 8, 16, 32, 64}. An illustration of this super-
vision approach is given in Figure
2.
This strategy makes training more effective, and allows
us to benefit from stacking multiple components together
without suffering from overfitting or dimensionality issues.
Intuitively, easier versions of the task are presented to the
network during the early stages of processing, and the com-
plexity increases gradually. This postpones the harder de-
cisions until the very end of the processing, when all the
available information has been processed and consolidated.
3.3. Decoupled architecture with volumetric target
To further show the versatility of the proposed volumet-
ric representation, we also employ it in a scenario where
end-to-end training is not an option. This is usually the case
for in-the-wild images, where accurate, large-scale acquisi-
tion of 3D groundtruth is not feasible. Inspired by the 3D
Interpreter Network [
41], we decouple 3D pose estimation
in two sequential steps consisting of predicting 2D keypoint
heatmaps, followed by an inference step of the 3D joint po-
sitions with our volumetric representation. The first step
can be trained with 2D labeled in-the-wild imagery, while
the second step requires only 3D data (e.g., MoCap). Inde-
pendently, each of these sources are abundantly available.
This training strategy is useful for practical scenarios,
and we present compelling results for in-the-wild images
(Sec.
4.6). However, it remains suboptimal compared to our
end-to-end approach when images with corresponding 3D
groundtruth are available for training. Figure
3 provides an
7028

(a) Decoupled architecture
(b) Coarse-to-fine architecture
Figure 3: Schematic comparison of a decoupled architec-
ture versus our coarse-to-fine architecture with intermedi-
ate supervision at the coarsest level (2D heatmaps). Blue
blocks indicate 3D heatmaps, while green blocks indicate
2D heatmaps. Decoupled architecture: The 2D heatmaps
are provided directly as input to the second part of the net-
work, which effectively operates as a 2D-to-3D reconstruc-
tion component. Note, no image features are processed in
the second component, only information about 2D joint lo-
cations. Coarse-to-fine architecture: We use 2D heatmaps
as intermediate supervision, which are then combined with
image features, effectively carrying information both from
the image and the 2D locations of the joints.
illustration of each architecture in a simplified setting with
two hourglasses. It can be seen that the decoupled case is
related to our course-to-fine architecture when the resolu-
tion of the intermediate supervision is set to d = 1 resulting
in 2D heatmaps. A crucial difference between the two ar-
chitectures is that our coarse-to-fine approach combines the
produced 2D heatmaps with intermediate image features.
This way, the rest of the network can process information
both about the image and the 2D joint locations. On the
other hand, a decoupled network processes the 2D heatmaps
directly and attempts to reconstruct 3D locations without
further aid by image-based evidence. In cases where the
heatmaps are grossly erroneous, the 3D predictions can be
lead astray. In Sec.
4.4, we show empirically that when im-
ages with corresponding 3D groundtruth are available, our
coarse-to-fine architecture outperforms the decoupled one.
4. Empirical evaluation
4.1. Datasets
We present extensive quantitative evaluation of our
coarse-to-fine volumetric approach on three standard
benchmarks for 3D human pose: Human3.6M [
14],
HumanEva-I [
30] and KTH Football II [15]. Additionally,
qualitative results are presented on the MPII human pose
dataset [
2], since no 3D groundtruth is available.
Human3.6M: It contains video of 11 subjects perform-
ing a variety of actions, such as “Walking”, “Sitting” and
“Phoning”. We follow the same evaluation protocol as prior
work [
20, 45]. In particular, Subjects S1, S5, S6, S7 and S8
were used for training, while subjects S9 and S11 were used
for testing. The original videos were downsampled from
50fps to 10fps We employed all camera views and trained
a single model for all actions, instead of training action-
specific models [
20, 45].
HumanEva-I: It is a smaller dataset compared to Hu-
man3.6M, with fewer subjects and actions. Following the
standard protocol [
16, 42], we evaluated on “Walking” and
“Jogging” from subjects S1, S2 and S3. The training se-
quences of these subjects and actions were used for train-
ing, and the corresponding validation sequences for testing.
As done with the Human3.6M evaluation, we train a single
model using the frames for all users and actions.
KTH Football II: The images are taken from a profes-
sional football match, and 3D groundtruth is provided only
for a very small number of them. The limited available
groundtruth is not very accurate, since it was generated by
combining manual 2D annotations from multiple views. In
this case, image-to-3D training is not a practical option.
Instead, we report results using our volumetric representa-
tion within the decoupled architecture described in Sec.
3.3.
More specifically, we train the first network component (im-
age to 2D heatmaps) using images from this dataset which
provide 2D groundtruth. For the second network compo-
nent (2D heatmaps to 3D heatmaps), we use all the training
MoCap data from Human3.6M dataset. As others [
7, 36],
we report results using “Sequence 1” from “Player 2” and
frames taken from “Camera 1”.
MPII: It is a large scale 2D pose dataset containing in-
the-wild imagery. It provides 2D annotations but no 3D
groundtruth. Like KTH, direct image-to-3D training is not a
practical option with this dataset. Instead, we use the decou-
pled architecture with our volumetric representation. Since
we cannot quantify the performance here, we only provide
qualitative results.
4.2. Evaluation metrics
For Human3.6M, most approaches report the per joint
3D error, which is the average Euclidean distance of the
estimated joints to the groundtruth. This is done after
aligning the root joints (here the pelvis) of the estimated
and groundtruth 3D pose. An alternative metric, which is
used by some methods to report results on Human3.6M and
HumanEva-I is the reconstruction error. It is defined as the
per joint 3D error up to a similarity transformation. Effec-
tively, the estimated 3D pose is aligned to the groundtruth
by the Procrustes method. Finally, for KTH the percentage
of the correctly estimated parts in 3D (3D PCP [
7]) is re-
ported. Again, the root joints (here we use the center of the
chest) are aligned to resolve the depth ambiguity.
7029

Citations
More filters
Proceedings ArticleDOI

End-to-End Recovery of Human Shape and Pose

TL;DR: This work introduces an adversary trained to tell whether human body shape and pose parameters are real or not using a large database of 3D human meshes, and produces a richer and more useful mesh representation that is parameterized by shape and 3D joint angles.
Proceedings ArticleDOI

A Simple Yet Effective Baseline for 3d Human Pose Estimation

TL;DR: In this paper, a relatively simple deep feed-forward network was proposed to estimate 3D human pose from 2D joint locations with a remarkably low error rate, achieving state-of-the-art results on Human3.6M.
Journal ArticleDOI

VNect: real-time 3D human pose estimation with a single RGB camera

TL;DR: In this paper, a fully-convolutional pose formulation was proposed to regress 2D and 3D joint positions jointly in real-time and does not require tightly cropped input frames.
Proceedings ArticleDOI

Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision

TL;DR: In this article, a CNN-based approach for 3D human body pose estimation from single RGB images is proposed to address the issue of limited generalizability of models trained solely on the starkly limited publicly available 3D pose data.
Proceedings ArticleDOI

Learning to Estimate 3D Human Pose and Shape from a Single Color Image

TL;DR: This work addresses the problem of estimating the full body 3D human pose and shape from a single color image and proposes an efficient and effective direct prediction method based on ConvNets, incorporating a parametric statistical body shape model (SMPL) within an end-to-end framework.
References
More filters
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Book ChapterDOI

Stacked Hourglass Networks for Human Pose Estimation

TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Proceedings ArticleDOI

Convolutional Pose Machines

TL;DR: In this paper, a convolutional network is incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation, which can implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation.
Proceedings ArticleDOI

2D Human Pose Estimation: New Benchmark and State of the Art Analysis

TL;DR: A novel benchmark "MPII Human Pose" is introduced that makes a significant advance in terms of diversity and difficulty, a contribution that is required for future developments in human body models.
Journal ArticleDOI

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

TL;DR: A new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, is introduced for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms.
Related Papers (5)