Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose

doi:10.1109/CVPR.2017.139

Georgios Pavlakos

1

, Xiaowei Zhou

1

, Konstantinos G. Derpanis

2

, Kostas Daniilidis

1

University of Pennsylvania

2

Ryerson University

Abstract

This paper addresses the challenge of 3D human pose

estimation from a single color image. Despite the general

success of the end-to-end learning paradigm, top perform-

ing approaches employ a two-step solution consisting of a

Convolutional Network (ConvNet) for 2D joint localization

and a subsequent optimization step to recover 3D pose. In

this paper, we identify the representation of 3D pose as a

critical issue with current ConvNet approaches and make

two important contributions towards validating the value of

end-to-end learning for this task. First, we propose a ﬁne

discretization of the 3D space around the subject and train a

ConvNet to predict per voxel likelihoods for each joint. This

creates a natural representation for 3D pose and greatly im-

proves performance over the direct regression of joint coor-

dinates. Second, to further improve upon initial estimates,

we employ a coarse-to-ﬁne prediction scheme. This step ad-

dresses the large dimensionality increase and enables iter-

ative reﬁnement and repeated processing of the image fea-

tures. The proposed approach outperforms all state-of-the-

art methods on standard benchmarks achieving a relative

error reduction greater than 30% on average. Additionally,

we investigate using our volumetric representation in a re-

lated architecture which is suboptimal compared to our end-

to-end approach, but is of practical interest, since it enables

training when no image with corresponding 3D groundtruth

is available, and allows us to present compelling results for

in-the-wild images.

1. Introduction

Estimating the full-body 3D pose of a human from a sin-

gle monocular image is an open challenge, which has gar-

nered signiﬁcant attention since the early days of computer

vision [

18]. Given its ill-posed nature, researchers have

generally approached 3D human pose estimation in sim-

pliﬁed settings, such as assuming background subtraction

is feasible [

1], relying on groundtruth 2D joint locations

to estimate 3D pose [

26, 43], employing additional cam-

era views [

7, 15], and capitalizing on temporal consistency

to improve upon single frame predictions [38, 3]. This di-

Image ConvNet Volumetric Output

Figure 1: Illustration of our volumetric representation for

3D human pose. We discretize the space around the subject

and use a ConvNet to predict per voxel likelihoods for each

joint from a single color image.

versity of assumptions and additional information sources

exempliﬁes the challenge presented by the task.

With the introduction of more powerful discriminative

approaches, such as Convolutional Networks (ConvNets),

many of these restrictive assumptions have been relaxed.

End-to-end learning approaches attempt to estimate 3D

pose directly from a single image by addressing it as co-

ordinate regression [

19, 35], nearest neighbor between im-

ages and poses [

20], or classiﬁcation over a set of pose

classes [

27]. Yet to date, these approaches have been out-

performed by more traditional two-step pipelines, e.g., [

45,

6]. In these cases, ConvNets are used only for 2D joint

localization and 3D poses are generated during a post-

processing optimization step. Combining accurate 2D joint

localization with strong and expressive 3D priors has been

proven to be very effective. In this work, we show that

ConvNets are able to provide much richer information than

simply 2D joint locations.

To fully exploit the potential of ConvNets in the context

of 3D human pose, we propose the following items, and jus-

tify them empirically. First, we cast 3D pose estimation as

a keypoint localization problem in a discretized 3D space.

Instead of directly regressing the coordinates of the joints

(e.g., [

19, 35]), we train a ConvNet to predict per voxel

likelihoods for each joint in this volume. This volumetric

representation, illustrated in Figure

1, is much more sensi-

ble for the 3D nature of our problem and improves learning.

1

7025

Effectively, for every joint, the volumetric supervision pro-

vides the network with groundtruth for each voxel in the 3D

space. This provides much richer information than a set of

world coordinates. The empirical results also validate the

superiority of our proposed form of supervision.

Second, to deal with the increased dimensionality of the

volumetric representation, we propose a coarse-to-ﬁne pre-

diction scheme. As demonstrated in the 2D pose case, inter-

mediate supervision and iterative estimation are particularly

effective strategies [

40, 8, 21]. For our volumetric repre-

sentation though, naively stacking an increasing number of

components and reﬁning the estimates is not an effective so-

lution, as shown empirically. Instead, we gradually increase

the resolution of the supervision volume for the most chal-

lenging z-dimension (depth), during the processing. This

coarse-to-ﬁne supervision, illustrated schematically in Fig-

ure

2, allows for more accurate estimates after each step.

We empirically demonstrate the advantage of this practice

over naively stacking more components together.

Our proposed approach achieves state-of-the-art results

on standard benchmarks, outperforming both ConvNet-only

and hybrid approaches that post-process the 2D output of a

ConvNet. Additionally, we investigate using our volumet-

ric representation within a related architecture that decou-

ples 2D joint localization and 3D joint reconstruction. In

particular, we use two separate networks (the output of one

serves as the input to the other) and two non-corresponding

data sources, i.e., 2D labeled imagery to train the ﬁrst com-

ponent and an independent 3D data source (e.g., MoCap) to

train the second one separately. While this architecture has

practical beneﬁts (e.g., predicting 3D pose for in-the-wild

images), we show empirically that it underperforms com-

pared to our end-to-end approach when images with cor-

responding 3D groundtruth are available for training. This

ﬁnding further underlines the beneﬁt of predicting 3D pose

directly from an image, whenever this is possible, instead

of using 2D joint localization as an intermediate step.

In summary, we make the four following contributions:

• we are the ﬁrst to cast 3D human pose estimation as

a 3D keypoint localization problem in a voxel space

using the end-to-end learning paradigm;

• we propose a coarse-to-ﬁne prediction scheme to deal

with the large dimensionality of our representation and

enable iterative processing to realize further beneﬁts;

• our proposed approach achieves state-of-the-art results

on standard benchmarks, surpassing both ConvNet-

only and hybrid approaches that employ ConvNets for

2D pose estimation, with a relative error reduction that

exceeds 30% on average;

• we show the practical use of our volumetric representa-

tion in cases when end-to-end training is not an option

and present compelling results on in-the-wild images.

2. Related work

The literature on 3D human pose estimation is vast with

approaches addressing the problem in a variety of settings.

Here, we survey works that are most relevant to ours with a

focus on ConvNet-based approaches; we refer the reader to

a recent survey [

29] for a more complete literature review.

The majority of recent ConvNet-only approaches cast 3D

pose estimation as a coordinate regression task, with the tar-

get output being the spatial x, y, z coordinates of the human

joints with respect to a known root joint, such as the pelvis.

Li and Chan [

19] pretrain their network with maps for 2D

joint classiﬁcation. Tekin et al. [

35] include a pretrained

autoencoder within the network to enforce structural con-

straints on the output. Ghezelghieh et al. [

13] employ view-

point prediction as a side task to provide the network with

global joint conﬁguration information. Zhou et al. [44] em-

bed a kinematic model to guarantee the validity of the re-

gressed pose. Park et al. [

22] concatenate the 2D joint pre-

dictions with image features to improve 3D joint localiza-

tion. Tekin et al. [

36] include temporal information in the

joint predictions by extracting spatiotemporal features from

a sequence of frames. In contrast to all these approaches, we

adopt a volumetric representation of the human pose, and

regress the per voxel likelihood for each joint separately.

This proves to have signiﬁcant advantages for the network

performance and provides a richer output compared to the

low-dimensional vector of joint coordinates.

An alternative approach to the classical regression

paradigm is proposed by Li et al. [

20]. During training,

they learn a common embedding between color images and

3D poses. At test time, the test image is coupled with each

candidate pose and forwarded through the network; the in-

put image is assigned the candidate pose with the maximum

network score. This is a form of nearest neighbor classiﬁ-

cation which is highly inefﬁcient due to the requirement of

multiple forward network passes. On the other hand, Ro-

gez and Schmid [

27] cast pose estimation as a classiﬁcation

problem. Given a predeﬁned set of pose classes, each im-

age is assigned to the class with the highest score. This

guarantees a valid global pose prediction, but the approach

is constrained by the poses in the original classes and thus

returns only a rough pose estimate. In contrast to the inef-

ﬁcient nearest neighbor approach and the coarse classiﬁca-

tion approach, our volume regression allows for much more

accurate 3D joint localization, while also being efﬁcient.

Despite the interest in end-to-end learning, ConvNet-

only approaches underperform those that employ a

ConvNet for the 2D localization of joints, and produce 3D

pose with a subsequent optimization step. Zhou et al. [

45]

utilize a standard 2D pose ConvNet to localize the joints

and retrieve the 3D pose using an optimization scheme over

a sequence of monocular images. Similarly, Du et al. [10]

include height-maps of the human body to improve 2D joint

7026

Figure 2: Illustration of our coarse-to-ﬁne volumetric approach for 3D human pose estimation from a single image. The input

is a single color image and the output is a dense 3D volume with separate per voxel likelihoods for each joint. The network

consists of multiple fully convolutional components [

21], which are supervised in a coarse-to-ﬁne fashion, to deal with the

large dimensionality of our representation. 3D heatmaps are synthesized for supervision by increasing the resolution for the

most challenging z-dimension (depth) after each component. The dashed lines indicate that the intermediate heatmaps are

fused with image features to produce the input for the next fully convolutional component. For presentation simplicity, the

illustrated heatmaps correspond to the location of only one joint.

localization. Bogo et al. [

6] use the joints predicted by a

2D ConvNet and ﬁt a statistical body shape model to re-

cover the full shape of the human body. In contrast, our

approach achieves state-of-the-art results with a single net-

work. Furthermore, it provides a rich 3D output, amenable

to post-processing, such as pictorial structures optimization

to constrain limb lengths, or temporal ﬁltering.

Another issue that has been addressed in the context of

using ConvNets for 3D human pose is the scarcity of train-

ing data. Chen et al. [

9] use a graphics renderer to create

images with known groundtruth. Similarly, Ghezelghieh et

al. [

13] augment the training set with synthesized examples.

A collage approach is proposed by Rogez and Schmid [

27],

where parts from in-the-wild images are combined to create

additional images with known 3D poses. However, there

is no guarantee that the statistics of the synthetic exam-

ples match those of real images. To investigate the data

scarcity issue, we take inspiration from the 3D Interpreter

Network [

41], which decouples the 3D pose estimation task

into 2D localization and 3D reconstruction within a single

ConvNet. In contrast, rather than using a predeﬁned linear

basis for 3D reconstruction, we predict 3D joint locations

directly with our volumetric representation. This demon-

strates the practical use of our volumetric representation

even when end-to-end training is not an option.

Finally, while we do not compare explicitly with multi-

view pose estimation work (e.g., [

12, 31, 4, 11]), it is inter-

esting to note that the representation of 3D human pose in

a discretized 3D space has also been previously adopted in

multi-view settings [

7, 15, 23], where it was used to accom-

modate predictions from different viewpoints. For single

view pose estimation, it has been considered in the context

of random forests [

16]. This approach suffered from large

execution time (around three minutes), and required an ad-

ditional reﬁnement step using a pictorial structures model.

In stark contrast, our network can provide complete volume

predictions with a single forward pass in a few milliseconds,

needs no additional reﬁnement (although it is still a possi-

bility) to provide state-of-the-art results, and is integrated

within a coarse-to-ﬁne prediction scheme to deal with ex-

cessive dimensionality.

3. Technical approach

The following subsections summarize our technical ap-

proach. Section

3.1 describes the proposed volumetric rep-

resentation for 3D human pose and discusses its merits.

Next, Section

3.2 describes our coarse-to-ﬁne prediction ap-

proach that addresses the high dimensional nature of our

output representation. Finally, Section

3.3 describes the use

of our volumetric representation within a related decoupled

architecture and discusses its relative merits compared to

our coarse-to-ﬁne volumetric prediction approach.

3.1. Volumetric representation for 3D human pose

The problem of 3D human pose estimation using

ConvNets has been primarily approached as a coordinate

regression problem. In this case, the target of the network

is a 3N-dimensional vector comprised of the concatenation

of the x, y, z coordinates of the N joints of the human body.

For training, an L

2

regression loss is employed:

L =

N

X

n=1

kx

n

gt

− x

n

pr

k

2

, (1)

where x

n

gt

is the groundtruth and x

n

pr

is the predicted lo-

cation for joint n. The location of each joint is expressed

globally, with respect to a root joint, or locally, with re-

spect to its parent joint in the kinematic tree. The second

formulation has some beneﬁts, as discussed also by Li et

7027

al. [19] (e.g., easier to learn to predict small, local devi-

ations), but still suffers from the fact that small errors can

easily propagate hierarchically to children joints of the kine-

matic tree. In general, despite its simplicity, the coordinate

regression approach makes the problem highly non-linear

and presents problems for the learning procedure. These

issues have previously been identiﬁed in the context of 2D

human pose [

37, 24].

To improve learning, we propose a volumetric represen-

tation for 3D human pose. The volume around the subject

is discretized uniformly in each dimension. For each joint

we create a volume of size w ×h ×d. Let p

n

(i,j,k)

denote the

predicted likelihood of joint n being in voxel (i, j, k). To

train this network, the supervision is also provided in vol-

umetric form. The target for each joint is a volume with

a 3D Gaussian centered around the groundtruth position

x

n

gt

= (x, y, z) of the joint in the 3D grid:

G

i,j,k

(x

n

gt

) =

1

2πσ

2

e

−

(x−i)

2

+(y−j)

2

+(z−k)

2

2σ

2

, (2)

where the value σ = 2 is used for our experiments. For

training, we use the mean squared error loss:

L =

N

X

n=1

X

i,j,k

kG

(i,j,k)

(x

n

gt

) − p

n

(i,j,k)

k

2

. (3)

In theory, the output of the network is four dimensional,

i.e., (w × h × d × N ), but in practice we organize it in

channels, thus our output is three dimensional, i.e., w × h ×

dN. The voxel with the maximum response in each 3D grid

is selected as the joint’s 3D location.

A major advantage of the volumetric representation is

that it casts the highly non-linear problem of direct 3D coor-

dinate regression to a more manageable form of prediction

in a discretized space. In this case, the predictions do not

necessarily commit to a unique location for each joint, but

instead an estimate of the conﬁdence is provided for each

voxel. This makes it easier for the network to learn the tar-

get mapping. A similar argument has been previously put

forth in the 2D pose case, validating the beneﬁt of predicting

per pixel likelihoods instead of pixel coordinates [

37, 24].

In terms of the network architecture, an important beneﬁt

of the volumetric representation is that it enables the use

of a fully convolutional network for prediction. Here, we

adopt the hourglass design [

21]. This leads to less network

parameters than using fully connected layers for coordinate

regression or pose classiﬁcation. Finally, in terms of the

predicted output, besides being more accurate, our network

predictions in the form of dense 3D heatmaps are useful

for subsequent post-processing applications. For example,

structural constraints can be enforced with the use of a 3D

Pictorial Structures model, e.g., [

7, 23]. Another option is

to use the dense predictions in a ﬁltering framework in cases

where multiple input frames are available.

3.2. Coarse-to-ﬁne prediction

A design choice that has been particularly effective in

the case of 2D human pose is the iterative processing of the

network output [

8, 40, 21]. Instead of using a single compo-

nent with a single output, the network is forced to produce

predictions in multiple processing stages. These intermedi-

ate predictions are gradually reﬁned to produce more accu-

rate estimates. Additionally, the use of intermediate super-

vision on the “earlier” outputs allows for a richer gradient

signal, which has been demonstrated empirically as an ef-

fective learning practice [

17, 34].

Inspired by the success of iterative reﬁnement in the

context of 2D pose, we also consider a gradual reﬁnement

scheme. Empirically, we found that naively stacking mul-

tiple components yielded diminishing returns because of

the large dimensionality of our representation. In fact, for

the highest 3D resolution of 64 × 64 × 64 with 16 joints,

we would need to estimate the likelihood for more that 4

million voxels. To deal with this curse of dimensionality,

we propose to use a coarse-to-ﬁne prediction scheme. In

particular, the ﬁrst steps are supervised with lower resolu-

tion targets for the (most challenging and technically un-

observed) z-dimension. Precisely, we use targets of size

64 × 64 × d per joint, where d typically takes values from

the set {1, 2, 4, 8, 16, 32, 64}. An illustration of this super-

vision approach is given in Figure

2.

This strategy makes training more effective, and allows

us to beneﬁt from stacking multiple components together

without suffering from overﬁtting or dimensionality issues.

Intuitively, easier versions of the task are presented to the

network during the early stages of processing, and the com-

plexity increases gradually. This postpones the harder de-

cisions until the very end of the processing, when all the

available information has been processed and consolidated.

3.3. Decoupled architecture with volumetric target

To further show the versatility of the proposed volumet-

ric representation, we also employ it in a scenario where

end-to-end training is not an option. This is usually the case

for in-the-wild images, where accurate, large-scale acquisi-

tion of 3D groundtruth is not feasible. Inspired by the 3D

Interpreter Network [

41], we decouple 3D pose estimation

in two sequential steps consisting of predicting 2D keypoint

heatmaps, followed by an inference step of the 3D joint po-

sitions with our volumetric representation. The ﬁrst step

can be trained with 2D labeled in-the-wild imagery, while

the second step requires only 3D data (e.g., MoCap). Inde-

pendently, each of these sources are abundantly available.

This training strategy is useful for practical scenarios,

and we present compelling results for in-the-wild images

(Sec.

4.6). However, it remains suboptimal compared to our

end-to-end approach when images with corresponding 3D

groundtruth are available for training. Figure

3 provides an

7028

(a) Decoupled architecture

(b) Coarse-to-ﬁne architecture

Figure 3: Schematic comparison of a decoupled architec-

ture versus our coarse-to-ﬁne architecture with intermedi-

ate supervision at the coarsest level (2D heatmaps). Blue

blocks indicate 3D heatmaps, while green blocks indicate

2D heatmaps. Decoupled architecture: The 2D heatmaps

are provided directly as input to the second part of the net-

work, which effectively operates as a 2D-to-3D reconstruc-

tion component. Note, no image features are processed in

the second component, only information about 2D joint lo-

cations. Coarse-to-ﬁne architecture: We use 2D heatmaps

as intermediate supervision, which are then combined with

image features, effectively carrying information both from

the image and the 2D locations of the joints.

illustration of each architecture in a simpliﬁed setting with

two hourglasses. It can be seen that the decoupled case is

related to our course-to-ﬁne architecture when the resolu-

tion of the intermediate supervision is set to d = 1 resulting

in 2D heatmaps. A crucial difference between the two ar-

chitectures is that our coarse-to-ﬁne approach combines the

produced 2D heatmaps with intermediate image features.

This way, the rest of the network can process information

both about the image and the 2D joint locations. On the

other hand, a decoupled network processes the 2D heatmaps

directly and attempts to reconstruct 3D locations without

further aid by image-based evidence. In cases where the

heatmaps are grossly erroneous, the 3D predictions can be

lead astray. In Sec.

4.4, we show empirically that when im-

ages with corresponding 3D groundtruth are available, our

coarse-to-ﬁne architecture outperforms the decoupled one.

4. Empirical evaluation

4.1. Datasets

We present extensive quantitative evaluation of our

coarse-to-ﬁne volumetric approach on three standard

benchmarks for 3D human pose: Human3.6M [

14],

HumanEva-I [

30] and KTH Football II [15]. Additionally,

qualitative results are presented on the MPII human pose

dataset [

2], since no 3D groundtruth is available.

Human3.6M: It contains video of 11 subjects perform-

ing a variety of actions, such as “Walking”, “Sitting” and

“Phoning”. We follow the same evaluation protocol as prior

work [

20, 45]. In particular, Subjects S1, S5, S6, S7 and S8

were used for training, while subjects S9 and S11 were used

for testing. The original videos were downsampled from

50fps to 10fps We employed all camera views and trained

a single model for all actions, instead of training action-

speciﬁc models [

20, 45].

HumanEva-I: It is a smaller dataset compared to Hu-

man3.6M, with fewer subjects and actions. Following the

standard protocol [

16, 42], we evaluated on “Walking” and

“Jogging” from subjects S1, S2 and S3. The training se-

quences of these subjects and actions were used for train-

ing, and the corresponding validation sequences for testing.

As done with the Human3.6M evaluation, we train a single

model using the frames for all users and actions.

KTH Football II: The images are taken from a profes-

sional football match, and 3D groundtruth is provided only

for a very small number of them. The limited available

groundtruth is not very accurate, since it was generated by

combining manual 2D annotations from multiple views. In

this case, image-to-3D training is not a practical option.

Instead, we report results using our volumetric representa-

tion within the decoupled architecture described in Sec.

3.3.

More speciﬁcally, we train the ﬁrst network component (im-

age to 2D heatmaps) using images from this dataset which

provide 2D groundtruth. For the second network compo-

nent (2D heatmaps to 3D heatmaps), we use all the training

MoCap data from Human3.6M dataset. As others [

7, 36],

we report results using “Sequence 1” from “Player 2” and

frames taken from “Camera 1”.

MPII: It is a large scale 2D pose dataset containing in-

the-wild imagery. It provides 2D annotations but no 3D

groundtruth. Like KTH, direct image-to-3D training is not a

practical option with this dataset. Instead, we use the decou-

pled architecture with our volumetric representation. Since

we cannot quantify the performance here, we only provide

qualitative results.

4.2. Evaluation metrics

For Human3.6M, most approaches report the per joint

3D error, which is the average Euclidean distance of the

estimated joints to the groundtruth. This is done after

aligning the root joints (here the pelvis) of the estimated

and groundtruth 3D pose. An alternative metric, which is

used by some methods to report results on Human3.6M and

HumanEva-I is the reconstruction error. It is deﬁned as the

per joint 3D error up to a similarity transformation. Effec-

tively, the estimated 3D pose is aligned to the groundtruth

by the Procrustes method. Finally, for KTH the percentage

of the correctly estimated parts in 3D (3D PCP [

7]) is re-

ported. Again, the root joints (here we use the center of the

chest) are aligned to resolve the depth ambiguity.

7029

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose

Citations

End-to-End Recovery of Human Shape and Pose

A Simple Yet Effective Baseline for 3d Human Pose Estimation

VNect: real-time 3D human pose estimation with a single RGB camera

Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision

Learning to Estimate 3D Human Pose and Shape from a Single Color Image

References

Going deeper with convolutions

Stacked Hourglass Networks for Human Pose Estimation

Convolutional Pose Machines

2D Human Pose Estimation: New Benchmark and State of the Art Analysis

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

Related Papers (5)

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

A Simple Yet Effective Baseline for 3d Human Pose Estimation

Stacked Hourglass Networks for Human Pose Estimation

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

2D Human Pose Estimation: New Benchmark and State of the Art Analysis