Unite the People: Closing the Loop Between 3D and 2D Human Representations

doi:10.1109/CVPR.2017.500

1

Bernstein Center for Computational Neuroscience, T

¨

ubingen, Germany

2

MPI for Intelligent Systems, T

¨

ubingen, Germany

3

Body Labs Inc., New York, United States

4

Microsoft, Cambridge, UK

5

University of W

¨

urzburg, Germany

Unite the People: Closing the Loop Between 3D and 2D Human Representations

Christoph Lassner

1,2

classner@tue.mpg.de

Javier Romero

3,*

javier.romero@bodylabs.com

Martin Kiefel

2

mkiefel@tue.mpg.de

Federica Bogo

4,*

febogo@microsoft.com

Michael J. Black

2

black@tue.mpg.de

Peter V. Gehler

5,*

pgehler@tue.mpg.de

Abstract

3D models provide a common ground for different repre-

sentations of human bodies. In turn, robust 2D estimation

has proven to be a powerful tool to obtain 3D ﬁts “in-the-

wild”. However, depending on the level of detail, it can be

hard to impossible to acquire labeled data for training 2D

estimators on large scale. We propose a hybrid approach to

this problem: with an extended version of the recently in-

troduced SMPLify method, we obtain high quality 3D body

model ﬁts for multiple human pose datasets. Human anno-

tators solely sort good and bad ﬁts. This procedure leads

to an initial dataset, UP-3D, with rich annotations. With a

comprehensive set of experiments, we show how this data

can be used to train discriminative models that produce re-

sults with an unprecedented level of detail: our models pre-

dict 31 segments and 91 landmark locations on the body.

Using the 91 landmark pose estimator, we present state-of-

the art results for 3D human pose and shape estimation us-

ing an order of magnitude less training data and without

assumptions about gender or pose in the ﬁtting procedure.

We show that UP-3D can be enhanced with these improved

ﬁts to grow in quantity and quality, which makes the system

deployable on large scale. The data, code and models are

available for research purposes.

1. Introduction

Teaching computers to recognize and understand hu-

mans in images and videos is a fundamental task of com-

puter vision. Different applications require different trade-

offs between ﬁdelity of the representation and inference

complexity. This led to a wide range of parameterizations

for human bodies and corresponding prediction methods

ranging from bounding boxes to detailed 3D models.

*

This work was performed while J. Romero and F. Bogo were with

the MPI-IS

2

; P. V. Gehler with the BCCN

1

and MPI-IS

2

.

31 Parts

United People Dataset

91 Landmarks 3D Fits

Direct 3D

3D Fit Improvement

MPII HPDBLeeds Sports Pose / extended FashionPose

Label Generation

Figure 1: Lower row: validated 3D body model ﬁts on vari-

ous datasets form our initial dataset, UP-3D, and provide la-

bels for multiple tasks. Top row: we perform experiments

on semantic body part segmentation, pose estimation and

3D ﬁtting. Improved 3D ﬁts can extend the initial dataset.

Learning-based algorithms, especially convolutional

neural networks (CNNs), are the leading methods to cope

with the complexity of human appearance. Their represen-

tational power has led to increasingly robust algorithms for

bounding box detection [

10], keypoint detection [19, 32, 42]

and body part segmentation [

7, 15, 43]. However, they are

usually applied in isolation on separate datasets and inde-

pendent from the goal of precise 3D body estimation. In

this paper we aim to overcome this separation and “unite

the people” of different datasets and for multiple tasks.

With this strategy, we attack the main problem of learning-

based approaches for complex body representations: the

lack of data. While it is feasible to annotate a small number

of keypoints in images (e.g., 14 in the case of the MPII-

HumanPose dataset [

1]), scaling to larger numbers quickly

becomes impractical and prone to annotation inconsistency.

The same is true for semantic segmentation annotations:

most datasets provide labels for only a few body parts.

6050

In this paper, we aim to develop a self-improving, scal-

able method that obtains high-quality 3D body model ﬁts

for 2D images (see Fig.

1 for an illustration). To form an

initial dataset of 3D body ﬁts, we use an improved version

of the recently developed SMPLify method [

4] that elevates

2D keypoints to a full body model of pose and shape. A

more robust initialization and an additional ﬁtting objective

allow us to apply it on the ground truth keypoints of the

standard human pose datasets; human annotators solely sort

good and bad ﬁts.

This semi-automatic scheme has several advantages.

The required annotation time is greatly reduced (Sec.

3.3).

By projecting surfaces (Sec.

4.1) or keypoints (Sec. 4.2)

from the ﬁts to the original images, we obtain consistent

labels while retaining generalization performance. The rich

representation and the ﬂexible ﬁtting process make it easy

to integrate datasets with different label sets, e.g., a different

set of keypoint locations.

Predictions from our 91 keypoint model improve the

3D model ﬁtting method that generated the annotations for

training the keypoint model in the ﬁrst place. We report

state-of-the art results on the HumanEva and Human3.6M

datasets (Sec.

4.3). Further, using the 3D body ﬁts, we de-

velop a random forest method for 3D pose estimation that

runs orders of magnitudes faster than SMPLify (Sec. 4.4).

The improved predictions from the 91 landmark model

increase the ratio of high quality 3D ﬁts on the LSP

dataset by 9.3% when compared to the ﬁts using 14 key-

point ground truth locations (Sec.

5). This ability for self-

improvement together with the possibility to easily integrate

new data into the pool make the presented system deploy-

able on large scale. Data, code and models are available

for research purposes on the project homepage at

http:

//up.is.tuebingen.mpg.de/

.

2. Related Work

Acquiring human pose annotations in 3D is a long-

standing problem with several attempts from the computer

vision as well as the 3D human pose community.

The classical 2D representation of humans are 2D key-

points [

1, 6, 23, 38, 39]. While 2D keypoint prediction has

seen considerable progress in the last years and could be

considered close to being solved [19, 32, 42], 3D pose esti-

mation from single images remains a challenge [

4, 36, 44].

Bourdev and Malik [

5] enhanced the H3D dataset from

20 keypoint annotations for 1,240 people in 2D with relative

3D information as well as 11 annotated body part segments.

In contrast, the HumanEva [

41] and Human3.6M [21]

datasets provide very accurate 3D labels: they are both

recorded in motion capture environments. Both datasets

have high ﬁdelity but contain only a very limited level of

diversity in background and person appearance. We eval-

uate the 3D human pose estimation performance on both.

Recent approaches target 3D pose ground truth from natu-

ral scenes, but either rely on vision systems prone to fail-

ure [

11] or inertial suits that modify the appearance of the

body and are prone to motion drift [

44].

Body representations beyond 3D skeletons have a long

history in the computer vision community [17, 30, 31, 35].

More recently, these representations have taken new pop-

ularity in approaches that ﬁt detailed surfaces of a body

model to images [

4, 14, 16, 25, 44]. These representations

are more tightly connected to the physical reality of the hu-

man body and the image formation process.

One of the classic problems related to representations

of the extent of the body is body part segmentation. Fine-

grained part segmentation has been added to the public parts

of the VOC dataset [

12] by Chen et al. [8]. Annotations

for 24 human body parts and also part segments for all

VOC object classes, where applicable, are available. Even

though hard to compare, we provide results on the dataset.

The Freiburg Sitting People dataset [

33] consists of 200 im-

ages with 14 part segmentation and is tailored towards sit-

ting poses. The ideas by Shotton et al. [

40] for 2.5D data

inspired our body part representation. Relatively simple

methods have proven to achieve good performance in seg-

mentation tasks with “easy” backgrounds like Human80k, a

subset of Human3.6M [

20].

Following previous work on cardboard people [

24]

and contour people [

13], an attempt to work towards an

intermediate-level person representation is the JHMDB

dataset and the related labeling tool [

22]. It relies on ‘pup-

pets’ to ease the annotation task, while providing a higher

level of detail than solely joint locations.

The attempt to unify representations for human bodies

has been made mainly in the context of human kinemat-

ics [2, 29]. In their work, a rich representation for 3D mo-

tion capture marker sets is used to transfer captures to dif-

ferent targets. The setup of markers to capture not only hu-

man motion but also shape has been explored by Loper et

al. [

28] for motion capture scenarios. While they optimized

the placement of markers for a 12 camera setup, we must

ensure that the markers disambiguate pose and shape from

a single view. Hence, we use a denser set of markers.

3. Building the Initial Dataset

Our motivation to use a common 3D representation is

to (1) map many possible representations from a variety of

datasets to it, and (2) generate detailed and consistent labels

for supervised model training from it.

We argue that the use of a full human body model with a

prior on shape and pose is necessary: without the visualiza-

tion possibilities and regularization, it may be impossible to

create sufﬁciently accurate annotations for small body parts.

However, so far, no dataset is available that provides human

body model ﬁts on a large variety of images.

6051

To ﬁll this gap, we build on a set of human pose datasets

with annotated keypoints. SMPLify [

4] presented promis-

ing results for automatically translating these into 3D body

model ﬁts. This helps us to keep the human involvement to a

minimum. With strongly increasing working times and lev-

els of label noise for increasingly complex tasks, this may

be a critical decision to create a large dataset of 3D body

models.

3.1. Improving Body Shape Estimation

In [

4], the authors ﬁt the pose and shape parameters of

the SMPL [

26] body model to 2D keypoints by minimiz-

ing an objective function composed of a data term and sev-

eral penalty terms that represent priors over pose and shape.

However, the connection length between two keypoints is

the only indicator that can be used to estimate body shape.

Our aim is to match the shape of the body model as accu-

rately as possible to the images, hence we must incorporate

a shape objective in the ﬁtting.

The best evidence for the extent of a 3D body projected

on a 2D image is encoded by its silhouette. We deﬁne the

silhouette to be the set of all pixels belonging to a body’s

projection. Hence, we add a term to the original SMPLify

objective to prefer solutions for which the image silhouette,

S, and the model silhouette,

ˆ

S, match.

Let M (

~

θ,

~

β, ~γ) be a 3D mesh generated by a SMPL body

model with pose,

~

θ, shape,

~

β, and global translation, ~γ. Let

Π(·, K) be a function that takes a 3D mesh and projects it

into the image plane given camera parameters K, such that

ˆ

S(

~

θ,

~

β, ~γ) = Π(M (

~

θ,

~

β, ~γ)) represents the silhouette pixels

of the model in the image. We compute the bi-directional

distance between S and

ˆ

S(·)

E

S

(

~

θ,

~

β, ~γ; S, K) =

X

~x∈

ˆ

S(

~

θ,

~

β,~γ)

dist(~x, S)

2

+

X

~x∈S

dist(~x,

ˆ

S(

~

θ,

~

β, ~γ)), (1)

where dist(~x, S) denotes the absolute distance from a point

~x to the closest point belonging to the silhouette S.

The ﬁrst term in Eq. (

1) computes the distance from

points of the projected model to a given silhouette, while

the second term computes the distance from points in the

silhouette to the model. We ﬁnd that the second term is

noisier and use the plain L1 distance to measure its contri-

bution to the energy function while we use the squared L2

distance to measure the contribution of the ﬁrst. We op-

timize the overall objective including this additional term

using OpenDR [

27], just as in [4].

Whereas it would be possible to use an automatic seg-

mentation method to provide foreground silhouettes, we

decided to involve human annotators for reliability. We

also asked for six body part segmentation that we will use

Dataset Foreground 6 Body Parts AMT hours logged

LSP [23]

1000 train,

1000 test

1000 train,

1000 test

361h foreground,

LSP-extended [23] 10000 train 0 131h parts

MPII-HPDB [1]

13030 train,

2622 test

0 729h

Table 1: Logged AMT labelling times. The average fore-

ground labeling task was solved in 108s on the LSP and

168s on the MPII datasets respectively. Annotating the seg-

mentation for six body parts took on average more than

twice as long as annotating foreground segmentation: 236s.

Figure 2: Examples for six part segmentation ground truth.

White areas mark inconsistencies with the foreground seg-

mentation and are ignored.

in Sec.

4 for evaluation. We built an interactive annota-

tion tool on top of the Opensurfaces package [

3] to work

with Amazon Mechanical Turk (AMT). To obtain image-

consistent silhouette borders, we use the interactive Grab-

cut algorithm [37]. Workers spent more than 1,200 hours

on creating the labels for the LSP [

23] datasets as well as

the single-person part of the MPII-HumanPose [

1] dataset

(see Tab.

1). There is an increase in average annotation time

of more than a factor of two comparing annotation for fore-

ground labels and six body part labels. This provides a hint

on how long annotation for a 31 body part representation

could take. Examples for six part segmentation labels are

provided in Fig.

2.

3.2. Handling Noisy Ground Truth Keypoints

The SMPLify method is especially vulnerable to missing

annotations of the four torso joints: it uses their locations for

an initial depth guess, and convergence deteriorates if this

guess is of poor quality.

Finding a good depth initialization is particularly hard

due to the foreshortening effect of the perspective projec-

tion. However, since we know that only a shortening but

no lengthening effect can occur, we can ﬁnd a more reliable

person size estimate

ˆ

θ for a skeleton model with k connec-

tions:

ˆ

θ = x

i

· arg max

y

f

i

(y), i = arg max

j=1,...,k

x

j

, (2)

where f

i

is the distribution over ratios of person size to the

length of connection x

i

. Since this is a skewed distribution,

we use a corrected mean to ﬁnd the solution of the arg max

function and obtain a person size estimate. This turns out to

be a simple, yet robust estimator.

6052

LSP [23] LSP extended [23] MPII-HP [1] FashionPose [9]

45% 12% 25% 23%

Table 2: Percentages of accepted ﬁts per dataset. The addi-

tion of the FashionPose dataset is discussed in Sec.

4.2.

3.3. Exploring the Data

With the foreground segmentation data and the ad-

justments described in the preceding sections, we ﬁt the

SMPL model to a total of 27,652 images of the LSP,

LSP-extended, and MPII-HumanPose datasets. We use

only people marked with the ‘single person’ ﬂag in MPII-

HumanPose to avoid instance segmentation problems. We

honor the train/test splits of the datasets and keep images

from their test sets in our new, joined test set.

In the next step, human annotators

1

selected the ﬁts

where rotation and location of body parts largely match the

image evidence. For this task, we provide the original im-

age, as well as four perspectives of renderings of the body.

Optionally, annotators can overlay rendering and image.

These visualizations help to identify ﬁtting errors quickly

and reduce the labeling time to ∼12s per image. The pro-

cess uncovered many erroneously labeled keypoints, where

mistakes in the 3D ﬁt were clear to spot, but not obvious in

the 2D representation. We excluded head and foot rotation

as criteria for the sorting process. There is usually not suf-

ﬁcient information in the original 14 keypoints to estimate

them correctly. The resulting ratios of accepted ﬁts can be

found in Tab.

2.

Even with the proposed, more robust initialization term,

the ratio of accepted ﬁts on the LSP-extended dataset re-

mains the lowest. It has the highest number of missing key-

points of the four datasets, and at the same time the most

extreme viewpoints and poses. On the other hand, the rather

high ratio of usable ﬁts on the LSP dataset can be explained

with the clean and complete annotations.

The validated ﬁts form our initial dataset with 5,569

training images (of which we use a held-out validation set

of 1,112 images in our experiments) and 1,208 test images.

We denote this dataset as UPI-3D (UnitedPeople in 3D with

an added ‘I’ for “Initial”). To be able to clearly reference the

different label types in the following sections, we add an ‘h’

to the dataset name when referring to labels from human an-

notators.

Consistency of Human Labels The set of curated 3D ﬁts

allows us to assess the distribution of the human-provided

labels by projecting them to the UPI-3D bodies. We did this

for both, keypoints and body part segments. Visualizations

can be found in Fig.

3.

1

For this task, we did not rely on AMT workers, but only on few experts

in close collaboration to maintain consistency.

While keypoint locations in Fig.

3a in completely non-

matching areas of the body can be explained by self-

occlusion, there is a high variance in keypoint locations

around joints. It must be taken into account that the key-

points are projected to the body surface, and depending

on person shape and body part orientation some variation

can be expected. Nevertheless, even for this reduced set of

images with very good 3D ﬁts, high variance areas, e.g.,

around the hip joints, indicate labeling noise.

The visualization in Fig.

3b shows the density of part

types for six part segmentation with the segments head,

torso, left and right arms and left and right legs. While the

head and lower parts of the extremities resemble distinct

colors, the areas converging to brown represent a mixture

of part annotations. The brown tone on the torso is a clear

indicator for the frequent occlusion by the arms. The area

around the hips is showing a smooth transition from torso

to leg color, hinting again at varying annotation styles.

4. Label Generation and Learning

In a comprehensive series of experiments, we analyze the

quality of labels generated from UPI-3D. We focus on labels

for well-established tasks, but highlight that the generation

possibilities are not limited to them: all types of data that

can be extracted from the body model can be used as labels

for supervised training. In our experiments, we move from

surface (segmentation) prediction over 2D- to 3D-pose and

shape estimation to a method for predicting 3D body pose

and shape directly from 2D landmark positions.

4.1. Semantic Body Part Segmentation

We segment the SMPL mesh into 31 regions, following

the segmentation into semantic parts introduced in [

40] (for

a visualization, see Fig.

3d). We note that the Kinect tracker

works on 2.5D data while our detectors only receive 2D data

as input. We deliberately did not make any of our methods

for data collection or prediction dependent on 2.5D data to

retain generality. This way, we can use it on outdoor images

and regular 2D photo datasets. The Segmentation dataset

UPI-S31 is obtained by projecting the segmented 3D mesh

posed on the 6,777 images of UPI-3D.

Following [

7], we optimize a multiscale ResNet101 on

a pixel-wise cross entropy loss. We train the network on

size-normalized, cutout images, which could in a produc-

tion system be provided by a person detector. Following

best practices for CNN training, we use a validation set to

determine the optimal number of training iterations and the

person size, which is around 500 pixels. This high resolu-

tion allows the CNN to reliably predict small body parts.

In this challenging setup, we achieve an intersection over

union (IoU) score of 0.4432 and an accuracy of 0.9331.

Qualitative results on ﬁve datasets are shown in Fig.

4a.

6053

(a) (b) (c) (d)

Figure 3: Density of human annotations on high quality body model ﬁts for (a) keypoints and (b) six part segmentation in

front and back views. Areas of the bodies are colored with (1) hue according to part label, and (2) saturation according to

frequency of the label. Keypoints on completely ‘wrong’ bodyparts are due to self-occlusion. The high concentration of

‘head’ labels in the nose region originates from the FashionPose dataset, where the ‘head’ keypoint is placed on the nose.

The segmentation data originates solely from the six part segmentation labels on the LSP dataset. (Must be viewed in color.)

(c) Placement of the 91 landmarks (left: front, right: back). (d) Segmentation for generating the 31 part labels.

The overall performance is compelling: even the small

segments around the joints are recovered reliably. Left and

right sides of the subjects are identiﬁed correctly, and the

four parts of the head provide an estimate of head orien-

tation. The average IoU score is dominated by the small

segments, such as the wrists.

The VOC part dataset is a hard match for our predictor:

instead of providing instances of people, it consists of en-

tire scenes, and many people are visible at small scale. To

provide a comparison, we use the instance annotations from

the VOC-Part dataset, cut out samples and reduce the gran-

ularity of our segmentation to match the widely used six

part representation. Because of the low resolution of many

displayed people and extreme perspectives with, e.g., only

a face visible, the predictor often only predicts the back-

ground class on images not matching our training scheme.

Still, we achieve an IoU score of 0.3185 and 0.7208 accu-

racy over the entire dataset without ﬁnetuning.

Additional examples from the LSP, MPII-HumanPose,

FashionPose, Fashionista, VOC, HumanEva and Hu-

man3.6M datasets are shown in the supplementary mate-

rial available on the project homepage

2

. The model has not

been trained on any of the latter four, but the results indi-

cate good generalization behavior. We include a video to

visualize stability across consecutive frames.

4.2. Human Pose Estimation

With the 3D body ﬁts, we can not only generate consis-

tent keypoints on the human skeleton but also on the body

surface. For the experiments in the rest of this paper, we de-

signed a 91-landmark

3

set to analyze a dense keypoint set.

2

http://up.is.tuebingen.mpg.de/

3

We use the term ‘landmark’ to refer to keypoints on the mesh surface

to emphasize the difference to the so-far used term ‘joints’ for keypoints

located inside of the body.

We distributed the landmarks according to two criteria:

disambiguation of body part conﬁguration and estimation

of body shape. The former requires placement of markers

around joints to get a good estimation of their conﬁguration.

To satisfy the latter, we place landmarks in regular intervals

around the body to get an estimate of spatial extent indepen-

dent of the viewpoint. We visualize our selection in Fig.

3c

and example predictions in Fig. 4b.

In the visualization of predictions, we show a subset of

the 91 landmarks and only partially connect the displayed

ones for better interpretability. The core 14 keypoints de-

scribing the human skeleton are part of our selection to

describe the fundamental pose and maintain comparability

with existing methods.

We use a state-of-the-art DeeperCut CNN [

19] for our

pose-related experiments, but believe that using other mod-

els such as Convolutional Pose Machines [

42] or Stacked

Hourglass Networks [

32] would lead to similar ﬁndings.

To assess the inﬂuence of the quality of our data and

the difference of the loss function for 91 and 14 key-

points, we train multiple CNNs: (1) using all human la-

bels but on our (smaller) dataset for 14 keypoints (UPI-

P14h) and (2) on the dense 91 landmarks from projections

of the SMPL mesh (UPI-P91). Again, models are trained on

size-normalized crops with cross-validated parameters. We

include the performance of the original DeeperCut CNN,

which has been trained on the full LSP, LSP-extended and

MPII-HumanPose datasets (in total more than 52,000 peo-

ple) in the comparison with the models being trained on our

data (in total 5,569 people). The results are summarized

in Tab.

3. Even though the size of the dataset is reduced

by nearly an order of magnitude, we maintain high perfor-

mance compared to the original DeeperCut CNN. Compar-

ing the two models trained on the same amount of data,

we ﬁnd that the model trained on the 91 landmarks from

6054

Unite the People: Closing the Loop Between 3D and 2D Human Representations

Citations

Human body shape reconstruction from binary silhouette images

OSSO: Obtaining Skeletal Shape from Outside

Learning to Dress 3D People in Generative Clothing

SMPLy Benchmarking 3D Human Pose Estimation in the Wild

Neural Scene Decomposition for Multi-Person Motion Capture

References

The Pascal Visual Object Classes (VOC) Challenge

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

"GrabCut": interactive foreground extraction using iterated graph cuts

Stacked Hourglass Networks for Human Pose Estimation

Pedestrian Detection: An Evaluation of the State of the Art

Related Papers (5)

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

End-to-End Recovery of Human Shape and Pose

SMPL: a skinned multi-person linear model

SCAPE: shape completion and animation of people

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments