scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Unite the People: Closing the Loop Between 3D and 2D Human Representations

TL;DR: This work proposes a hybrid approach to 3D body model fits for multiple human pose datasets with an extended version of the recently introduced SMPLify method, and shows that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale.
Abstract: 3D models provide a common ground for different representations of human bodies. In turn, robust 2D estimation has proven to be a powerful tool to obtain 3D fits in-the-wild. However, depending on the level of detail, it can be hard to impossible to acquire labeled data for training 2D estimators on large scale. We propose a hybrid approach to this problem: with an extended version of the recently introduced SMPLify method, we obtain high quality 3D body model fits for multiple human pose datasets. Human annotators solely sort good and bad fits. This procedure leads to an initial dataset, UP-3D, with rich annotations. With a comprehensive set of experiments, we show how this data can be used to train discriminative models that produce results with an unprecedented level of detail: our models predict 31 segments and 91 landmark locations on the body. Using the 91 landmark pose estimator, we present state-of-the art results for 3D human pose and shape estimation using an order of magnitude less training data and without assumptions about gender or pose in the fitting procedure. We show that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale. The data, code and models are available for research purposes.

Content maybe subject to copyright    Report

1
Bernstein Center for Computational Neuroscience, T
¨
ubingen, Germany
2
MPI for Intelligent Systems, T
¨
ubingen, Germany
3
Body Labs Inc., New York, United States
4
Microsoft, Cambridge, UK
5
University of W
¨
urzburg, Germany
Unite the People: Closing the Loop Between 3D and 2D Human Representations
Christoph Lassner
1,2
classner@tue.mpg.de
Javier Romero
3,*
javier.romero@bodylabs.com
Martin Kiefel
2
mkiefel@tue.mpg.de
Federica Bogo
4,*
febogo@microsoft.com
Michael J. Black
2
black@tue.mpg.de
Peter V. Gehler
5,*
pgehler@tue.mpg.de
Abstract
3D models provide a common ground for different repre-
sentations of human bodies. In turn, robust 2D estimation
has proven to be a powerful tool to obtain 3D fits “in-the-
wild”. However, depending on the level of detail, it can be
hard to impossible to acquire labeled data for training 2D
estimators on large scale. We propose a hybrid approach to
this problem: with an extended version of the recently in-
troduced SMPLify method, we obtain high quality 3D body
model fits for multiple human pose datasets. Human anno-
tators solely sort good and bad fits. This procedure leads
to an initial dataset, UP-3D, with rich annotations. With a
comprehensive set of experiments, we show how this data
can be used to train discriminative models that produce re-
sults with an unprecedented level of detail: our models pre-
dict 31 segments and 91 landmark locations on the body.
Using the 91 landmark pose estimator, we present state-of-
the art results for 3D human pose and shape estimation us-
ing an order of magnitude less training data and without
assumptions about gender or pose in the fitting procedure.
We show that UP-3D can be enhanced with these improved
fits to grow in quantity and quality, which makes the system
deployable on large scale. The data, code and models are
available for research purposes.
1. Introduction
Teaching computers to recognize and understand hu-
mans in images and videos is a fundamental task of com-
puter vision. Different applications require different trade-
offs between fidelity of the representation and inference
complexity. This led to a wide range of parameterizations
for human bodies and corresponding prediction methods
ranging from bounding boxes to detailed 3D models.
*
This work was performed while J. Romero and F. Bogo were with
the MPI-IS
2
; P. V. Gehler with the BCCN
1
and MPI-IS
2
.
31 Parts
United People Dataset
91 Landmarks 3D Fits
Direct 3D
3D Fit Improvement
MPII HPDBLeeds Sports Pose / extended FashionPose
Label Generation
Figure 1: Lower row: validated 3D body model fits on vari-
ous datasets form our initial dataset, UP-3D, and provide la-
bels for multiple tasks. Top row: we perform experiments
on semantic body part segmentation, pose estimation and
3D fitting. Improved 3D fits can extend the initial dataset.
Learning-based algorithms, especially convolutional
neural networks (CNNs), are the leading methods to cope
with the complexity of human appearance. Their represen-
tational power has led to increasingly robust algorithms for
bounding box detection [
10], keypoint detection [19, 32, 42]
and body part segmentation [
7, 15, 43]. However, they are
usually applied in isolation on separate datasets and inde-
pendent from the goal of precise 3D body estimation. In
this paper we aim to overcome this separation and “unite
the people” of different datasets and for multiple tasks.
With this strategy, we attack the main problem of learning-
based approaches for complex body representations: the
lack of data. While it is feasible to annotate a small number
of keypoints in images (e.g., 14 in the case of the MPII-
HumanPose dataset [
1]), scaling to larger numbers quickly
becomes impractical and prone to annotation inconsistency.
The same is true for semantic segmentation annotations:
most datasets provide labels for only a few body parts.
6050

In this paper, we aim to develop a self-improving, scal-
able method that obtains high-quality 3D body model fits
for 2D images (see Fig.
1 for an illustration). To form an
initial dataset of 3D body fits, we use an improved version
of the recently developed SMPLify method [
4] that elevates
2D keypoints to a full body model of pose and shape. A
more robust initialization and an additional fitting objective
allow us to apply it on the ground truth keypoints of the
standard human pose datasets; human annotators solely sort
good and bad fits.
This semi-automatic scheme has several advantages.
The required annotation time is greatly reduced (Sec.
3.3).
By projecting surfaces (Sec.
4.1) or keypoints (Sec. 4.2)
from the fits to the original images, we obtain consistent
labels while retaining generalization performance. The rich
representation and the flexible fitting process make it easy
to integrate datasets with different label sets, e.g., a different
set of keypoint locations.
Predictions from our 91 keypoint model improve the
3D model fitting method that generated the annotations for
training the keypoint model in the first place. We report
state-of-the art results on the HumanEva and Human3.6M
datasets (Sec.
4.3). Further, using the 3D body fits, we de-
velop a random forest method for 3D pose estimation that
runs orders of magnitudes faster than SMPLify (Sec. 4.4).
The improved predictions from the 91 landmark model
increase the ratio of high quality 3D fits on the LSP
dataset by 9.3% when compared to the fits using 14 key-
point ground truth locations (Sec.
5). This ability for self-
improvement together with the possibility to easily integrate
new data into the pool make the presented system deploy-
able on large scale. Data, code and models are available
for research purposes on the project homepage at
http:
//up.is.tuebingen.mpg.de/
.
2. Related Work
Acquiring human pose annotations in 3D is a long-
standing problem with several attempts from the computer
vision as well as the 3D human pose community.
The classical 2D representation of humans are 2D key-
points [
1, 6, 23, 38, 39]. While 2D keypoint prediction has
seen considerable progress in the last years and could be
considered close to being solved [19, 32, 42], 3D pose esti-
mation from single images remains a challenge [
4, 36, 44].
Bourdev and Malik [
5] enhanced the H3D dataset from
20 keypoint annotations for 1,240 people in 2D with relative
3D information as well as 11 annotated body part segments.
In contrast, the HumanEva [
41] and Human3.6M [21]
datasets provide very accurate 3D labels: they are both
recorded in motion capture environments. Both datasets
have high fidelity but contain only a very limited level of
diversity in background and person appearance. We eval-
uate the 3D human pose estimation performance on both.
Recent approaches target 3D pose ground truth from natu-
ral scenes, but either rely on vision systems prone to fail-
ure [
11] or inertial suits that modify the appearance of the
body and are prone to motion drift [
44].
Body representations beyond 3D skeletons have a long
history in the computer vision community [17, 30, 31, 35].
More recently, these representations have taken new pop-
ularity in approaches that fit detailed surfaces of a body
model to images [
4, 14, 16, 25, 44]. These representations
are more tightly connected to the physical reality of the hu-
man body and the image formation process.
One of the classic problems related to representations
of the extent of the body is body part segmentation. Fine-
grained part segmentation has been added to the public parts
of the VOC dataset [
12] by Chen et al. [8]. Annotations
for 24 human body parts and also part segments for all
VOC object classes, where applicable, are available. Even
though hard to compare, we provide results on the dataset.
The Freiburg Sitting People dataset [
33] consists of 200 im-
ages with 14 part segmentation and is tailored towards sit-
ting poses. The ideas by Shotton et al. [
40] for 2.5D data
inspired our body part representation. Relatively simple
methods have proven to achieve good performance in seg-
mentation tasks with “easy” backgrounds like Human80k, a
subset of Human3.6M [
20].
Following previous work on cardboard people [
24]
and contour people [
13], an attempt to work towards an
intermediate-level person representation is the JHMDB
dataset and the related labeling tool [
22]. It relies on ‘pup-
pets’ to ease the annotation task, while providing a higher
level of detail than solely joint locations.
The attempt to unify representations for human bodies
has been made mainly in the context of human kinemat-
ics [2, 29]. In their work, a rich representation for 3D mo-
tion capture marker sets is used to transfer captures to dif-
ferent targets. The setup of markers to capture not only hu-
man motion but also shape has been explored by Loper et
al. [
28] for motion capture scenarios. While they optimized
the placement of markers for a 12 camera setup, we must
ensure that the markers disambiguate pose and shape from
a single view. Hence, we use a denser set of markers.
3. Building the Initial Dataset
Our motivation to use a common 3D representation is
to (1) map many possible representations from a variety of
datasets to it, and (2) generate detailed and consistent labels
for supervised model training from it.
We argue that the use of a full human body model with a
prior on shape and pose is necessary: without the visualiza-
tion possibilities and regularization, it may be impossible to
create sufficiently accurate annotations for small body parts.
However, so far, no dataset is available that provides human
body model fits on a large variety of images.
6051

To fill this gap, we build on a set of human pose datasets
with annotated keypoints. SMPLify [
4] presented promis-
ing results for automatically translating these into 3D body
model fits. This helps us to keep the human involvement to a
minimum. With strongly increasing working times and lev-
els of label noise for increasingly complex tasks, this may
be a critical decision to create a large dataset of 3D body
models.
3.1. Improving Body Shape Estimation
In [
4], the authors fit the pose and shape parameters of
the SMPL [
26] body model to 2D keypoints by minimiz-
ing an objective function composed of a data term and sev-
eral penalty terms that represent priors over pose and shape.
However, the connection length between two keypoints is
the only indicator that can be used to estimate body shape.
Our aim is to match the shape of the body model as accu-
rately as possible to the images, hence we must incorporate
a shape objective in the fitting.
The best evidence for the extent of a 3D body projected
on a 2D image is encoded by its silhouette. We define the
silhouette to be the set of all pixels belonging to a body’s
projection. Hence, we add a term to the original SMPLify
objective to prefer solutions for which the image silhouette,
S, and the model silhouette,
ˆ
S, match.
Let M (
~
θ,
~
β, ~γ) be a 3D mesh generated by a SMPL body
model with pose,
~
θ, shape,
~
β, and global translation, ~γ. Let
Π(·, K) be a function that takes a 3D mesh and projects it
into the image plane given camera parameters K, such that
ˆ
S(
~
θ,
~
β, ~γ) = Π(M (
~
θ,
~
β, ~γ)) represents the silhouette pixels
of the model in the image. We compute the bi-directional
distance between S and
ˆ
S(·)
E
S
(
~
θ,
~
β, ~γ; S, K) =
X
~x
ˆ
S(
~
θ,
~
β,~γ)
dist(~x, S)
2
+
X
~xS
dist(~x,
ˆ
S(
~
θ,
~
β, ~γ)), (1)
where dist(~x, S) denotes the absolute distance from a point
~x to the closest point belonging to the silhouette S.
The first term in Eq. (
1) computes the distance from
points of the projected model to a given silhouette, while
the second term computes the distance from points in the
silhouette to the model. We find that the second term is
noisier and use the plain L1 distance to measure its contri-
bution to the energy function while we use the squared L2
distance to measure the contribution of the first. We op-
timize the overall objective including this additional term
using OpenDR [
27], just as in [4].
Whereas it would be possible to use an automatic seg-
mentation method to provide foreground silhouettes, we
decided to involve human annotators for reliability. We
also asked for six body part segmentation that we will use
Dataset Foreground 6 Body Parts AMT hours logged
LSP [23]
1000 train,
1000 test
1000 train,
1000 test
361h foreground,
LSP-extended [23] 10000 train 0 131h parts
MPII-HPDB [1]
13030 train,
2622 test
0 729h
Table 1: Logged AMT labelling times. The average fore-
ground labeling task was solved in 108s on the LSP and
168s on the MPII datasets respectively. Annotating the seg-
mentation for six body parts took on average more than
twice as long as annotating foreground segmentation: 236s.
Figure 2: Examples for six part segmentation ground truth.
White areas mark inconsistencies with the foreground seg-
mentation and are ignored.
in Sec.
4 for evaluation. We built an interactive annota-
tion tool on top of the Opensurfaces package [
3] to work
with Amazon Mechanical Turk (AMT). To obtain image-
consistent silhouette borders, we use the interactive Grab-
cut algorithm [37]. Workers spent more than 1,200 hours
on creating the labels for the LSP [
23] datasets as well as
the single-person part of the MPII-HumanPose [
1] dataset
(see Tab.
1). There is an increase in average annotation time
of more than a factor of two comparing annotation for fore-
ground labels and six body part labels. This provides a hint
on how long annotation for a 31 body part representation
could take. Examples for six part segmentation labels are
provided in Fig.
2.
3.2. Handling Noisy Ground Truth Keypoints
The SMPLify method is especially vulnerable to missing
annotations of the four torso joints: it uses their locations for
an initial depth guess, and convergence deteriorates if this
guess is of poor quality.
Finding a good depth initialization is particularly hard
due to the foreshortening effect of the perspective projec-
tion. However, since we know that only a shortening but
no lengthening effect can occur, we can find a more reliable
person size estimate
ˆ
θ for a skeleton model with k connec-
tions:
ˆ
θ = x
i
· arg max
y
f
i
(y), i = arg max
j=1,...,k
x
j
, (2)
where f
i
is the distribution over ratios of person size to the
length of connection x
i
. Since this is a skewed distribution,
we use a corrected mean to find the solution of the arg max
function and obtain a person size estimate. This turns out to
be a simple, yet robust estimator.
6052

LSP [23] LSP extended [23] MPII-HP [1] FashionPose [9]
45% 12% 25% 23%
Table 2: Percentages of accepted fits per dataset. The addi-
tion of the FashionPose dataset is discussed in Sec.
4.2.
3.3. Exploring the Data
With the foreground segmentation data and the ad-
justments described in the preceding sections, we fit the
SMPL model to a total of 27,652 images of the LSP,
LSP-extended, and MPII-HumanPose datasets. We use
only people marked with the ‘single person’ flag in MPII-
HumanPose to avoid instance segmentation problems. We
honor the train/test splits of the datasets and keep images
from their test sets in our new, joined test set.
In the next step, human annotators
1
selected the fits
where rotation and location of body parts largely match the
image evidence. For this task, we provide the original im-
age, as well as four perspectives of renderings of the body.
Optionally, annotators can overlay rendering and image.
These visualizations help to identify fitting errors quickly
and reduce the labeling time to 12s per image. The pro-
cess uncovered many erroneously labeled keypoints, where
mistakes in the 3D fit were clear to spot, but not obvious in
the 2D representation. We excluded head and foot rotation
as criteria for the sorting process. There is usually not suf-
ficient information in the original 14 keypoints to estimate
them correctly. The resulting ratios of accepted fits can be
found in Tab.
2.
Even with the proposed, more robust initialization term,
the ratio of accepted fits on the LSP-extended dataset re-
mains the lowest. It has the highest number of missing key-
points of the four datasets, and at the same time the most
extreme viewpoints and poses. On the other hand, the rather
high ratio of usable fits on the LSP dataset can be explained
with the clean and complete annotations.
The validated fits form our initial dataset with 5,569
training images (of which we use a held-out validation set
of 1,112 images in our experiments) and 1,208 test images.
We denote this dataset as UPI-3D (UnitedPeople in 3D with
an added ‘I’ for “Initial”). To be able to clearly reference the
different label types in the following sections, we add an ‘h’
to the dataset name when referring to labels from human an-
notators.
Consistency of Human Labels The set of curated 3D fits
allows us to assess the distribution of the human-provided
labels by projecting them to the UPI-3D bodies. We did this
for both, keypoints and body part segments. Visualizations
can be found in Fig.
3.
1
For this task, we did not rely on AMT workers, but only on few experts
in close collaboration to maintain consistency.
While keypoint locations in Fig.
3a in completely non-
matching areas of the body can be explained by self-
occlusion, there is a high variance in keypoint locations
around joints. It must be taken into account that the key-
points are projected to the body surface, and depending
on person shape and body part orientation some variation
can be expected. Nevertheless, even for this reduced set of
images with very good 3D fits, high variance areas, e.g.,
around the hip joints, indicate labeling noise.
The visualization in Fig.
3b shows the density of part
types for six part segmentation with the segments head,
torso, left and right arms and left and right legs. While the
head and lower parts of the extremities resemble distinct
colors, the areas converging to brown represent a mixture
of part annotations. The brown tone on the torso is a clear
indicator for the frequent occlusion by the arms. The area
around the hips is showing a smooth transition from torso
to leg color, hinting again at varying annotation styles.
4. Label Generation and Learning
In a comprehensive series of experiments, we analyze the
quality of labels generated from UPI-3D. We focus on labels
for well-established tasks, but highlight that the generation
possibilities are not limited to them: all types of data that
can be extracted from the body model can be used as labels
for supervised training. In our experiments, we move from
surface (segmentation) prediction over 2D- to 3D-pose and
shape estimation to a method for predicting 3D body pose
and shape directly from 2D landmark positions.
4.1. Semantic Body Part Segmentation
We segment the SMPL mesh into 31 regions, following
the segmentation into semantic parts introduced in [
40] (for
a visualization, see Fig.
3d). We note that the Kinect tracker
works on 2.5D data while our detectors only receive 2D data
as input. We deliberately did not make any of our methods
for data collection or prediction dependent on 2.5D data to
retain generality. This way, we can use it on outdoor images
and regular 2D photo datasets. The Segmentation dataset
UPI-S31 is obtained by projecting the segmented 3D mesh
posed on the 6,777 images of UPI-3D.
Following [
7], we optimize a multiscale ResNet101 on
a pixel-wise cross entropy loss. We train the network on
size-normalized, cutout images, which could in a produc-
tion system be provided by a person detector. Following
best practices for CNN training, we use a validation set to
determine the optimal number of training iterations and the
person size, which is around 500 pixels. This high resolu-
tion allows the CNN to reliably predict small body parts.
In this challenging setup, we achieve an intersection over
union (IoU) score of 0.4432 and an accuracy of 0.9331.
Qualitative results on five datasets are shown in Fig.
4a.
6053

(a) (b) (c) (d)
Figure 3: Density of human annotations on high quality body model fits for (a) keypoints and (b) six part segmentation in
front and back views. Areas of the bodies are colored with (1) hue according to part label, and (2) saturation according to
frequency of the label. Keypoints on completely ‘wrong’ bodyparts are due to self-occlusion. The high concentration of
‘head’ labels in the nose region originates from the FashionPose dataset, where the ‘head’ keypoint is placed on the nose.
The segmentation data originates solely from the six part segmentation labels on the LSP dataset. (Must be viewed in color.)
(c) Placement of the 91 landmarks (left: front, right: back). (d) Segmentation for generating the 31 part labels.
The overall performance is compelling: even the small
segments around the joints are recovered reliably. Left and
right sides of the subjects are identified correctly, and the
four parts of the head provide an estimate of head orien-
tation. The average IoU score is dominated by the small
segments, such as the wrists.
The VOC part dataset is a hard match for our predictor:
instead of providing instances of people, it consists of en-
tire scenes, and many people are visible at small scale. To
provide a comparison, we use the instance annotations from
the VOC-Part dataset, cut out samples and reduce the gran-
ularity of our segmentation to match the widely used six
part representation. Because of the low resolution of many
displayed people and extreme perspectives with, e.g., only
a face visible, the predictor often only predicts the back-
ground class on images not matching our training scheme.
Still, we achieve an IoU score of 0.3185 and 0.7208 accu-
racy over the entire dataset without finetuning.
Additional examples from the LSP, MPII-HumanPose,
FashionPose, Fashionista, VOC, HumanEva and Hu-
man3.6M datasets are shown in the supplementary mate-
rial available on the project homepage
2
. The model has not
been trained on any of the latter four, but the results indi-
cate good generalization behavior. We include a video to
visualize stability across consecutive frames.
4.2. Human Pose Estimation
With the 3D body fits, we can not only generate consis-
tent keypoints on the human skeleton but also on the body
surface. For the experiments in the rest of this paper, we de-
signed a 91-landmark
3
set to analyze a dense keypoint set.
2
http://up.is.tuebingen.mpg.de/
3
We use the term ‘landmark’ to refer to keypoints on the mesh surface
to emphasize the difference to the so-far used term ‘joints’ for keypoints
located inside of the body.
We distributed the landmarks according to two criteria:
disambiguation of body part configuration and estimation
of body shape. The former requires placement of markers
around joints to get a good estimation of their configuration.
To satisfy the latter, we place landmarks in regular intervals
around the body to get an estimate of spatial extent indepen-
dent of the viewpoint. We visualize our selection in Fig.
3c
and example predictions in Fig. 4b.
In the visualization of predictions, we show a subset of
the 91 landmarks and only partially connect the displayed
ones for better interpretability. The core 14 keypoints de-
scribing the human skeleton are part of our selection to
describe the fundamental pose and maintain comparability
with existing methods.
We use a state-of-the-art DeeperCut CNN [
19] for our
pose-related experiments, but believe that using other mod-
els such as Convolutional Pose Machines [
42] or Stacked
Hourglass Networks [
32] would lead to similar findings.
To assess the influence of the quality of our data and
the difference of the loss function for 91 and 14 key-
points, we train multiple CNNs: (1) using all human la-
bels but on our (smaller) dataset for 14 keypoints (UPI-
P14h) and (2) on the dense 91 landmarks from projections
of the SMPL mesh (UPI-P91). Again, models are trained on
size-normalized crops with cross-validated parameters. We
include the performance of the original DeeperCut CNN,
which has been trained on the full LSP, LSP-extended and
MPII-HumanPose datasets (in total more than 52,000 peo-
ple) in the comparison with the models being trained on our
data (in total 5,569 people). The results are summarized
in Tab.
3. Even though the size of the dataset is reduced
by nearly an order of magnitude, we maintain high perfor-
mance compared to the original DeeperCut CNN. Compar-
ing the two models trained on the same amount of data,
we find that the model trained on the 91 landmarks from
6054

Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: This work introduces an adversary trained to tell whether human body shape and pose parameters are real or not using a large database of 3D human meshes, and produces a richer and more useful mesh representation that is parameterized by shape and 3D joint angles.
Abstract: We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which allows our model to be trained using in-the-wild images that only have ground truth 2D annotations. However, the reprojection loss alone is highly underconstrained. In this work we address this problem by introducing an adversary trained to tell whether human body shape and pose parameters are real or not using a large database of 3D human meshes. We show that HMR can be trained with and without using any paired 2D-to-3D supervision. We do not rely on intermediate 2D keypoint detections and infer 3D pose and shape parameters directly from image pixels. Our model runs in real-time given a bounding box containing the person. We demonstrate our approach on various images in-the-wild and out-perform previous optimization-based methods that output 3D meshes and show competitive results on tasks such as 3D joint location estimation and part segmentation.

1,462 citations


Cites methods or result from "Unite the People: Closing the Loop ..."

  • ...We report the segmentation accuracy and average F1 score over all parts including the background as done in [20]....

    [...]

  • ...[20] take curated results from SMPLify to train 91 keypoint detectors corresponding to traditional body joints and points on the surface....

    [...]

  • ...We follow [5, 20] and use a regressor to obtain the 14 joints of Human3....

    [...]

  • ...We also evaluate our approach on the auxiliary task of human body segmentation on the 1000 test images of LSP [17] labeled by [20]....

    [...]

  • ...Existing methods for recovering 3D human mesh today focus on a multi-stage approach [5, 20]....

    [...]

Proceedings ArticleDOI
01 Feb 2018
TL;DR: This work establishes dense correspondences between an RGB image and a surface-based representation of the human body, a task referred to as dense human pose estimation, and improves accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu.
Abstract: In this work we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an inpainting network that can fill in missing ground truth values and report improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter. We further improve accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu. Supplementary materials, data, code, and videos are provided on the project page http://densepose.org.

987 citations


Cites background or methods from "Unite the People: Closing the Loop ..."

  • ...We use the code provided by [23] with both DeeperCut pose estimation landmark detector [18] for 14-landmark results and with the 91landmark alternative proposed in [23]....

    [...]

  • ...A semi-automated method is used for the ‘Unite the People’ (UP) dataset of [23], where human annotators verified the results of fitting the SMPL 3D deformable model [28] to 2D images....

    [...]

  • ...Surface-level supervision was only recently introduced for synthetic images in [45], while in [23] a dataset of 8515 images is annotated with keypoints and semi-automated fits of 3D models to images....

    [...]

  • ...The works of [23, 45] can be used as surrogates, but as we show in Sec....

    [...]

  • ...However, model fitting often fails in the presence of occlusions, or extreme poses, and is never guaranteed to be entirely successful – for instance, even after rejecting a large fraction of the fitting results, the feet are still often misaligned in [23]....

    [...]

Proceedings ArticleDOI
13 May 2019
TL;DR: Pixel-aligned Implicit Function (PIFu) as mentioned in this paper aligns pixels of 2D images with the global context of their corresponding 3D object to produce highresolution surfaces including largely unseen regions such as the back of a person.
Abstract: We introduce Pixel-aligned Implicit Function (PIFu), an implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu produces high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.

907 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: SPIN as discussed by the authors uses a deep network to initialize an iterative optimization routine that fits the body model to 2D joints within the training loop, and the fitted estimate is subsequently used to supervise the network.
Abstract: Model-based human pose estimation is currently approached through two different paradigms. Optimization-based methods fit a parametric body model to 2D observations in an iterative manner, leading to accurate image-model alignments, but are often slow and sensitive to the initialization. In contrast, regression-based methods, that use a deep network to directly estimate the model parameters from pixels, tend to provide reasonable, but not pixel accurate, results while requiring huge amounts of supervision. In this work, instead of investigating which approach is better, our key insight is that the two paradigms can form a strong collaboration. A reasonable, directly regressed estimate from the network can initialize the iterative optimization making the fitting faster and more accurate. Similarly, a pixel accurate fit from iterative optimization can act as strong supervision for the network. This is the core of our proposed approach SPIN (SMPL oPtimization IN the loop). The deep network initializes an iterative optimization routine that fits the body model to 2D joints within the training loop, and the fitted estimate is subsequently used to supervise the network. Our approach is self-improving by nature, since better network estimates can lead the optimization to better solutions, while more accurate optimization fits provide better supervision for the network. We demonstrate the effectiveness of our approach in different settings, where 3D ground truth is scarce, or not available, and we consistently outperform the state-of-the-art model-based pose estimation approaches by significant margins. The project website with videos, results, and code can be found at https://seas.upenn.edu/~nkolot/projects/spin.

725 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work defines a novel temporal network architecture with a self-attention mechanism and shows that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels.
Abstract: Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose "Video Inference for Body Pose and Shape Estimation'' (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a novel temporal network architecture with a self-attention mechanism and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE

687 citations


Cites background or methods from "Unite the People: Closing the Loop ..."

  • ...[39] uses silhouettes along with keypoints for the fitting algorithm....

    [...]

  • ...Tremendous progress has been made on estimating 3D human pose and shape from a single image [11, 21, 25, 29, 36, 37, 39, 48, 51]....

    [...]

  • ...Due to the lack of in-the-wild 3D ground-truth labels, these methods use weak supervision signals obtained from a 2D keypoint re-projection loss [29, 60, 62], use body/part segmentation as an intermediate representation [48, 51], or employ a human in the loop [39]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations


"Unite the People: Closing the Loop ..." refers methods in this paper

  • ...Finegrained part segmentation has been added to the public parts of the VOC dataset [12] by Chen et al....

    [...]

Posted Content
TL;DR: This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).
Abstract: Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: this https URL

5,904 citations


"Unite the People: Closing the Loop ..." refers methods in this paper

  • ...rom an image in 0.378s. The pose-predicting CNN is the computational bottleneck. Because our findings are not specific to a CNN model, we believe that by using a speed-optimized CNN, such as SqueezeNet [18], and further optimizations of the direct predictor, the proposed method could reach realtime speed. 5. Closing the Loop With the improved results for 3D fitting, which helped to create the dataset of ...

    [...]

Journal ArticleDOI
01 Aug 2004
TL;DR: A more powerful, iterative version of the optimisation of the graph-cut approach is developed and the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result.
Abstract: The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.

5,670 citations

Book ChapterDOI
08 Oct 2016
TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

3,865 citations


"Unite the People: Closing the Loop ..." refers background or methods in this paper

  • ...While 2D keypoint prediction has seen considerable progress in the last years and could be considered close to being solved [19, 32, 42], 3D pose estimation from single images remains a challenge [4, 36, 44]....

    [...]

  • ...We use a state-of-the-art DeeperCut CNN [19] for our pose-related experiments, but believe that using other models such as Convolutional Pose Machines [42] or Stacked Hourglass Networks [32] would lead to similar findings....

    [...]

  • ...Their representational power has led to increasingly robust algorithms for bounding box detection [10], keypoint detection [19, 32, 42] and body part segmentation [7, 15, 43]....

    [...]

  • ...We use a state-of-the-art DeeperCut CNN [19] for our pose-related experiments, but believe that using other models such as Convolutional Pose Machines [43] or Stacked Hourglass Networks [33] would lead to similar findings....

    [...]

Journal ArticleDOI
TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.
Abstract: Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians.

3,170 citations


"Unite the People: Closing the Loop ..." refers background in this paper

  • ...Their representational power has led to increasingly robust algorithms for bounding box detection [10], keypoint detection [19, 32, 42] and body part segmentation [7, 15, 43]....

    [...]