scispace - formally typeset
Open AccessProceedings ArticleDOI

Unite the People: Closing the Loop Between 3D and 2D Human Representations

Reads0
Chats0
TLDR
This work proposes a hybrid approach to 3D body model fits for multiple human pose datasets with an extended version of the recently introduced SMPLify method, and shows that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale.
Abstract
3D models provide a common ground for different representations of human bodies. In turn, robust 2D estimation has proven to be a powerful tool to obtain 3D fits in-the-wild. However, depending on the level of detail, it can be hard to impossible to acquire labeled data for training 2D estimators on large scale. We propose a hybrid approach to this problem: with an extended version of the recently introduced SMPLify method, we obtain high quality 3D body model fits for multiple human pose datasets. Human annotators solely sort good and bad fits. This procedure leads to an initial dataset, UP-3D, with rich annotations. With a comprehensive set of experiments, we show how this data can be used to train discriminative models that produce results with an unprecedented level of detail: our models predict 31 segments and 91 landmark locations on the body. Using the 91 landmark pose estimator, we present state-of-the art results for 3D human pose and shape estimation using an order of magnitude less training data and without assumptions about gender or pose in the fitting procedure. We show that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale. The data, code and models are available for research purposes.

read more

Content maybe subject to copyright    Report

1
Bernstein Center for Computational Neuroscience, T
¨
ubingen, Germany
2
MPI for Intelligent Systems, T
¨
ubingen, Germany
3
Body Labs Inc., New York, United States
4
Microsoft, Cambridge, UK
5
University of W
¨
urzburg, Germany
Unite the People: Closing the Loop Between 3D and 2D Human Representations
Christoph Lassner
1,2
classner@tue.mpg.de
Javier Romero
3,*
javier.romero@bodylabs.com
Martin Kiefel
2
mkiefel@tue.mpg.de
Federica Bogo
4,*
febogo@microsoft.com
Michael J. Black
2
black@tue.mpg.de
Peter V. Gehler
5,*
pgehler@tue.mpg.de
Abstract
3D models provide a common ground for different repre-
sentations of human bodies. In turn, robust 2D estimation
has proven to be a powerful tool to obtain 3D fits “in-the-
wild”. However, depending on the level of detail, it can be
hard to impossible to acquire labeled data for training 2D
estimators on large scale. We propose a hybrid approach to
this problem: with an extended version of the recently in-
troduced SMPLify method, we obtain high quality 3D body
model fits for multiple human pose datasets. Human anno-
tators solely sort good and bad fits. This procedure leads
to an initial dataset, UP-3D, with rich annotations. With a
comprehensive set of experiments, we show how this data
can be used to train discriminative models that produce re-
sults with an unprecedented level of detail: our models pre-
dict 31 segments and 91 landmark locations on the body.
Using the 91 landmark pose estimator, we present state-of-
the art results for 3D human pose and shape estimation us-
ing an order of magnitude less training data and without
assumptions about gender or pose in the fitting procedure.
We show that UP-3D can be enhanced with these improved
fits to grow in quantity and quality, which makes the system
deployable on large scale. The data, code and models are
available for research purposes.
1. Introduction
Teaching computers to recognize and understand hu-
mans in images and videos is a fundamental task of com-
puter vision. Different applications require different trade-
offs between fidelity of the representation and inference
complexity. This led to a wide range of parameterizations
for human bodies and corresponding prediction methods
ranging from bounding boxes to detailed 3D models.
*
This work was performed while J. Romero and F. Bogo were with
the MPI-IS
2
; P. V. Gehler with the BCCN
1
and MPI-IS
2
.
31 Parts
United People Dataset
91 Landmarks 3D Fits
Direct 3D
3D Fit Improvement
MPII HPDBLeeds Sports Pose / extended FashionPose
Label Generation
Figure 1: Lower row: validated 3D body model fits on vari-
ous datasets form our initial dataset, UP-3D, and provide la-
bels for multiple tasks. Top row: we perform experiments
on semantic body part segmentation, pose estimation and
3D fitting. Improved 3D fits can extend the initial dataset.
Learning-based algorithms, especially convolutional
neural networks (CNNs), are the leading methods to cope
with the complexity of human appearance. Their represen-
tational power has led to increasingly robust algorithms for
bounding box detection [
10], keypoint detection [19, 32, 42]
and body part segmentation [
7, 15, 43]. However, they are
usually applied in isolation on separate datasets and inde-
pendent from the goal of precise 3D body estimation. In
this paper we aim to overcome this separation and “unite
the people” of different datasets and for multiple tasks.
With this strategy, we attack the main problem of learning-
based approaches for complex body representations: the
lack of data. While it is feasible to annotate a small number
of keypoints in images (e.g., 14 in the case of the MPII-
HumanPose dataset [
1]), scaling to larger numbers quickly
becomes impractical and prone to annotation inconsistency.
The same is true for semantic segmentation annotations:
most datasets provide labels for only a few body parts.
6050

In this paper, we aim to develop a self-improving, scal-
able method that obtains high-quality 3D body model fits
for 2D images (see Fig.
1 for an illustration). To form an
initial dataset of 3D body fits, we use an improved version
of the recently developed SMPLify method [
4] that elevates
2D keypoints to a full body model of pose and shape. A
more robust initialization and an additional fitting objective
allow us to apply it on the ground truth keypoints of the
standard human pose datasets; human annotators solely sort
good and bad fits.
This semi-automatic scheme has several advantages.
The required annotation time is greatly reduced (Sec.
3.3).
By projecting surfaces (Sec.
4.1) or keypoints (Sec. 4.2)
from the fits to the original images, we obtain consistent
labels while retaining generalization performance. The rich
representation and the flexible fitting process make it easy
to integrate datasets with different label sets, e.g., a different
set of keypoint locations.
Predictions from our 91 keypoint model improve the
3D model fitting method that generated the annotations for
training the keypoint model in the first place. We report
state-of-the art results on the HumanEva and Human3.6M
datasets (Sec.
4.3). Further, using the 3D body fits, we de-
velop a random forest method for 3D pose estimation that
runs orders of magnitudes faster than SMPLify (Sec. 4.4).
The improved predictions from the 91 landmark model
increase the ratio of high quality 3D fits on the LSP
dataset by 9.3% when compared to the fits using 14 key-
point ground truth locations (Sec.
5). This ability for self-
improvement together with the possibility to easily integrate
new data into the pool make the presented system deploy-
able on large scale. Data, code and models are available
for research purposes on the project homepage at
http:
//up.is.tuebingen.mpg.de/
.
2. Related Work
Acquiring human pose annotations in 3D is a long-
standing problem with several attempts from the computer
vision as well as the 3D human pose community.
The classical 2D representation of humans are 2D key-
points [
1, 6, 23, 38, 39]. While 2D keypoint prediction has
seen considerable progress in the last years and could be
considered close to being solved [19, 32, 42], 3D pose esti-
mation from single images remains a challenge [
4, 36, 44].
Bourdev and Malik [
5] enhanced the H3D dataset from
20 keypoint annotations for 1,240 people in 2D with relative
3D information as well as 11 annotated body part segments.
In contrast, the HumanEva [
41] and Human3.6M [21]
datasets provide very accurate 3D labels: they are both
recorded in motion capture environments. Both datasets
have high fidelity but contain only a very limited level of
diversity in background and person appearance. We eval-
uate the 3D human pose estimation performance on both.
Recent approaches target 3D pose ground truth from natu-
ral scenes, but either rely on vision systems prone to fail-
ure [
11] or inertial suits that modify the appearance of the
body and are prone to motion drift [
44].
Body representations beyond 3D skeletons have a long
history in the computer vision community [17, 30, 31, 35].
More recently, these representations have taken new pop-
ularity in approaches that fit detailed surfaces of a body
model to images [
4, 14, 16, 25, 44]. These representations
are more tightly connected to the physical reality of the hu-
man body and the image formation process.
One of the classic problems related to representations
of the extent of the body is body part segmentation. Fine-
grained part segmentation has been added to the public parts
of the VOC dataset [
12] by Chen et al. [8]. Annotations
for 24 human body parts and also part segments for all
VOC object classes, where applicable, are available. Even
though hard to compare, we provide results on the dataset.
The Freiburg Sitting People dataset [
33] consists of 200 im-
ages with 14 part segmentation and is tailored towards sit-
ting poses. The ideas by Shotton et al. [
40] for 2.5D data
inspired our body part representation. Relatively simple
methods have proven to achieve good performance in seg-
mentation tasks with “easy” backgrounds like Human80k, a
subset of Human3.6M [
20].
Following previous work on cardboard people [
24]
and contour people [
13], an attempt to work towards an
intermediate-level person representation is the JHMDB
dataset and the related labeling tool [
22]. It relies on ‘pup-
pets’ to ease the annotation task, while providing a higher
level of detail than solely joint locations.
The attempt to unify representations for human bodies
has been made mainly in the context of human kinemat-
ics [2, 29]. In their work, a rich representation for 3D mo-
tion capture marker sets is used to transfer captures to dif-
ferent targets. The setup of markers to capture not only hu-
man motion but also shape has been explored by Loper et
al. [
28] for motion capture scenarios. While they optimized
the placement of markers for a 12 camera setup, we must
ensure that the markers disambiguate pose and shape from
a single view. Hence, we use a denser set of markers.
3. Building the Initial Dataset
Our motivation to use a common 3D representation is
to (1) map many possible representations from a variety of
datasets to it, and (2) generate detailed and consistent labels
for supervised model training from it.
We argue that the use of a full human body model with a
prior on shape and pose is necessary: without the visualiza-
tion possibilities and regularization, it may be impossible to
create sufficiently accurate annotations for small body parts.
However, so far, no dataset is available that provides human
body model fits on a large variety of images.
6051

To fill this gap, we build on a set of human pose datasets
with annotated keypoints. SMPLify [
4] presented promis-
ing results for automatically translating these into 3D body
model fits. This helps us to keep the human involvement to a
minimum. With strongly increasing working times and lev-
els of label noise for increasingly complex tasks, this may
be a critical decision to create a large dataset of 3D body
models.
3.1. Improving Body Shape Estimation
In [
4], the authors fit the pose and shape parameters of
the SMPL [
26] body model to 2D keypoints by minimiz-
ing an objective function composed of a data term and sev-
eral penalty terms that represent priors over pose and shape.
However, the connection length between two keypoints is
the only indicator that can be used to estimate body shape.
Our aim is to match the shape of the body model as accu-
rately as possible to the images, hence we must incorporate
a shape objective in the fitting.
The best evidence for the extent of a 3D body projected
on a 2D image is encoded by its silhouette. We define the
silhouette to be the set of all pixels belonging to a body’s
projection. Hence, we add a term to the original SMPLify
objective to prefer solutions for which the image silhouette,
S, and the model silhouette,
ˆ
S, match.
Let M (
~
θ,
~
β, ~γ) be a 3D mesh generated by a SMPL body
model with pose,
~
θ, shape,
~
β, and global translation, ~γ. Let
Π(·, K) be a function that takes a 3D mesh and projects it
into the image plane given camera parameters K, such that
ˆ
S(
~
θ,
~
β, ~γ) = Π(M (
~
θ,
~
β, ~γ)) represents the silhouette pixels
of the model in the image. We compute the bi-directional
distance between S and
ˆ
S(·)
E
S
(
~
θ,
~
β, ~γ; S, K) =
X
~x
ˆ
S(
~
θ,
~
β,~γ)
dist(~x, S)
2
+
X
~xS
dist(~x,
ˆ
S(
~
θ,
~
β, ~γ)), (1)
where dist(~x, S) denotes the absolute distance from a point
~x to the closest point belonging to the silhouette S.
The first term in Eq. (
1) computes the distance from
points of the projected model to a given silhouette, while
the second term computes the distance from points in the
silhouette to the model. We find that the second term is
noisier and use the plain L1 distance to measure its contri-
bution to the energy function while we use the squared L2
distance to measure the contribution of the first. We op-
timize the overall objective including this additional term
using OpenDR [
27], just as in [4].
Whereas it would be possible to use an automatic seg-
mentation method to provide foreground silhouettes, we
decided to involve human annotators for reliability. We
also asked for six body part segmentation that we will use
Dataset Foreground 6 Body Parts AMT hours logged
LSP [23]
1000 train,
1000 test
1000 train,
1000 test
361h foreground,
LSP-extended [23] 10000 train 0 131h parts
MPII-HPDB [1]
13030 train,
2622 test
0 729h
Table 1: Logged AMT labelling times. The average fore-
ground labeling task was solved in 108s on the LSP and
168s on the MPII datasets respectively. Annotating the seg-
mentation for six body parts took on average more than
twice as long as annotating foreground segmentation: 236s.
Figure 2: Examples for six part segmentation ground truth.
White areas mark inconsistencies with the foreground seg-
mentation and are ignored.
in Sec.
4 for evaluation. We built an interactive annota-
tion tool on top of the Opensurfaces package [
3] to work
with Amazon Mechanical Turk (AMT). To obtain image-
consistent silhouette borders, we use the interactive Grab-
cut algorithm [37]. Workers spent more than 1,200 hours
on creating the labels for the LSP [
23] datasets as well as
the single-person part of the MPII-HumanPose [
1] dataset
(see Tab.
1). There is an increase in average annotation time
of more than a factor of two comparing annotation for fore-
ground labels and six body part labels. This provides a hint
on how long annotation for a 31 body part representation
could take. Examples for six part segmentation labels are
provided in Fig.
2.
3.2. Handling Noisy Ground Truth Keypoints
The SMPLify method is especially vulnerable to missing
annotations of the four torso joints: it uses their locations for
an initial depth guess, and convergence deteriorates if this
guess is of poor quality.
Finding a good depth initialization is particularly hard
due to the foreshortening effect of the perspective projec-
tion. However, since we know that only a shortening but
no lengthening effect can occur, we can find a more reliable
person size estimate
ˆ
θ for a skeleton model with k connec-
tions:
ˆ
θ = x
i
· arg max
y
f
i
(y), i = arg max
j=1,...,k
x
j
, (2)
where f
i
is the distribution over ratios of person size to the
length of connection x
i
. Since this is a skewed distribution,
we use a corrected mean to find the solution of the arg max
function and obtain a person size estimate. This turns out to
be a simple, yet robust estimator.
6052

LSP [23] LSP extended [23] MPII-HP [1] FashionPose [9]
45% 12% 25% 23%
Table 2: Percentages of accepted fits per dataset. The addi-
tion of the FashionPose dataset is discussed in Sec.
4.2.
3.3. Exploring the Data
With the foreground segmentation data and the ad-
justments described in the preceding sections, we fit the
SMPL model to a total of 27,652 images of the LSP,
LSP-extended, and MPII-HumanPose datasets. We use
only people marked with the ‘single person’ flag in MPII-
HumanPose to avoid instance segmentation problems. We
honor the train/test splits of the datasets and keep images
from their test sets in our new, joined test set.
In the next step, human annotators
1
selected the fits
where rotation and location of body parts largely match the
image evidence. For this task, we provide the original im-
age, as well as four perspectives of renderings of the body.
Optionally, annotators can overlay rendering and image.
These visualizations help to identify fitting errors quickly
and reduce the labeling time to 12s per image. The pro-
cess uncovered many erroneously labeled keypoints, where
mistakes in the 3D fit were clear to spot, but not obvious in
the 2D representation. We excluded head and foot rotation
as criteria for the sorting process. There is usually not suf-
ficient information in the original 14 keypoints to estimate
them correctly. The resulting ratios of accepted fits can be
found in Tab.
2.
Even with the proposed, more robust initialization term,
the ratio of accepted fits on the LSP-extended dataset re-
mains the lowest. It has the highest number of missing key-
points of the four datasets, and at the same time the most
extreme viewpoints and poses. On the other hand, the rather
high ratio of usable fits on the LSP dataset can be explained
with the clean and complete annotations.
The validated fits form our initial dataset with 5,569
training images (of which we use a held-out validation set
of 1,112 images in our experiments) and 1,208 test images.
We denote this dataset as UPI-3D (UnitedPeople in 3D with
an added ‘I’ for “Initial”). To be able to clearly reference the
different label types in the following sections, we add an ‘h’
to the dataset name when referring to labels from human an-
notators.
Consistency of Human Labels The set of curated 3D fits
allows us to assess the distribution of the human-provided
labels by projecting them to the UPI-3D bodies. We did this
for both, keypoints and body part segments. Visualizations
can be found in Fig.
3.
1
For this task, we did not rely on AMT workers, but only on few experts
in close collaboration to maintain consistency.
While keypoint locations in Fig.
3a in completely non-
matching areas of the body can be explained by self-
occlusion, there is a high variance in keypoint locations
around joints. It must be taken into account that the key-
points are projected to the body surface, and depending
on person shape and body part orientation some variation
can be expected. Nevertheless, even for this reduced set of
images with very good 3D fits, high variance areas, e.g.,
around the hip joints, indicate labeling noise.
The visualization in Fig.
3b shows the density of part
types for six part segmentation with the segments head,
torso, left and right arms and left and right legs. While the
head and lower parts of the extremities resemble distinct
colors, the areas converging to brown represent a mixture
of part annotations. The brown tone on the torso is a clear
indicator for the frequent occlusion by the arms. The area
around the hips is showing a smooth transition from torso
to leg color, hinting again at varying annotation styles.
4. Label Generation and Learning
In a comprehensive series of experiments, we analyze the
quality of labels generated from UPI-3D. We focus on labels
for well-established tasks, but highlight that the generation
possibilities are not limited to them: all types of data that
can be extracted from the body model can be used as labels
for supervised training. In our experiments, we move from
surface (segmentation) prediction over 2D- to 3D-pose and
shape estimation to a method for predicting 3D body pose
and shape directly from 2D landmark positions.
4.1. Semantic Body Part Segmentation
We segment the SMPL mesh into 31 regions, following
the segmentation into semantic parts introduced in [
40] (for
a visualization, see Fig.
3d). We note that the Kinect tracker
works on 2.5D data while our detectors only receive 2D data
as input. We deliberately did not make any of our methods
for data collection or prediction dependent on 2.5D data to
retain generality. This way, we can use it on outdoor images
and regular 2D photo datasets. The Segmentation dataset
UPI-S31 is obtained by projecting the segmented 3D mesh
posed on the 6,777 images of UPI-3D.
Following [
7], we optimize a multiscale ResNet101 on
a pixel-wise cross entropy loss. We train the network on
size-normalized, cutout images, which could in a produc-
tion system be provided by a person detector. Following
best practices for CNN training, we use a validation set to
determine the optimal number of training iterations and the
person size, which is around 500 pixels. This high resolu-
tion allows the CNN to reliably predict small body parts.
In this challenging setup, we achieve an intersection over
union (IoU) score of 0.4432 and an accuracy of 0.9331.
Qualitative results on five datasets are shown in Fig.
4a.
6053

(a) (b) (c) (d)
Figure 3: Density of human annotations on high quality body model fits for (a) keypoints and (b) six part segmentation in
front and back views. Areas of the bodies are colored with (1) hue according to part label, and (2) saturation according to
frequency of the label. Keypoints on completely ‘wrong’ bodyparts are due to self-occlusion. The high concentration of
‘head’ labels in the nose region originates from the FashionPose dataset, where the ‘head’ keypoint is placed on the nose.
The segmentation data originates solely from the six part segmentation labels on the LSP dataset. (Must be viewed in color.)
(c) Placement of the 91 landmarks (left: front, right: back). (d) Segmentation for generating the 31 part labels.
The overall performance is compelling: even the small
segments around the joints are recovered reliably. Left and
right sides of the subjects are identified correctly, and the
four parts of the head provide an estimate of head orien-
tation. The average IoU score is dominated by the small
segments, such as the wrists.
The VOC part dataset is a hard match for our predictor:
instead of providing instances of people, it consists of en-
tire scenes, and many people are visible at small scale. To
provide a comparison, we use the instance annotations from
the VOC-Part dataset, cut out samples and reduce the gran-
ularity of our segmentation to match the widely used six
part representation. Because of the low resolution of many
displayed people and extreme perspectives with, e.g., only
a face visible, the predictor often only predicts the back-
ground class on images not matching our training scheme.
Still, we achieve an IoU score of 0.3185 and 0.7208 accu-
racy over the entire dataset without finetuning.
Additional examples from the LSP, MPII-HumanPose,
FashionPose, Fashionista, VOC, HumanEva and Hu-
man3.6M datasets are shown in the supplementary mate-
rial available on the project homepage
2
. The model has not
been trained on any of the latter four, but the results indi-
cate good generalization behavior. We include a video to
visualize stability across consecutive frames.
4.2. Human Pose Estimation
With the 3D body fits, we can not only generate consis-
tent keypoints on the human skeleton but also on the body
surface. For the experiments in the rest of this paper, we de-
signed a 91-landmark
3
set to analyze a dense keypoint set.
2
http://up.is.tuebingen.mpg.de/
3
We use the term ‘landmark’ to refer to keypoints on the mesh surface
to emphasize the difference to the so-far used term ‘joints’ for keypoints
located inside of the body.
We distributed the landmarks according to two criteria:
disambiguation of body part configuration and estimation
of body shape. The former requires placement of markers
around joints to get a good estimation of their configuration.
To satisfy the latter, we place landmarks in regular intervals
around the body to get an estimate of spatial extent indepen-
dent of the viewpoint. We visualize our selection in Fig.
3c
and example predictions in Fig. 4b.
In the visualization of predictions, we show a subset of
the 91 landmarks and only partially connect the displayed
ones for better interpretability. The core 14 keypoints de-
scribing the human skeleton are part of our selection to
describe the fundamental pose and maintain comparability
with existing methods.
We use a state-of-the-art DeeperCut CNN [
19] for our
pose-related experiments, but believe that using other mod-
els such as Convolutional Pose Machines [
42] or Stacked
Hourglass Networks [
32] would lead to similar findings.
To assess the influence of the quality of our data and
the difference of the loss function for 91 and 14 key-
points, we train multiple CNNs: (1) using all human la-
bels but on our (smaller) dataset for 14 keypoints (UPI-
P14h) and (2) on the dense 91 landmarks from projections
of the SMPL mesh (UPI-P91). Again, models are trained on
size-normalized crops with cross-validated parameters. We
include the performance of the original DeeperCut CNN,
which has been trained on the full LSP, LSP-extended and
MPII-HumanPose datasets (in total more than 52,000 peo-
ple) in the comparison with the models being trained on our
data (in total 5,569 people). The results are summarized
in Tab.
3. Even though the size of the dataset is reduced
by nearly an order of magnitude, we maintain high perfor-
mance compared to the original DeeperCut CNN. Compar-
ing the two models trained on the same amount of data,
we find that the model trained on the 91 landmarks from
6054

Citations
More filters
Journal ArticleDOI

Human body shape reconstruction from binary silhouette images

TL;DR: A deep learning based reconstruction of 3D human body shape from 2D orthographic silhouette images is proposed, requiring only one or two silhouette images, and can help users create their own digital avatars quickly, and also make it easy to create digital human body for 3D game, virtual reality, online fashion shopping.
Proceedings ArticleDOI

OSSO: Obtaining Skeletal Shape from Outside

TL;DR: OSSO (Obtaining Skeletal Shape from Outside), is the first to learn the mapping from the 3D body surface to the internal skeleton from real data, and evaluates the accuracy of the skeleton shape quantitatively on held out DXA scans, outperforming the state-of-the art.
Posted Content

Learning to Dress 3D People in Generative Clothing

TL;DR: In this paper, a conditional mesh-VAE-GAN is trained to learn the clothing deformation from the SMPL body model, making clothing an additional term in SMPL.
Proceedings ArticleDOI

SMPLy Benchmarking 3D Human Pose Estimation in the Wild

TL;DR: In this article, the Mannequin Challenge dataset is used to evaluate 3D human pose estimation from images captured in-the-wild, where people are static and the camera moving to accurately fit the parametric model on the sequences.
Posted Content

Neural Scene Decomposition for Multi-Person Motion Capture

TL;DR: In this paper, a self-supervised approach is proposed to learn a neural scene decomposition (NSD) that can be exploited for 3D pose estimation for multiple persons and full-frame images.
References
More filters
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Posted Content

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

TL;DR: This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).
Journal ArticleDOI

"GrabCut": interactive foreground extraction using iterated graph cuts

TL;DR: A more powerful, iterative version of the optimisation of the graph-cut approach is developed and the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result.
Book ChapterDOI

Stacked Hourglass Networks for Human Pose Estimation

TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Journal ArticleDOI

Pedestrian Detection: An Evaluation of the State of the Art

TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.
Related Papers (5)