scispace - formally typeset
Open AccessProceedings ArticleDOI

Real-Time Body Tracking with One Depth Camera and Inertial Sensors

Reads0
Chats0
TLDR
A novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations, and takes inspiration from previous tracking solutions, and combines a generative tracker and a discriminative tracker retrieving closest poses in a database.
Abstract
In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the research in monocular full body skeletal pose tracking. Unfortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part occlusions. In this paper, we present a novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations. It takes inspiration from previous tracking solutions, and combines a generative tracker and a discriminative tracker retrieving closest poses in a database. In contrast to previous work, both trackers employ data from a low number of inexpensive body-worn inertial sensors. These sensors provide reliable and complementary information when the monocular depth information alone is not sufficient. We also contribute by new algorithmic solutions to best fuse depth and inertial data in both trackers. One is a new visibility model to determine global body pose, occlusions and usable depth correspondences and to decide what data modality to use for discriminative tracking. We also contribute with a new inertial-based pose retrieval, and an adapted late fusion step to calculate the final body pose.

read more

Content maybe subject to copyright    Report

Real-time Body Tracking with One Depth Camera and Inertial Sensors
Thomas Helten
Meinard M
¨
uller
Hans-Peter Seidel
Christian Theobalt
Saarland University and MPI Informatik
International Audio Laboratories Erlangen
{thelten,theobalt}@mpi-inf.mpg.de meinard.mueller@audiolabs-erlangen.de
Abstract
In recent years, the availability of inexpensive depth
cameras, such as the Microsoft Kinect, has boosted the re-
search in monocular full body skeletal pose tracking. Un-
fortunately, existing trackers often fail to capture poses
where a single camera provides insucient data, such as
non-frontal poses, and all other poses with body part oc-
clusions. In this paper, we present a novel sensor fusion ap-
proach for real-time full body tracking that succeeds in such
dicult situations. It takes inspiration from previous track-
ing solutions, and combines a generative tracker and a dis-
criminative tracker retrieving closest poses in a database.
In contrast to previous work, both trackers employ data
from a low number of inexpensive body-worn inertial sen-
sors. These sensors provide reliable and complementary
information when the monocular depth information alone
is not sucient. We also contribute by new algorithmic so-
lutions to best fuse depth and inertial data in both trackers.
One is a new visibility model to determine global body pose,
occlusions and usable depth correspondences and to decide
what data modality to use for discriminative tracking. We
also contribute with a new inertial-based pose retrieval, and
an adapted late fusion step to calculate the final body pose.
1. Introduction
In recent years, the advent of new and inexpensive cam-
eras that measure 2.5D depth images has triggered exten-
sive research in monocular human pose tracking. Most
of the trackers introduced so far can be classified into
three families—discriminative approaches and generative
approaches, and approaches combining both strategies.
While discriminative trackers detect cues in the depth im-
age and derive a pose hypothesis from them using a retrieval
strategy, generative trackers optimize for the parameters of
a human model to best explain the observed depth image.
Combining discriminative and generative approaches, hy-
This work was funded by the ERC Starting Grant “CapReal”.
brid trackers have shown good results for fast motions in
real-time scenarios where tracked actors face the camera
more or less frontally. However, noise in the depth data,
and the ambiguous representation of human poses in depth
images are still a challenge and often lead to trackig errors,
even if all body parts are actually exposed to the camera.
In addition, if large parts of the body are occluded from
view, tracking of the full pose is not possible. Using mul-
tiple depth cameras can partially remedy the problem [19],
but does not eradicate occlusion problems, and is not always
practical in home user scenarios. Depth data alone may thus
not be sucient to capture poses accurately in such chal-
lenging scenarios. In this paper, we show that fusing a depth
tracker with an additional sensor modality, which provides
information complementary to the 2.5D depth video, can
overcome these limitations. In particular, we use the orien-
tation data obtained from a sparse set of inexpensive inertial
measurement devices fixed to the arms, legs, the trunk and
the head of the tracked person. We include this additional
information as stabilizing evidence in a hybrid tracker that
combines generative and discriminative pose computation.
Our approach enables us to track fast and dynamic motions,
including non-frontal poses and poses with significant self-
occlusions, accurately and in real-time.
Contributions. Our method is the first to adaptively fuse
inertial and depth information in a combined generative and
discriminative monocular pose estimation framework. To
enable this, we contribute with a novel visibility model for
determining which parts of the body are visible to the depth
camera. This model tells what data modality is reliable
and can be used to infer the pose, and enables us to more
robustly infer global body orientation even in challenging
poses, see Sect. 4. Our second contribution is a genera-
tive tracker that fuses optical and inertial cues depending
on body part visibility, and finds pose parameters via opti-
mization, see Sect. 5. As a third contribution, we introduce
two separate retrieval schemes for handling optical and iner-
tial cues for retrieving database poses during discriminative
tracking, see Sect. 6. The final pose is found in a late fu-
sion step which uses the results of both trackers mentioned
2013 IEEE International Conference on Computer Vision
1550-5499/13 $31.00 © 2013 IEEE
DOI 10.1109/ICCV.2013.141
1105
2013 IEEE International Conference on Computer Vision
1550-5499/13 $31.00 © 2013 IEEE
DOI 10.1109/ICCV.2013.141
1105

(a)
(b)
(c)
Figure 1. Three typical failure cases of a current real-time tracker combining generative and discriminative pose estimation [1] (left: input
depth image; middle: recovered pose of body model with catastrophic pose errors; right: significantly better result using our approach):
(a) Occluded body parts, (b) non-frontal poses, (b) and both at the same time.
above, see Sect. 7. We evaluate our proposed tracker on an
extensive dataset including calibrated depth images, inertial
sensor data, as well as ground-truth data obtained with a
traditional marker-based mocap system, see Sect. 8. This
dataset is publicly available
1
. We also show qualitatively
and quantitatively that it accurately captures poses even un-
der stark occlusion where other trackers fail.
2. Related Work
Marker-less pose estimation from multi-view video has
been a long-standing problem in computer vision, and
nowadays mature solutions exist, see [11] for an overview.
Recently, so-called depth cameras that measure 2.5D geom-
etry information in real-time have emerged [6, 21]. Many
monocular tracking algorithms use this depth data for hu-
man pose estimation. They can be classified in discrim-
inative approaches, generative approaches and hybrid ap-
proaches, reviewed in the following. A discriminative strat-
egy based on body part detectors that also estimated body
part orientations on depth images was presented in [9].
Body part detectors and a mapping to a kinematic skele-
ton are used in [22] to track full-body poses at interac-
tive frame rates. The approach [13] uses regression forests
based on depth features to estimate the joint positions of
the tracked person without the need for a kinematic model
of its skeleton. Later [4], further increased the accuracy,
by being able to also detect some occluded joints in non-
frontal poses. Finally, also using depth features and regres-
sion forests, [16] generate correspondences between body
parts and a pose and size parametrized human model that
is optimized in real-time using a one-shot optimization ap-
proach. While showing good results on single frame basis,
these approaches cannot deduce true poses of body parts
that are invisible in the camera.
By using kinematic body models with simple shape
primitives, the pose of an actor can be found using a gener-
ative strategy. The body model is fitted to depth data or to a
combination of depth and image features [5, 8]. [2] propose
1
http://resources.mpi-inf.mpg.de/InertialDepthTracker
a generative depth-based tracker using a modified energy
function that incorporates empty space information, as well
as inter-penetration constraints. An approach that uses mul-
tiple depth cameras for pose estimation which reduces the
occlusion problem is presented in [19]. The approach is not
real-time capable, though. With all these depth-based meth-
ods, real-time pose estimation is still a challenge, tracking
may drift, and with exception to [19] the employed shape
models are rather coarse which impairs pose estimation ac-
curacy.
Salzmann et al. [12] combine generative and discrimi-
native approaches, with the goal to reconstruct general 3D
deformable surfaces. Soon after that, [3] showed a hybrid
approach specialized to reconstruct human 3D pose from
depth image using the body part detectors proposed by [9]
as regularizing component. Further accuracy improvements
were achieved by [1, 20] using regularizing poses from a
pre-recorded database as input to the generative tracker.
Here, [1] was the first approach running at real-time frame-
rates of more than 50 fps, whereas Ye et al.s method [20] is
an oine approach. Other real-time algorithms were pro-
posed by e.g. [17] that use a body-part detector similar to
[13] to augment a generative tracker. However, none of
these hybrid approaches is able to give a meaningful pose
hypothesis for non-visible body parts in case of occlusions.
Methods that reconstruct motions based on inertial sen-
sors only have been proposed e.g. in [7, 15]. Here, either
densely placed sensors or large databases containing mo-
tions are used. Also, reconstructing the global position is
not possible.
Only a few vision algorithms so far use fusion with com-
plementary sensor systems for full-body tracking. One ap-
proach combining 3D inertial information and multi-view
markerless motion capture was presented in [10]. Here, the
orientation data of ve inertial sensors was used as addi-
tional energy term to stabilize the local pose optimization.
Another example is [23] who fuse information from densely
placed inertial sensors is fused with global position estima-
tion using a laser range scanner equipped robot accompany-
11061106

ing the tracked person.
3. Hybrid Inertial Tracker - An Overview
Recent hybrid (generative + discriminative) monocular
tracking algorithms e.g. [1, 17] can track human skeletons in
real-time from a single depth camera, as long as the body is
mostly front-facing. However, even in frontal poses, track-
ing may fail due to complex self-occlusions, limbs close
to the body, and other ambiguities. It certainly fails if
large sections of the body are completely invisible to the
camera, such as in lateral postures, see Fig. 1c. Our new
hybrid depth-based tracker succeeds in such cases by in-
corporating additional inertial sensor data for tracking sta-
bilization. While our concepts are in general applicable
to a wide range of generative approaches, discriminative
approaches and hybrid approaches, we modify the hybrid
depth-based tracker by Baak et al. [1] to demonstrate our
concepts. This tracker uses discriminative features detected
in the depth data, so-called geodesic extrema E
I
, to query
a database containing pre-recorded full body poses. These
poses are then used to initialize a generative tracker that op-
timizes skeletal pose parameters X of a mesh-based human
body model M
X
R
3
to best explain the 3D point cloud
M
I
R
3
of the observed depth image I. In a late fusion
step, the tracker decides between two pose hypotheses: one
obtained using the database pose as initialization or one ob-
tained that used the previously tracked poses as initializa-
tion. Baak et al.s approach makes two assumptions: The
person to be tracked is facing the depth camera and all body
parts are visible to the depth camera, which means it fails in
dicult poses mentioned earlier (see Fig. 1 for some exam-
ples).
In our new hybrid approach, we overcome these limita-
tions by modifying every step in the original algorithm to
benefit from depth and inertial data together. In particular,
we introduce a visibility model to decide what data modal-
ity is best used in each pose estimation step, and develop
a discrimative tracker combining both data. We also em-
power generative tracking to use both data for reliable pose
inference, and develop a new late fusion step using both
modalities.
Body Model Similar to [1], we use a body model com-
prising a surface mesh M
X
of 6 449 vertices, whose defor-
mation is controlled by an embedded kinematic skeleton of
62 joints and 42 degrees of freedom via surface skinning.
Currently, the model is manually adapted to the actor, but
automatic shape adaptation is feasible, see e.g. [18]. Fur-
thermore, let B
all
:= {larm, rarm, lleg, rleg, body} be a set
of body parts representing the left and right arm, left and
right leg and the rest of the body. Now, we define ve dis-
joint subsets M
b
X
, b ∈B
all
containing all vertices from M
X
belonging to body part b.
X
Z
Y
Sensor LocalSensor Global Camera Global
X
Y
X
Y
X
Y
Z
t
0
t
Time
q
X,root
(t
0
)
q
S,root
(t
0
)
q
X,root
(t)
q
S,root
(t)
Δq(t)
Figure 2. Relationship between the dierent IMU coordinate sys-
tems and orientations.
Sensors As depth camera we use a Microsoft Kinect run-
ning at 30 fps, but in Sect. 8 we also show that our approach
works on time-of-flight camera data. As additional sensors,
we use inertial measurement units (IMUs) which are able to
determine their relative orientation with respect to a global
coordinate system, irrespective of visibility from a cam-
era. IMUs are nowadays manufactured cheaply and com-
pactly, and integrated into many hand-held devices, such as
smart phones and game consoles. In this paper, we use six
Xsens MTx IMUs, attached to the trunk (s
root
), the forearms
(s
larm
, s
rarm
), the lower legs (s
lleg
, s
rleg
), and the head (s
head
),
see Fig. 4a. The sensor s
root
gives us information about the
global body orientation, while the sensors on arms and feet
give cues about the configuration of the extremities. Finally,
the head sensor is important to resolve some of the ambigu-
ities in sparse inertial features. For instance, it helps us to
discriminate upright from crouched full body poses. The
sensors’ orientations are described as the transformations
from the sensors’ local coordinate systems to a global co-
ordinate system and are denoted by q
root
,q
larm
,q
rarm
,q
lleg
,
q
rleg
, and q
head
. In our implementation, we use unit quater-
nions for representing these transformations, as they best
suit our processing steps.
For ease of explanation, we introduce the concept of a
virtual sensor which provides a simulated orientation read-
ing of an IMU for a given pose X of our kinematic skele-
ton. Furthermore, the transformation between the virtual
sensor’s coordinate system and the depth camera’s global
coordinate system can be calculated. For clarity, we add X
or S to the index, e.g.q
S,root
denotes the measured orien-
tation of the real sensor attached to the trunk, while q
X,root
represents the readings of the virtual sensor for a given pose
X. Note, while the exact placement of the sensors relative
to the bones is not so important, it needs to be roughly the
same for corresponding real and virtual sensors. Further
calibration of the sensors is not required. An orientation of
a sensor at time t is denoted as q
root
(t).
4. Visibility Model
Our visibility model enables us to reliably detect global
body pose and the visibility of body parts in the depth cam-
11071107

era. This information is then used to establish reliable corre-
spondences between the depth image and body model dur-
ing generative tracking, even under occlusion. Furthermore,
it enables us to decide whether inertial or optical data are
more reliable for pose retrieval.
Global body position and orientation. In [1], the au-
thors use plane fitting to a heuristically chosen subset of
depth data to compute body orientation and translation of
the depth centroid. Their approach fails if the person is not
roughly facing the camera or body parts are occluding the
torso. Inertial sensors are able to measure their orientation
in space independent of occlusions and lack of data in the
depth channel. We thus use the orientation of the sensor
s
root
to get a good estimate of the body’s front direction f
within the camera’s global coordinate system, even in dif-
ficult non-frontal poses, Fig. 3b. However, inertial sensors
measure their orientation with respect to some global sen-
sor coordinate system that in general is not identical to the
camera’s global coordinate system, see also Fig. 2. For that
reason, we calculate the transformation q
X,root
(t) in a sim-
ilar fashion as described in [10] using relative transforma-
tions Δq(t):=
q
S,root
(t
0
) q
S,root
(t) with respect to an initial
orientation at time t
0
. Here, q denotes the inverse trans-
formation of q, while q
2
q
1
expresses that transformation
q
2
is executed after transformation q
1
. The transformations
q
S,root
(t
0
) and q
S,root
(t) can be directly obtained from the
sensor’s measurement. The desired transformation from the
sensor’s coordinate system to the camera’s global coordi-
nate system at time t is now q
X,root
(t) = q
X,root
(t
0
) Δq(t).
Note that q
X,root
(t
0
) can not be measured. Instead, we calcu-
late it using virtual sensors and an initial pose X(t
0
) at time
t
0
. For this first frame, we determine the front direction
f(t
0
) as described in [1] and then use our tracker to com-
pute X(t
0
). In all other frames, the front facing direction is
defined as
f (t):= q
X,root
(t) q
X,root
(t
0
)[ f (t
0
)]. (1)
Here, q[v] means that the transformation q is applied to the
vector v, Fig. 3b.
Body part visibility. The second important information
supplied by our visibility model is which parts of the model
are visible to the depth camera. To infer body part visibility,
we compute all vertices C
X
⊆M
X
of the body mesh that
the depth camera sees in pose X. To this end, we resort to
rendering of the model and fast OpenGL visibility testing.
Now, the visibility of a body part b is defined as
V
b
:=
|M
b
X
∩C
X
|
|M
b
X
|
. (2)
(a)
0
(b)
45
(c)
90
(d)
45
Figure 3. Tracking of frame at 5.0 s of sequence D
6
from our eval-
uation dataset. The views are rotated around the tracked person,
where oset w. r. t. the depth camera is depicted at the bottom of
each subfigure. (a) Input depth data. (b) Output of the visibility
model. Note: the right arm is not visible. (c) Correspondences
used by the generative tracker. Note: no correspondences with
right arm. The pose parametrized mesh was moved to the left for
better visibility. (d) Final fused pose.
The set of visible body parts is denoted as B
vis
:=
{
b ∈B
all
: V
b
3
}
. Note, that the accuracy of B
vis
depends
on M
X
resembling the actual pose assumed by the person in
the depth image as closely as possible, which is not known
before pose estimation. For this reason, we choose the pose
X = X
DB
, obtained by the discriminative tracker which
yields better results than using the pose X(t 1) from the
previous step, (see Sect. 6). To account for its possible devi-
ation from the “real” pose and to avoid false positives in the
set B
vis
, we introduce the threshold τ
3
> 0. In the tested sce-
narios, values of τ
3
up to 10% have shown a good trade-o
between rejecting false positives and not rejecting to many
body parts, that are actually visible.
In the rendering process also a virtual depth image I
X
is
created, from which we calculate the first M = 50 geodesic
extrema in the same way as for the real depth image I, see
[1]. Finally, we denote the vertices that generated the ex-
trema’s depth points with C
M
X
.
5. Generative Pose Estimation
Similar to [1], generative tracking optimizes skeletal
pose parameters by minimizing the distance between cor-
responding points on the model and in the depth data. Baak
et al. fix C
X
manually, and never update it during track-
ing. For every point in C
X
they find the closest point in
the depth point cloud M
I
, and minimize the sum of dis-
tances between model and data points by local optimization
in the joint angles. Obviously, this leads to wrong corre-
spondences if the person strikes a pose in which large parts
of the body are occluded.
In our approach, we also use a local optimization scheme
to find a pose X that best aligns the model M
X
to the point
cloud M
I
. In contrast to prior work, it also considers which
parts of the body are visible and can actually contribute to
explaining a good alignment into the depth image. Further-
11081108

X
Y
s
rarm
X
Y
s
larm
X
Y
s
root
X
Y
s
rleg
X
Y
s
lleg
X
Y
s
head
ˆ
q
larm
(t)
(a)
d
rarm
d
larm
d
rleg
d
lleg
d
head
(b)
(c)
(d)
Figure 4. (a) Placement of the sensors on the body and normalized
orientation w. r. t. s
root
. (b) Body part directions used as inertial
features for indexing the database. (c) Two poses that cannot be
distinguished using inertial features. (d) The same two poses look
dierent when using optical features.
more, we define subsets X
b
, b ∈B
all
of all pose parameters
in X that aect the corresponding point sets M
b
X
. We de-
fine the set of active pose parameters X
act
:=
b∈B
vis
X
b
.
Finally, the energy function is given as
d(M
X
, M
I
):= d
M
X
→M
I
+ d
M
I
→M
X
(3)
d
M
X
→M
I
:=
1
M
v∈C
M
X
min
p∈M
I
p v
2
(4)
d
M
I
→M
X
:=
1
N
eE
N
I
min
v∈M
X
e v
2
. (5)
Here, E
N
I
represents the first N = 50 geodesic extrema in
I, while C
M
X
is a subset of C
X
containing M = 50 visible
vertices, see Sect. 4 for details. A visualization for the re-
sulting correspondences can be seen in Fig. 3c. As opposed
to Baak et al., we minimize d(M
X
, M
I
) using a gradient
descent solver similar to the one used in [14] and employ
analytic derivatives.
6. Discriminative Pose Estimation
In hybrid tracking, discriminative tracking complements
generative tracking by continuous re-initialization of the
pose optimization when generative tracking converges to an
erroneous pose optimum (see also Sect. 7). We present a
new discriminative pose estimation approach that retrieves
poses from a database with 50 000 poses obtained from mo-
tion sequences recorded using a marker-based mocap sys-
tem. It adaptively relies on optical features for pose look-up,
and new inertial features, depending on visibility and thus
reliability of each sensor type. In combination, this enables
tracking of poses with strong occlusions, and it stabilizes
pose estimation in front-facing poses.
Optical database lookup. In order to retrieve a pose X
DB
I
matching the one in the depth image from the database,
Baak et al. [1] use geodesic extrema computed on the depth
map as index. In their original work, they expect that the
first five geodesic extrema E
5
I
from the depth image I are
roughly co-located with the positions of the body extrema
(head, hands and feet). Geodesic extrema also need to be
correctly labeled. Further on, the poses in their database
are normalized w. r. t. to global body orientation which re-
duces the database size. As a consequence, also queries
into the database need to be pose normalized. We use Baak
et al.s geodesic extrema for optical lookup, but use our
more robust way for estimating f (t) for normalization, see
Sect. 4. Our method thus fares better even in poses where
all geodesic extrema are found, but the pose is lateral to the
camera.
Inertial database lookup. In poses where not all body
extrema are visible, or where they are too close to the
torso, the geodesic extrema become unreliable for database
lookup. In such cases, we revert to IMU data, in partic-
ual their orientations relative to the coordinate system of the
sensor s
root
, see Fig. 4a. Similar to the optical features based
on geodesic extrema, these normalized orientations
ˆ
q
b
(t):=
q
root
(t) q
b
(t), b ∈B= {larm, rarm, lleg, rleg, head} are in-
variant to the tracked person’s global orientation but capture
the relative orientation of various parts of the person’s body.
However, using these normalized orientations directly as in-
dex has one disadvantage. This is because many orienta-
tion representations need special similarity metrics that are
often incompatible to fast indexing structures, such as kd-
trees. To this end, we use a vector
ˆ
d
b
R
3
that points in
the direction of the bone of a body part, see Fig. 4b. In our
setup, these directions are co-aligned with the sensors’ lo-
cal X-axis for all sensors except for the sensor s
head
, where
it is co-aligned with the local Y-axis. The normalized direc-
tions
ˆ
d
b
(t):=
ˆ
q
b
(t)[d
b
] are then stacked to serve as inertial
feature-based query to the database. The retrieved pose is
denoted as X
DB
S
.
Selecting optical or inertial lookup. At first sight, it may
seem that inertial features alone are sucient to look up
poses from the database, because they are independent from
visibility issues. However, with our sparse set of six IMUs,
the inertial data alone are often not discriminative enough to
exactly characterize body poses. Some very dierent poses
may induce the same inertial readings, and are thus ambigu-
ous, see also Fig. 4c. Of course, adding more IMUs to the
body would remedy the problem but would starkly impair
usablity and is not necessary as we show in the following.
Optical geodesic extrema features are very accurate and dis-
criminative of a pose, given that they are reliably found,
which is not the case for all extrema in dicult non-frontal
starkly occluded poses, see Fig. 4d. Therefore, we intro-
duce two reliability measures to assess the reliability of op-
tical features for retrieval, and use the inertial features only
as fall-back modality for retrieval in case optical feautures
11091109

Citations
More filters
Journal ArticleDOI

A survey of depth and inertial sensor fusion for human action recognition

TL;DR: The thrust of this survey is on the utilization of depth cameras and inertial sensors as these two types of sensors are cost-effective, commercially available, and more significantly they both provide 3D human action data.
Proceedings ArticleDOI

Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors

TL;DR: An algorithm for fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data to accurately estimate 3D human pose is presented, yielding improved accuracy over prior methods.
Journal ArticleDOI

LiveCap: Real-Time Human Performance Capture From Monocular Video

TL;DR: This work proposes a novel two-stage analysis-by-synthesis optimization whose formulation and implementation are designed for high performance, and is the first real-time monocular approach for full-body performance capture.
Journal ArticleDOI

Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs

TL;DR: This work addresses the problem of making human motion capture in the wild more practical by making use of a realistic statistical body model that includes anthropometric constraints and using a joint optimization framework to fit the model to orientation and acceleration measurements over multiple frames.
Journal ArticleDOI

MonoPerfCap: Human Performance Capture From Monocular Video

TL;DR: In this article, a markerless approach for temporally coherent 3D performance capture of a human with general clothing from monocular video is presented, which reconstructs articulated human skeleton motion as well as medium-scale non-rigid surface deformations in general scenes.
References
More filters
Proceedings ArticleDOI

Real-time human pose recognition in parts from single depth images

TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Journal ArticleDOI

A survey on vision-based human action recognition

TL;DR: A detailed overview of current advances in vision-based human action recognition is provided, including a discussion of limitations of the state of the art and outline promising directions of research.
Proceedings ArticleDOI

Efficient regression of general-activity human poses from depth images

TL;DR: Key aspects of this work include: regression directly from the raw depth image, without the use of an arbitrary intermediate representation; applicability to general motions (not constrained to particular activities) and the ability to localize occluded as well as visible body joints.
Proceedings ArticleDOI

Real time motion capture using a single time-of-flight camera

TL;DR: This paper derives an efficient filtering algorithm for tracking human pose using a stream of monocular depth images and describes a novel algorithm for propagating noisy evidence about body part locations up the kinematic chain using the un-scented transform.
Proceedings ArticleDOI

Spacetime stereo: shape recovery for dynamic scenes

TL;DR: The approach is one of very few existing methods that can robustly reconstruct objects that are moving and deforming over time, achieved by use of oriented spacetime windows in the matching procedure.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions in "Real-time body tracking with one depth camera and inertial sensors" ?

In this paper, the authors present a novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations. 

Recent hybrid (generative + discriminative) monocular tracking algorithms e.g. [1, 17] can track human skeletons in real-time from a single depth camera, as long as the body is mostly front-facing. 

In order to retrieve a pose XDB The authormatching the one in the depth image from the database, Baak et al. [1] use geodesic extrema computed on the depth map as index. 

This tracker uses discriminative features detected in the depth data, so-called geodesic extrema EI, to query a database containing pre-recorded full body poses. 

For every point in CX they find the closest point in the depth point cloud MI, and minimize the sum of distances between model and data points by local optimization in the joint angles. 

IMUs are nowadays manufactured cheaply and compactly, and integrated into many hand-held devices, such as smart phones and game consoles. 

The authors also empower generative tracking to use both data for reliable pose inference, and develop a new late fusion step using both modalities. 

also using depth features and regression forests, [16] generate correspondences between body parts and a pose and size parametrized human model that is optimized in real-time using a one-shot optimization approach. 

Most of the trackers introduced so far can be classified into three families—discriminative approaches and generative approaches, and approaches combining both strategies. 

In the tested scenarios, values of τ3 up to 10% have shown a good trade-off between rejecting false positives and not rejecting to many body parts, that are actually visible. 

Other real-time algorithms were proposed by e.g. [17] that use a body-part detector similar to [13] to augment a generative tracker. 

Similar to [1], generative tracking optimizes skeletal pose parameters by minimizing the distance between corresponding points on the model and in the depth data. 

To account for its possible deviation from the “real” pose and to avoid false positives in the setBvis, the authors introduce the threshold τ3 > 0.