What is the purpose of the tracker?

This tracker uses discriminative features detected in the depth data, so-called geodesic extrema EI, to query a database containing pre-recorded full body poses.

How do they find the closest point in the depth point cloud?

For every point in CX they find the closest point in the depth point cloud MI, and minimize the sum of distances between model and data points by local optimization in the joint angles.

How many false positives have been found in the tested scenarios?

In the tested scenarios, values of τ3 up to 10% have shown a good trade-off between rejecting false positives and not rejecting to many body parts, that are actually visible.

(Open Access) Real-Time Body Tracking with One Depth Camera and Inertial Sensors (2013) | Thomas Helten

Q: What are the contributions in "Real-time body tracking with one depth camera and inertial sensors" ?

In this paper, the authors present a novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations.

Q: How do they use the depth map to retrieve a pose?

In order to retrieve a pose XDB The authormatching the one in the depth image from the database, Baak et al. [1] use geodesic extrema computed on the depth map as index.

Q: What are the advantages of using IMUs?

IMUs are nowadays manufactured cheaply and compactly, and integrated into many hand-held devices, such as smart phones and game consoles.

Q: How do the authors use the generative tracking approach?

The authors also empower generative tracking to use both data for reliable pose inference, and develop a new late fusion step using both modalities.

Q: What is the method to reconstruct human pose from depth data?

also using depth features and regression forests, [16] generate correspondences between body parts and a pose and size parametrized human model that is optimized in real-time using a one-shot optimization approach.

Q: What are the three types of trackers?

Most of the trackers introduced so far can be classified into three families—discriminative approaches and generative approaches, and approaches combining both strategies.

Real-time Body Tracking with One Depth Camera and Inertial Sensors

Thomas Helten

∗

Meinard M

uller

†

Hans-Peter Seidel

∗

Christian Theobalt

∗

Saarland University and MPI Informatik

†

International Audio Laboratories Erlangen

{thelten,theobalt}@mpi-inf.mpg.de meinard.mueller@audiolabs-erlangen.de

Abstract

In recent years, the availability of inexpensive depth

cameras, such as the Microsoft Kinect, has boosted the re-

search in monocular full body skeletal pose tracking. Un-

fortunately, existing trackers often fail to capture poses

where a single camera provides insuﬃcient data, such as

non-frontal poses, and all other poses with body part oc-

clusions. In this paper, we present a novel sensor fusion ap-

proach for real-time full body tracking that succeeds in such

diﬃcult situations. It takes inspiration from previous track-

ing solutions, and combines a generative tracker and a dis-

criminative tracker retrieving closest poses in a database.

In contrast to previous work, both trackers employ data

from a low number of inexpensive body-worn inertial sen-

sors. These sensors provide reliable and complementary

information when the monocular depth information alone

is not suﬃcient. We also contribute by new algorithmic so-

lutions to best fuse depth and inertial data in both trackers.

One is a new visibility model to determine global body pose,

occlusions and usable depth correspondences and to decide

what data modality to use for discriminative tracking. We

also contribute with a new inertial-based pose retrieval, and

an adapted late fusion step to calculate the ﬁnal body pose.

1. Introduction

In recent years, the advent of new and inexpensive cam-

eras that measure 2.5D depth images has triggered exten-

sive research in monocular human pose tracking. Most

of the trackers introduced so far can be classiﬁed into

three families—discriminative approaches and generative

approaches, and approaches combining both strategies.

While discriminative trackers detect cues in the depth im-

age and derive a pose hypothesis from them using a retrieval

strategy, generative trackers optimize for the parameters of

a human model to best explain the observed depth image.

Combining discriminative and generative approaches, hy-

∗

This work was funded by the ERC Starting Grant “CapReal”.

brid trackers have shown good results for fast motions in

real-time scenarios where tracked actors face the camera

more or less frontally. However, noise in the depth data,

and the ambiguous representation of human poses in depth

images are still a challenge and often lead to trackig errors,

even if all body parts are actually exposed to the camera.

In addition, if large parts of the body are occluded from

view, tracking of the full pose is not possible. Using mul-

tiple depth cameras can partially remedy the problem [19],

but does not eradicate occlusion problems, and is not always

practical in home user scenarios. Depth data alone may thus

not be suﬃcient to capture poses accurately in such chal-

lenging scenarios. In this paper, we show that fusing a depth

tracker with an additional sensor modality, which provides

information complementary to the 2.5D depth video, can

overcome these limitations. In particular, we use the orien-

tation data obtained from a sparse set of inexpensive inertial

measurement devices ﬁxed to the arms, legs, the trunk and

the head of the tracked person. We include this additional

information as stabilizing evidence in a hybrid tracker that

combines generative and discriminative pose computation.

Our approach enables us to track fast and dynamic motions,

including non-frontal poses and poses with signiﬁcant self-

occlusions, accurately and in real-time.

Contributions. Our method is the ﬁrst to adaptively fuse

inertial and depth information in a combined generative and

discriminative monocular pose estimation framework. To

enable this, we contribute with a novel visibility model for

determining which parts of the body are visible to the depth

camera. This model tells what data modality is reliable

and can be used to infer the pose, and enables us to more

robustly infer global body orientation even in challenging

poses, see Sect. 4. Our second contribution is a genera-

tive tracker that fuses optical and inertial cues depending

on body part visibility, and ﬁnds pose parameters via opti-

mization, see Sect. 5. As a third contribution, we introduce

two separate retrieval schemes for handling optical and iner-

tial cues for retrieving database poses during discriminative

tracking, see Sect. 6. The ﬁnal pose is found in a late fu-

sion step which uses the results of both trackers mentioned

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.141

1105

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.141

1105

(a)

(b)

(c)

Figure 1. Three typical failure cases of a current real-time tracker combining generative and discriminative pose estimation [1] (left: input

depth image; middle: recovered pose of body model with catastrophic pose errors; right: signiﬁcantly better result using our approach):

(a) Occluded body parts, (b) non-frontal poses, (b) and both at the same time.

above, see Sect. 7. We evaluate our proposed tracker on an

extensive dataset including calibrated depth images, inertial

sensor data, as well as ground-truth data obtained with a

traditional marker-based mocap system, see Sect. 8. This

dataset is publicly available

. We also show qualitatively

and quantitatively that it accurately captures poses even un-

der stark occlusion where other trackers fail.

2. Related Work

Marker-less pose estimation from multi-view video has

been a long-standing problem in computer vision, and

nowadays mature solutions exist, see [11] for an overview.

Recently, so-called depth cameras that measure 2.5D geom-

etry information in real-time have emerged [6, 21]. Many

monocular tracking algorithms use this depth data for hu-

man pose estimation. They can be classiﬁed in discrim-

inative approaches, generative approaches and hybrid ap-

proaches, reviewed in the following. A discriminative strat-

egy based on body part detectors that also estimated body

part orientations on depth images was presented in [9].

Body part detectors and a mapping to a kinematic skele-

ton are used in [22] to track full-body poses at interac-

tive frame rates. The approach [13] uses regression forests

based on depth features to estimate the joint positions of

the tracked person without the need for a kinematic model

of its skeleton. Later [4], further increased the accuracy,

by being able to also detect some occluded joints in non-

frontal poses. Finally, also using depth features and regres-

sion forests, [16] generate correspondences between body

parts and a pose and size parametrized human model that

is optimized in real-time using a one-shot optimization ap-

proach. While showing good results on single frame basis,

these approaches cannot deduce true poses of body parts

that are invisible in the camera.

By using kinematic body models with simple shape

primitives, the pose of an actor can be found using a gener-

ative strategy. The body model is ﬁtted to depth data or to a

combination of depth and image features [5, 8]. [2] propose

http://resources.mpi-inf.mpg.de/InertialDepthTracker

a generative depth-based tracker using a modiﬁed energy

function that incorporates empty space information, as well

as inter-penetration constraints. An approach that uses mul-

tiple depth cameras for pose estimation which reduces the

occlusion problem is presented in [19]. The approach is not

real-time capable, though. With all these depth-based meth-

ods, real-time pose estimation is still a challenge, tracking

may drift, and with exception to [19] the employed shape

models are rather coarse which impairs pose estimation ac-

curacy.

Salzmann et al. [12] combine generative and discrimi-

native approaches, with the goal to reconstruct general 3D

deformable surfaces. Soon after that, [3] showed a hybrid

approach specialized to reconstruct human 3D pose from

depth image using the body part detectors proposed by [9]

as regularizing component. Further accuracy improvements

were achieved by [1, 20] using regularizing poses from a

pre-recorded database as input to the generative tracker.

Here, [1] was the ﬁrst approach running at real-time frame-

rates of more than 50 fps, whereas Ye et al.’s method [20] is

an oﬄine approach. Other real-time algorithms were pro-

posed by e.g. [17] that use a body-part detector similar to

[13] to augment a generative tracker. However, none of

these hybrid approaches is able to give a meaningful pose

hypothesis for non-visible body parts in case of occlusions.

Methods that reconstruct motions based on inertial sen-

sors only have been proposed e.g. in [7, 15]. Here, either

densely placed sensors or large databases containing mo-

tions are used. Also, reconstructing the global position is

not possible.

Only a few vision algorithms so far use fusion with com-

plementary sensor systems for full-body tracking. One ap-

proach combining 3D inertial information and multi-view

markerless motion capture was presented in [10]. Here, the

orientation data of ﬁve inertial sensors was used as addi-

tional energy term to stabilize the local pose optimization.

Another example is [23] who fuse information from densely

placed inertial sensors is fused with global position estima-

tion using a laser range scanner equipped robot accompany-

11061106

ing the tracked person.

3. Hybrid Inertial Tracker - An Overview

Recent hybrid (generative + discriminative) monocular

tracking algorithms e.g. [1, 17] can track human skeletons in

real-time from a single depth camera, as long as the body is

mostly front-facing. However, even in frontal poses, track-

ing may fail due to complex self-occlusions, limbs close

to the body, and other ambiguities. It certainly fails if

large sections of the body are completely invisible to the

camera, such as in lateral postures, see Fig. 1c. Our new

hybrid depth-based tracker succeeds in such cases by in-

corporating additional inertial sensor data for tracking sta-

bilization. While our concepts are in general applicable

to a wide range of generative approaches, discriminative

approaches and hybrid approaches, we modify the hybrid

depth-based tracker by Baak et al. [1] to demonstrate our

concepts. This tracker uses discriminative features detected

in the depth data, so-called geodesic extrema E

, to query

a database containing pre-recorded full body poses. These

poses are then used to initialize a generative tracker that op-

timizes skeletal pose parameters X of a mesh-based human

body model M

⊆ R

to best explain the 3D point cloud

⊆ R

of the observed depth image I. In a late fusion

step, the tracker decides between two pose hypotheses: one

obtained using the database pose as initialization or one ob-

tained that used the previously tracked poses as initializa-

tion. Baak et al.’s approach makes two assumptions: The

person to be tracked is facing the depth camera and all body

parts are visible to the depth camera, which means it fails in

diﬃcult poses mentioned earlier (see Fig. 1 for some exam-

ples).

In our new hybrid approach, we overcome these limita-

tions by modifying every step in the original algorithm to

beneﬁt from depth and inertial data together. In particular,

we introduce a visibility model to decide what data modal-

ity is best used in each pose estimation step, and develop

a discrimative tracker combining both data. We also em-

power generative tracking to use both data for reliable pose

inference, and develop a new late fusion step using both

modalities.

Body Model Similar to [1], we use a body model com-

prising a surface mesh M

of 6 449 vertices, whose defor-

mation is controlled by an embedded kinematic skeleton of

62 joints and 42 degrees of freedom via surface skinning.

Currently, the model is manually adapted to the actor, but

automatic shape adaptation is feasible, see e.g. [18]. Fur-

thermore, let B

all

:= {larm, rarm, lleg, rleg, body} be a set

of body parts representing the left and right arm, left and

right leg and the rest of the body. Now, we deﬁne ﬁve dis-

joint subsets M

, b ∈B

all

containing all vertices from M

belonging to body part b.

Sensor LocalSensor Global Camera Global

Time

X,root

)

S,root

)

X,root

(t)

S,root

(t)

Δq(t)

Figure 2. Relationship between the diﬀerent IMU coordinate sys-

tems and orientations.

Sensors As depth camera we use a Microsoft Kinect run-

ning at 30 fps, but in Sect. 8 we also show that our approach

works on time-of-ﬂight camera data. As additional sensors,

we use inertial measurement units (IMUs) which are able to

determine their relative orientation with respect to a global

coordinate system, irrespective of visibility from a cam-

era. IMUs are nowadays manufactured cheaply and com-

pactly, and integrated into many hand-held devices, such as

smart phones and game consoles. In this paper, we use six

Xsens MTx IMUs, attached to the trunk (s

root

), the forearms

larm

, s

rarm

), the lower legs (s

lleg

, s

rleg

), and the head (s

head

see Fig. 4a. The sensor s

root

gives us information about the

global body orientation, while the sensors on arms and feet

give cues about the conﬁguration of the extremities. Finally,

the head sensor is important to resolve some of the ambigu-

ities in sparse inertial features. For instance, it helps us to

discriminate upright from crouched full body poses. The

sensors’ orientations are described as the transformations

from the sensors’ local coordinate systems to a global co-

ordinate system and are denoted by q

root

larm

rarm

lleg

rleg

, and q

head

. In our implementation, we use unit quater-

nions for representing these transformations, as they best

suit our processing steps.

For ease of explanation, we introduce the concept of a

virtual sensor which provides a simulated orientation read-

ing of an IMU for a given pose X of our kinematic skele-

ton. Furthermore, the transformation between the virtual

sensor’s coordinate system and the depth camera’s global

coordinate system can be calculated. For clarity, we add X

or S to the index, e.g.q

S,root

denotes the measured orien-

tation of the real sensor attached to the trunk, while q

X,root

represents the readings of the virtual sensor for a given pose

X. Note, while the exact placement of the sensors relative

to the bones is not so important, it needs to be roughly the

same for corresponding real and virtual sensors. Further

calibration of the sensors is not required. An orientation of

a sensor at time t is denoted as q

root

(t).

4. Visibility Model

Our visibility model enables us to reliably detect global

body pose and the visibility of body parts in the depth cam-

11071107

era. This information is then used to establish reliable corre-

spondences between the depth image and body model dur-

ing generative tracking, even under occlusion. Furthermore,

it enables us to decide whether inertial or optical data are

more reliable for pose retrieval.

Global body position and orientation. In [1], the au-

thors use plane ﬁtting to a heuristically chosen subset of

depth data to compute body orientation and translation of

the depth centroid. Their approach fails if the person is not

roughly facing the camera or body parts are occluding the

torso. Inertial sensors are able to measure their orientation

in space independent of occlusions and lack of data in the

depth channel. We thus use the orientation of the sensor

root

to get a good estimate of the body’s front direction f

within the camera’s global coordinate system, even in dif-

ﬁcult non-frontal poses, Fig. 3b. However, inertial sensors

measure their orientation with respect to some global sen-

sor coordinate system that in general is not identical to the

camera’s global coordinate system, see also Fig. 2. For that

reason, we calculate the transformation q

X,root

(t) in a sim-

ilar fashion as described in [10] using relative transforma-

tions Δq(t):=

S,root

) ◦ q

S,root

(t) with respect to an initial

orientation at time t

. Here, q denotes the inverse trans-

formation of q, while q

◦ q

expresses that transformation

is executed after transformation q

. The transformations

S,root

) and q

S,root

(t) can be directly obtained from the

sensor’s measurement. The desired transformation from the

sensor’s coordinate system to the camera’s global coordi-

nate system at time t is now q

X,root

(t) = q

X,root

) ◦ Δq(t).

Note that q

X,root

) can not be measured. Instead, we calcu-

late it using virtual sensors and an initial pose X(t

) at time

. For this ﬁrst frame, we determine the front direction

f(t

) as described in [1] and then use our tracker to com-

pute X(t

). In all other frames, the front facing direction is

deﬁned as

f (t):= q

X,root

(t) ◦ q

X,root

)[ f (t

)]. (1)

Here, q[v] means that the transformation q is applied to the

vector v, Fig. 3b.

Body part visibility. The second important information

supplied by our visibility model is which parts of the model

are visible to the depth camera. To infer body part visibility,

we compute all vertices C

⊆M

of the body mesh that

the depth camera sees in pose X. To this end, we resort to

rendering of the model and fast OpenGL visibility testing.

Now, the visibility of a body part b is deﬁned as

∩C

. (2)

(a)

◦

(b)

◦

(c)

◦

(d)

◦

Figure 3. Tracking of frame at 5.0 s of sequence D

from our eval-

uation dataset. The views are rotated around the tracked person,

where oﬀset w. r. t. the depth camera is depicted at the bottom of

each subﬁgure. (a) Input depth data. (b) Output of the visibility

model. Note: the right arm is not visible. (c) Correspondences

used by the generative tracker. Note: no correspondences with

right arm. The pose parametrized mesh was moved to the left for

better visibility. (d) Final fused pose.

The set of visible body parts is denoted as B

vis

{

b ∈B

all

: V

>τ

}

. Note, that the accuracy of B

vis

depends

on M

resembling the actual pose assumed by the person in

the depth image as closely as possible, which is not known

before pose estimation. For this reason, we choose the pose

X = X

, obtained by the discriminative tracker which

yields better results than using the pose X(t − 1) from the

previous step, (see Sect. 6). To account for its possible devi-

ation from the “real” pose and to avoid false positives in the

set B

vis

, we introduce the threshold τ

> 0. In the tested sce-

narios, values of τ

up to 10% have shown a good trade-oﬀ

between rejecting false positives and not rejecting to many

body parts, that are actually visible.

In the rendering process also a virtual depth image I

created, from which we calculate the ﬁrst M = 50 geodesic

extrema in the same way as for the real depth image I, see

[1]. Finally, we denote the vertices that generated the ex-

trema’s depth points with C

5. Generative Pose Estimation

Similar to [1], generative tracking optimizes skeletal

pose parameters by minimizing the distance between cor-

responding points on the model and in the depth data. Baak

et al. ﬁx C

manually, and never update it during track-

ing. For every point in C

they ﬁnd the closest point in

the depth point cloud M

, and minimize the sum of dis-

tances between model and data points by local optimization

in the joint angles. Obviously, this leads to wrong corre-

spondences if the person strikes a pose in which large parts

of the body are occluded.

In our approach, we also use a local optimization scheme

to ﬁnd a pose X that best aligns the model M

to the point

cloud M

. In contrast to prior work, it also considers which

parts of the body are visible and can actually contribute to

explaining a good alignment into the depth image. Further-

11081108

rarm

larm

root

rleg

lleg

head

larm

(t)

(a)

rarm

larm

rleg

lleg

head

(b)

(c)

(d)

Figure 4. (a) Placement of the sensors on the body and normalized

orientation w. r. t. s

root

. (b) Body part directions used as inertial

features for indexing the database. (c) Two poses that cannot be

distinguished using inertial features. (d) The same two poses look

diﬀerent when using optical features.

more, we deﬁne subsets X

, b ∈B

all

of all pose parameters

in X that aﬀect the corresponding point sets M

. We de-

ﬁne the set of active pose parameters X

act



b∈B

vis

Finally, the energy function is given as

d(M

, M

):= d

→M

+ d

→M

(3)

→M



v∈C

min

p∈M

p − v

(4)

→M



e∈E

min

v∈M

e − v

. (5)

Here, E

represents the ﬁrst N = 50 geodesic extrema in

I, while C

is a subset of C

containing M = 50 visible

vertices, see Sect. 4 for details. A visualization for the re-

sulting correspondences can be seen in Fig. 3c. As opposed

to Baak et al., we minimize d(M

, M

) using a gradient

descent solver similar to the one used in [14] and employ

analytic derivatives.

6. Discriminative Pose Estimation

In hybrid tracking, discriminative tracking complements

generative tracking by continuous re-initialization of the

pose optimization when generative tracking converges to an

erroneous pose optimum (see also Sect. 7). We present a

new discriminative pose estimation approach that retrieves

poses from a database with 50 000 poses obtained from mo-

tion sequences recorded using a marker-based mocap sys-

tem. It adaptively relies on optical features for pose look-up,

and new inertial features, depending on visibility and thus

reliability of each sensor type. In combination, this enables

tracking of poses with strong occlusions, and it stabilizes

pose estimation in front-facing poses.

Optical database lookup. In order to retrieve a pose X

matching the one in the depth image from the database,

Baak et al. [1] use geodesic extrema computed on the depth

map as index. In their original work, they expect that the

ﬁrst ﬁve geodesic extrema E

from the depth image I are

roughly co-located with the positions of the body extrema

(head, hands and feet). Geodesic extrema also need to be

correctly labeled. Further on, the poses in their database

are normalized w. r. t. to global body orientation which re-

duces the database size. As a consequence, also queries

into the database need to be pose normalized. We use Baak

et al.’s geodesic extrema for optical lookup, but use our

more robust way for estimating f (t) for normalization, see

Sect. 4. Our method thus fares better even in poses where

all geodesic extrema are found, but the pose is lateral to the

camera.

Inertial database lookup. In poses where not all body

extrema are visible, or where they are too close to the

torso, the geodesic extrema become unreliable for database

lookup. In such cases, we revert to IMU data, in partic-

ual their orientations relative to the coordinate system of the

sensor s

root

, see Fig. 4a. Similar to the optical features based

on geodesic extrema, these normalized orientations

(t):=

root

(t) ◦ q

(t), b ∈B= {larm, rarm, lleg, rleg, head} are in-

variant to the tracked person’s global orientation but capture

the relative orientation of various parts of the person’s body.

However, using these normalized orientations directly as in-

dex has one disadvantage. This is because many orienta-

tion representations need special similarity metrics that are

often incompatible to fast indexing structures, such as kd-

trees. To this end, we use a vector

∈ R

that points in

the direction of the bone of a body part, see Fig. 4b. In our

setup, these directions are co-aligned with the sensors’ lo-

cal X-axis for all sensors except for the sensor s

head

, where

it is co-aligned with the local Y-axis. The normalized direc-

tions

(t):=

(t)[d

] are then stacked to serve as inertial

feature-based query to the database. The retrieved pose is

denoted as X

Selecting optical or inertial lookup. At ﬁrst sight, it may

seem that inertial features alone are suﬃcient to look up

poses from the database, because they are independent from

visibility issues. However, with our sparse set of six IMUs,

the inertial data alone are often not discriminative enough to

exactly characterize body poses. Some very diﬀerent poses

may induce the same inertial readings, and are thus ambigu-

ous, see also Fig. 4c. Of course, adding more IMUs to the

body would remedy the problem but would starkly impair

usablity and is not necessary as we show in the following.

Optical geodesic extrema features are very accurate and dis-

criminative of a pose, given that they are reliably found,

which is not the case for all extrema in diﬃcult non-frontal

starkly occluded poses, see Fig. 4d. Therefore, we intro-

duce two reliability measures to assess the reliability of op-

tical features for retrieval, and use the inertial features only

as fall-back modality for retrieval in case optical feautures

11091109

Real-Time Body Tracking with One Depth Camera and Inertial Sensors

Figures

Citations

A survey of depth and inertial sensor fusion for human action recognition

Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors

LiveCap: Real-Time Human Performance Capture From Monocular Video

Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs

MonoPerfCap: Human Performance Capture From Monocular Video

References

Real-time human pose recognition in parts from single depth images

A survey on vision-based human action recognition

Efficient regression of general-activity human poses from depth images

Real time motion capture using a single time-of-flight camera

Spacetime stereo: shape recovery for dynamic scenes

Related Papers (5)

Real-time human pose recognition in parts from single depth images

SMPL: a skinned multi-person linear model

SCAPE: shape completion and animation of people

Articulated mesh animation from multi-view silhouettes

Performance capture from sparse multi-view video

Frequently Asked Questions (13)

Q1. What are the contributions in "Real-time body tracking with one depth camera and inertial sensors" ?

Q2. How can the authors track skeletons in real-time?

Q3. How do they use the depth map to retrieve a pose?

Q4. What is the purpose of the tracker?

Q5. How do they find the closest point in the depth point cloud?

Q6. What are the advantages of using IMUs?

Q7. How do the authors use the generative tracking approach?

Q8. What is the method to reconstruct human pose from depth data?

Q9. What are the three types of trackers?

Q10. How many false positives have been found in the tested scenarios?

Q11. What other real-time algorithms were proposed by e.g. [17]?

Q12. How does generative tracking optimize skeletal pose parameters?

Q13. How do the authors avoid false positives in the setBvis?