(Open Access) Joint self-localization and tracking of generic objects in 3D range data (2013) | Frank Moosmann

Q: What is the way to evaluate the localization accuracy of a track?

Since in static scenes the proposed algorithm for localization equals the algorithm presented in [16], the results are transferable.

Q: What is the average flatness value of the tracklet?

In case the average flatness value of the tracklet exceeds some threshold, the pointto-plane ICP [6] is used, otherwise the point-to-point variant [2].

Q: What is the definition of a multi target tracking problem?

moosmann at kit.eduThe problem of multi target tracking is usually understood as the task to detect a set of objects in the environment and to characterize them by their position, orientation, extent, and velocity.

Q: Why is the car continuously tracked during the overtaking maneuver?

Especially during the overtaking maneuver, the car is continuously tracked because the appearance smoothly adapts to the new viewpoints.

Q: What is the ICP energy of the tracklet?

The ICP energy eg( tρ′′g,h) is calculated using both the Euclidean point-to-point distance [2] and the projective point-to-plane distance [6].

Q: What is the special treatment for the static scene track?

One special treatment is made for the static scene track: instead of registering the track appearance against the input data, the input data is registered against the track appearance as in [16].

Q: How is the ttt algorithm applied to 3D environments?

Experiments conducted in a vehicular environment show the applicability to 3D environments with both tracking and self-localization performed with full 6 degrees of freedom.

Q: How is the registration and update performed?

Registration and update is performed as in [14] by aligning the track’s appearance point cloud with the full input point cloud by means of the ICP algorithm.

Joint Self-Localization and Tracking of Generic Objects

in 3D Range Data

Frank Moosmann

and Christoph Stiller

Abstract—Both, the estimation of the trajectory of a sensor and

the detection and tracking of moving objects are essential tasks

for autonomous robots. This work proposes a new algorithm that

treats both problems jointly. The sole input is a sequence of dense

3D measurements as returned by multi-layer laser scanners or

time-of-ﬂight cameras. A major characteristic of the proposed

approach is its applicability to any type of environment since

speciﬁc object models are not used at any algorithm stage.

More speciﬁcally, precise localization in non-ﬂat environments

is possible as well as the detection and tracking of e.g. trams or

recumbent bicycles. Moreover, 3D shape estimation of moving

objects is inherent t o the proposed method. Thorough evaluation

is cond ucted on a vehicular platform with a mounted Velodyne

HDL-64E laser scanner.

I. INTRODUCTION

Two main tasks can be identiﬁed for a perception sys-

tem of a robot: precise self-localization, often perfo rmed

simultaneou sly with mapping (SLAM), and the detection and

tracking of moving objects (DATMO). While most methods

from literature treat the two tasks as being independent, a join t

estimation scheme is introduced in this co ntribution.

A. Self-Localization

The problem of localization is usually understood a s the

estimation of the robot’s pose, i. e. position and orientation.

The fr ame of reference thereby varies. Some appr oaches seek

a global estima te using GPS or global landmarks. Others refer

to the relative motion of the robot, specifying the pose w.r.t.

the starting point – the goal of this work.

The most widely spread algorithms for range sensor s fol-

low the principle of simultaneous localization and m apping

(SLAM) [23]. Though the last decade showed a trend towards

probabilistic techniq ues, the computationa l complexity with

3D data in outdoor environments notably shifts the used

method types in favor of scan-matching [18], [11], [ 3], [16].

Most SLAM methods only estimate the motion of the

vehicle w.r.t. a static scene and usua lly average out objects

with different motion. For low outlier ratios these registration

methods provide good results. A high portion of moving

objects, however, might cause these methods to fail. Only

few SLAM methods try to simultaneously detect and track

moving objects [2 6], [25]. Unfortunately, their computational

efﬁciency and robustness in the 3D real world was not yet

shown.

Both authors are with the Institute of Measurement and Control, Karlsruhe

Institute of Technology, 76131 Karlsruhe, Germany frank.moosmann

at kit.edu

Figure 1. Result of the proposed method: The mapped static environment

colored by altitude (left) and tracked moving objects highlighted with a unique

coloring in the sensor data (right).

B. Multi Target Tracking

The problem of multi target tracking is usually understood

as the task to detec t a set of objects in the environment and

to characterize them by their position, orientation, extent, and

velocity. Existing solutions frequently decompose the proble m

into two independent stages. T he ﬁrst stage detects objects

indepen dently for each point in tim e. State of the art methods

mostly train classiﬁers for the detection of speciﬁc objec t

classes like cars or humans [19]. Only few methods employ

generic segmentation meth ods to detect any kind of o bject

that sticks out well from background [22]. The second stage

associates the detections over time in order to get continuous

tracks, i. e. estimates of the objects in trinsic state like e. g.

position and velocity. Possible generic solutions for this stage

are given in [1]. When using dense data, the association

of measurements can be ambiguous especially wh e n several

detections per object exist. To overcome ambiguities, solutions

like fuzzy segmentation [21], segment matching [8] or appear-

ance learning [10] have been proposed.

This two-stage approach has been applied to various kind

of sensors, from 2D laser scanners [17] over 3D laser scanners

[12], [19] up to time-of-ﬂight cameras [9]. I ts major drawback

is the dependen c e on a reliable and repeatable object detector.

To the best of our knowledge, no approach exists that can

robustly track arbitrary objects.

A completely different methodology is track before detect

[7]. Sensor data is quantized e. g. at ﬁxed im age columns

[20] or at ﬁxed intervals in the horiz ontal pla ne [4]. Although

results seem very promising, ﬁnding a good grouping of the

tracked partitions, which corresponds to the dete ction, is still

2013 IEEE International Conference on Robotics and Automation (ICRA)

Karlsruhe, Germany, May 6-10, 2013

PointCloud

Prediction

Registration&Update

Merging

Tracklets Tracks

PointCloud/

RangeImage

Object

Hypotheses

Pre-processing

&Features

Object

Detection

t − 1

t + 1

t−1

t−2

t−1

t−m

t−1

∗

t−1

t−2

t−m

∗

t−1

′

t−2

′

t−m

′

∗

′

t−1

t−2

∗

Figure 2. Overview of the proposed method.

an open issue.

One step further is the idea to optimize the partitioning of

data (which can here be regarded as object de te c tion) and the

motion estimation together. However, the proposed solutions

[13], [24] are computationally too complex to be applied in

real-time on ordinary computers within the next years.

Hence, all successful object tracking methods seem to be

either 2D or model based, which requires manual model

construction and model selection through classiﬁcation.

This work propo ses a novel idea for the joint solution of

both pro blems. The combination of a dynamic data partitioning

with track before detect techniques allows to tr ack arbitrary

objects. By treating the static scene as object, mapping is ap -

plied to both, moving objects and th e static scene, in a unique

way. Experiments conducted in a vehicular environmen t show

the applicability to 3D environments with both tracking and

self-localization performed with full 6 degrees of freedom.

II. PROPOSED METHOD

Throu ghout this work, a left superscript x

denotes the

current time index and a left subscript

x the measurem ent

time. For clarity, these ar e only speciﬁed if necessary. All

computations are made w.r.t. the sensor coordinate system.

No ﬁxed world coordina te frame is used.

A. Overview

Input to the algorithm at each time t is a set of range

measurements represented as 3D points

P =



(x, y, z)



This point cloud is preproce ssed , features are calc ulated, and

Figure 3. Each segment, indicated by a unique color, is turned into an object

hypothesis and veriﬁed by tracking across m frames.

object h ypotheses are generated. Each object hy pothesis

S is

turned into a tracklet T

. Hence, the set of tracklets

T = { T

}

is created. The only exception is the initialization in the very

ﬁrst frame. Object detection is skipped and one single (static-

scene-) track is created from all measurements. The track(let)s

are predicted and updated across m frames and ﬁnally merged

with existing tra c ks. Note that the registration step uses the

unsegmented point cloud as re ference, which is in contrast

to most existing tracking m ethods. Ou tput of th e a lgorithm

is a set of tracks which includes the track of the static scene.

Hence, the sensor motion w.r.t. a ﬁxed world coordinate frame

can be deduced as the inverse static scene motion.

B. Pre-processing and Features

The input point cloud is smoothed and two features are

calculated for each point p

∈ P: a normal vector n

, n

)

: kn

k = 1, representing a local surface plane,

and a so-called ﬂatness value f

∈ [0, 1] which characte rizes

how app ropriate the approximation by a surface plane is.

The exact calculations are taken from [ 16], where the normal

vectors N = {n

} are denoted as N and the ﬂatness-values

F = {f

} are denoted as C.

C. Object Detection

The aim of this work is to track any kind of object that

is m oving. As a consequ e nce, object class speciﬁc detectors

cannot be used. Better suited are segmentation methods, that

split the set of input points P, represented by th e set of indices

S = {i}, into segments S

⊆ S,

= S, ∀g, h, g 6=

h : S

= ∅, where each segment corresponds to on e

object hypothesis. Any meaningful segmentation method can

be employed within the pro posed tracking fra mework; h e re

the so-called local convexity criterion is used, which was

introdu ced in [15] and improved in [ 14], also see Fig. 3.

D. Tracking

A tracklet T

is created from each object h ypothesis S

with a minimum size and can be regarded as object hypothesis

in the time domain. A local object coordinate system O

introdu ced, as depicted in Fig. 4. It is speciﬁed by a pose

vector

= (φ, θ, ψ, x, y, z)

(1)

1139

Figure 4. The pose ρ of the state vector deﬁnes the position and the

orientation of a track coordinate system (top) w.r.t. the scanner coordinate

system (bottom). The track appearance is stored as point cloud (violet) with

normal vectors and ﬂatness values (both not shown) relative to the track

coordinate system.

which deﬁne s its orientation and position w.r.t. the sensor

coordinate system S. The pose and its derivative constitute

the state of the tracklet:





= (φ, θ, ψ, x, y, z,

φ,

θ,

ψ, ˙x, ˙y, ˙z)

(2)

The 3D points P

⊆ P, the normals N

, and the ﬂatness values

constitute th e appearance of the tracklet. They are stored

relative to the object coordinate system O

, see Fig. 4 . In total,

a tracklet is deﬁned by its state and appearance:

= (x

, P

, N

, F

) (3)

It is worth noting that our method fo r track estimation thus

includes the 3D reconstruction of the shape of moving objects.

In the ﬁrst m frames the appearanc e of a tracklet is kept

constant. On the contrary, the state is re-e stima te d for each

new in coming frame within the Prediction and Update step

of Fig. 2. This makes the appearance move along with the

coordinate frame deﬁned by the state. A Kalman Filter with

constant velocity model

is employed upon the state vector

which can express any rigid motion. Prediction c orresponds

exactly to the prediction step of the Kalman Filter. Registration

and update is performe d as in [14] by align ing the track ’s

appearance point cloud with the full input point cloud by

means of the ICP algorithm. The predicted pose thereby serves

as initial pose of this iterative algorithm. In case the average

ﬂatness value of the tracklet exceeds some threshold, the point-

to-plane ICP [6] is used, otherwise the point-to-point variant

[2]. The measuremen t covariance for the Kalman Filter update

is calculated with the method of [5]. One special treatme nt

is made for the static scene track: instead of registering the

track appearance a gainst the input data, the input data is

registered against the trac k appearance as in [16]. This makes

the approa c h faster and more robust and allows for sensor

motion compensation.

Note that up to this point, no associations are made yet

between the tracklets, since registration is performed with the

full input data . Relations are established only in the track

management stage described next.

More speciﬁc and possibly non-linear models could of course be used for

speciﬁc object classes to extend and hence improve the method.

Figure 5. Input points as virtual range image, colored by distance.

Figure 6. Associations between new tracklets

T (upper image) and tracks

∗

(lower image) are established by overlaying their projections and counting the

number of pixels they overlap. Shown are the association strengths for moving

objects (in the lower ﬁgure); the edges that are not labeled are associations

with the static track (gray).

E. Track-Management

This section essentially describes the Merging step in Fig. 2,

which also handle s the transition of tracklets to tracks. Both

describe moving objects by their state and a ppearan c e. The

difference is conceptual only: tracklets are track hypotheses

that, after successful veriﬁcation, can become track s. As a

consequence, tra cks are predicted and updated exactly like

tracklets.

The merge step ta kes as input the current set of tracks

∗

′

and the set of tracklets

t−m

′

that was (independently)

registered across m fr ames and produces an updated set of

tracks

∗

T. First, any track that moved out of the ﬁeld of view

is removed from

∗

′

. Then, each tracklet is compared with

the existing tracks and one of three actions is taken:

1) The tracklet is kept and add ed to the set of tracks if

the trac klet was successfuly registered over the last m

frames and if it represents an object with a motion

different to all existing track s.

2) The tracklet is merged with the track in c a se a track on

the same ob je c t alre ady exists.

3) The tracklet is discarded if none of the above two ca ses

is true.

In all three c ases, tracklets are inherently associated and

compare d with existing tracks. Fig. 6 illustrates the efﬁcient

method used to determine these associations: The appearan ces

of all existing tracks and tracklets are projec ted to two virtual

1140

range images and the num ber of overlapping pixels determines

the association strength a

between track le t T

t−m g

and track

∗ h

To decide upon the three cases, the tracklet T

t−m g

is charac-

terized by a feature vector ( see Sec. IV). Several characteristics

are there fore calculated. Among others is the motion histogram

m = (m

, m

)

. For the trackle t that moved from

t−m

′

, it summarizes how many appearance p oints

moved perpendicular to their normal vector (m

), aslant to

it ( m

and m

), and along the normal vector (m

). This

effectively characterize s how reliable motion estimation is,

since motion perpendicular to the normal vector is, generally,

unreliable. Furthermore, the tracklet is compared to each

associated track T

t ′

∗ h

with association strength a

> 0.

Therefore, the motion of the associated track T

∗ h

within the

last m frames is applied to the tracklet T

t ′

t−m g

′′

g,h

t−m

+ (

′

−

t−m

) (4)

In case both the tracklet and the track referred to the same

object and tracking was successful,

′′

g,h

should be very sim-

ilar to

′

. The ICP energy e

(

′′

g,h

) is calculated using both

the Eu clidean point-to-po int distance [2] and the projective

point-to-plane distance [6]. These errors are deno te d e

g,h,2

and

g,h,P

in the following as opposed to e

g,2

and e

g,P

, the errors

for the original pose

′

. Based on these errors the associated

tracks causing minimum e rror can be determined as well as

the track with maximum association strength

= arg min

h:a

g,h

g,h,2

} (5)

= arg min

h:a

g,h

g,h,P

} (6)

= arg max

g,h

} (7)

Note that h

, h

, and h

are not necessarily different. The

features are gath ered within a 52-dimensional feature vector

, detailed in the appendix. A multi-class suppo rt vector

machine (SVM) with RBF kernel is used to classify the feature

vector in order to decid e upon the three cases.

In case a tracklet is kept as new tr ack, th e tracklet’s

appearance is removed from all associated tracks and the

tracklet is added to the set of tracks. This implicitly handles

track-splits.

In case a tracklet is to be merged with an existing track,

the corresponding track still has to be determined. This is

performed by calculating for each associated track h a score

and choose the track with the highest score. T he score is

calculated as linear combination of a second feature vector:

= (1 f

) · w (8)

The featur e vector f

is similar to f

and is deta iled

in the appendix. The parameter vecto r w is determ ined by

optimization on a labeled training set. When merging, the

state of the associated track remains unchanged. Only the

appearance of the tracklet is added to the track. Th e algorithm

for accumulatin g the appearanc e is taken from [16]. There, ﬂat

Table I

CLASSIFICATION RESULTS ON A LABELED DATA S ET FOR TWO DI FFERENT

PARAMETER SETTINGS OF THE CLASSIFIER.

Decision variant A Decision variant B

Keep Merge Ignore Keep Merge Ignore

Keep 23 14 89 103 13 10

Merge 0 12208 612 196 12516 108

Ignore 1 208 3614 200 567 3056

Accuracy 94.49% 93.48%

areas are contracted to yield sharper surface representations.

This so-called moving object mapping (MOM) not only makes

the results nicer, it also improves the registration result.

As compared to [16], one further step is added to process the

appearances. This is particularly r e levant for non- rigid object

in order to avoid tracking inaccuracies. In the projection step

illustrated in Fig. 6, each appearance point is removed from the

track that yields a closer range value than the range value at the

pixel of the current sensor data, see Fig. 5. As consequence,

the appearance can adapt to non -rigid ob je cts like pedestrians.

III. RESULTS

The proposed algorithm is evaluated on data capture d with

a Velodyne HDL-6 4E laser scanner. The sensor, a 64-bea m

laser scanner, is mounted on top of a car and yields a 360

◦

view of the environment, as illustrated in Fig. 9. We set m = 3

through all the experiments.

The ﬁrst stage of evaluation con c erns the localization pre-

cision. Since in static scenes the pr oposed algorithm for local-

ization equals the algorithm presented in [16], the results are

transferable . Two scenarios were evaluated that both represent

loops in a non -ﬂat urban environment. These loops c an be

used to evaluate drift, i. e. the localization imprecision that

increases with traveled distance. In average, a position error

of 2.66 m after a 1 km drive was determin ed. This value can

be regarded as very low and is about an orde r of magnitude

lower than for common camera-based tech niques. More details

and discussions are given in [16].

The evaluation of object detection and tracking p roceeds

in several stages. First, the classiﬁer for track management is

evaluated. This classiﬁer decides for each trac klet, i. e. track

hypothesis, whether it is to be merged with an existing track,

kept as new track, or ignored. Four-fold cross-validation is

applied on a dataset that was set-up and lab e le d manually. The

classiﬁcation accuracy reaches the values listed in table I. The

two decision variants correspo nd to different weigh tings of

the classes during training. With this weighting, the SVM can

be pulled towards favoring c ertain decisions. Varia nt A favors

the ignorance of tra cklets, which yields a higher precision but

lower recall of the tracker. On the contrary, variant B yields

a lower pr ecision but higher recall. Many alternative variants

exist, most of them with a classiﬁcation accuracy between 90%

and 95%. For further experiments, variant A is selected.

In order to evaluate the quality of tracking, an experiment

was conducted in real trafﬁc using a second car, denoted as

1141

Figure 8. Moving Object Mapping: Appearance of a car accumulated over time (from left to right). Initial points are depicted with double size.

-10

-5

0 5 10 15 20 25 30 35

speed-error / (m/s)

speed / (m/s)

time / s

true target speed

true sensor car speed

speed-error, with MOM

speed-error, first appearance

speed-error, replace appearance

Figure 7. Tracking quality assessed by using a second car, denoted target car.

Shown are the speed-proﬁles of both cars and the speed error as difference

between the estimated speed (by the tracker) and the true target speed

(measured by DGPS/IMU) for different tracking strategies. Missing values

indicate a temporary failure of the tracking method.

Table II

TRACKING S TATISTICS F OR SPEED COMPARISON EXPERIMENT OF FIG. 7

GENERATED WITHOUT OUTER 10% QUANTILES.

nb. of speed error in m/s

tracks median mean std-deviation

with MOM 2 -0.84 -0.96 ±1.16

ﬁrst appearance 4 -0.82 -1.04 ±1.29

replace appearance

12 -10.62 -7.69 ±10.27

target car. This target car starts in front of the sensor ca r,

accelerates, and gets overtaken by the sensor car after 26 s.

The speed-proﬁles (measured by DGPS/IMU) as w e ll as the

speed-err ors are depicted in Fig. 7, some chara cteristic values

are listed in ta ble II. Evident is the advantage of MOM

over using only the appea rance of one frame. The speed-error

is w ithin an acceptable range and the track gets lost only

once. Esp ecially during the overtaking maneuver, the car is

continuously tracked because the appearance smoothly adapts

to the new viewpoints. This adaption is well illustrated in

Fig. 8. Using constantly the ﬁrst appearance leads to three

track losses and a slightly higher speed error. Replacing the

appearance each frame leads to the worst results. As this

technique ca uses the track to drift, speed errors are high and

the track gets lost 11 times.

Additional experiments were conducted around intersections

with many moving objects. A video and the data is available

on www.mrt.kit.edu/z/publ/download/velodynetracking/. Vari-

0 5 10 15 20 25 30 35

count

track-length / s

Figure 10. Track lengths on the sequence illustrated in Fig. 9 (total 50 s).

ous types were successfully tracked: p e destrians (with rolling

case), cyclists, cars, vans, trucks, trams. Fig. 9 shows some

tracking results at a big intersection in the city of Karlsruhe.

Most moving objects are immediately detected, some slowly

moving pedestrians with a short delay. M ost tracks are stable,

i. e. trac king is successful until the objec t moves out of view.

This is shown by Fig. 10 that lists the distribution of track

lengths across the sequence. Cars th at move in parallel to the

sensor car are tracked for the whole tim e of movement, i. e. 30

seconds. Most other objects are tracked for several seconds,

even in areas where the objects are partly occluded.

IV. CONCLUSIONS

A novel approach was presented for self-lo c alization and

mapping combined with moving object tracking in dense

range data. Tracking and mapping was applied to both object

hypotheses and the static scene identically. Thus, 3D shape es-

timation of moving objects is inherent to the proposed method.

A classiﬁcation-based track management was introduc ed for

track veriﬁcation, merging, and splitting. The applicability

of the method was shown for a vehicular platform in a

crowded city environmen t. But these are not the limits of the

approa c h. Since object models were kept generic and tracking

is performed in full 3D, the approach is applicable to o ther

sensors and in other application areas, too.

APPENDIX

Let log

(x) := log(1 + max{0, x}).

The 52-dimensional feature vector f

is composed as

follows: f[1] ∈ {0, 1} is 1 if the last measurement

was su c cessful and 0 otherwise. f [2] ∈ {0, 1, 2, 3} is

1142

Joint self-localization and tracking of generic objects in 3D range data

Figures

Citations

Efficient Surfel-Based SLAM using 3D Laser Range Data in Urban Environments

Lidar for Autonomous Driving: The Principles, Challenges, and Trends for Automotive Lidar and Perception Systems

Motion-based detection and tracking in 3D LiDAR scans

Lidar for Autonomous Driving: The principles, challenges, and trends for automotive lidar and perception systems

Rigid scene flow for 3D LiDAR scans

References

A method for registration of 3-D shapes

Tracking and data association

Probabilistic Robotics (Intelligent Robotics and Autonomous Agents)

Object modeling by registration of multiple range images

Simultaneous Localization, Mapping and Moving Object Tracking

Related Papers (5)

Are we ready for autonomous driving? The KITTI vision benchmark suite

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Evaluating multiple object tracking performance: the CLEAR MOT metrics

Vision meets robotics: The KITTI dataset

Multi-view 3D Object Detection Network for Autonomous Driving

Frequently Asked Questions (9)

Q1. What is the main purpose of the method?

Q2. What is the way to evaluate the localization accuracy of a track?

Q3. What is the average flatness value of the tracklet?

Q4. What is the definition of a multi target tracking problem?

Q5. Why is the car continuously tracked during the overtaking maneuver?

Q6. What is the ICP energy of the tracklet?

Q7. What is the special treatment for the static scene track?

Q8. How is the ttt algorithm applied to 3D environments?

Q9. How is the registration and update performed?