scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Vision meets robotics: The KITTI dataset

TL;DR: A novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research, using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras and a high-precision GPS/IMU inertial navigation system.
Abstract: We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10-100 Hz using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations, and range from freeways over rural areas to inner-city scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets, and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.
Figures (10)

Content maybe subject to copyright    Report

1
Vision meets Robotics: The KITTI Dataset
Andreas Geiger, Philip Lenz, Christoph Stiller and Raquel Urtasun
Abstract—We present a novel dataset captured from a VW
station wagon for use in mobile robotics and autonomous driving
research. In total, we recorded 6 hours of traffic scenarios at
10-100 Hz using a variety of sensor modalities such as high-
resolution color and grayscale stereo cameras, a Velodyne 3D
laser scanner and a high-precision GPS/IMU inertial navigation
system. The scenarios are diverse, capturing real-world traffic
situations and range from freeways over rural areas to inner-
city scenes with many static and dynamic objects. Our data is
calibrated, synchronized and timestamped, and we provide the
rectified and raw image sequences. Our dataset also contains
object labels in the form of 3D tracklets and we provide online
benchmarks for stereo, optical flow, object detection and other
tasks. This paper describes our recording platform, the data
format and the utilities that we provide.
Index Terms—dataset, autonomous driving, mobile robotics,
field robotics, computer vision, cameras, laser, GPS, benchmarks,
stereo, optical flow, SLAM, object detection, tracking, KITTI
I. INTRODUCTION
The KITTI dataset has been recorded from a moving plat-
form (Fig. 1) while driving in and around Karlsruhe, Germany
(Fig. 2). It includes camera images, laser scans, high-precision
GPS measurements and IMU accelerations from a combined
GPS/IMU system. The main purpose of this dataset is to
push forward the development of computer vision and robotic
algorithms targeted to autonomous driving [1]–[7]. While our
introductory paper [8] mainly focuses on the benchmarks,
their creation and use for evaluating state-of-the-art computer
vision methods, here we complement this information by
providing technical details on the raw data itself. We give
precise instructions on how to access the data and comment
on sensor limitations and common pitfalls. The dataset can
be downloaded from http://www.cvlibs.net/datasets/kitti. For
a review on related work, we refer the reader to [8].
II. SENSOR SETUP
Our sensor setup is illustrated in Fig. 3:
2 × PointGray Flea2 grayscale cameras (FL2-14S3M-C),
1.4 Megapixels, 1/2” Sony ICX267 CCD, global shutter
2 × PointGray Flea2 color cameras (FL2-14S3C-C), 1.4
Megapixels, 1/2” Sony ICX267 CCD, global shutter
4 × Edmund Optics lenses, 4mm, opening angle 90
,
vertical opening angle of region of interest (ROI) 35
1 × Velodyne HDL-64E rotating 3D laser scanner, 10 Hz,
64 beams, 0.09 degree angular resolution, 2 cm distance
accuracy, collecting 1.3 million points/second, field of
view: 360
horizontal, 26.8
vertical, range: 120 m
A. Geiger, P. Lenz and C. Stiller are with the Department of Measurement
and Control Systems, Karlsruhe Institute of Technology, Germany. Email:
{geiger,lenz,stiller}@kit.edu
R. Urtasun is with the Toyota Technological Institute at Chicago, USA.
Email: rurtasun@ttic.edu
x
z
y
y
z
x
Velodyne HDL-64E Laserscanner
Point Gray Flea 2
Video Cameras
x
y
z
OXTS
RT 3003
GPS / IMU
x
y
z
OXTS
RT 3003
GPS / IMU
Fig. 1. Recording Platform. Our VW Passat station wagon is equipped
with four video cameras (two color and two grayscale cameras), a rotating
3D laser scanner and a combined GPS/IMU inertial navigation system.
1 × OXTS RT3003 inertial and GPS navigation system,
6 axis, 100 Hz, L1/L2 RTK, resolution: 0.02m / 0.1
Note that the color cameras lack in terms of resolution due
to the Bayer pattern interpolation process and are less sensitive
to light. This is the reason why we use two stereo camera
rigs, one for grayscale and one for color. The baseline of
both stereo camera rigs is approximately 54 cm. The trunk
of our vehicle houses a PC with two six-core Intel XEON
X5650 processors and a shock-absorbed RAID 5 hard disk
storage with a capacity of 4 Terabytes. Our computer runs
Ubuntu Linux (64 bit) and a real-time database [9] to store
the incoming data streams.
III. DATASET
The raw data described in this paper can be accessed from
http://www.cvlibs.net/datasets/kitti and contains 25% of our
overall recordings. The reason for this is that primarily data
with 3D tracklet annotations has been put online, though we
will make more data available upon request. Furthermore, we
have removed all sequences which are part of our benchmark
test sets. The raw data set is divided into the categories ’Road’,
’City’, ’Residential’, ’Campus’ and ’Person’. Example frames
are illustrated in Fig. 5. For each sequence, we provide the raw
data, object annotations in form of 3D bounding box tracklets
and a calibration file, as illustrated in Fig. 4. Our recordings
have taken place on the 26th, 28th, 29th, 30th of September
and on the 3rd of October 2011 during daytime. The total size
of the provided data is 180 GB.

2
Fig. 2. Recording Zone. This figure shows the GPS traces of our recordings
in the metropolitan area of Karlsruhe, Germany. Colors encode the GPS signal
quality: Red tracks have been recorded with highest precision using RTK
corrections, blue denotes the absence of correction signals. The black runs
have been excluded from our data set as no GPS signal has been available.
A. Data Description
All sensor readings of a sequence are zipped into a single
file named date_drive.zip, where date and drive are
placeholders for the recording date and the sequence number.
The directory structure is shown in Fig. 4. Besides the raw
recordings (’raw data’), we also provide post-processed data
(’synced data’), i.e., rectified and synchronized video streams,
on the dataset website.
Timestamps are stored in timestamps.txt and per-
frame sensor readings are provided in the corresponding data
sub-folders. Each line in timestamps.txt is composed
of the date and time in hours, minutes and seconds. As the
Velodyne laser scanner has a rolling shutter’, three timestamp
files are provided for this sensor, one for the start position
(timestamps_start.txt) of a spin, one for the end
position (timestamps_end.txt) of a spin, and one for the
time, where the laser scanner is facing forward and triggering
the cameras (timestamps.txt). The data format in which
each sensor stream is stored is as follows:
a) Images: Both, color and grayscale images are stored
with loss-less compression using 8-bit PNG files. The engine
hood and the sky region have been cropped. To simplify
working with the data, we also provide rectified images. The
size of the images after rectification depends on the calibration
parameters and is 0.5 Mpx on average. The original images
before rectification are available as well.
b) OXTS (GPS/IMU): For each frame, we store 30 differ-
ent GPS/IMU values in a text file: The geographic coordinates
including altitude, global orientation, velocities, accelerations,
angular rates, accuracies and satellite information. Accelera-
x
y
z
x
y
z
x
z
y
GPS/IMU
(height: 0.93 m)
Velodyne laserscanner
(height: 1.73 m)
Cam 0 (gray)
Cam 2 (color)
Cam 1 (gray)
Cam 3 (color)
2.71 m
0.81 m
0.32 m
0.27 m
0.54 m
0.06 m
All camera heights: 1.65 m
0.05 m
0.48 m
IMU-to-Velo
Velo-to-Cam
All heights wrt. road surface
Wheel axis
(height: 0.30m)
1.68 m
0.80 m
1.60 m
Cam-to-CamRect
& CamRect
-to-Image
0.06 m
Fig. 3. Sensor Setup. This figure illustrates the dimensions and mounting
positions of the sensors (red) with respect to the vehicle body. Heights above
ground are marked in green and measured with respect to the road surface.
Transformations between sensors are shown in blue.
date/
date_drive/
date_drive.zip
image_0x/ x={0,..,3}
data/
frame_number.png
timestamps.txt
oxts/
data/
frame_number.txt
dataformat.txt
timestamps.txt
velodyne_points/
data/
frame_number.bin
timestamps.txt
timestamps_start.txt
timestamps_end.txt
date_drive_tracklets.zip
tracklet_labels.xml
date_calib.zip
calib_cam_to_cam.txt
calib_imu_to_velo.txt
calib_velo_to_cam.txt
Fig. 4. Structure of the provided Zip-Files and their location within a
global file structure that stores all KITTI sequences. Here, date’ and drive’
are placeholders, and ’image 0x’ refers to the 4 video camera streams.
tions and angular rates are both specified using two coordinate
systems, one which is attached to the vehicle body (x, y, z) and
one that is mapped to the tangent plane of the earth surface
at that location (f, l, u). From time to time we encountered
short ( 1 second) communication outages with the OXTS
device for which we interpolated all values linearly and set
the last 3 entries to ’-1’ to indicate the missing information.
More details are provided in dataformat.txt. Conversion
utilities are provided in the development kit.
c) Velodyne: For efficiency, the Velodyne scans are
stored as floating point binaries that are easy to parse using
the C++ or MATLAB code provided. Each point is stored with
its (x, y, z) coordinate and an additional reflectance value (r).
While the number of points per scan is not constant, on average
each file/frame has a size of 1.9 MB which corresponds
to 120, 000 3D points and reflectance values. Note that the
Velodyne laser scanner rotates continuously around its vertical
axis (counter-clockwise), which can be taken into account
using the timestamp files.
B. Annotations
For each dynamic object within the reference camera’s field
of view, we provide annotations in the form of 3D bounding
box tracklets, represented in Velodyne coordinates. We define
the classes ’Car’, ’Van’, ’Truck’, ’Pedestrian’, ’Person (sit-
ting)’, ’Cyclist’, ’Tram’ and ’Misc’ (e.g., Trailers, Segways).
The tracklets are stored in date_drive_tracklets.xml.

3
Fig. 6. Development kit. Working with tracklets (top), Velodyne point clouds
(bottom) and their projections onto the image plane is demonstrated in the
MATLAB development kit which is available from the KITTI website.
Each object is assigned a class and its 3D size (height, width,
length). For each frame, we provide the object’s translation
and rotation in 3D, as illustrated in Fig. 7. Note that we
only provide the yaw angle, while the other two angles
are assumed to be close to zero. Furthermore, the level of
occlusion and truncation is specified. The development kit
contains C++/MATLAB code for reading and writing tracklets
using the boost::serialization
1
library.
To give further insights into the properties of our dataset,
we provide statistics for all sequences that contain annotated
objects. The total number of objects and the object orientations
for the two predominant classes ’Car’ and ’Pedestrian’ are
shown in Fig. 8. For each object class, the number of object
labels per image and the length of the captured sequences is
shown in Fig. 9. The egomotion of our platform recorded by
the GPS/IMU system as well as statistics about the sequence
length and the number of objects are shown in Fig. 10 for the
whole dataset and in Fig. 11 per street category.
C. Development Kit
The raw data development kit provided on the KITTI
website
2
contains MATLAB demonstration code with C++
wrappers and a readme.txt file which gives further
details. Here, we will briefly discuss the most impor-
tant features. Before running the scripts, the mex wrapper
readTrackletsMex.cpp for reading tracklets into MAT-
LAB structures and cell arrays needs to be built using the
script make.m. It wraps the file tracklets.h from the
1
http://www.boost.org
2
http://www.cvlibs.net/datasets/kitti/raw data.php
x
y
z
Velodyne
Coordinates
z
y
x
Object
Coordinates
rz
length (l)
width (w)
Fig. 7. Object Coordinates. This figure illustrates the coordinate system
of the annotated 3D bounding boxes with respect to the coordinate system of
the 3D Velodyne laser scanner. In z -direction, the object coordinate system is
located at the bottom of the object (contact point with the supporting surface).
cpp folder which holds the tracklet object for serialization.
This file can also be directly interfaced with when working in
a C++ environment.
The script run_demoTracklets.m demonstrates how
3D bounding box tracklets can be read from the XML files
and projected onto the image plane of the cameras. The
projection of 3D Velodyne point clouds into the image plane
is demonstrated in run_demoVelodyne.m. See Fig. 6 for
an illustration.
The script run_demoVehiclePath.m shows how to
read and display the 3D vehicle trajectory using the GPS/IMU
data. It makes use of convertOxtsToPose(), which takes
as input GPS/IMU measurements and outputs the 6D pose of
the vehicle in Euclidean space. For this conversion we make
use of the Mercator projection [10]
x = s × r ×
π lon
180
(1)
y = s × r × log
tan
π(90 + lat)
360

(2)
with earth radius r 6378137 meters, scale s = cos
lat
0
×π
180
,
and (lat, lon) the geographic coordinates. lat
0
denotes the lat-
itude of the first frame’s coordinates and uniquely determines
the Mercator scale.
The function loadCalibrationCamToCam() can be
used to read the intrinsic and extrinsic calibration parameters
of the four video sensors. The other 3D rigid body transfor-
mations can be parsed with loadCalibrationRigid().
D. Benchmarks
In addition to the raw data, our KITTI website hosts
evaluation benchmarks for several computer vision and robotic
tasks such as stereo, optical flow, visual odometry, SLAM, 3D
object detection and 3D object tracking. For details about the
benchmarks and evaluation metrics we refer the reader to [8].
IV. SENSOR CALIBRATION
We took care that all sensors are carefully synchronized
and calibrated. To avoid drift over time, we calibrated the
sensors every day after our recordings. Note that even though
the sensor setup hasn’t been altered in between, numerical
differences are possible. The coordinate systems are defined
as illustrated in Fig. 1 and Fig. 3, i.e.:
Camera: x = right, y = down, z = forward
Velodyne: x = forward, y = left, z = up

4
City Residential Road Campus Person
Fig. 5. Examples from the KITTI dataset. This figure demonstrates the diversity in our dataset. The left color camera image is shown.
GPS/IMU: x = forward, y = left, z = up
Notation: In the following, we write scalars in lower-case
letters (a), vectors in bold lower-case (a) and matrices using
bold-face capitals (A). 3D rigid-body transformations which
take points from coordinate system a to coordinate system b
will be denoted by T
b
a
, with T for ’transformation’.
A. Synchronization
In order to synchronize the sensors, we use the timestamps
of the Velodyne 3D laser scanner as a reference and consider
each spin as a frame. We mounted a reed contact at the bottom
of the continuously rotating scanner, triggering the cameras
when facing forward. This minimizes the differences in the
range and image observations caused by dynamic objects.
Unfortunately, the GPS/IMU system cannot be synchronized
that way. Instead, as it provides updates at 100 Hz, we
collect the information with the closest timestamp to the laser
scanner timestamp for a particular frame, resulting in a worst-
case time difference of 5 ms between a GPS/IMU and a
camera/Velodyne data package. Note that all timestamps are
provided such that positioning information at any time can be
easily obtained via interpolation. All timestamps have been
recorded on our host computer using the system clock.
B. Camera Calibration
For calibrating the cameras intrinsically and extrinsically,
we use the approach proposed in [11]. Note that all camera
centers are aligned, i.e., they lie on the same x/y-plane. This
is important as it allows us to rectify all images jointly.
The calibration parameters for each day are stored in
row-major order in calib_cam_to_cam.txt using the
following notation:
s
(i)
N
2
. . . . . . . . . . . . original image size (1392 × 512)
K
(i)
R
3×3
. . . . . . . . . calibration matrices (unrectified)
d
(i)
R
5
. . . . . . . . . . distortion coefficients (unrectified)
R
(i)
R
3×3
. . . . . . rotation from camera 0 to camera i
t
(i)
R
1×3
. . . . . translation from camera 0 to camera i
s
(i)
rect
N
2
. . . . . . . . . . . . . . . image size after rectification
R
(i)
rect
R
3×3
. . . . . . . . . . . . . . . rectifying rotation matrix
P
(i)
rect
R
3×4
. . . . . . projection matrix after rectification
Here, i {0, 1, 2, 3} is the camera index, where 0 represents
the left grayscale, 1 the right grayscale, 2 the left color and
3 the right color camera. Note that the variable definitions
are compliant with the OpenCV library, which we used for
warping the images. When working with the synchronized
and rectified datasets only the variables with rect-subscript are
relevant. Note that due to the pincushion distortion effect the
images have been cropped such that the size of the rectified
images is smaller than the original size of 1392 × 512 pixels.
The projection of a 3D point x = (x, y, z, 1)
T
in rectified
(rotated) camera coordinates to a point y = (u, v, 1)
T
in the
i’th camera image is given as
y = P
(i)
rect
x (3)

5
with
P
(i)
rect
=
f
(i)
u
0 c
(i)
u
f
(i)
u
b
(i)
x
0 f
(i)
v
c
(i)
v
0
0 0 1 0
(4)
the i’th projection matrix. Here, b
(i)
x
denotes the baseline (in
meters) with respect to reference camera 0. Note that in order
to project a 3D point x in reference camera coordinates to a
point y on the i’th image plane, the rectifying rotation matrix
of the reference camera R
(0)
rect
must be considered as well:
y = P
(i)
rect
R
(0)
rect
x (5)
Here, R
(0)
rect
has been expanded into a 4×4 matrix by append-
ing a fourth zero-row and column, and setting R
(0)
rect
(4, 4) = 1.
C. Velodyne and IMU Calibration
We have registered the Velodyne laser scanner with respect
to the reference camera coordinate system (camera 0) by
initializing the rigid body transformation using [11]. Next, we
optimized an error criterion based on the Euclidean distance of
50 manually selected correspondences and a robust measure on
the disparity error with respect to the 3 top performing stereo
methods in the KITTI stereo benchmark [8]. The optimization
was carried out using Metropolis-Hastings sampling.
The rigid body transformation from Velodyne coordinates to
camera coordinates is given in calib_velo_to_cam.txt:
R
cam
velo
R
3×3
. . . . rotation matrix: velodyne camera
t
cam
velo
R
1×3
. . . translation vector: velodyne camera
Using
T
cam
velo
=
R
cam
velo
t
cam
velo
0 1
(6)
a 3D point x in Velodyne coordinates gets projected to a point
y in the i’th camera image as
y = P
(i)
rect
R
(0)
rect
T
cam
velo
x (7)
For registering the IMU/GPS with respect to the Velodyne
laser scanner, we first recorded a sequence with an ’-loop
and registered the (untwisted) point clouds using the Point-
to-Plane ICP algorithm. Given two trajectories this problem
corresponds to the well-known hand-eye calibration problem
which can be solved using standard tools [12]. The rotation
matrix R
velo
imu
and the translation vector t
velo
imu
are stored in
calib_imu_to_velo.txt. A 3D point x in IMU/GPS
coordinates gets projected to a point y in the i’th image as
y = P
(i)
rect
R
(0)
rect
T
cam
velo
T
velo
imu
x (8)
V. SUMMARY AND FUTURE WORK
In this paper, we have presented a calibrated, synchronized
and rectified autonomous driving dataset capturing a wide
range of interesting scenarios. We believe that this dataset
will be highly useful in many areas of robotics and com-
puter vision. In the future we plan on expanding the set of
available sequences by adding additional 3D object labels for
currently unlabeled sequences and recording new sequences,
Number of Labels
Object Class
200000
150000
100000
50000
0
Misc
Tram
Cyclist
Person (sitting)
Pedestrian
Truck
Van
Car
% of Object Class Members
Occlusion/Truncation Status by Class
100
50
0
Pedestrian
Truncated
Pedestrian
Occluded
Car
Truncated
Car
Occluded
Number of Cars
Orientation [deg]
20000
15000
10000
5000
0
157.50
112.50
67.50
22.50
-22.50
-67.50
-112.50
-157.50
Number of Pedestrians
Orientation [deg]
6000
5000
4000
3000
2000
1000
0
157.50
112.50
67.50
22.50
-22.50
-67.50
-112.50
-157.50
Fig. 8. Object Occurrence and Orientation Statistics of our Dataset.
This figure shows the different types of objects occurring in our sequences
(top) and the orientation histograms (bottom) for the two most predominant
categories ’Car’ and ’Pedestrian’.
for example in difficult lighting situations such as at night, in
tunnels, or in the presence of fog or rain. Furthermore, we
plan on extending our benchmark suite by novel challenges.
In particular, we will provide pixel-accurate semantic labels
for many of the sequences.
REFERENCES
[1] G. Singh and J. Kosecka, Acquiring semantics induced topology in
urban environments, in ICRA, 2012.
[2] R. Paul and P. Newman, “FAB-MAP 3D: Topological mapping with
spatial and visual appearance, in ICRA, 2010.
[3] C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele, “Monocular
visual scene understanding: Understanding multi-object traffic scenes,
PAMI, 2012.
[4] D. Pfeiffer and U. Franke, “Efficient representation of traffic scenes by
means of dynamic stixels, in IV, 2010.
[5] A. Geiger, M. Lauer, and R. Urtasun, A generative model for 3d urban
scene understanding from movable platforms, in CVPR, 2011.
[6] A. Geiger, C. Wojek, and R. Urtasun, “Joint 3d estimation of objects
and scene layout, in NIPS, 2011.
[7] M. A. Brubaker, A. Geiger, and R. Urtasun, “Lost! leveraging the crowd
for probabilistic visual self-localization. in CVPR, 2013.
[8] A. Geiger, P. Lenz, and R. Urtasun, Are we ready for autonomous
driving? The KITTI vision benchmark suite, in CVPR, 2012.
[9] M. Goebl and G. Faerber, A real-time-capable hard- and software
architecture for joint image and knowledge processing in cognitive
automobiles, in IV, 2007.
[10] P. Osborne, “The mercator projections, 2008. [Online]. Available:
http://mercator.myzen.co.uk/mercator.pdf
[11] A. Geiger, F. Moosmann, O. Car, and B. Schuster, A toolbox for
automatic calibration of range and camera sensors using a single shot,
in ICRA, 2012.
[12] R. Horaud and F. Dornaika, “Hand-eye calibration, IJRR, vol. 14, no. 3,
pp. 195–210, 1995.

Citations
More filters
Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Proceedings ArticleDOI
01 Jun 2016
TL;DR: This work introduces Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling, and exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity.
Abstract: Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations, 20 000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.

7,547 citations


Cites background from "Vision meets robotics: The KITTI da..."

  • ...Out of the previously discussed datasets, only Cityscapes and KITTI provide instance-level annotations for humans and vehicles....

    [...]

  • ...As no official pixel-wise annotations exist for KITTI, several independent groups have annotated approximately 700 frames [22, 29, 32, 33, 58, 64, 77, 80]....

    [...]

  • ...Also in this area, research progress can be heavily linked to the existence of datasets such as the KITTI Vision Benchmark Suite [19], CamVid [7], Leuven [35], and Daimler Urban Segmentation [61] datasets....

    [...]

  • ...4 we observe, that in comparison to KITTI, Cityscapes covers a larger distance range....

    [...]

  • ...Regarding the first two aspects, we compare Cityscapes to other datasets with semantic pixel-wise annotations, i.e. CamVid [7], DUS [62], and KITTI [19]....

    [...]

Journal ArticleDOI
TL;DR: ORB-SLAM as discussed by the authors is a feature-based monocular SLAM system that operates in real time, in small and large indoor and outdoor environments, with a survival of the fittest strategy that selects the points and keyframes of the reconstruction.
Abstract: This paper presents ORB-SLAM, a feature-based monocular simultaneous localization and mapping (SLAM) system that operates in real time, in small and large indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.

4,522 citations

Posted Content
TL;DR: In this article, a new convolutional network module is proposed to aggregate multi-scale contextual information without losing resolution, and the architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage.
Abstract: State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.

4,018 citations

References
More filters
Proceedings ArticleDOI
16 Jun 2012
TL;DR: The autonomous driving platform is used to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection, revealing that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world.
Abstract: Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti

11,283 citations


"Vision meets robotics: The KITTI da..." refers background or methods in this paper

  • ...For details about the benchmarks and evaluation metrics we refer the reader to Geiger et al. (2012a)....

    [...]

  • ...Next, we optimized an error criterion based on the Euclidean distance of 50 manually selected correspondences and a robust measure on the disparity error with respect to the three top performing stereo methods in the KITTI stereo benchmark Geiger et al. (2012a)....

    [...]

  • ...While our introductory paper (Geiger et al., 2012a) mainly focuses on the benchmarks, their creation and use for evaluating state-of-the-art computer vision methods, here we complement this information by providing technical details on the raw data itself....

    [...]

  • ...We have registered the Velodyne laser scanner with respect to the reference camera coordinate system (camera 0) by initializing the rigid body transformation using Geiger et al. (2012b)....

    [...]

  • ...For a review on related work, we refer the reader to Geiger et al. (2012a)....

    [...]

Proceedings ArticleDOI
14 May 2012
TL;DR: It is demonstrated that the proposed checkerboard corner detector significantly outperforms current state-of-the-art and the proposed camera-to-range registration method is able to discover multiple solutions in the case of ambiguities.
Abstract: As a core robotic and vision problem, camera and range sensor calibration have been researched intensely over the last decades. However, robotic research efforts still often get heavily delayed by the requirement of setting up a calibrated system consisting of multiple cameras and range measurement units. With regard to removing this burden, we present a toolbox with web interface for fully automatic camera-to-camera and camera-to-range calibration. Our system is easy to setup and recovers intrinsic and extrinsic camera parameters as well as the transformation between cameras and range sensors within one minute. In contrast to existing calibration approaches, which often require user intervention, the proposed method is robust to varying imaging conditions, fully automatic, and easy to use since a single image and range scan proves sufficient for most calibration scenarios. Experimentally, we demonstrate that the proposed checkerboard corner detector significantly outperforms current state-of-the-art. Furthermore, the proposed camera-to-range registration method is able to discover multiple solutions in the case of ambiguities. Experiments using a variety of sensors such as grayscale and color cameras, the Kinect 3D sensor and the Velodyne HDL-64 laser scanner show the robustness of our method in different indoor and outdoor settings and under various lighting conditions.

488 citations


"Vision meets robotics: The KITTI da..." refers background or methods in this paper

  • ...For details about the benchmarks and evaluation metrics we refer the reader to Geiger et al. (2012a)....

    [...]

  • ...Next, we optimized an error criterion based on the Euclidean distance of 50 manually selected correspondences and a robust measure on the disparity error with respect to the three top performing stereo methods in the KITTI stereo benchmark Geiger et al. (2012a)....

    [...]

  • ...While our introductory paper (Geiger et al., 2012a) mainly focuses on the benchmarks, their creation and use for evaluating state-of-the-art computer vision methods, here we complement this information by providing technical details on the raw data itself....

    [...]

  • ...We have registered the Velodyne laser scanner with respect to the reference camera coordinate system (camera 0) by initializing the rigid body transformation using Geiger et al. (2012b)....

    [...]

  • ...For a review on related work, we refer the reader to Geiger et al. (2012a)....

    [...]

Journal ArticleDOI
TL;DR: A common mathematical framework is developed to solve for the hand-eye calibration problem using either of the two formulations and the nonlinear optimization method, which solves for rotation and translation simultaneously, seems to be the most robust one with respect to noise and measurement errors.
Abstract: Whenever a sensor is mounted on a robot hand, it is important to know the relationship between the sensor and the hand. The problem of determining this relationship is referred to as the hand-eye calibration problem. Hand-eye calibration is impor tant in at least two types of tasks: (1) map sensor centered measurements into the robot workspace frame and (2) tasks allowing the robot to precisely move the sensor. In the past some solutions were proposed, particularly in the case of the sensor being a television camera. With almost no exception, all existing solutions attempt to solve a homogeneous matrix equation of the form AX = X B. This article has the following main contributions. First we show that there are two possible formulations of the hand-eye calibration problem. One formu lation is the classic one just mentioned. A second formulation takes the form of the following homogeneous matrix equation: MY = M'YB. The advantage of the latter formulation is that the extrinsic and intrinsic parameters of the camera need not be made explicit. Indeed, this formulation directly uses the 3 x4 perspective matrices ( M and M' ) associated with two positions of the camera with respect to the calibration frame. Moreover, this formulation together with the classic one covers a wider range of camera-based sensors to be calibrated with respect to the robot hand: single scan-line cameras, stereo heads, range finders, etc. Second, we develop a common mathematical framework to solve for the hand-eye calibration problem using either of the two formulations. We represent rotation by a unit quaternion and present two methods: (1) a closed-form solution for solving for rotation using unit quaternions and then solving for translation and (2) a nonlinear technique for simultane ously solving for rotation and translation. Third, we perform a stability analysis both for our two methods and for the lin ear method developed by Tsai and Lenz (1989). This analysis allows the comparison of the three methods. In light of this comparison, the nonlinear optimization method, which solves for rotation and translation simultaneously, seems to be the most robust one with respect to noise and measurement errors.

451 citations


"Vision meets robotics: The KITTI da..." refers methods in this paper

  • ...Given two trajectories this problem corresponds to the well-known hand-eye calibration problem which can be solved using standard tools (Horaud and Dornaika, 1995)....

    [...]

Proceedings ArticleDOI
07 Sep 1997
TL;DR: This work determines simultaneously the hand-eye transformation and the location of a calibration object with respect to the robot world coordinate system and proposes a new formulation that is constrained to be consistent with the whole set of calibration data.
Abstract: Deals with the hand-eye calibration problem. The question is to find the relative position and orientation between a camera rigidly mounted on the robot's last joint and the gripper. Hand-eye calibration is useful in many cases as for example, grasping objects or reconstructing 3D scenes. Almost all existing solutions lead to solving for homogeneous transformation equations of the form AX=XB. We propose a new formulation. Our method determines simultaneously the hand-eye transformation and the location of a calibration object with respect to the robot world coordinate system. The main advantage is that the number of unknowns remains constant and the solution is constrained to be consistent with the whole set of calibration data. Results of simulation experiments and comparisons with classical techniques are reported and analysed. Real experiments with a Cartesian robot are described and the results accuracy discussed.

196 citations

Proceedings ArticleDOI
23 Jun 2013
TL;DR: An affordable solution to self-localization, which utilizes visual odometry and road maps as the only inputs and is able to localize a vehicle up to 3m after only a few seconds of driving on maps which contain more than 2,150km of drivable roads.
Abstract: In this paper we propose an affordable solution to self-localization, which utilizes visual odometry and road maps as the only inputs. To this end, we present a probabilistic model as well as an efficient approximate inference algorithm, which is able to utilize distributed computation to meet the real-time requirements of autonomous systems. Because of the probabilistic nature of the model we are able to cope with uncertainty due to noisy visual odometry and inherent ambiguities in the map (e.g., in a Manhattan world). By exploiting freely available, community developed maps and visual odometry measurements, we are able to localize a vehicle up to 3m after only a few seconds of driving on maps which contain more than 2,150km of drivable roads.

181 citations


"Vision meets robotics: The KITTI da..." refers background in this paper

  • ...The main purpose of this dataset is to push forward the development of computer vision and robotic algorithms targeted at autonomous driving (Paul and Newman, 2010; Pfeiffer and Franke, 2010; Geiger et al., 2011a,b; Wojek et al., 2012; Singh and Kosecka, 2012; Brubaker et al., 2013)....

    [...]

Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Vision meets robotics: the kitti dataset" ?

The authors present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. Their data is calibrated, synchronized and timestamped, and the authors provide the rectified and raw image sequences. Their dataset also contains object labels in the form of 3D tracklets and the authors provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes their recording platform, the data format and the utilities that the authors provide. In total, the authors recorded 6 hours of traffic scenarios at 10-100 Hz using a variety of sensor modalities such as highresolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. 

In the future the authors plan on expanding the set of available sequences by adding additional 3D object labels for currently unlabeled sequences and recording new sequences, N u m b er o f L ab el s 200000 150000 100000 50000 % o f O b je ct C la ss M em b er s 100 50 for example in difficult lighting situations such as at night, in tunnels, or in the presence of fog or rain. Furthermore, the authors plan on extending their benchmark suite by novel challenges. 

The trunk of their vehicle houses a PC with two six-core Intel XEON X5650 processors and a shock-absorbed RAID 5 hard disk storage with a capacity of 4 Terabytes. 

In order to synchronize the sensors, the authors use the timestamps of the Velodyne 3D laser scanner as a reference and consider each spin as a frame. 

4. Besides the raw recordings (’raw data’), the authors also provide post-processed data (’synced data’), i.e., rectified and synchronized video streams, on the dataset website. 

For efficiency, the Velodyne scans are stored as floating point binaries that are easy to parse using the C++ or MATLAB code provided. 

In the future the authors plan on expanding the set of available sequences by adding additional 3D object labels for currently unlabeled sequences and recording new sequences,N u m b ero f L ab el s20000015000010000050000% o f O b je ct C la ssM emb er s10050for example in difficult lighting situations such as at night, in tunnels, or in the presence of fog or rain. 

Note that the Velodyne laser scanner rotates continuously around its vertical axis (counter-clockwise), which can be taken into account using the timestamp files. 

Accelera-tions and angular rates are both specified using two coordinate systems, one which is attached to the vehicle body (x, y, z) and one that is mapped to the tangent plane of the earth surface at that location (f, l, u). 

The geographic coordinates including altitude, global orientation, velocities, accelerations, angular rates, accuracies and satellite information. 

The rigid body transformation from Velodyne coordinates to camera coordinates is given in calib_velo_to_cam.txt:• Rcamvelo ∈ R3×3 . . . . rotation matrix: velodyne → camera • tcamvelo ∈ R1×3 . . . translation vector: velodyne → cameraUsingTcamvelo =( Rcamvelo t cam velo0 1) (6)a 3D point x in Velodyne coordinates gets projected to a point y in the i’th camera image asy = P (i) rect R (0) rect T cam velo x (7)For registering the IMU/GPS with respect to the Velodyne laser scanner, the authors first recorded a sequence with an ’∞’-loop and registered the (untwisted) point clouds using the Pointto-Plane ICP algorithm. 

The data format in which each sensor stream is stored is as follows:a) Images: Both, color and grayscale images are stored with loss-less compression using 8-bit PNG files.