What contributions have the authors mentioned in the paper "Monocular visual odometry in urban environments" ?

Q: What contributions have the authors mentioned in the paper "Monocular visual odometry in urban environments" ?

The authors present a system for Monocular Simultaneous Localization and Mapping ( Mono-SLAM ) relying solely on video input. In particular, the authors do not perform any simultaneous iterative optimization of the camera positions and estimated 3D structure ( local bundle adjustment ). The authors show that the latter leads to a much more stable estimation of the camera trajectory than the conventional approach. The authors perform high precision camera trajectory estimation in urban scenes with a large amount of clutter. Using an omnidirectional camera placed on a vehicle, the authors cover the longest distance ever reported, up to 2. 5 kilometers.

(Open Access) Monocular visual odometry in urban environments using an omnidirectional camera (2008) | Jean-Philippe Tardif

Monocular Visual Odometry in Urban Environments

Jean-Philippe Tardif Yanis Pavlidis Kostas Daniilidis

GRASP Laboratory, University of Pennsylvania,

3401 Walnut Street, Suite 300C, Philadelphia, PA 19104-6228

{tardifj, pavlidis}@seas.upenn.edu, kostas@cis.upenn.edu

Abstract— We present a system for Monocular Simultaneous

Localization and Mapping (Mono-SLAM) relying solely on

video input. Our algorithm makes it possible to precisely

estimate the camera trajectory without relying on any motion

model. The estimation is fully incremental: at a given time

frame, only the current location is estimated while the previous

camera positions are never modiﬁed. In particular, we do not

perform any simultaneous iterative optimization of the camera

positions and estimated 3D structure (local bundle adjustment).

The key aspects of the system is a fast and simple pose

estimation algorithm that uses information not only from the

estimated 3D map, but also from the epipolar constraint. We

show that the latter leads to a much more stable estimation

of the camera trajectory than the conventional approach. We

perform high precision camera trajectory estimation in urban

scenes with a large amount of clutter. Using an omnidirectional

camera placed on a vehicle, we cover the longest distance ever

reported, up to 2.5 kilometers.

I. INTRODUCTION

Robot localization without any map knowledge can be

effectively solved with combinations of expensive GPS and

IMU sensors if we can assume that GPS cannot be jammed

and works well in urban canyons. Given accurate robot poses

from GPS/IMU, one can quickly establish a quasi-dense 3D

map of the environment if provided with a full 2.5D laser

scanning system and if the system can detect moving objects

around. This is how vehicles, in their majority, navigate

autonomously in recent DARPA Challenges. However, for

massive capture of urban environments or fast deployment

of small robots, we are faced with the challenge of how to

estimate accurate pose based on images and in particular

whether this can work for a very long route in the order of

kilometers. We know that the best results can be obtained

by combining sensors, for example cameras with IMUs.

Because we address applications where video data is useful

for many other purposes, we start from the assumption that

we have cameras and push this assumption to the extreme:

cameras will be the only sensor used.

We choose an omnidirectional camera array without over-

lap between the cameras, which is the reason why we

talk about monocular vision. We will concentrate on the

localization problem, which we call visual odometry. We

simultaneously establish a metric map of 3D landmarks. In

this work, no recognition technique is used for loop closing.

The main challenge for monocular visual odometry and

for mono-SLAM is to minimize the drift in the trajectory as

well as the map distortion in very long routes. We believe that

instabilities in the estimation of 3D motion can be weakened

or even eliminated with a very large ﬁeld of view. Not only

does the two view estimation become robust but landmarks

remain in the ﬁeld of view for longer temporal windows.

Having a good sensor like a camera with almost spherical

ﬁeld of view, we believe that the main priority is not the prop-

agation of the uncertainty but the correctness of the matches.

Having thousands of landmarks in each omnidirectional view

makes the correspondence problem hard. This is why we

use an established RANSAC based solution for two-view

motion estimation. Based on triangulation from consecutive

frames we establish maps of landmarks. An incoming view

has to be registered with the existing map. Here is the main

methodological contribution in this paper: given the last

image and a current 3D map of landmarks, we decouple the

rotation estimation from the translation in order to estimate

the pose of a new image.

This is the main difference to the state of the art in visual

odometry [30]: instead of applying classic three-point based

pose estimation, we compute the rotation from the epipolar

geometry between the last and the new image and the re-

maining translation from the 3D map. We cascade RANSAC

for these two steps and the main effect is that rotation

error decreases signiﬁcantly as well as the translation and

in particular the heading direction. The intuition behind this

decoupling is two-fold. Firstly, contrary to pose estimation

from the 3D points, the epipolar constraint estimation is not

subject to error propagation. Secondly, there is effectively

a wider overlap of the ﬁeld of views in epipolar geometry

than the effective angle that the current 3D map spans.

The difference are points far away which have not been

reconstructed. Such points (like in the vanishing direction of

the road) which are a burden according to [8], do contribute

to the epipolar estimation and thus to a better relative rotation

estimation.

We have been able to perform 2.5km long visual odometry

with the smallest drift with respect to the results in the

literature. This is veriﬁed by comparing it to GPS.

The main advantages of our approach can be summarized

as follows:

• The data association problem is solved robustly and

makes the algorithm unaffected by 3rd party motions

like surrounding vehicles and humans occupying a

signiﬁcant part of the ﬁeld of view;

• We do not apply any motion model and do not make

any assumption about uncertainty in the prediction or

in the measurements;

• At no stage do we need to apply an expensive batch

approach like bundle adjustment

• The propagation of the global scale is robust and the

underestimation in the length of the trajectory small;

• The orientation of the vehicle is estimated accurately.

We believe that considering the length of the route in our

results and the complexity of the environment, we surpass the

state of the art by using our new pose estimation algorithm

and by employing an omnidirectional camera.

II. RELATED WORK

Progress and challenges in multi-sensor SLAM have been

nicely summarized in the tutorial by Durrant-White [11], in

Thrun’s mapping survey [37], and in Frese’s survey [15].

Here, we will concentrate on summarizing purely vision

based SLAM systems, and in particular monocular SLAM

systems. We will describe related work as follows: We will

start with multiple frame structure from motion approaches

from computer vision, characterized mainly by their appli-

cation in short motion ranges. Then, we describe approaches

capable of handling long-range sequences.

Since the eighties, several approaches have been intro-

duced to extract the motion and the structure of an object

rather than a scene from multiple video-frames either in

recursive or batch ﬁltering modes [3], [5]. Object features

were visible over a long temporal window and the object

projection occupied a signiﬁcant part of the ﬁeld of view.

Systems employing motion models effectively smoothen the

trajectory of the camera and constrain the search area for

feature correspondence across frames. We list [6] as the latest

approach in this category, and as the ﬁrst who introduced

the inverse depth in the state vector converting this way a

nonlinear measurement equation to an identity. In all these

approaches, the global scale is ﬁxed for the ﬁrst frame while

the global reference frame is deﬁned from three points in the

ﬁrst frame. Among the very early vision-based approaches,

we have to mention [10] who used a camera over several

kilometers for lane following but not for global localization

or mapping. In the nineties, several approaches emerged with

the goal of 3D modeling from video starting with approaches

from Oxford [2], and culminating to state of the art systems

by Pollefeys et al. [31] and Heyden and Kahl [19] whose

main goal was a dense reconstruction of a 3D scene from

uncalibrated views. Multiple view methods are now textbook

material [17], [26].

We continue with approaches free of motion models.

Nister et al. [30] were the ﬁrst in recent history to produce

a real-time monocular SLAM without motion model or any

assumptions about the scene. They use the 5-point algorithm

[28] with preemptive RANSAC for three consecutive frames

and estimate subsequently pose with a 3-point algorithm

[16]. Points are triangulated between the farthest view-

points from which they are visible. A “ﬁrewall” prohibits

the inclusion of frames for triangulation past a particular

frame in history. Royer et al. [33] track Harris corners

with normalized cross-correlation. Similar to [30], local 3D

motion is computed from each incoming image triplet using

RANSAC for inlier detection. The last image of the triplet

is used in combination with the current estimate of the 3D

structure to compute the pose of the most recent view. A

hierarchical bundle adjustment is applied to subdivisions of

the sequence with at least one overlapping frame among

them. The established map is used for localization in a

second pass. In similar spirit is the work by Kaess and

Dellaert [20] who use an array of outwards looking cameras

but they test in a short range sequence. Sipla-Anan and

Hartley [34] use the same camera as we do but concentrate

on the loop closing task.

All approaches mentioned in the last paragraph have the

following in common with our approach: they do not use any

motion model, a requirement for any ﬁltering or Bayesian

approach. Among ﬁltering approaches, Davison’s real-time

monoSLAM system [8], [9] as well as Eade and Drum-

mond’s also monocular system [12] are the most competitive.

Similar to [6], both groups use inverse depth parametrization

with the advantages of being able to estimate depths of

very far features and being less dependent on the initial

estimates. A motion model via a combinaion of a particle

ﬁlter and EF is used by Karlsson et al. [21], too, in their

trademarked vSLAM system., tested in a tour of a two-

bedroom apartment. Corke et al. [7] use a catadioptric vision

system for visual odometry with Iterated EKF and report that

tracking fails after 30m (300 frames).

Regarding localization but from a sparse set of views, we

refer to Teller’s work [36] who estimates all pairwise epipolar

geometries from an unordered set of omnidirectional images

and then ﬁxes the translation scales with a spectral approach

[4]. An interesting fusion is the adjustment of a purely

visually obtained long range map [23] where visual odometry

is adjusted with a rough map. We will not refer to approaches

using an IMU [27], odometry, making assumptions about the

terrain or using laser scanners for mapping. We refer though

as state of the art to the longest sequence (>10km) that has

been used by Konolige et al. [22] who achieved an error

of only 10m after 10km using only a stereo camera and an

IMU. Their results show that sparse bundle adjustment can

improve pure stereo vision by halving the error while the use

of an IMU can decrease the error by a factor of 20.

III. OVERVIEW

Monocular Visual SLAM in urban environments with a

camera mounted on a vehicle is a particularly challenging

task. Difﬁculties indeed arise at many levels:

• Many outliers are present (other moving vehicles, os-

cillating trees, pedestrians) and must not be used by the

algorithm;

• Frequently, very few image landmarks are present, for

instance when passing by a park or a parking lot;

• Occlusions, especially from the trees, make it almost

impossible to track landmarks for a long period of time.

Under these circumstances, the use of an omnidirectional

camera is almost unavoidable as large parts of the ﬁeld of

view are sometimes useless. We combine ﬁve cameras for a

total of roughly 4 mega-pixels. Because, their optical centers

are virtually aligned, this can be seen as a high resolution

omnidirectional camera. In particular, no depth information

can be recovered without motion. To avoid confusion in

the rest of the paper, we will refer to those cameras and

their acquired images as a single omnidirectional camera

and single image. This high resolution comes at the cost

of a rather low frame-rate: a maximum of 10 frames per

second. Our system can handle this kind of frame-rate; in

fact, we could achieve very good results using only 3 frames

per second (see §V).

The system we present follows the approach of Nister et

al. [30] with some important differences, the most important

of which is the decoupled estimation of the rotation and

translation using epipolar geometry. Furthermore, we use

SIFT features instead of corners. Similarly to them, motion

estimation is completely incremental and no sophisticated

optimization step is performed.

IV. DETAILED DESCRIPTION

In this section, we detail each step of our algorithm

summarized in ﬁgure 1.

A. Landmark detection

The traditional approach for landmark detection is the use

of image corners such as Harris. It has been shown that these

can be computed very efﬁciently and matched by normalized

cross-correlation as long a the image deformation remains

low. Recently, Barfoot demonstrated that Scale Invariant

Feature Transform (SIFT) detector [24] could provide more

robust image matches in the context of SLAM [1], [13].

We tested both the Fast Corner Detector [32] and the

SIFT detector and found the latter to perform better. This

is most likely because of the low frame-rate of our camera.

Furthermore, the presence of trees, sometimes covering a

large portion of the image, yields large numbers of corners

of poor quality. This effect was not as important in the

case of SIFT detector. We use the implementation of the

SIFT detector by Vedaldi and Fulkerson, available online

and set the peak-threshold to one, which is lower than the

default value suggested in Lowe’s original implementation.

On average, we obtain, between 500 to 1000 landmarks per

image.

B. Landmark tracking

Landmark matching between two images is the basis

for our SLAM algorithm. Again, many possibilities were

considered before settling on the one described below. First,

we considered a KLT based tracker [25] which did no

provide sufﬁcient quality at our frame rate. Better results

were obtained using either normalized cross-correlation of

corners or SIFT matching, but the latter gave the best results.

The traditional approach for SIFT matching is to compute

the squared difference between SIFT descriptors. A match is

accepted if the ratio between its score and the score of the

second best match is at least 1.5. We found this criterion to be

far too restrictive for our application. Indeed, valid matches

between regions with repeating patterns, that are common in

urban scenes (e.g. buildings), were ofter rejected. Lowering

http://vision.ucla.edu/ vedaldi/code/vlfeat/vlfeat.html

OUPUT

• List of 3D points: X = {}

• List of cameras: P = {}

INITIALIZATION

• Tracking until keyframe is found

• Preemptive RANSAC for relative pose estimation

• Triangulation of landmarks

MAIN LOOP

• θ

min

= 5

• D

min

= 20 pixels

• At the current image I

Tracking:

– Detect and compute SIFT features

– Match with view I

n−1

– Preemptive RANSAC for relative pose → E

– Reﬁne matches using E

– If 95% of landmarks are within D

min

, drop

frame, repeat tracking

Motion estimation

– Preemtive RANSAC for relative pose → E

n−1,n

– Remove bad tracks

– Preemptive RANSAC for camera translation →

P{n}

Structure computation

– For each landmark l:

– If camList

= , camList

= {n − 1}

– N-view triangulation with views from camList

and view n → 3D point M

– Reject point if high re-projection error, goto next

landmark

– Compute angle θ between N,

P{camList

(end)} and P{n}

– If θ > θ

min

X{l} ← M

add {n + 1} to camList

Fig. 1. Overview of our visual odometry algorithm. Details are provided

in §IV.

the threshold caused more landmarks to be matched, but the

quality of the matching was also reduced.

To improve the quality of the matching, we make a simple

assumption similar to instantaneous constant velocity as

proposed by Davison et al. among others [8]. Our assumption

is weaker and we only use it to speedup the search. Denote

the current and two previous images by I

, I

n−1

and I

n−2

We constrain our search for correspondences between I

and I

n−1

, by assuming that the epipolar geometry changes

relatively little between neighboring images. In other words,

n−2,n−1

≈ E

n−1,n

We divide the matching into two steps. First, for every

landmark, we only consider landmarks in the second image

that are within a relatively high distance (D

= 20 pixels)

to the epipolar line. Furthermore, no minimum score ratio is

used, but only matches of a very good score (below 40000)

are kept. These matches are used to robustly estimate the

epipolar geometry between the two frames (see §IV-E). Then,

we recompute the matches, this time, setting D

to two

pixels while doubling the matching score threshold.

C. Initialization

Initialization of the algorithm is very simple. Tracking

is done until a keyframe is found, as described above. We

then robustly estimate the relative orientation of the second

frame with respect to the ﬁrst one. The distance between

the two cameras is set to one which also ﬁxes the overall

scale of the reconstruction. Then, two-view triangulation of

the landmarks is performed.

D. Keyframe selection

Our keyframe selection is akin to the one of Royer et

al. [33] and does not rely on Geometric Robust Information

Criteria [38]. A frame is dropped if more than 95% of the

landmarks moved within a threshold D

min

, set to 20 in all

our experiments. In practice, almost all acquired images are

keyframes when the vehicle runs at full speed.

E. Robust pose estimation

Our pose estimation procedure is the key to the high

quality of the trajectory estimation. At this step, we are given

a set of tracked landmarks. In addition, an estimate for the

location of a 3D point, previously obtained by triangulation

(see §sec:trian), is given for some of these tracks.

We proceed with a geometric hypothesize-and-test archi-

tecture similar to Nister et al. [30] and Royer et al. [33].

However we obtain superior results in practice. Before going

into details, it is useful to outline the approach used in the

two aforementioned works.

The ﬁrst step is to remove bad correspondences between

the two frames. Their approach, which we also adopt, is to

rely on Random Sampling and Consensus (RANSAC)[14]

with the minimal relative pose algorithm (so called 5-point

algorithm)[35]. Our implementation is a fast variation of

RANSAC called preemptive RANSAC [29]. Only landmarks

with known 3D points are retained for the second step.

Again, preemptive RANSAC is used with an algorithm for

computing the pose requiring three 3D-2D correspondences

[16]. Note that for this approach to be successful, the 3D

structure must be recovered with high accuracy. For this

reason, Royer et al. resort to iterative reﬁnement of the

camera position and orientation with the current visible

3D points (local bundle-adjustment). This step is the most

computationally intensive and avoiding it signiﬁcantly speeds

up the algorithm. In the case of Nister et al. , they use of

a ﬁrewall to reduce error propagation with the drawback of

reducing the number of landmarks available for triangulation.

We propose decoupling the estimation of the camera

orientation and position without implementing bundle ad-

justment or ﬁrewall. First, the following observations are

made. In general, and especially in urban environments, the

number of landmarks with known 3D points is much lower

than the actual number of correspondences. This is because

39.955 39.9555 39.956 39.9565 39.957

−75.2158

−75.2156

−75.2154

−75.2152

−75.215

−75.2148

−75.2146

Latitude

Longitude

Ground truth

3−point Algo.

2−point Algo.

0 50 100 150 200 250 300

100

150

Frames

Number of inlier

3−point Algo.

2−point Algo.

Fig. 2. Comparison between using a 3-point algorithm for pose estimation

and the proposed approach of obtaining the relative orientation using 2-view

epipolar geometry and camera position with the 2-point algorithm. Top) Two

images of location where the number of landmarks decreases signiﬁcantly.

Middle) Camera trajectories with respect to ground truth given by the GPS

(350 meters). Bottom) Inliers used for pose estimation.

triangulation can be accurately performed only for the part

of the scene located on the side of the vehicle. Furthermore,

we observe that the quality of the reconstruction is typically

quite low for a large portion of the 3D points, since land-

marks only appear in a few frames before being occluded.

Typically, the number of available 3D points varies between

20 and 300. On the other hand correspondences between

two views are much more abundant, typically between 500

and 1500, and more uniformly distributed. Thus, it seems

natural that the epipolar constraint should not be “ignored”

while estimating the pose from the 3D points.

As a matter of fact, epipolar geometry already provides

the relative position of the new camera up to one degree of

freedom: only the direction of the translation is estimated

[17]. In theory, only one 3D point is required to recover that

scale. In doing so, the estimated camera position is consistent

with both 3D points and the epipolar geometry. One may

argue that the recovered translation direction can be of poor

quality, especially under small camera motion. This is indeed

what we experienced in practice. For this reason, instead of

only estimating the translation scale, we actually estimate

the full camera position while ﬁxing its orientation. Solving

for this is straightforward and requires only 2 (1 and a half)

3D-2D correspondences. Preemptive RANSAC is performed

followed by iterative reﬁnement.

Figure 2 shows a 350 m. trajectory estimated using both a

3-point and the 2-point algorithm, as well as the ground truth

given by our GPS. In this sequence, the number of landmarks

is signiﬁcantly reduced around frame 200, resulting in a drift

in the 3-point algorithm. To better understand why our pose

estimation algorithm is less sensitive to drift we analyzed

the inlier count given by the two algorithms. In ﬁgure 2, we

observe that when drifting occurs, the number of inliers for

a 3-point algorithm is on average higher than that for the

2-point algorithm. This means that the drift is caused by the

use of some of the 3D points whose tracks are consistent with

the epipolar geometry but whose 3D points are erroneous.

F. Triangulation

As explained above, structure computation is only needed

to estimate the camera position, but not its orientation. A

small set of points of high quality is thus preferred over

larger one of lower quality. Thus, our triangulation procedure

is conservative. When only two landmarks are available, we

rely on a fast close-form two-view triangulation algorithm

[28]. Otherwise, we perform N-view triangulation using the

DLT algorithm [17]. Any attempt of relying on only a two

view triangulation algorithm, including the one minimizing

the re-projection error [18], gave results of lower quality. No

iterative reﬁnement of the 3D points using the re-projection

error is used in our experiments.

We propose a simple heuristic to select the cameras used

for triangulation. We do so in order to distribute back-

projection rays as evenly as possible. Our goal here is

twofold. Firstly, we make sure that 3D points far away from

the camera are never used for estimating the position of

the camera. However, their corresponding 2D landmarks are

still used for estimating the camera orientation as explained

earlier. Secondly, as the distance of a 3D point from the

current camera increases, it is not re-triangulated over again.

We proceed as follows. For each landmark l, we keep a list

of indices camList

corresponding to the cameras used for

triangulation. When a landmark ﬁrst appears in two frames,

denoted I

and I

n+1

, we add {n} to camList

. Note that

n + 1 is not yet added. We then estimate the 3D points using

camera n and n + 1 and compute the angle between the

back-projection rays with respect to the 3D points. If the

angle is larger than a given threshold, θ

min

, the 3D point

is kept and n + 1 is added to camList

. When performing

triangulation for a landmark observed before frame f , we

proceed similarly. We compute a 3D point using camList

as well as the current camera. We then compute the angle

of the emerging rays from the camera corresponding to the

last element of camList

and the current one. Again, if the

Fig. 3. Top) Ladybug 2 camera mounted on a vehicle. Bottom) Six images

acquired from the six cameras, after radial distortion correction.

angle is too small, this newly triangulated point is rejected

and the one previously estimated is kept. At any of these

steps, if the re-projection error is higher than our threshold,

we completely reject the 3D point.

V. EXPERIMENTS

A. System conﬁguration

We tested our system using a Pointgrey Ladybug 2 camera

mounted on a vehicle as illustrated in ﬁgure V-A. The SDK

was used to rectiﬁed the six ﬁsh eye cameras and also

provided the relative orientation of the cameras. Each image

has a resolution of 768 × 1024. Only the ﬁrst ﬁve were

actually used. They also show a large amount of clutter.

B. results

Two sequences, respectively of 1 km and 2.5 km, and

both containing loops were tested. The frame rate of the

camera during each experiment was set to 3 and 10 images

per second. A GPS unit was used and synchronized with the

image acquisition. It was used to test the accuracy of the

Monocular visual odometry in urban environments using an omnidirectional camera

Figures

Citations

Visual Odometry [Tutorial]

Visual Odometry : Part II: Matching, Robustness, Optimization, and Applications

1-Point RANSAC for extended Kalman filtering: Application to real-time structure from motion and visual odometry: Civera et al.: 1-Point RANSAC for EK Filtering

Real-Time Monocular Visual Odometry for On-Road Vehicles with 1-Point RANSAC.

Real-time monocular visual odometry for on-road vehicles with 1-point RANSAC

References

Distinctive Image Features from Scale-Invariant Keypoints

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Multiple view geometry in computer vision

Multiple View Geometry in Computer Vision.

An iterative image registration technique with an application to stereo vision

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Multiple view geometry in computer vision

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

An efficient solution to the five-point relative pose problem

MonoSLAM: Real-Time Single Camera SLAM

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Monocular visual odometry in urban environments" ?