What are the contributions in "Semi-dense visual odometry for a monocular camera∗" ?

The authors propose a fundamentally novel approach to real-time visual odometry for a monocular camera. In this paper, the authors propose a novel semi-dense visual odometry approach for a monocular camera, which combines the accuracy and robustness of dense approaches with the efficiency of feature-based methods. This work was supported by the ERC Starting Grant ConvexVision and the DFG project Mapping on Demand far Further, it computes highly accurate semi-dense depth maps from the monocular images, providing rich information about the 3D ∗

(Open Access) Semi-dense Visual Odometry for a Monocular Camera (2013) | Jakob Engel

Q: What are the key concepts of this approach?

The key concepts are• a probabilistic depth map representation, • tracking based on whole-image alignment, • the reduction on image-regions which carry informa-tion (semi-dense), and• the full incorporation of stereo measurement uncertainty.

Semi-Dense Visual Odometry for a Monocular Camera

∗

Jakob Engel, J

urgen Sturm, Daniel Cremers

TU M

unchen, Germany

Abstract

We propose a fundamentally novel approach to real-time

visual odometry for a monocular camera. It allows to ben-

eﬁt from the simplicity and accuracy of dense tracking –

which does not depend on visual features – while running

in real-time on a CPU. The key idea is to continuously esti-

mate a semi-dense inverse depth map for the current frame,

which in turn is used to track the motion of the camera using

dense image alignment. More speciﬁcally, we estimate the

depth of all pixels which have a non-negligible image gradi-

ent. Each estimate is represented as a Gaussian probability

distribution over the inverse depth. We propagate this in-

formation over time, and update it with new measurements

as new images arrive. In terms of tracking accuracy and

computational speed, the proposed method compares favor-

ably to both state-of-the-art dense and feature-based visual

odometry and SLAM algorithms. As our method runs in

real-time on a CPU, it is of large practical value for robotics

and augmented reality applications.

1. Towards Dense Monocular Visual Odometry

Tracking a hand-held camera and recovering the three-

dimensional structure of the environment in real-time is

among the most prominent challenges in computer vision.

In the last years, dense approaches to these challenges have

become increasingly popular: Instead of operating solely

on visual feature positions, they reconstruct and track on

the whole image using a surface-based map and thereby

are fundamentally different from feature-based approaches.

Yet, these methods are to date either not real-time capable

on standard CPUs [11, 15, 17] or require direct depth mea-

surements from the sensor [7], making them unsuitable for

many practical applications.

In this paper, we propose a novel semi-dense visual

odometry approach for a monocular camera, which com-

bines the accuracy and robustness of dense approaches

with the efﬁciency of feature-based methods. Further, it

computes highly accurate semi-dense depth maps from the

monocular images, providing rich information about the 3D

∗

This work was supported by the ERC Starting Grant ConvexVision

and the DFG project Mapping on Demand

far

Figure 1. Semi-Dense Monocular Visual Odometry: Our ap-

proach works on a semi-dense inverse depth map and combines the

accuracy and robustness of dense visual SLAM methods with the

efﬁciency of feature-based techniques. Left: video frame, Right:

color-coded semi-dense depth map, which consists of depth esti-

mates in all image regions with sufﬁcient structure.

structure of the environment. We use the term visual odom-

etry as supposed to SLAM, as – for simplicity – we deliber-

ately maintain only information about the currently visible

scene, instead of building a global world-model.

1.1. Related Work

Feature-based monocular SLAM. In all feature-based

methods (such as [4, 8]), tracking and mapping consists of

two separate steps: First, discrete feature observations (i.e.,

their locations in the image) are extracted and matched to

each other. Second, the camera and the full feature poses

are calculated from a set of such observations – disregard-

ing the images themselves. While this preliminary abstrac-

tion step greatly reduces the complexity of the overall prob-

lem and allows it to be tackled in real time, it inherently

comes with two signiﬁcant drawbacks: First, only image

information conforming to the respective feature type and

parametrization – typically image corners and blobs [6] or

line segments [9] – is utilized. Second, features have to

be matched to each other, which often requires the costly

computation of scale- and rotation-invariant descriptors and

robust outlier estimation methods like RANSAC.

Dense monocular SLAM. To overcome these limitations

and to better exploit the available image information, dense

monocular SLAM methods [11, 17] have recently been pro-

posed. The fundamental difference to keypoint-based ap-

proaches is that these methods directly work on the images

instead of a set of extracted features, for both mapping and

tracking: The world is modeled as dense surface while in

turn new frames are tracked using whole-image alignment.

This concept removes the need for discrete features, and

allows to exploit all information present in the image, in-

creasing tracking accuracy and robustness. To date how-

ever, doing this in real-time is only possible using modern,

powerful GPU processors.

Similar methods are broadly used in combination with

RGB-D cameras [7], which directly measure the depth of

each pixel, or stereo camera rigs [3] – greatly reducing the

complexity of the problem.

Dense multi-view stereo. Signiﬁcant prior work exists on

multi-view dense reconstruction, both in a real-time setting

[13, 11, 15], as well as off-line [5, 14]. In particular for off-

line reconstruction, there is a long history of using different

baselines to steer the stereo-inherent trade-off between ac-

curacy and precision [12]. Most similar to our approach is

the early work of Matthies et al., who proposed probabilis-

tic depth map fusion and propagation for image sequences

[10], however only for structure from motion, i.e., not cou-

pled with subsequent dense tracking.

1.2. Contributions

In this paper, we propose a novel semi-dense approach to

monocular visual odometry, which does not require feature

points. The key concepts are

• a probabilistic depth map representation,

• tracking based on whole-image alignment,

• the reduction on image-regions which carry informa-

tion (semi-dense), and

• the full incorporation of stereo measurement uncer-

tainty.

To the best of our knowledge, this is the ﬁrst featureless,

real-time monocular visual odometry approach, which runs

in real-time on a CPU.

1.3. Method Outline

Our approach is partially motivated by the basic princi-

ple that for most real-time applications, video information

is abundant and cheap to come by. Therefore, the computa-

tional budget should be spent such that the expected infor-

mation gain is maximized. Instead of reducing the images

to a sparse set of feature observations however, our method

continuously estimates a semi-dense inverse depth map for

the current frame, i.e., a dense depth map covering all image

regions with non-negligible gradient (see Fig. 2). It is com-

prised of one inverse depth hypothesis per pixel modeled

by a Gaussian probability distribution. This representation

still allows to use whole-image alignment [7] to track new

far

original image semi-dense depth map (ours)

keypoint depth map [8] dense depth map [11] RGB-D camera [16]

Figure 2. Semi-Dense Approach: Our approach reconstructs and

tracks on a semi-dense inverse depth map, which is dense in all

image regions carrying information (top-right). For comparison,

the bottom row shows the respective result from a keypoint-based

approach, a fully dense approach and the ground truth from an

RGB-D camera.

frames, while at the same time greatly reducing computa-

tional complexity compared to volumetric methods. The

estimated depth map is propagated from frame to frame,

and updated with variable-baseline stereo comparisons. We

explicitly use prior knowledge about a pixel’s depth to se-

lect a suitable reference frame on a per-pixel basis, and to

limit the disparity search range.

The remainder of this paper is organized as follows: Sec-

tion 2 describes the semi-dense mapping part of the pro-

posed method, including the derivation of the observation

accuracy as well as the probabilistic data fusion, propaga-

tion and regularization steps. Section 3 describes how new

frames are tracked using whole-image alignment, and Sec. 4

summarizes the complete visual odometry method. A qual-

itative as well as a quantitative evaluation is presented in

Sec. 5. We then give a brief conclusion in Sec. 6.

2. Semi-Dense Depth Map Estimation

One of the key ideas proposed in this paper is to esti-

mate a semi-dense inverse depth map for the current cam-

era image, which in turn can be used for estimating the

camera pose of the next frame. This depth map is continu-

ously propagated from frame to frame, and reﬁned with new

stereo depth measurements, which are obtained by perform-

ing per-pixel, adaptive-baseline stereo comparisons. This

allows us to accurately estimate the depth both of close-by

and far-away image regions. In contrast to previous work

that accumulates the photometric cost over a sequence of

several frames [11, 15], we keep exactly one inverse depth

hypothesis per pixel that we represent as Gaussian proba-

bility distribution.

This section is comprised of three main parts: Sec-

reference small baseline medium baseline large baseline

0 0.05 0.1 0.15 0.2 0.25 0.3

small

medium

large

cost

inverse depth d

Figure 3. Variable Baseline Stereo: Reference image (left), three

stereo images at different baselines (right), and the respective

matching cost functions. While a small baseline (black) gives a

unique, but imprecise minimum, a large baseline (red) allows for

a very precise estimate, but has many false minima.

tion 2.1 describes the stereo method used to extract new

depth measurements from previous frames, and how they

are incorporated into the prior depth map. In Sec. 2.2, we

describe how the depth map is propagated from frame to

frame. In Sec. 2.3, we detail how we partially regularize

the obtained depth map in each iteration, and how outliers

are handled. Throughout this section, d denotes the inverse

depth of a pixel.

2.1. Stereo-Based Depth Map Update

It is well known [12] that for stereo, there is a trade-off

between precision and accuracy (see Fig. 3). While many

multiple-baseline stereo approaches resolve this by accu-

mulating the respective cost functions over many frames

[5, 13], we propose a probabilistic approach which ex-

plicitly takes advantage of the fact that in a video, small-

baseline frames are available before large-baseline frames.

The full depth map update (performed once for each new

frame) consists of the following steps: First, a subset of pix-

els is selected for which the accuracy of a disparity search

is sufﬁciently large. For this we use three intuitive and

very efﬁciently computable criteria, which will be derived

in Sec. 2.1.3. For each selected pixel, we then individu-

ally select a suitable reference frame, and perform a one-

dimensional disparity search. Propagated prior knowledge

is used to reduce the disparity search range when possible,

decreasing computational cost and eliminating false min-

ima. The obtained inverse depth estimate is then fused into

the depth map.

2.1.1 Reference Frame Selection

Ideally, the reference frame is chosen such that it max-

imizes the stereo accuracy, while keeping the disparity

search range as well as the observation angle sufﬁciently

current frame pixel’s “age”

-4.8 s -3.9 s -3.1 s -2.2 s

-1.2 s -0.8 s -0.5 s -0.4 s

Figure 4. Adaptive Baseline Selection: For each pixel in the

new frame (top left), a different stereo-reference frame is selected,

based on how long the pixel was visible (top right: the more yel-

low, the older the pixel.). Some of the reference frames are dis-

played below, the red regions were used for stereo comparisons.

small. As the stereo accuracy depends on many factors and

because this selection is done for each pixel independently,

we employ the following heuristic: We use the oldest frame

the pixel was observed in, where the disparity search range

and the observation angle do not exceed a certain threshold

(see Fig. 4). If a disparity search is unsuccessful (i.e., no

good match is found), the pixel’s “age” is increased, such

that subsequent disparity searches use newer frames where

the pixel is likely to be still visible.

2.1.2 Stereo Matching Method

We perform an exhaustive search for the pixel’s intensity

along the epipolar line in the selected reference frame, and

then perform a sub-pixel accurate localization of the match-

ing disparity. If a prior inverse depth hypothesis is avail-

able, the search interval is limited by d ± 2σ

, where d and

denote the mean and standard deviation of the prior hy-

pothesis. Otherwise, the full disparity range is searched.

In our implementation, we use the SSD error over ﬁve

equidistant points on the epipolar line: While this signiﬁ-

cantly increases robustness in high-frequent image regions,

it does not change the purely one-dimensional nature of this

search. Furthermore, it is computationally efﬁcient, as 4 out

of 5 interpolated image values can be re-used for each SSD

evaluation.

2.1.3 Uncertainty Estimation

In this section, we use uncertainty propagation to derive an

expression for the error variance σ

on the inverse depth d.

In general this can be done by expressing the optimal in-

verse depth d

∗

as a function of the noisy inputs – here we

consider the images I

, I

themselves, their relative orien-

tation ξ and the camera calibration in terms of a projection

function π

∗

= d(I

, I

, ξ, π). (1)

The error-variance of d

∗

is then given by

= J

ΣJ

, (2)

where J

is the Jacobian of d, and Σ the covariance of the

input-error. For more details on covariance propagation, in-

cluding the derivation of this formula, we refer to [2]. For

simplicity, the following analysis is performed for patch-

free stereo, i.e., we consider only a point-wise search for a

single intensity value along the epipolar line.

For this analysis, we split the computation into three

steps: First, the epipolar line in the reference frame is com-

puted. Second, the best matching position λ

∗

∈ R along it

(i.e., the disparity) is determined. Third, the inverse depth

∗

is computed from the disparity λ

∗

. The ﬁrst two steps

involve two independent error sources: the geometric error,

which originates from noise on ξ and π and affects the ﬁrst

step, and the photometric error, which originates from noise

in the images I

, I

and affects the second step. The third

step scales these errors by a factor, which depends on the

baseline.

Geometric disparity error. The geometric error is the er-

ror 

on the disparity λ

∗

caused by noise on ξ and π. While

it would be possible to model, propagate, and estimate the

complete covariance on ξ and π, we found that the gain

in accuracy does not justify the increase in computational

complexity. We therefore use an intuitive approximation:

Let the considered epipolar line segment L ⊂ R

be de-

ﬁned by

L :=

+ λ





| λ ∈ S

, (3)

where λ is the disparity with search interval S, (l

, l

)

the

normalized epipolar line direction and l

the point corre-

sponding to inﬁnite depth. We now assume that only the

absolute position of this line segment, i.e., l

is subject to

isotropic Gaussian noise 

. As in practice we keep the

searched epipolar line segments short, the inﬂuence of rota-

tional error is small, making this a good approximation.

Intuitively, a positioning error 

on the epipolar line

causes a small disparity error 

if the epipolar line is par-

allel to the image gradient, and a large one otherwise (see

Fig. 5). This can be mathematically derived as follows: The

image constrains the optimal disparity λ

∗

to lie on a certain

isocurve, i.e. a curve of equal intensity. We approximate

In the linear case, this is the camera matrix K – in practice however,

nonlinear distortion and other (unmodeled) effects also play a role.



g, l



Figure 5. Geometric Disparity Error: Inﬂuence of a small posi-

tioning error 

of the epipolar line on the disparity error 

. The

dashed line represents the isocurve on which the matching point

has to lie. 

is small if the epipolar line is parallel to the image

gradient (left), and a large otherwise (right).

this isocurve to be locally linear, i.e. the gradient direction

to be locally constant. This gives

+ λ

∗





= g

+ γ



−g



, γ ∈ R (4)

where g := (g

, g

) is the image gradient and g

a point on

the isoline. The inﬂuence of noise on the image values will

be derived in the next paragraph, hence at this point g and

are assumed noise-free. Solving for λ gives the optimal

disparity λ

∗

in terms of the noisy input l

∗

) =

hg, g

− l

hg, li

(5)

Analogously to (2), the variance of the geometric disparity

error can then be expressed as

λ(ξ,π)

= J

∗

)



0 σ



∗

)

hg, li

, (6)

where g is the normalized image gradient, l the normalized

epipolar line direction and σ

the variance of 

. Note that

this error term solely originates from noise on the relative

camera orientation ξ and the camera calibration π, i.e., it is

independent of image intensity noise.

Photometric disparity error. Intuitively, this error en-

codes that small image intensity errors have a large effect

on the estimated disparity if the image gradient is small, and

a small effect otherwise (see Fig. 6). Mathematically, this

relation can be derived as follows. We seek the disparity λ

∗

that minimizes the difference in intensities, i.e.,

∗

= min

ref

− I

(λ))

, (7)

where i

ref

is the reference intensity, and I

(λ) the image in-

tensity on the epipolar line at disparity λ. We assume a good

initialization λ

to be available from the exhaustive search.

Using a ﬁrst-order Taylor approximation for I

gives

∗

(I) = λ

+ (i

ref

− I

(λ

)) g

−1

, (8)

where g

is the gradient of I

, that is image gradient along

the epipolar line. For clarity we only consider noise on i

ref

and I

(λ

); equivalent results are obtained in the general

case when taking into account noise on the image values

involved in the computation of g

. The variance of the pho-



Figure 6. Photometric Disparity Error: Noise 

on the image

intensity values causes a small disparity error 

if the image gra-

dient along the epipolar line is large (left). If the gradient is small,

the disparity error is magniﬁed (right).

tometric disparity error is given by

λ(I)

= J

∗

(I)



0 σ



∗

(I)

2σ

, (9)

where σ

is the variance of the image intensity noise. The

respective error originates solely from noisy image intensity

values, and hence is independent of the geometric disparity

error.

Pixel to inverse depth conversion. Using that, for small

camera rotation, the inverse depth d is approximately pro-

portional to the disparity λ, the observation variance of the

inverse depth σ

d,obs

can be calculated using

d,obs

= α



λ(ξ,π)

+ σ

λ(I)



, (10)

where the proportionality constant α – in the general, non-

rectiﬁed case – is different for each pixel, and can be calcu-

lated from

α :=

, (11)

where δ

is the length of the searched inverse depth inter-

val, and δ

the length of the searched epipolar line segment.

While α is inversely linear in the length of the camera trans-

lation, it also depends on the translation direction and the

pixel’s location in the image.

When using an SSD error over multiple points along the

epipolar line – as our implementation does – a good upper

bound for the matching uncertainty is then given by

d,obs-SSD

≤ α



min{σ

λ(ξ,π)

} + min{σ

λ(I)

}



, (12)

where the min goes over all points included in the SSD er-

ror.

2.1.4 Depth Observation Fusion

After a depth observation for a pixel in the current image

has been obtained, we integrate it into the depth map as fol-

lows: If no prior hypothesis for a pixel exists, we initialize

it directly with the observation. Otherwise, the new obser-

vation is incorporated into the prior, i.e., the two distribu-

tions are multiplied (corresponding to the update step in a

Kalman ﬁlter): Given a prior distribution N (d

, σ

) and a

noisy observation N(d

, σ

), the posterior is given by

+ σ

. (13)

2.1.5 Summary of Uncertainty-Aware Stereo

New stereo observations are obtained on a per-pixel ba-

sis, adaptively selecting for each pixel a suitable reference

frame and performing a one-dimensional search along the

epipolar line. We identiﬁed the three major factors which

determine the accuracy of such a stereo observation, i.e.,

• the photometric disparity error σ

λ(ξ,π)

, depending

on the magnitude of the image gradient along the

epipolar line,

• the geometric disparity error σ

λ(I)

, depending on the

angle between the image gradient and the epipolar line

(independent of the gradient magnitude), and

• the pixel to inverse depth ratio α, depending on the

camera translation, the focal length and the pixel’s po-

sition.

These three simple-to-compute and purely local criteria are

used to determine for which pixel a stereo update is worth

the computational cost. Further, the computed observation

variance is then used to integrate the new measurements into

the existing depth map.

2.2. Depth Map Propagation

We continuously propagate the estimated inverse depth

map from frame to frame, once the camera position of the

next frame has been estimated. Based on the inverse depth

estimate d

for a pixel, the corresponding 3D point is calcu-

lated and projected into the new frame, providing an inverse

depth estimate d

in the new frame. The hypothesis is then

assigned to the closest integer pixel position – to eliminate

discretization errors, the sub-pixel accurate image location

of the projected point is kept, and re-used for the next prop-

agation step.

For propagating the inverse depth variance, we assume

the camera rotation to be small. The new inverse depth d

can then be approximated by

) = (d

−1

− t

)

−1

, (14)

where t

is the camera translation along the optical axis.

The variance of d

is hence given by

= J

+ σ





+ σ

, (15)

where σ

is the prediction uncertainty, which directly cor-

responds to the prediction step in an extended Kalman ﬁl-

ter. It can also be interpreted as keeping the variance on

Semi-dense Visual Odometry for a Monocular Camera

Figures

Citations

LSD-SLAM: Large-Scale Direct Monocular SLAM

SVO: Fast semi-direct monocular visual odometry

SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems

BundleFusion: real-time globally consistent 3D reconstruction using on-the-fly surface re-integration

CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction

References

A Combined Corner and Edge Detector

Parallel Tracking and Mapping for Small AR Workspaces

MonoSLAM: Real-Time Single Camera SLAM

Lucas-Kanade 20 Years On: A Unifying Framework

A benchmark for the evaluation of RGB-D SLAM systems

Related Papers (5)

LSD-SLAM: Large-Scale Direct Monocular SLAM

Parallel Tracking and Mapping for Small AR Workspaces

ORB-SLAM: A Versatile and Accurate Monocular SLAM System

A benchmark for the evaluation of RGB-D SLAM systems

MonoSLAM: Real-Time Single Camera SLAM

Frequently Asked Questions (4)

Q1. What are the contributions in "Semi-dense visual odometry for a monocular camera∗" ?

Q2. What are the key concepts of this approach?

Q3. What is the main idea of this paper?

Q4. What was the support for this work?