scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

SVO: Fast semi-direct monocular visual odometry

29 Sep 2014-pp 15-22
TL;DR: A semi-direct monocular visual odometry algorithm that is precise, robust, and faster than current state-of-the-art methods and applied to micro-aerial-vehicle state-estimation in GPS-denied environments is proposed.
Abstract: We propose a semi-direct monocular visual odometry algorithm that is precise, robust, and faster than current state-of-the-art methods. The semi-direct approach eliminates the need of costly feature extraction and robust matching techniques for motion estimation. Our algorithm operates directly on pixel intensities, which results in subpixel precision at high frame-rates. A probabilistic mapping method that explicitly models outlier measurements is used to estimate 3D points, which results in fewer outliers and more reliable points. Precise and high frame-rate motion estimation brings increased robustness in scenes of little, repetitive, and high-frequency texture. The algorithm is applied to micro-aerial-vehicle state-estimation in GPS-denied environments and runs at 55 frames per second on the onboard embedded computer and at more than 300 frames per second on a consumer laptop. We call our approach SVO (Semi-direct Visual Odometry) and release our implementation as open-source software.

Summary (4 min read)

Introduction

  • Precise and high frame-rate motion estimation brings increased robustness in scenes of little, repetitive, and high-frequency texture.
  • Precise fully autonomous operation requires MAVs to rely on alternative localization systems.

A. Taxonomy of Visual Motion Estimation Methods

  • Methods that simultaneously recover camera pose and scene structure from video can be divided into two classes: ∗The authors are with the Robotics and Perception Group, University of Zurich, Switzerland—http://rpg.ifi.uzh.ch.
  • This research was supported by the Swiss National Science Foundation through project number 200021-143607 (“Swarm of Flying Cameras”), the National Centre of Competence in Research Robotics, and the CTI project number 14652.1. a) Feature-Based Methods:.
  • The majority of VO algorithms [12] follows this procedure, independent of the applied optimization framework.
  • A reason for the success of these methods is the availability of robust feature detectors and descriptors that allow matching between images even at large inter-frame movement.
  • Since direct methods operate directly on the intensitiy values of the image, the time for feature detection and invariant descriptor computation can be saved.

C. Contributions and Outline

  • The proposed Semi-Direct Visual Odometry (SVO) algorithm uses feature-correspondence; however, featurecorrespondence is an implicit result of direct motion estimation rather than of explicit feature extraction and matching.
  • In contrast to previous direct methods, the authors use many of small patches rather than few (tens) large planar patches [18]–[21].
  • A Bayesian filter that explicitly models outlier measurements is used to estimate the depth at feature locations.
  • Section II provides an overview of the pipeline and Section III, thereafter, introduces some required notation.
  • Section IV and V explain the proposed motion-estimation and mapping algorithms.

II. SYSTEM OVERVIEW

  • The algorithm uses two parallel threads (as in [16]), one for estimating the camera motion, and a second one for mapping as the environment is being explored.
  • This separation allows fast and constant-time tracking in one thread, while the second thread extends the map, decoupled from hard real-time constraints.
  • The 2D coordinates corresponding to the reprojected points are refined in the next step through alignment of the corresponding feature-patches .
  • Motion estimation concludes by refining the pose and the structure through minimizing the reprojection error introduced in the prevous feature-alignment step.
  • New depth-filters are initialised whenever a new keyframe is selected in regions of the image where few 3D-to-2D correspondences are found.

III. NOTATION

  • Before the algorithm is detailed, the authors briefly define the notation that is used throughout the paper.
  • The projection π is determined by the intrinsic camera parameters which are known from calibration.
  • The camera position and orientation at timestep k is expressed with the rigid-body transformation Tk,w ∈ SE(3).
  • During the optimization, the authors need a minimal representation of the transformation and, therefore, use the Lie algebra se(3) corresponding to the tangent space of SE(3) at the identity.
  • The authors denote the algebra elements—also named twist coordinates—with ξ = (ω,ν)T ∈ R6, where ω is called the angular velocity and ν the linear velocity.

A. Sparse Model-based Image Alignment

  • The maximum likelihood estimate of the rigid body transformation Tk,k−1 between two consecutive camera poses minimizes the negative log-likelihood of the intensity residuals: Tk,k−1 = argmin T ∫∫ R̄ ρ [ δ I ( T,u ) ] du. (4) The intensity residual δ.
  • The negative log likelihood minimizer then corresponds to the least squares problem: ρ[.]=̂ 1 2 ‖ . ‖2.
  • In practice, the distribution has heavier tails due to occlusions and thus, a robust cost function must be applied [10].
  • The authors denote small patches of 4× 4 pixels around the feature point with the vector I(ui).
  • The authors use the inverse compositional formulation [27] of the intensity residual, which computes the update step T(ξ ) for the reference image at time k−1: δ I(ξ ,ui) =.

B. Relaxation Through Feature Alignment

  • The last step aligned the camera with respect to the previous frame.
  • Through back-projection, the found relative pose Tk,k−1 implicitly defines an initial guess for the feature positions of all visible 3D points in the new image.
  • To reduce the drift, the camera pose should be aligned with respect to the map, rather than to the previous frame.
  • For each reprojected point, the keyframe r that observes the point with the closest observation angle is identified.
  • This step can be understood as a relaxation step that violates the epipolar constraints to achieve a higher correlation between the feature-patches.

C. Pose and Structure Refinement

  • In the previous step, the authors have established feature correspondence with subpixel accuracy at the cost of violating the epipolar constraints.
  • This is the well known problem of motion-only BA [17] and can efficiently be solved using an iterative non-linear least squares minimization algorithm such as Gauss Newton.
  • Finally, it is possible to apply local BA, in which both the pose of all close keyframes as well as the observed 3D points are jointly optimized.
  • The BA step is ommitted in the fast parameter settings of the algorithm (Section VII).

D. Discussion

  • The first (Section IV-A) and the last (Section IV-C) optimization of the algorithm seem to be redundant as both optimize the 6 DoF pose of the camera.
  • Indeed, one could directly start with the second step and establish featurecorrespondence through Lucas-Kanade tracking [27] of all feature-patches, followed by nonlinear pose refinement (Section IV-C).
  • While this would work, the processing time would be higher.
  • In SVO however, feature alignment is efficiently initialized by only optimizing six parameters—the camera pose—in the sparse image alignment step.
  • The authors found empirically that using the first step only results in significantly more drift compared to using all three steps together.

V. MAPPING

  • Given an image and its pose {Ik,Tk,w}, the mapping thread estimates the depth of 2D features for which the corresponding 3D point is not yet known.
  • The proposed depth estimation is very efficient when only a small range around the current depth estimate on the epipolar line is searched; in their case the range corresponds to twice the standard deviation of the current depth estimate.
  • Subsequently, the optimization is initialized at the next finer level.
  • When a new keyframe is inserted in the map, the keyframe farthest apart from the current position of the camera is removed.
  • The same grid is also used for reprojecting the map before feature alignment.

VII. EXPERIMENTAL RESULTS

  • Experiments were performed on datasets recorded from a downward-looking camera1 attached to a MAV and sequences from a handheld camera.
  • The video was processed on both a laptop2 and on an embedded platform3 that is mounted on the MAV .
  • Note that at maximum 2 CPU cores are used for the algorithm.
  • On the embedded platform only the fast parameters’ setting is used.
  • The authors compare the performance of SVO with the modified PTAM algorithm of [2].

A. Accuracy

  • The ground-truth for the trajectory originates from a motion capture system.
  • In order to generate the plots, the authors aligned the first 10 frames with the ground-truth using [31].
  • In [2], the use of lower resolution images is motivated by the fact that high-frequency self-similar texture in the image results in too many outlier 3D points.
  • The difference in accuracy between the fast and accurate parameter setting is not significant.
  • Optimizing the pose and the observed 3D points separately at every iteration (fast parameter setting) is accurate enough for MAV motion estimation.

B. Runtime Evaluation

  • Figures 13 and 14 show a break-up of the time required to compute the camera motion on the specified laptop and embedded platform respectively with the fast-parameter setting.
  • The laptop is capable to process the frames faster than 300 frames per second (fps) while the embedded platform runs at 55 fps.
  • The corresponding time for PTAM is 91 fps and 27 fps respectively.
  • The reason why the authors can reliably track the camera with less features is the use of depth-filters, which assures that the features being tracked are reliable.
  • The time required by the mapping thread to update all depth-filters with the new frame is highly dependent on the number of filters.

VIII. CONCLUSION

  • The authors proposed the semi-direct VO pipeline “SVO” that is precise and faster than the current state-of-theart.
  • The gain in speed is due to the fact that feature-extraction and matching is not required for motion estimation.
  • Instead, a direct method is used, which is based directly on the image intensities.
  • The algorithm is particularly useful for stateestimation onboard MAVs as it runs at more than 50 frames per second on current embedded computers.
  • High framerate motion estimation, combined with an outlier resistant probabilistic mapping method, provides increased robustness in scenes of little, repetitive, and high frequency-texture.

Did you find this useful? Give us your feedback

Figures (19)

Content maybe subject to copyright    Report

ZurichOpenRepositoryand
Archive
UniversityofZurich
UniversityLibrary
Strickhofstrasse39
CH-8057Zurich
www.zora.uzh.ch
Year:2014
SVO:fastsemi-directmonocularvisualodometry
Forster,Christian;Pizzoli,Matia;Scaramuzza,Davide
Abstract:Weproposeasemi-directmonocularvisualodometryalgorithmthatisprecise,robust,and
fasterthancurrentstate-of-the-artmethods.Thesemi-directapproacheliminatestheneedofcostly
featureextractionandrobustmatchingtechniquesformotionestimation.Ouralgorithmoperatesdirectly
onpixelintensities,whichresultsinsubpixelprecisionathighframe-rates.Aprobabilisticmapping
method thatexplicitly models outlier measurementsis used toestimate 3D points, which resultsin
feweroutliersandmorereliablepoints. Preciseandhighframe-ratemotionestimationbringsincreased
robustnessinscenesoflittle,repetitive,andhigh-frequencytexture. Thealgorithmisappliedtomicro-
aerial-vehiclestate-estimation inGPS-deniedenvironmentsandruns at55frames persecond onthe
onboardembeddedcomputerandatmorethan300framespersecondonaconsumerlaptop.Wecallour
approachSVO(Semi-directVisualOdometry)andreleaseourimplementationasopen-sourcesoftware.
DOI:https://doi.org/10.1109/ICRA.2014.6906584
PostedattheZurichOpenRepositoryandArchive,UniversityofZurich
ZORAURL:https://doi.org/10.5167/uzh-125453
ConferenceorWorkshopItem
AcceptedVersion
Originallypublishedat:
Forster,Christian;Pizzoli,Matia;Scaramuzza,Davide(2014).SVO:fastsemi-directmonocularvisual
odometry.In:IEEEInternationalConferenceonRoboticsandAutomation(ICRA),HongKong,31May
2014-7June2014.InstituteofElectricalandElectronicsEngineers,15-22.
DOI:https://doi.org/10.1109/ICRA.2014.6906584

SVO: Fast Semi-Direct Monocular Visual Odometry
Christian Forster, Matia Pizzoli, Davide Scaramuzza
Abstract We propose a semi-direct monocular visual odom-
etry algorithm that is precise, robust, and faster than current
state-of-the-art methods. The semi-direct approach eliminates
the need of costly feature extraction and robust matching
techniques for motion estimation. Our algorithm operates
directly on pixel intensities, which results in subpixel precision
at high frame-rates. A probabilistic mapping method that
explicitly models outlier measurements is used to estimate 3D
points, which results in fewer outliers and more reliable points.
Precise and high frame-rate motion estimation brings increased
robustness in scenes of little, repetitive, and high-frequency
texture. The algorithm is applied to micro-aerial-vehicle state-
estimation in GPS-denied environments and runs at 55 frames
per second on the onboard embedded computer and at more
than 300 frames per second on a consumer laptop. We call our
approach SVO (Semi-direct Visual Odometry) and release our
implementation as open-source software.
I. INTRODUCTION
Micro Aerial Vehicles (MAVs) will soon play a major role
in disaster management, industrial inspection and environ-
ment conservation. For such operations, navigating based
on GPS information only is not sufficient. Precise fully
autonomous operation requires MAVs to rely on alterna-
tive localization systems. For minimal weight and power-
consumption it was therefore proposed [1]–[5] to use only
a single downward-looking camera in combination with an
Inertial Measurement Unit. This setup allowed fully au-
tonomous way-point following in outdoor areas [1]–[3] and
collaboration between MAVs and ground robots [4], [5].
To our knowledge, all monocular Visual Odometry
(VO) systems for MAVs [1], [2], [6], [7] are feature-
based. In RGB-D and stereo-based SLAM systems how-
ever, direct methods [8]–[11]—based on photometric error
minimization—are becoming increasingly popular.
In this work, we propose a semi-direct VO that combines
the success-factors of feature-based methods (tracking many
features, parallel tracking and mapping, keyframe selection)
with the accurracy and speed of direct methods. High frame-
rate VO for MAVs promises increased robustness and faster
flight maneuvres.
An open-source implementation and videos of this work
are available at: http://rpg.ifi.uzh.ch/software
A. Taxonomy of Visual Motion Estimation Methods
Methods that simultaneously recover camera pose and
scene structure from video can be divided into two classes:
The authors are with the Robotics and Perception Group, University
of Zurich, Switzerland—http://rpg.ifi.uzh.ch. This research was
supported by the Swiss National Science Foundation through project number
200021-143607 (“Swarm of Flying Cameras”), the National Centre of
Competence in Research Robotics, and the CTI project number 14652.1.
a) Feature-Based Methods: The standard approach is
to extract a sparse set of salient image features (e.g. points,
lines) in each image; match them in successive frames using
invariant feature descriptors; robustly recover both camera
motion and structure using epipolar geometry; finally, refine
the pose and structure through reprojection error minimiza-
tion. The majority of VO algorithms [12] follows this proce-
dure, independent of the applied optimization framework. A
reason for the success of these methods is the availability of
robust feature detectors and descriptors that allow matching
between images even at large inter-frame movement. The
disadvantage of feature-based approaches is the reliance on
detection and matching thresholds, the neccessity for robust
estimation techniques to deal with wrong correspondences,
and the fact that most feature detectors are optimized for
speed rather than precision, such that drift in the motion
estimate must be compensated by averaging over many
feature-measurements.
b) Direct Methods: Direct methods [13] estimate struc-
ture and motion directly from intensity values in the image.
The local intensity gradient magnitude and direction is used
in the optimisation compared to feature-based methods that
consider only the distance to some feature-location. Direct
methods that exploit all the information in the image, even
from areas where gradients are small, have been shown to
outperform feature-based methods in terms of robustness in
scenes with little texture [14] or in the case of camera-
defocus and motion blur [15]. The computation of the
photometric error is more intensive than the reprojection
error, as it involves warping and integrating large image
regions. However, since direct methods operate directly on
the intensitiy values of the image, the time for feature
detection and invariant descriptor computation can be saved.
B. Related Work
Most monocular VO algorithms for MAVs [1], [2], [7] rely
on PTAM [16]. PTAM is a feature-based SLAM algorithm
that achieves robustness through tracking and mapping many
(hundreds) of features. Simultaneously, it runs in real-time by
parallelizing the motion estimation and mapping tasks and by
relying on efficient keyframe-based Bundle Adjustment (BA)
[17]. However, PTAM was designed for augmented reality
applications in small desktop scenes and multiple modifica-
tions (e.g., limiting the number of keyframes) were necessary
to allow operation in large-scale outdoor environments [2].
Early direct monocular SLAM methods tracked and
mapped few—sometimes manually selected—planar patches
[18]–[21]. While the first approaches [18], [19] used filtering
algorithms to estimate structure and motion, later methods

[20]–[22] used nonlinear least squares optimization. All these
methods estimate the surface normals of the patches, which
allows tracking a patch over a wide range of viewpoints,
thus, greatly reducing drift in the estimation. The authors
of [19]–[21] reported real-time performance, however, only
with few selected planar regions and on small datasets. A VO
algorithm for omnidirectional cameras on cars was proposed
in [22]. In [8], the local planarity assumption was relaxed
and direct tracking with respect to arbitrary 3D structures
computed from stereo cameras was proposed. In [9]–[11],
the same approach was also applied to RGB-D sensors.
With DTAM [15], a novel direct method was introduced
that computes a dense depthmap for each keyframe through
minimisation of a global, spatially-regularised energy func-
tional. The camera pose is found through direct whole image
alignment using the depth-map. This approach is compu-
tationally very intensive and only possible through heavy
GPU parallelization. To reduce the computational demand,
the method described in [23], which was published during the
review process of this work, uses only pixels characterized
by strong gradient.
C. Contributions and Outline
The proposed Semi-Direct Visual Odometry (SVO) al-
gorithm uses feature-correspondence; however, feature-
correspondence is an implicit result of direct motion estima-
tion rather than of explicit feature extraction and matching.
Thus, feature extraction is only required when a keyframe
is selected to initialize new 3D points (see Figure 1). The
advantage is increased speed due to the lack of feature-
extraction at every frame and increased accuracy through
subpixel feature correspondence. In contrast to previous
direct methods, we use many (hundreds) of small patches
rather than few (tens) large planar patches [18]–[21]. Using
many small patches increases robustness and allows neglect-
ing the patch normals. The proposed sparse model-based
image alignment algorithm for motion estimation is related to
model-based dense image alignment [8]–[10], [24]. However,
we demonstrate that sparse information of depth is sufficient
to get a rough estimate of the motion and to find feature-
correspondences. As soon as feature correspondences and
an initial estimate of the camera pose are established, the
algorithm continues using only point-features; hence, the
name semi-direct”. This switch allows us to rely on fast and
established frameworks for bundle adjustment (e.g., [25]).
A Bayesian filter that explicitly models outlier measure-
ments is used to estimate the depth at feature locations. A
3D point is only inserted in the map when the corresponding
depth-filter has converged, which requires multiple measure-
ments. The result is a map with few outliers and points that
can be tracked reliably.
The contributions of this paper are: (1) a novel semi-
direct VO pipeline that is faster and more accurate than
the current state-of-the-art for MAVs, (2) the integration
of a probabilistic mapping method that is robust to outlier
measurements.
Sparse Model-based
Image Alignment
Feature Alignment
Pose & Structure
Refinement
Motion Estimation Thread
New Image
Last Frame
Map
Frame
Queue
Feature
Extraction
Initialize
Depth-Filters
Mapping Thread
Is
Keyframe?
yes
Update
Depth-Filters
yes:
insert
new Point
no
Converged?
Fig. 1: Tracking and mapping pipeline
Section II provides an overview of the pipeline and Section
III, thereafter, introduces some required notation. Section IV
and V explain the proposed motion-estimation and mapping
algorithms. Section VII provides experimental results and
comparisons.
II. SYSTEM OVERVIEW
Figure 1 provides an overview of SVO. The algorithm
uses two parallel threads (as in [16]), one for estimating
the camera motion, and a second one for mapping as the
environment is being explored. This separation allows fast
and constant-time tracking in one thread, while the second
thread extends the map, decoupled from hard real-time
constraints.
The motion estimation thread implements the proposed
semi-direct approach to relative-pose estimation. The first
step is pose initialisation through sparse model-based image
alignment: the camera pose relative to the previous frame
is found through minimizing the photometric error between
pixels corresponding to the projected location of the same
3D points (see Figure 2). The 2D coordinates corresponding
to the reprojected points are refined in the next step through
alignment of the corresponding feature-patches (see Figure
3). Motion estimation concludes by refining the pose and
the structure through minimizing the reprojection error in-
troduced in the prevous feature-alignment step.
In the mapping thread, a probabilistic depth-filter is ini-
tialized for each 2D feature for which the corresponding
3D point is to be estimated. New depth-filters are initialised
whenever a new keyframe is selected in regions of the image
where few 3D-to-2D correspondences are found. The filters
are initialised with a large uncertainty in depth. At every
subsequent frame the depth estimate is updated in a Bayesian
fashion (see Figure 5). When a depth filter’s uncertainty
becomes small enough, a new 3D point is inserted in the
map and is immediately used for motion estimation.

III. NOTATION
Before the algorithm is detailed, we briefly define the
notation that is used throughout the paper.
The intensity image collected at timestep k is denoted with
I
k
: R
2
7→ R, where is the image domain. Any 3D point
p = (x, y, z)
S on the visible scene surface S R
3
maps
to the image coordinates u = (u, v)
through the camera
projection model
π : R
3
7→ R
2
:
u = π(
k
p), (1)
where the prescript k denotes that the point coordinates are
expressed in the camera frame of reference k. The projection
π is determined by the intrinsic camera parameters which
are known from calibration. The 3D point corresponding to
an image coordinate u can be recovered, given the inverse
projection function
π
1
and the depth d
u
R:
k
p = π
1
(u, d
u
), (2)
where R is the domain for which the depth is known.
The camera position and orientation at timestep k is ex-
pressed with the rigid-body transformation T
k,w
SE(3). It
allows us to map a 3D point from the world coordinate frame
to the camera frame of reference:
k
p = T
k,w
·
w
p. The relative
transformation between two consecutive frames can be com-
puted with T
k,k 1
= T
k,w
· T
1
k1,w
. During the optimization,
we need a minimal representation of the transformation
and, therefore, use the Lie algebra se(3) corresponding to
the tangent space of SE(3) at the identity. We denote the
algebra elements—also named twist coordinates—with
ξ =
(
ω, ν)
T
R
6
, where ω is called the angular velocity and ν
the linear velocity. The twist coordinates ξ are mapped to
SE(3) by the exponential map [26]:
T(
ξ ) = exp(
ˆ
ξ ). (3)
IV. MOTION ESTIMATION
SVO computes an initial guess of the relative camera
motion and the feature correspondences using direct methods
and concludes with a feature-based nonlinear reprojection-
error refinement. Each step is detailed in the following
sections and illustrated in Figures 2 to 4.
A. Sparse Model-based Image Alignment
The maximum likelihood estimate of the rigid body trans-
formation T
k,k 1
between two consecutive camera poses
minimizes the negative log-likelihood of the intensity resid-
uals:
T
k,k 1
= arg min
T
ZZ
¯
R
ρ
h
δI
T, u
i
du. (4)
The intensity residual
δI is defined by the photometric
difference between pixels observing the same 3D point. It
can be computed by back-projecting a 2D point u from the
previous image I
k1
and subsequently projecting it into the
current camera view:
δI
T, u
= I
k
π
T ·
π
1
(u, d
u
)
I
k1
(u) u
¯
R, (5)
p
3
T
k,k1
u
3
I
k1
I
k
u
3
p
2
p
1
u
1
u
2
u
1
u
2
Fig. 2: Changing the relative pose T
k,k1
between the current and the
previous frame implicitly moves the position of the reprojected points in the
new image u
i
. Sparse image alignment seeks to find T
k,k1
that minimizes
the photometric difference between image patches corresponding to the same
3D point (blue squares). Note, in all figures, the parameters to optimize are
drawn in
red and the optimization cost is highlighted in blue.
p
3
I
r
2
I
k
u
4
p
2
p
1
u
3
u
1
u
3
u
2
I
r
1
p
3
u
1
u
2
u
4
Fig. 3: Due to inaccuracies in the 3D point and camera pose estimation,
the photometric error between corresponding patches (blue squares) in
the current frame and previous keyframes r
i
can further be minimised by
optimising the 2D position of each patch individually.
I
r
2
I
k
δu
1
δu
3
δu
2
I
r
1
δu
4
w
T
w,k
p
3
p
2
p
1
p
3
Fig. 4: In the last motion estimation step, the camera pose and the structure
(3D points) are optimized to minimize the reprojection error that has been
established during the previous feature-alignment step.
where
¯
R is the image region for which the depth d
u
is known
at time k 1 and for which the back-projected points are
visible in the current image domain:
¯
R =
u
u R
k1
π
T · π
1
(u, d
u
)
k
. (6)
For the sake of simplicity, we assume in the following
that the intensity residuals are normally distributed with
unit variance. The negative log likelihood minimizer then
corresponds to the least squares problem:
ρ[.] ˆ=
1
2
k . k
2
. In
practice, the distribution has heavier tails due to occlusions
and thus, a robust cost function must be applied [10].
In contrast to previous works, where the depth is known
for large regions in the image [8]–[10], [24], we only know
the depth d
u
i
at sparse feature locations u
i
. We denote small
patches of 4 × 4 pixels around the feature point with the
vector I(u
i
). We seek to find the camera pose that minimizes

the photometric error of all patches (see Figure 2):
T
k,k 1
= arg min
T
k,k1
1
2
i
¯
R
k
δI(T
k,k 1
, u
i
) k
2
. (7)
Since Equation (7) is nonlinear in T
k,k 1
, we solve it in an
iterative Gauss-Newton procedure. Given an estimate of the
relative transformation
ˆ
T
k,k 1
, an incremental update T(
ξ )
to the estimate can be parametrised with a twist
ξ se(3).
We use the inverse compositional formulation [27] of the
intensity residual, which computes the update step T(
ξ ) for
the reference image at time k 1:
δI(ξ , u
i
) = I
k
π
ˆ
T
k,k 1
· p
i
I
k1
π
T(ξ ) · p
i
, (8)
with p
i
=
π
1
(u
i
, d
u
i
). The inverse of the update step is then
applied to the current estimate using Equation (3):
ˆ
T
k,k 1
ˆ
T
k,k 1
· T(
ξ )
1
. (9)
Note that we do not warp the patches for computing speed-
reasons. This assumption is valid in case of small frame-to-
frame motions and for small patch-sizes.
To find the optimal update step T(
ξ ), we compute the
derivative of (7) and set it to zero:
i
¯
R
δI(ξ , u
i
)
δI(ξ , u
i
) = 0. (10)
To solve this system, we linearize around the current state:
δI(ξ , u
i
) δ I(0, u
i
) + δ I(0, u
i
) · ξ (11)
The Jacobian J
i
:=
δI(0, u
i
) has the dimension 16 × 6
because of the 4 × 4 patch-size and is computed with the
chain-rule:
δI(ξ , u
i
)
ξ
=
I
k1
(a)
a
a=u
i
·
π(b)
b
b=p
i
·
T(ξ )
ξ
ξ =0
· p
i
By inserting (11) into (10) and by stacking the Jacobians in
a matrix J, we obtain the normal equations:
J
T
J
ξ = J
T
δI(0), (12)
which can be solved for the update twist
ξ . Note that by
using the inverse compositional approach, the Jacobian can
be precomputed as it remains constant over all iterations (the
reference patch I
k1
(u
i
) and the point p
i
do not change),
which results in a significant speedup [27].
B. Relaxation Through Feature Alignment
The last step aligned the camera with respect to the
previous frame. Through back-projection, the found relative
pose T
k,k 1
implicitly defines an initial guess for the feature
positions of all visible 3D points in the new image. Due to
inaccuracies in the 3D points’ positions and, thus, the camera
pose, this initial guess can be improved. To reduce the drift,
the camera pose should be aligned with respect to the map,
rather than to the previous frame.
All 3D points of the map that are visible from the
estimated camera pose are projected into the image, resulting
in an estimate of the corresponding 2D feature positions u
i
(see Figure 3). For each reprojected point, the keyframe r
that observes the point with the closest observation angle
is identified. The feature alignment step then optimizes all
2D feature-positions u
i
in the new image individually by
minimizing the photometric error of the patch in the current
image with respect to the reference patch in the keyframe r:
u
i
= arg min
u
i
1
2
k I
k
(u
i
) A
i
· I
r
(u
i
) k
2
, i. (13)
This alignment is solved using the inverse compositional
Lucas-Kanade algorithm [27]. Contrary to the previous step,
we apply an affine warping A
i
to the reference patch, since
a larger patch size is used (8 × 8 pixels) and the closest
keyframe is typically farther away than the previous image.
This step can be understood as a relaxation step that vio-
lates the epipolar constraints to achieve a higher correlation
between the feature-patches.
C. Pose and Structure Refinement
In the previous step, we have established feature corre-
spondence with subpixel accuracy at the cost of violating
the epipolar constraints. In particular, we have generated a
reprojection residual ||
δu
i
|| = ||u
i
π(T
k,w
w
p
i
)|| 6= 0, which
on average is around 0.3 pixels (see Figure 11). In this final
step, we again optimize the camera pose T
k,w
to minimize
the reprojection residuals (see Figure 4):
T
k,w
= arg min
T
k,w
1
2
i
k u
i
π(T
k,w
w
p
i
) k
2
. (14)
This is the well known problem of motion-only BA [17] and
can efficiently be solved using an iterative non-linear least
squares minimization algorithm such as Gauss Newton.
Subsequently, we optimize the position of the observed
3D points through reprojection error minimization (structure-
only BA). Finally, it is possible to apply local BA, in which
both the pose of all close keyframes as well as the observed
3D points are jointly optimized. The BA step is ommitted in
the fast parameter settings of the algorithm (Section VII).
D. Discussion
The first (Section IV-A) and the last (Section IV-C)
optimization of the algorithm seem to be redundant as both
optimize the 6 DoF pose of the camera. Indeed, one could
directly start with the second step and establish feature-
correspondence through Lucas-Kanade tracking [27] of all
feature-patches, followed by nonlinear pose refinement (Sec-
tion IV-C). While this would work, the processing time
would be higher. Tracking all features over large distances
(e.g., 30 pixels) requires a larger patch and a pyramidal im-
plementation. Furthermore, some features might be tracked
inaccurately, which would require outlier detection. In SVO
however, feature alignment is efficiently initialized by only
optimizing six parameters—the camera pose—in the sparse
image alignment step. The sparse image alignment step
satisfies implicitly the epipolar constraint and ensures that
there are no outliers.
One may also argue that the first step (sparse image align-
ment) would be sufficient to estimate the camera motion. In

Citations
More filters
Journal ArticleDOI
TL;DR: ORB-SLAM as discussed by the authors is a feature-based monocular SLAM system that operates in real time, in small and large indoor and outdoor environments, with a survival of the fittest strategy that selects the points and keyframes of the reconstruction.
Abstract: This paper presents ORB-SLAM, a feature-based monocular simultaneous localization and mapping (SLAM) system that operates in real time, in small and large indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.

4,522 citations

Journal ArticleDOI
TL;DR: A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation.
Abstract: This paper presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.

3,807 citations


Cites methods from "SVO: Fast semi-direct monocular vis..."

  • ...…bags of binary words obtained from BRIEF descriptors [14] along with the very efficient FAST feature detector [15], reducing in more than one order of magnitude the time needed for feature extraction, compared to SURF [16] and SIFT [17] features that were used in bags of words approaches so far....

    [...]

  • ...We delay the initialization until the method produces a unique solution with significant parallax....

    [...]

Book ChapterDOI
06 Sep 2014
TL;DR: A novel direct tracking method which operates on \(\mathfrak{sim}(3)\), thereby explicitly detecting scale-drift, and an elegant probabilistic solution to include the effect of noisy depth values into tracking are introduced.
Abstract: We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct methods, allows to build large-scale, consistent maps of the environment Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale Major enablers are two key novelties: (1) a novel direct tracking method which operates on \(\mathfrak{sim}(3)\), thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking The resulting direct monocular SLAM system runs in real-time on a CPU

3,273 citations


Cites background from "SVO: Fast semi-direct monocular vis..."

  • ...Two major reasons are (1) their use in robotics, in particular to navigate unmanned aerial vehicles (UAVs) [10,8,1], and (2) augmented and virtual reality applications slowly making their way into the mass-market....

    [...]

  • ...By combining direct tracking with keypoints, [10] achieves high framerates even on embedded platforms....

    [...]

Journal ArticleDOI
TL;DR: In this article, a robust and versatile monocular visual-inertial state estimator is presented, which is the minimum sensor suite (in size, weight, and power) for the metric six degrees of freedom (DOF) state estimation.
Abstract: One camera and one low-cost inertial measurement unit (IMU) form a monocular visual-inertial system (VINS), which is the minimum sensor suite (in size, weight, and power) for the metric six degrees-of-freedom (DOF) state estimation. In this paper, we present VINS-Mono: a robust and versatile monocular visual-inertial state estimator. Our approach starts with a robust procedure for estimator initialization. A tightly coupled, nonlinear optimization-based method is used to obtain highly accurate visual-inertial odometry by fusing preintegrated IMU measurements and feature observations. A loop detection module, in combination with our tightly coupled formulation, enables relocalization with minimum computation. We additionally perform 4-DOF pose graph optimization to enforce the global consistency. Furthermore, the proposed system can reuse a map by saving and loading it in an efficient way. The current and previous maps can be merged together by the global pose graph optimization. We validate the performance of our system on public datasets and real-world experiments and compare against other state-of-the-art algorithms. We also perform an onboard closed-loop autonomous flight on the microaerial-vehicle platform and port the algorithm to an iOS-based demonstration. We highlight that the proposed work is a reliable, complete, and versatile system that is applicable for different applications that require high accuracy in localization. We open source our implementations for both PCs ( https://github.com/HKUST-Aerial-Robotics/VINS-Mono ) and iOS mobile devices ( https://github.com/HKUST-Aerial-Robotics/VINS-Mobile ).

2,305 citations

Journal ArticleDOI
TL;DR: Direct Sparse Odometry (DSO) as mentioned in this paper combines a fully direct probabilistic model with consistent, joint optimization of all model parameters, including geometry represented as inverse depth in a reference frame and camera motion.
Abstract: Direct Sparse Odometry (DSO) is a visual odometry method based on a novel, highly accurate sparse and direct structure and motion formulation. It combines a fully direct probabilistic model (minimizing a photometric error) with consistent, joint optimization of all model parameters, including geometry-represented as inverse depth in a reference frame-and camera motion. This is achieved in real time by omitting the smoothness prior used in other direct methods and instead sampling pixels evenly throughout the images. Since our method does not depend on keypoint detectors or descriptors, it can naturally sample pixels from across all image regions that have intensity gradient, including edges or smooth intensity variations on essentially featureless walls. The proposed model integrates a full photometric calibration, accounting for exposure time, lens vignetting, and non-linear response functions. We thoroughly evaluate our method on three different datasets comprising several hours of video. The experiments show that the presented approach significantly outperforms state-of-the-art direct and indirect methods in a variety of real-world settings, both in terms of tracking accuracy and robustness.

1,868 citations

References
More filters
Proceedings ArticleDOI
13 Nov 2007
TL;DR: A system specifically designed to track a hand-held camera in a small AR workspace, processed in parallel threads on a dual-core computer, that produces detailed maps with thousands of landmarks which can be tracked at frame-rate with accuracy and robustness rivalling that of state-of-the-art model-based systems.
Abstract: This paper presents a method of estimating camera pose in an unknown scene. While this has previously been attempted by adapting SLAM algorithms developed for robotic exploration, we propose a system specifically designed to track a hand-held camera in a small AR workspace. We propose to split tracking and mapping into two separate tasks, processed in parallel threads on a dual-core computer: one thread deals with the task of robustly tracking erratic hand-held motion, while the other produces a 3D map of point features from previously observed video frames. This allows the use of computationally expensive batch optimisation techniques not usually associated with real-time operation: The result is a system that produces detailed maps with thousands of landmarks which can be tracked at frame-rate, with an accuracy and robustness rivalling that of state-of-the-art model-based systems.

4,091 citations


"SVO: Fast semi-direct monocular vis..." refers background or methods in this paper

  • ...The reason we do not compare with the original version of PTAM [16] is because it does not handle large environments and is not robust enough in scenes of high-frequency texture [2]....

    [...]

  • ...Most monocular VO algorithms for MAVs [1], [2], [7] rely on PTAM [16]....

    [...]

  • ...Like in [16], we assume a locally planar scene and estimate a homography....

    [...]

  • ...The algorithm uses two parallel threads (as in [16]), one for estimating the camera motion, and a second one for mapping as the environment is being explored....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a wide variety of extensions have been made to the original formulation of the Lucas-Kanade algorithm and their extensions can be used with the inverse compositional algorithm without any significant loss of efficiency.
Abstract: Since the Lucas-Kanade algorithm was proposed in 1981 image alignment has become one of the most widely used techniques in computer vision Applications range from optical flow and tracking to layered motion, mosaic construction, and face coding Numerous algorithms have been proposed and a wide variety of extensions have been made to the original formulation We present an overview of image alignment, describing most of the algorithms and their extensions in a consistent framework We concentrate on the inverse compositional algorithm, an efficient algorithm that we recently proposed We examine which of the extensions to Lucas-Kanade can be used with the inverse compositional algorithm without any significant loss of efficiency, and which cannot In this paper, Part 1 in a series of papers, we cover the quantity approximated, the warp update rule, and the gradient descent approximation In future papers, we will cover the choice of the error function, how to allow linear appearance variation, and how to impose priors on the parameters

3,168 citations

Proceedings ArticleDOI
24 Dec 2012
TL;DR: A large set of image sequences from a Microsoft Kinect with highly accurate and time-synchronized ground truth camera poses from a motion capture system is recorded for the evaluation of RGB-D SLAM systems.
Abstract: In this paper, we present a novel benchmark for the evaluation of RGB-D SLAM systems. We recorded a large set of image sequences from a Microsoft Kinect with highly accurate and time-synchronized ground truth camera poses from a motion capture system. The sequences contain both the color and depth images in full sensor resolution (640 × 480) at video frame rate (30 Hz). The ground-truth trajectory was obtained from a motion-capture system with eight high-speed tracking cameras (100 Hz). The dataset consists of 39 sequences that were recorded in an office environment and an industrial hall. The dataset covers a large variety of scenes and camera motions. We provide sequences for debugging with slow motions as well as longer trajectories with and without loop closures. Most sequences were recorded from a handheld Kinect with unconstrained 6-DOF motions but we also provide sequences from a Kinect mounted on a Pioneer 3 robot that was manually navigated through a cluttered indoor environment. To stimulate the comparison of different approaches, we provide automatic evaluation tools both for the evaluation of drift of visual odometry systems and the global pose error of SLAM systems. The benchmark website [1] contains all data, detailed descriptions of the scenes, specifications of the data formats, sample code, and evaluation tools.

3,050 citations


"SVO: Fast semi-direct monocular vis..." refers background in this paper

  • ...per second in Table II as proposed and motivated in [32]....

    [...]

Proceedings ArticleDOI
09 May 2011
TL;DR: G2o, an open-source C++ framework for optimizing graph-based nonlinear error functions, is presented and demonstrated that while being general g2o offers a performance comparable to implementations of state-of-the-art approaches for the specific problems.
Abstract: Many popular problems in robotics and computer vision including various types of simultaneous localization and mapping (SLAM) or bundle adjustment (BA) can be phrased as least squares optimization of an error function that can be represented by a graph. This paper describes the general structure of such problems and presents g2o, an open-source C++ framework for optimizing graph-based nonlinear error functions. Our system has been designed to be easily extensible to a wide range of problems and a new problem typically can be specified in a few lines of code. The current implementation provides solutions to several variants of SLAM and BA. We provide evaluations on a wide range of real-world and simulated datasets. The results demonstrate that while being general g2o offers a performance comparable to implementations of state-of-the-art approaches for the specific problems.

2,192 citations


Additional excerpts

  • ..., [25])....

    [...]

Journal ArticleDOI
TL;DR: The proposed theorem is a strict solution of the problem, and it always gives the correct transformation parameters even when the data is corrupted.
Abstract: In many applications of computer vision, the following problem is encountered. Two point patterns (sets of points) (x/sub i/) and (x/sub i/); i=1, 2, . . ., n are given in m-dimensional space, and the similarity transformation parameters (rotation, translation, and scaling) that give the least mean squared error between these point patterns are needed. Recently, K.S. Arun et al. (1987) and B.K.P. Horn et al. (1987) presented a solution of this problem. Their solution, however, sometimes fails to give a correct rotation matrix and gives a reflection instead when the data is severely corrupted. The proposed theorem is a strict solution of the problem, and it always gives the correct transformation parameters even when the data is corrupted. >

2,123 citations


"SVO: Fast semi-direct monocular vis..." refers methods in this paper

  • ...In order to generate the plots, we aligned the first 10 frames with the ground-truth using [31]....

    [...]

Frequently Asked Questions (18)
Q1. What are the contributions mentioned in the paper "Svo: fast semi-direct monocular visual odometry" ?

The authors propose a semi-direct monocular visual odometry algorithm that is precise, robust, and faster than current state-of-the-art methods. The authors call their approach SVO ( Semi-direct Visual Odometry ) and release their implementation as open-source software. The authors propose a semi-direct monocular visual odometry algorithm that is precise, robust, and faster than current state-of-the-art methods. The authors call their approach SVO ( Semi-direct Visual Odometry ) and release their implementation as open-source software. The authors propose a semi-direct monocular visual odometry algorithm that is precise, robust, and faster than current state-of-the-art methods. The authors call their approach SVO ( Semi-direct Visual Odometry ) and release their implementation as open-source software. 

The reason why the authors can reliably track the camera with less features is the use of depth-filters, which assures that the features being trackedare reliable. 

The standard approach is to extract a sparse set of salient image features (e.g. points, lines) in each image; match them in successive frames using invariant feature descriptors; robustly recover both camera motion and structure using epipolar geometry; finally, refine the pose and structure through reprojection error minimization. 

The main advantage of the proposed methods over the standard approach of triangulating points from two views is that the authors observe far fewer outliers as every filter undergoes many measurements until convergence. 

A reason for the success of these methods is the availability of robust feature detectors and descriptors that allow matching between images even at large inter-frame movement. 

The contributions of this paper are: (1) a novel semidirect VO pipeline that is faster and more accurate than the current state-of-the-art for MAVs, (2) the integration of a probabilistic mapping method that is robust to outlier measurements. 

Motion estimation concludes by refining the pose and the structure through minimizing the reprojection error introduced in the prevous feature-alignment step. 

The improved accuracy is due to the alignment of the new image with respect to the keyframes and the map, whereas sparse image alignment aligns the new frame only with respect to the previous frame. 

The proposed depth estimation is very efficient when only a small range around the current depth estimate on the epipolar line is searched; in their case the range corresponds to twice the standard deviation of the current depth estimate. 

In [2], the use of lower resolution images is motivated by the fact that high-frequency self-similar texture in the image results in too many outlier 3D points. 

The first step is pose initialisation through sparse model-based image alignment: the camera pose relative to the previous frame is found through minimizing the photometric error between pixels corresponding to the projected location of the same 3D points (see Figure 2). 

The authors use the inverse compositional formulation [27] of the intensity residual, which computes the update step T(ξ ) for the reference image at time k−1:δ I(ξ ,ui) = Ik(π ( T̂k,k−1 ·pi )) − Ik−1 ( π ( T(ξ ) ·pi ) ) , (8)with pi = π −1(ui,dui). 

R̄= { u ∣ ∣ u ∈Rk−1 ∧ π ( T ·π−1(u,du) ) ∈Ωk } . (6)For the sake of simplicity, the authors assume in the following that the intensity residuals are normally distributed with unit variance. 

The feature alignment step then optimizes all 2D feature-positions ui in the new image individually by minimizing the photometric error of the patch in the current image with respect to the reference patch in the keyframe r:u′i = argmin u′i1 2 ‖ Ik(u ′ i)−Ai · Ir(ui) ‖ 2 , ∀ i. (13)This alignment is solved using the inverse compositional Lucas-Kanade algorithm [27]. 

Since a camera is only an angle-sensor, it is impossible to obtain the scale of the map through a Structure from Motion pipeline. 

Since the plots are highly dependent on the accuracy of alignment of the first 10 frames, the authors also report the drift in meters1Matrix Vision BlueFox, global shutter, 752×480 pixel resolution. 

The authors suspect the main reason for this result to originate from the fact that the PTAM version of [2] does not extract features on the pyramid level of highest resolution and subpixel refinement is not performed for all features in PTAM. 

since direct methods operate directly on the intensitiy values of the image, the time for feature detection and invariant descriptor computation can be saved.