SVO: Fast semi-direct monocular visual odometry
Summary (4 min read)
Introduction
- Precise and high frame-rate motion estimation brings increased robustness in scenes of little, repetitive, and high-frequency texture.
- Precise fully autonomous operation requires MAVs to rely on alternative localization systems.
A. Taxonomy of Visual Motion Estimation Methods
- Methods that simultaneously recover camera pose and scene structure from video can be divided into two classes: ∗The authors are with the Robotics and Perception Group, University of Zurich, Switzerland—http://rpg.ifi.uzh.ch.
- This research was supported by the Swiss National Science Foundation through project number 200021-143607 (“Swarm of Flying Cameras”), the National Centre of Competence in Research Robotics, and the CTI project number 14652.1. a) Feature-Based Methods:.
- The majority of VO algorithms [12] follows this procedure, independent of the applied optimization framework.
- A reason for the success of these methods is the availability of robust feature detectors and descriptors that allow matching between images even at large inter-frame movement.
- Since direct methods operate directly on the intensitiy values of the image, the time for feature detection and invariant descriptor computation can be saved.
C. Contributions and Outline
- The proposed Semi-Direct Visual Odometry (SVO) algorithm uses feature-correspondence; however, featurecorrespondence is an implicit result of direct motion estimation rather than of explicit feature extraction and matching.
- In contrast to previous direct methods, the authors use many of small patches rather than few (tens) large planar patches [18]–[21].
- A Bayesian filter that explicitly models outlier measurements is used to estimate the depth at feature locations.
- Section II provides an overview of the pipeline and Section III, thereafter, introduces some required notation.
- Section IV and V explain the proposed motion-estimation and mapping algorithms.
II. SYSTEM OVERVIEW
- The algorithm uses two parallel threads (as in [16]), one for estimating the camera motion, and a second one for mapping as the environment is being explored.
- This separation allows fast and constant-time tracking in one thread, while the second thread extends the map, decoupled from hard real-time constraints.
- The 2D coordinates corresponding to the reprojected points are refined in the next step through alignment of the corresponding feature-patches .
- Motion estimation concludes by refining the pose and the structure through minimizing the reprojection error introduced in the prevous feature-alignment step.
- New depth-filters are initialised whenever a new keyframe is selected in regions of the image where few 3D-to-2D correspondences are found.
III. NOTATION
- Before the algorithm is detailed, the authors briefly define the notation that is used throughout the paper.
- The projection π is determined by the intrinsic camera parameters which are known from calibration.
- The camera position and orientation at timestep k is expressed with the rigid-body transformation Tk,w ∈ SE(3).
- During the optimization, the authors need a minimal representation of the transformation and, therefore, use the Lie algebra se(3) corresponding to the tangent space of SE(3) at the identity.
- The authors denote the algebra elements—also named twist coordinates—with ξ = (ω,ν)T ∈ R6, where ω is called the angular velocity and ν the linear velocity.
A. Sparse Model-based Image Alignment
- The maximum likelihood estimate of the rigid body transformation Tk,k−1 between two consecutive camera poses minimizes the negative log-likelihood of the intensity residuals: Tk,k−1 = argmin T ∫∫ R̄ ρ [ δ I ( T,u ) ] du. (4) The intensity residual δ.
- The negative log likelihood minimizer then corresponds to the least squares problem: ρ[.]=̂ 1 2 ‖ . ‖2.
- In practice, the distribution has heavier tails due to occlusions and thus, a robust cost function must be applied [10].
- The authors denote small patches of 4× 4 pixels around the feature point with the vector I(ui).
- The authors use the inverse compositional formulation [27] of the intensity residual, which computes the update step T(ξ ) for the reference image at time k−1: δ I(ξ ,ui) =.
B. Relaxation Through Feature Alignment
- The last step aligned the camera with respect to the previous frame.
- Through back-projection, the found relative pose Tk,k−1 implicitly defines an initial guess for the feature positions of all visible 3D points in the new image.
- To reduce the drift, the camera pose should be aligned with respect to the map, rather than to the previous frame.
- For each reprojected point, the keyframe r that observes the point with the closest observation angle is identified.
- This step can be understood as a relaxation step that violates the epipolar constraints to achieve a higher correlation between the feature-patches.
C. Pose and Structure Refinement
- In the previous step, the authors have established feature correspondence with subpixel accuracy at the cost of violating the epipolar constraints.
- This is the well known problem of motion-only BA [17] and can efficiently be solved using an iterative non-linear least squares minimization algorithm such as Gauss Newton.
- Finally, it is possible to apply local BA, in which both the pose of all close keyframes as well as the observed 3D points are jointly optimized.
- The BA step is ommitted in the fast parameter settings of the algorithm (Section VII).
D. Discussion
- The first (Section IV-A) and the last (Section IV-C) optimization of the algorithm seem to be redundant as both optimize the 6 DoF pose of the camera.
- Indeed, one could directly start with the second step and establish featurecorrespondence through Lucas-Kanade tracking [27] of all feature-patches, followed by nonlinear pose refinement (Section IV-C).
- While this would work, the processing time would be higher.
- In SVO however, feature alignment is efficiently initialized by only optimizing six parameters—the camera pose—in the sparse image alignment step.
- The authors found empirically that using the first step only results in significantly more drift compared to using all three steps together.
V. MAPPING
- Given an image and its pose {Ik,Tk,w}, the mapping thread estimates the depth of 2D features for which the corresponding 3D point is not yet known.
- The proposed depth estimation is very efficient when only a small range around the current depth estimate on the epipolar line is searched; in their case the range corresponds to twice the standard deviation of the current depth estimate.
- Subsequently, the optimization is initialized at the next finer level.
- When a new keyframe is inserted in the map, the keyframe farthest apart from the current position of the camera is removed.
- The same grid is also used for reprojecting the map before feature alignment.
VII. EXPERIMENTAL RESULTS
- Experiments were performed on datasets recorded from a downward-looking camera1 attached to a MAV and sequences from a handheld camera.
- The video was processed on both a laptop2 and on an embedded platform3 that is mounted on the MAV .
- Note that at maximum 2 CPU cores are used for the algorithm.
- On the embedded platform only the fast parameters’ setting is used.
- The authors compare the performance of SVO with the modified PTAM algorithm of [2].
A. Accuracy
- The ground-truth for the trajectory originates from a motion capture system.
- In order to generate the plots, the authors aligned the first 10 frames with the ground-truth using [31].
- In [2], the use of lower resolution images is motivated by the fact that high-frequency self-similar texture in the image results in too many outlier 3D points.
- The difference in accuracy between the fast and accurate parameter setting is not significant.
- Optimizing the pose and the observed 3D points separately at every iteration (fast parameter setting) is accurate enough for MAV motion estimation.
B. Runtime Evaluation
- Figures 13 and 14 show a break-up of the time required to compute the camera motion on the specified laptop and embedded platform respectively with the fast-parameter setting.
- The laptop is capable to process the frames faster than 300 frames per second (fps) while the embedded platform runs at 55 fps.
- The corresponding time for PTAM is 91 fps and 27 fps respectively.
- The reason why the authors can reliably track the camera with less features is the use of depth-filters, which assures that the features being tracked are reliable.
- The time required by the mapping thread to update all depth-filters with the new frame is highly dependent on the number of filters.
VIII. CONCLUSION
- The authors proposed the semi-direct VO pipeline “SVO” that is precise and faster than the current state-of-theart.
- The gain in speed is due to the fact that feature-extraction and matching is not required for motion estimation.
- Instead, a direct method is used, which is based directly on the image intensities.
- The algorithm is particularly useful for stateestimation onboard MAVs as it runs at more than 50 frames per second on current embedded computers.
- High framerate motion estimation, combined with an outlier resistant probabilistic mapping method, provides increased robustness in scenes of little, repetitive, and high frequency-texture.
Did you find this useful? Give us your feedback
Citations
4,522 citations
3,807 citations
Cites methods from "SVO: Fast semi-direct monocular vis..."
...…bags of binary words obtained from BRIEF descriptors [14] along with the very efficient FAST feature detector [15], reducing in more than one order of magnitude the time needed for feature extraction, compared to SURF [16] and SIFT [17] features that were used in bags of words approaches so far....
[...]
...We delay the initialization until the method produces a unique solution with significant parallax....
[...]
3,273 citations
Cites background from "SVO: Fast semi-direct monocular vis..."
...Two major reasons are (1) their use in robotics, in particular to navigate unmanned aerial vehicles (UAVs) [10,8,1], and (2) augmented and virtual reality applications slowly making their way into the mass-market....
[...]
...By combining direct tracking with keypoints, [10] achieves high framerates even on embedded platforms....
[...]
2,305 citations
1,868 citations
References
4,091 citations
"SVO: Fast semi-direct monocular vis..." refers background or methods in this paper
...The reason we do not compare with the original version of PTAM [16] is because it does not handle large environments and is not robust enough in scenes of high-frequency texture [2]....
[...]
...Most monocular VO algorithms for MAVs [1], [2], [7] rely on PTAM [16]....
[...]
...Like in [16], we assume a locally planar scene and estimate a homography....
[...]
...The algorithm uses two parallel threads (as in [16]), one for estimating the camera motion, and a second one for mapping as the environment is being explored....
[...]
3,168 citations
3,050 citations
"SVO: Fast semi-direct monocular vis..." refers background in this paper
...per second in Table II as proposed and motivated in [32]....
[...]
2,192 citations
Additional excerpts
..., [25])....
[...]
2,123 citations
"SVO: Fast semi-direct monocular vis..." refers methods in this paper
...In order to generate the plots, we aligned the first 10 frames with the ground-truth using [31]....
[...]
Related Papers (5)
Frequently Asked Questions (18)
Q2. Why can the authors reliably track the camera with less features?
The reason why the authors can reliably track the camera with less features is the use of depth-filters, which assures that the features being trackedare reliable.
Q3. What is the standard approach to capturing a scene from a video?
The standard approach is to extract a sparse set of salient image features (e.g. points, lines) in each image; match them in successive frames using invariant feature descriptors; robustly recover both camera motion and structure using epipolar geometry; finally, refine the pose and structure through reprojection error minimization.
Q4. What is the main advantage of the proposed methods over the standard approach of triangulating points?
The main advantage of the proposed methods over the standard approach of triangulating points from two views is that the authors observe far fewer outliers as every filter undergoes many measurements until convergence.
Q5. What is the reason for the success of these methods?
A reason for the success of these methods is the availability of robust feature detectors and descriptors that allow matching between images even at large inter-frame movement.
Q6. What is the contribution of this paper?
The contributions of this paper are: (1) a novel semidirect VO pipeline that is faster and more accurate than the current state-of-the-art for MAVs, (2) the integration of a probabilistic mapping method that is robust to outlier measurements.
Q7. How does the motion estimation process work?
Motion estimation concludes by refining the pose and the structure through minimizing the reprojection error introduced in the prevous feature-alignment step.
Q8. What is the reason for the improved accuracy of the image?
The improved accuracy is due to the alignment of the new image with respect to the keyframes and the map, whereas sparse image alignment aligns the new frame only with respect to the previous frame.
Q9. How is the proposed depth estimation performed?
The proposed depth estimation is very efficient when only a small range around the current depth estimate on the epipolar line is searched; in their case the range corresponds to twice the standard deviation of the current depth estimate.
Q10. Why is the use of lower resolution images more accurate than PTAM?
In [2], the use of lower resolution images is motivated by the fact that high-frequency self-similar texture in the image results in too many outlier 3D points.
Q11. What is the first step in the motion estimation thread?
The first step is pose initialisation through sparse model-based image alignment: the camera pose relative to the previous frame is found through minimizing the photometric error between pixels corresponding to the projected location of the same 3D points (see Figure 2).
Q12. What is the inverse compositional formulation of the intensity residual?
The authors use the inverse compositional formulation [27] of the intensity residual, which computes the update step T(ξ ) for the reference image at time k−1:δ I(ξ ,ui) = Ik(π ( T̂k,k−1 ·pi )) − Ik−1 ( π ( T(ξ ) ·pi ) ) , (8)with pi = π −1(ui,dui).
Q13. What is the inverse projection function of the intensity residuals?
R̄= { u ∣ ∣ u ∈Rk−1 ∧ π ( T ·π−1(u,du) ) ∈Ωk } . (6)For the sake of simplicity, the authors assume in the following that the intensity residuals are normally distributed with unit variance.
Q14. How is the feature alignment algorithm solved?
The feature alignment step then optimizes all 2D feature-positions ui in the new image individually by minimizing the photometric error of the patch in the current image with respect to the reference patch in the keyframe r:u′i = argmin u′i1 2 ‖ Ik(u ′ i)−Ai · Ir(ui) ‖ 2 , ∀ i. (13)This alignment is solved using the inverse compositional Lucas-Kanade algorithm [27].
Q15. How can the authors obtain the scale of the map through a Structure from Motion pipeline?
Since a camera is only an angle-sensor, it is impossible to obtain the scale of the map through a Structure from Motion pipeline.
Q16. How many pixels are used to calculate the drift?
Since the plots are highly dependent on the accuracy of alignment of the first 10 frames, the authors also report the drift in meters1Matrix Vision BlueFox, global shutter, 752×480 pixel resolution.
Q17. Why is the PTAM version not accurate?
The authors suspect the main reason for this result to originate from the fact that the PTAM version of [2] does not extract features on the pyramid level of highest resolution and subpixel refinement is not performed for all features in PTAM.
Q18. What is the difference between direct methods and feature-based methods?
since direct methods operate directly on the intensitiy values of the image, the time for feature detection and invariant descriptor computation can be saved.