scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DSAC — Differentiable RANSAC for Camera Localization

TL;DR: In this article, a differentiable version of RANSAC, called DSAC, is applied to the problem of camera localization, where deep learning has so far failed to improve on traditional approaches.
Abstract: RANSAC is an important algorithm in robust optimization and a central building block for many computer vision applications. In recent years, traditionally hand-crafted pipelines have been replaced by deep learning pipelines, which can be trained in an end-to-end fashion. However, RANSAC has so far not been used as part of such deep learning pipelines, because its hypothesis selection procedure is non-differentiable. In this work, we present two different ways to overcome this limitation. The most promising approach is inspired by reinforcement learning, namely to replace the deterministic hypothesis selection by a probabilistic selection for which we can derive the expected loss w.r.t. to all learnable parameters. We call this approach DSAC, the differentiable counterpart of RANSAC. We apply DSAC to the problem of camera localization, where deep learning has so far failed to improve on traditional approaches. We demonstrate that by directly minimizing the expected loss of the output camera poses, robustly estimated by RANSAC, we achieve an increase in accuracy. In the future, any deep learning pipeline can use DSAC as a robust optimization component.

Content maybe subject to copyright    Report

DSAC - Differentiable RANSAC for Camera Localization
Eric Brachmann
1
, Alexander Krull
1
, Sebastian Nowozin
2
Jamie Shotton
2
, Frank Michel
1
, Stefan Gumhold
1
, Carsten Rother
1
1
TU Dresden,
2
Microsoft
Abstract
RANSAC is an important algorithm in robust optimiza-
tion and a central building block for many computer vision
applications. In recent years, traditionally hand-crafted
pipelines have been replaced by deep learning pipelines,
which can be trained in an end-to-end fashion. However,
RANSAC has so far not been used as part of such deep
learning pipelines, because its hypothesis selection proce-
dure is non-differentiable. In this work, we present two dif-
ferent ways to overcome this limitation. The most promising
approach is inspired by reinforcement learning, namely to
replace the deterministic hypothesis selection by a proba-
bilistic selection for which we can derive the expected loss
w.r.t. to all learnable parameters. We call this approach
DSAC, the differentiable counterpart of RANSAC. We apply
DSAC to the problem of camera localization, where deep
learning has so far failed to improve on traditional ap-
proaches. We demonstrate that by directly minimizing the
expected loss of the output camera poses, robustly estimated
by RANSAC, we achieve an increase in accuracy. In the fu-
ture, any deep learning pipeline can use DSAC as a robust
optimization component
1
.
1. Introduction
Introduced in 1981, the random sample consensus
(RANSAC) algorithm [
11] remains the most important al-
gorithm for robust estimation. It is easy to implement, it
can be applied to a wide range of problems and it is able
to handle data with a substantial percentage of outliers,
i.e. data points that are not explained by the data model.
RANSAC and variants thereof [
39, 28, 7] have, for many
years, been important tools in computer vision, including
multi-view geometry [
16], object retrieval [29], pose esti-
mation [
36, 4] and simultaneous localization and mapping
(SLAM) [
27]. Solutions to these diverse tasks often in-
volve a common strategy: Local predictions (e.g. feature
matches) induce a global model (e.g. a homography). In
1
We will make our source code publicly available on the DSAC project
website.
this schema, RANSAC provides robustness to erroneous lo-
cal predictions.
Recently, deep learning has been shown to be highly
successful at image recognition tasks [
37, 17, 13, 31],
and, increasingly, in other domains including geometry
[
10, 19, 20, 9]. Part of this recent success is the ability to
perform end-to-end training, i.e. propagating gradients back
through an entire pipeline to allow the direct optimization of
a task-specific loss function, examples include [
41, 1, 38].
In this work, we are interested in learning components of
a computer vision pipeline that follows the principle: pre-
dict locally, fit globally. As explained earlier, RANSAC is
an integral component of this wide-spread strategy. We ask
the question, whether we can train such a pipeline end-to-
end. More specifically, we want to learn parameters of a
convolutional neural network (CNN) such that models, fit
robustly to its predictions via RANSAC, minimize a task
specific loss function.
RANSAC works by first creating multiple model hy-
potheses from small, random subsets of data points. Then
it scores each hypothesis by determining its consensus with
all data points. Finally, RANSAC selects the hypothesis
with the highest consensus as the final output. Unfortu-
nately, this hypothesis selection is non-differentiable, mean-
ing that it cannot directly be used in an end-to-end-trained
deep learning pipeline.
A common approach within the deep learning commu-
nity is to soften non-differentiable operators, e.g. argmax
in LIFT [
41] or visual word assignment in NetVLAD [1]. In
the case of RANSAC, the non-differentiable operator is the
argmax operator which selects the highest scoring hypoth-
esis. Similar to [
41], we might substitute the argmax for a
soft argmax, which is a weighted average of arguments [
6].
We indeed explore this direction but argue that this substitu-
tion changes the underlying principle of RANSAC. Instead
of learning how to select a good hypothesis, the pipeline
learns a (robust) average of hypotheses. We show experi-
mentally that this approach learns to focus on a narrow se-
lection of hypotheses and is prone to overfitting.
Alternatively, we aim to preserve the hard hypothesis se-
lection but treat it as a probabilistic process. We call this
6684

approach DSAC Differentiable SAmple Consensus our
new, differentiable counterpart to RANSAC. DSAC allows
us to differentiate the expected loss of the pipeline w.r.t.
to all learnable parameters. This technique is well known
in reinforcement learning, for stochastic computation prob-
lems like policy gradient approaches [
34].
To demonstrate the principle, we choose the problem of
camera localization: From a single RGB image in a known
static scene, we estimate the 6D camera pose (3D transla-
tion and 3D rotation) relative to the scene. We demonstrate
an end-to-end trainable solution for this problem, build-
ing on the scene coordinate regression forest (SCoRF) ap-
proach [
36, 40, 5]. The original SCoRF approach uses a
regression forest to predict the 3D location of each pixel
in an observed image in terms of ‘scene coordinates’. A
hypothesize-verify-refine RANSAC loop then randomly se-
lect scene coordinates of four pixel locations to generate an
initial set of camera pose hypotheses, which is then itera-
tively pruned and refined until a single high-quality pose es-
timate remains. In contrast to previous SCoRF approaches,
we adopt two CNNs for predicting scene coordinates and
for scoring hypotheses. More importantly, the key novelty
of this work is to replace RANSAC by our new, differen-
tiable DSAC.
Our contributions are in short:
We present and discuss two alternative ways of mak-
ing RANSAC differentiable, by soft argmax and prob-
abilistic selection. We call our new RANSAC version,
with the latter option, DSAC (Differentiable SAmple
Consensus).
We put both options into a new end-to-end trainable
camera localization pipeline. It contains two separate
CNNs, linked by our new RANSAC, motivated by pre-
vious work [
36, 23].
We validate experimentally that the option of proba-
bilistic selection is superior, i.e. less sensitive to over-
fitting, for our application. We conjecture that the ad-
vantage of probabilistic selection is allowing hard de-
cisions and, at the same time, keeping broad distribu-
tions over possible decisions.
We exceed the state-of-the-art results on camera local-
ization by 3.3%.
1.1. Related Work
Over the last decades, researchers have proposed many
variants of the original RANSAC algorithm [
11]. Most
works focus on either or both of two aspects: speed
[
8, 28, 7], or quality of the final estimate [39, 8]. For de-
tailed information about RANSAC variants we refer the
reader to [30]. To the best of our knowledge, this work
is the first to introduce a differentiable variant of RANSAC
for the purpose of end-to-end learning.
In the following, we review previous work on differen-
tiable algorithms and solutions for the problem of camera
localization.
Differentiable Algorithms. The success of deep learning
began with systems in which a CNN processes an image
in one forward pass to directly predict the desired output,
e.g. class probabilities [22], a semantic segmentation [25]
or depth values and normals [
10]. Given a sufficient amount
of training data, CNNs can autonomously discover useful
strategies for solving a task at hand, e.g. hierarchical part-
structures for object recognition [
42].
However, for many computer vision tasks, useful strate-
gies have been known for a long time. Recently, researchers
started to revisit and encode such strategies explicitly in
deep learning pipelines. This can reduce the necessary
amount of training data compared to CNNs with an un-
constrained architecture [
35]. Yi et al. [41] introduced a
stack of CNNs that remodels the established sparse fea-
ture pipeline of detection, orientation estimation and de-
scription, originally proposed in [
26]. Arandjelovic et
al. [
1] mapped the Vector of Locally Aggregated Descrip-
tors (VLAD) [
2] to a CNN architecture for place recogni-
tion. Thewlis et al. [
38] substituted the recursive decoding
of Deep Matching [
32] with reverse convolutions for end-
to-end trainable dense image matching.
Similar in spirit to these works, we show how to train
an established, RANSAC-based computer vision pipeline
in an end-to-end fashion. Instead of substituting hard as-
signments by soft counterparts as in [
41, 1], we enable end-
to-end learning by turning the hard selection into a proba-
bilistic process. Thus, we are able to calculate gradients to
minimize the expectation of the task loss function [
34].
Camera Localization. The SCoRF camera localization
pipeline [
36], already discussed in the introduction, has
been extended in several works. Guzman-Rivera et al. [
14]
trained a random forest to predict diverse scene coordinates
to resolve scene ambiguities. Valentin et al. [40] trained the
random forest to predict multi-model distributions of scene
coordinates for increased pose accuracy. Brachmann et
al. [
5] addressed camera localization from an RGB image
instead of RGB-D, utilizing the increased predictive power
of an auto-context random forest. None of these works sup-
port end-to-end learning.
In a system similar to SCoRF but for the task of object
pose estimation, Krull et al. [23] trained a CNN to measure
hypothesis consensus by comparing rendered and observed
images. In this work, we adopt the idea of a CNN measur-
ing hypothesis consensus, but learn it jointly with the scene
coordinate regressor and in an end-to-end fashion.
Kendall et al. [
20] demonstrated that a single CNN is
able to directly regress the 6D camera pose given an RGB
image, but its accuracy on indoor scenes is inferior to a
RGB-based SCoRF pipeline [
5].
6685


a) Vanilla RANSAC
b) Soft Selection (SoftAM)
c) Probabilistic Selection (DSAC)

󰇛
󰇜
󰆓
󰇛
󰆓󰇜


󰇛
󰇜
󰆓
󰇛
󰆓󰇜




Correspondence
Prediction
Minimal Set
Sampling
Scoring
Hypothesis
Selection
Hypothesis
Generation
Refinement Loss
Ground
Truth
Figure 1. Stochastic Computation Graphs [34]. A graphical representation of three RANSAC variants investigated in this work. The
variants differ in the way they select the final model hypothesis: a) non-differentiable, vanilla RANSAC with hard, deterministic argmax
selection; b) differentiable RANSAC with deterministic, soft argmax selection; c) differentiable RANSAC with hard, probabilistic se-
lection (named DSAC). Nodes shown as boxes represent deterministic functions, while circular nodes with yellow background represent
probabilistic functions. Arrows indicate dependency in computation. All differences between a), b) and c) are marked in red.
2. Method
2.1. Background
As a preface to explaining our method, we first briefly
review the standard RANSAC algorithm for model fitting,
and how it can be applied to the camera localization prob-
lem using discriminative scene coordinate regression.
Many problems in computer vision involve fitting a
model to a set of data points, which in practice usually in-
clude outliers due to sensor noise and other factors. The
RANSAC algorithm was specifically designed to be able to
fit models robustly in the presence of noise [11]. Dozens of
variations of RANSAC exist [
39, 8, 28, 7]. We consider a
general, basic variant here but the new principles presented
in this work can be applied to many RANSAC variants, such
as to locally-refined preemptive RANSAC [
36].
A basic RANSAC implementation consists of four steps:
(i) generate a set of model hypotheses by sampling minimal
subsets of the data; (ii) score hypotheses based on some
measure of consensus, e.g. by counting inliers; (iii) select
the best scoring hypothesis; (iv) refine the selected hypoth-
esis using additional data points, e.g. the full set of inliers.
Step (iv) is optional, though in practice important for high
accuracy.
We introduce our notation below using the example ap-
plication of camera localization. We consider an RGB im-
age I consisting of pixels indexed by i. We wish to esti-
mate the parameters
˜
h of a model that explains I. In the
camera localization problem this is the 6D camera pose, i.e.
the 3D rotation and 3D translation of the camera relative to
the scene’s coordinate frame. Following [
36], we do not fit
model
˜
h directly to image data I, but instead make use of
intermediate, noisy 2D-3D correspondences predicted for
each pixel: Y (I) = {y(I, i)|∀i}, where y(I, i) is the ‘scene
coordinate’ of pixel i, i.e. a discriminative prediction for
where the point imaged at pixel i lives in the 3D scene co-
ordinate frame. We will use y
i
as shorthand for y(I, i).
Y (I) denotes the complete set of scene coordinate predic-
tions for image I, and we write Y for Y (I). To estimate
˜
h
from Y we apply RANSAC as follows:
1. Generate a pool of hypotheses. Each hypothesis is
generated from a subset of correspondences. This sub-
set contains the minimal number of correspondences
to compute a unique solution. We call this a minimal
set Y
J
with correspondence indices J = {j
1
, ..., j
n
},
where n is the minimal set size. To create the set,
we uniformly sample n correspondence indices: j
m
[1, . . . , |Y |] to get Y
J
:= {y
j
1
, ..., y
j
n
}. We assume
a function H which generates a model hypothesis as
h
J
= H(Y
J
) from the minimal set Y
J
. In our appli-
cation, H is the perspective-n-point (PNP) algorithm
[
12], and n = 4.
2. Score hypotheses. Scalar function s(h
J
, Y ) measures
the consensus / quality of hypothesis h
J
, e.g. by count-
6686

ing inlier correspondences. To define an inlier in our
application, we first define the reprojection error of
scene coordinate y
i
:
e
i
= kp
i
Ch
J
y
i
k, (1)
where p
i
is the 2D location of pixel i and C is the cam-
era projection matrix. We call y
i
an inlier if e
i
< τ,
where τ is the inlier threshold. In this work, instead
of counting inliers, we to aim to learn s(h
J
, Y ) to di-
rectly regress the hypothesis score from reprojection
errors e
i
, as we will explain shortly.
3. Select best hypothesis. We take
h
AM
= argmax
h
J
s(h
J
, Y ) . (2)
4. Refine hypothesis. h
AM
is refined using function
R(h
AM
, Y ). Refinement may use all correspondences
Y . A common approach is to select a set of inliers
from Y and recalculate function H on this set. The
refined pose is the output of the algorithm
˜
h
AM
=
R(h
AM
, Y ).
2.2. Learning in a RANSAC Pipeline
The system of Shotton et al. [
36] had a single learned
component, namely the regression forest that made the pre-
dictions y(I, i). Krull et al. [
23] extended the approach to
also learn the scoring function s(h
J
, Y ) as a generalization
of the simpler inlier counting scheme of [
36]. However,
these have thus far been learned separately.
Our work instead aims to learn both, the scene coordinate
predictions and the scoring function, and to do so jointly in
an end-to-end fashion within a RANSAC framework. Mak-
ing the parameterizations explicit, we have y(I, i; w) and
s(h
J
, Y ; v). We aim to learn parameters w and v, where
w affects the quality of poses that we generate, and v affects
the selection process which should choose a good hypoth-
esis. We write Y
w
to reflect that scene coordinate predic-
tions depend on parameters w. Similarly, we write h
w,v
AM
to
reflect that the chosen hypothesis depends on w and v.
We would like to find parameters w and v such that the
loss of the final, refined hypotheses over a training set of
images I is minimized, i.e.
˜
w,
˜
v = argmin
w,v
X
I∈I
(R(h
w,v
AM
, Y
w
), h
), (3)
where h
are ground truth model parameters for I. To al-
low end-to-end learning, we need to differentiate w.r.t. w
and v. We assume a differentiable loss and differentiable
refinement R.
One might consider differentiating h
w,v
AM
w.r.t. to w via
the minimal set Y
J
of the single selected hypothesis of
Eq.
2. But learning a RANSAC pipeline in this fashion fails
because the selection process itself depends on w and v,
which is not represented in the gradients of the selected hy-
pothesis.
2
Parameters v influence the selection directly via
the scoring function s(h, Y ; v), and parameters w influence
the quality of competing hypotheses h, though neither influ-
ence the initial uniform sampling of minimal sets Y
J
.
We next present two approaches to learn parameters w
and v soft argmax selection (Sec.
2.2.1) and probabilistic
selection (Sec.
2.2.2) that do model the dependency of the
selection process on the parameters.
2.2.1 Soft argmax Selection (SoftAM)
To solve the problem of non-differentiability, one can relax
the argmax operator of Eq.
2 and substitute it for a soft
argmax operator [
6]. The soft argmax turns the hypothesis
selection into a weighted average of hypotheses:
h
w,v
SoftAM
=
X
J
P (J|v, w)h
w
J
(4)
which averages over candidate hypotheses h
w
J
with
P (J|v, w) =
exp(s(h
w
J
, Y
w
; v))
P
J
exp(s(h
w
J
Y
w
; v))
. (5)
In this variant, scoring function s(h
w
J
, Y
w
; v) has to pre-
dict weights that lead to a robust average of hypotheses (i.e.
model parameters). This means that model parameters cor-
rupted by outliers should receive sufficiently small weights,
such that they do not affect the accuracy of h
w,v
SoftAM
.
Substituting h
w,v
AM
for h
w,v
SoftAM
in Eq.
3 allows us to cal-
culate gradients to learn parameters w and v. We refer the
reader to the supplementary materials for details.
By utilizing the soft argmax operator, we diverge from
the RANSAC principle of making one hard decision for a
hypothesis. Soft argmax hypothesis selection bears simi-
larity with an independent strain within the field of robust
optimization, namely robust averaging, see e.g. the work of
Hartley et al. [
15]. While we explore soft argmax selection
in the experimental evaluation, we introduce an alternative
in the next section, that preserves the hard hypothesis selec-
tion, and is empirically superior for our task.
2.2.2 Probabilistic Selection (DSAC)
We substitute the deterministic selection of the highest scor-
ing model hypothesis in Eq.
2 by a probabilistic selection,
i.e. we chose a hypothesis probabilistically according to:
h
w,v
DSAC
= h
w
J
, with J P (J|v, w), (6)
where P (J| v, w) is the softmax distribution of scores pre-
dicted by s(h
w
J
, Y
w
; v) (see Eq.
5).
2
We observed in early experiments that the training loss immediately
increases without recovering.
6687

The inspiration for this approach comes from policy gra-
dient approaches in reinforcement learning that involve the
minimization of a loss function defined over a stochastic
process [
34]. Similarly, we are able to learn parameters w
and v that minimize the expectation of loss of the stochastic
process defined in Eq. 6:
˜
w,
˜
v = argmin
w,v
X
I∈I
E
JP (J |v,w)
[(R(h
w
J
, Y
w
))] . (7)
As shown in [
34], we can calculate the derivative w.r.t. pa-
rameters w as follows (similarly for parameters v):
w
E
JP (J |v,w)
[(·)] =
E
JP (J |v,w)
(·)
w
log P (J|v, w) +
w
(·)
, (8)
i.e. the derivative of the expectation is an expectation over
derivatives of the loss and the log probabilities of model
hypotheses. We inlcude further steps of the derivation of
Eq.
8 in the supplementary materials.
We call this method of differentiating RANSAC, that
preserves hard hypothesis selection, DSAC Differentiable
SAmple Consensus. See Fig.
1 for a schematic view of
DSAC in comparison to the RANSAC variants introduced
at the beginning of this section. While learning parameters
with the vanilla RANSAC is not possible, as mentioned be-
fore, both new variants (SoftAM and DSAC) are sensible
options which we evaluate in the experimental section.
3. Differentiable Camera Localization
We demonstrate the principles for differentiating
RANSAC for the task of one-shot camera localization from
an RGB image. Our pipeline is inspired by the state-of-the-
art pipeline of Brachmann et al. [
5], which is an extension
of the original SCoRF pipeline [
36] from RGB-D to RGB
images. Brachmann et al. use an auto-context random for-
est to predict multi-modal scene coordinate distributions per
image patch. After that, minimal sets of four scene coordi-
nates are randomly sampled and the PNP algorithm [
12] is
applied to create a pool of camera pose hypotheses. A pre-
emptive RANSAC schema iteratively refines, re-scores and
rejects hypotheses until only one remains. The preemptive
RANSAC scores hypotheses by counting inlier scene co-
ordinates, i.e. scene coordinates y
i
for which reprojection
error e
i
< τ . In a last step, the final, remaining hypothe-
sis is further optimized using the uncertainty of the scene
coordinate distributions.
Our pipeline differs from Brachmann et al. [
5] in the fol-
lowing aspects:
Instead of a random forest, we use a CNN (called ‘Co-
ordinate CNN’ below) to predict scene coordinates.
For each 42x42 pixel image patch, it predicts a scene
coordinate point estimate. We use a VGG style archi-
tecture with 13 layers and 33M parameters. To reduce
test time we process only 40x40 patches per image.
We score hypotheses using a second CNN (called
‘Score CNN’ below). We took inspiration from the
work of Krull et al. [
23] for the task of object pose
estimation. Instead of learning a CNN to compare ren-
dered and observed images as in [
23], our Score CNN
predicts hypothesis consensus based on reprojection
errors. For each of the 40x40 scene coordinate pre-
dictions y
i
we calculate the reprojection error e
i
for
hypothesis h
J
(see Eq.
1). This results in a 40x40 re-
projection error image, which we feed into the Score
CNN, a VGG style architecture with 13 layers and 6M
parameters.
Instead of the preemptive RANSAC schema, we score
hypotheses only once and select the final pose, either
by applying the soft argmax operator (SoftAM), or
by probabilistic selection according to the softmaxed
scores (DSAC).
Only the final pose is refined. We choose inlier object
coordinate predictions (at most 100), i.e. scene coor-
dinates y
i
with reprojection error e
i
< τ, and solve
PNP [
24] again using this set. This is iterated multiple
times. Since the Coordinate CNN predicts only point
estimates we do no further pose optimization using un-
certainty.
See Fig. 2 for an overview of our pipeline. Where appli-
cable we use the parameter values reported by Brachmann
et al. in [
5], e.g. sampling 256 hypotheses, using 8 refine-
ment steps and an inlier threshold of τ = 10px.
4. Experiments
For comparability to other methods, we show results on
the widely used 7-Scenes dataset [
36]. The dataset consists
of RGB-D images of 7 indoor environments where each
frame is annotated with its 6D camera pose. A 3D model of
each scene is also available. The data of each scene is com-
prised of multiple sequences (= independent camera paths)
which are assigned either to test or training. The number
of images per scene ranges from 1k to 7k for training resp.
test. We omit the depth channels and estimate poses using
RGB images only. See the supplementary materials for a
discussion of the difficulty of the 7-Scenes dataset.
We measure accuracy by the percentage of images for
which the pose error is below 5
and 5cm. For training, we
use the following differentiable loss which is closely corre-
lated with the task loss:
pose
(h, h
) = max((θ, θ
), kt t
k), (9)
where h = (θ, t), θ denotes the axis-angle representa-
tion of the camera rotation, and t is the camera translation.
6688

Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: This paper introduces the first benchmark datasets specifically designed for analyzing the impact of day-night changes, weather and seasonal variations, as well as sequence-based localization approaches and the need for better local features on visual localization.
Abstract: Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds. Practical visual localization approaches need to be robust to a wide variety of viewing condition, including day-night changes, as well as weather and seasonal variations, while providing highly accurate 6 degree-of-freedom (6DOF) camera pose estimates. In this paper, we introduce the first benchmark datasets specifically designed for analyzing the impact of such factors on visual localization. Using carefully created ground truth poses for query images taken under a wide variety of conditions, we evaluate the impact of various factors on 6DOF camera pose estimation accuracy through extensive experiments with state-of-the-art localization approaches. Based on our results, we draw conclusions about the difficulty of different conditions, showing that long-term localization is far from solved, and propose promising avenues for future work, including sequence-based localization approaches and the need for better local features. Our benchmark is available at visuallocalization.net.

595 citations

Journal ArticleDOI
TL;DR: This survey introduces feature detection, description, and matching techniques from handcrafted methods to trainable ones and provides an analysis of the development of these methods in theory and practice, and briefly introduces several typical image matching-based applications.
Abstract: As a fundamental and critical task in various visual applications, image matching can identify then correspond the same or similar structure/content from two or more images. Over the past decades, growing amount and diversity of methods have been proposed for image matching, particularly with the development of deep learning techniques over the recent years. However, it may leave several open questions about which method would be a suitable choice for specific applications with respect to different scenarios and task requirements and how to design better image matching methods with superior performance in accuracy, robustness and efficiency. This encourages us to conduct a comprehensive and systematic review and analysis for those classical and latest techniques. Following the feature-based image matching pipeline, we first introduce feature detection, description, and matching techniques from handcrafted methods to trainable ones and provide an analysis of the development of these methods in theory and practice. Secondly, we briefly introduce several typical image matching-based applications for a comprehensive understanding of the significance of image matching. In addition, we also provide a comprehensive and objective comparison of these classical and latest techniques through extensive experiments on representative datasets. Finally, we conclude with the current status of image matching technologies and deliver insightful discussions and prospects for future works. This survey can serve as a reference for (but not limited to) researchers and engineers in image matching and related fields.

474 citations

Proceedings ArticleDOI
01 Oct 2017
TL;DR: Experimental results show the proposed CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes outperforms existing deep architectures, and can localize images in hard conditions, where classic SIFT-based methods fail.
Abstract: In this work we propose a new CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes. CNNs allow us to learn suitable feature representations for localization that are robust against motion blur and illumination changes. We make use of LSTM units on the CNN output, which play the role of a structured dimensionality reduction on the feature vector, leading to drastic improvements in localization performance. We provide extensive quantitative comparison of CNN-based and SIFT-based localization methods, showing the weaknesses and strengths of each. Furthermore, we present a new large-scale indoor dataset with accurate ground truth from a laser scanner. Experimental results on both indoor and outdoor public datasets show our method outperforms existing deep architectures, and can localize images in hard conditions, e.g., in the presence of mostly textureless surfaces, where classic SIFT-based methods fail.

322 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this paper, a fully convolutional neural network (FCN) is used to predict the 6D camera pose from a single RGB image in a given 3D environment, and the network is prepended to a new end-to-end trainable pipeline.
Abstract: Popular research areas like autonomous driving and augmented reality have renewed the interest in image-based camera localization. In this work, we address the task of predicting the 6D camera pose from a single RGB image in a given 3D environment. With the advent of neural networks, previous works have either learned the entire camera localization process, or multiple components of a camera localization pipeline. Our key contribution is to demonstrate and explain that learning a single component of this pipeline is sufficient. This component is a fully convolutional neural network for densely regressing so-called scene coordinates, defining the correspondence between the input image and the 3D scene space. The neural network is prepended to a new end-to-end trainable pipeline. Our system is efficient, highly accurate, robust in training, and exhibits outstanding generalization capabilities. It exceeds state-of-the-art consistently on indoor and outdoor datasets. Interestingly, our approach surpasses existing techniques even without utilizing a 3D model of the scene during training, since the network is able to discover 3D scene geometry automatically, solely from single-view constraints.

303 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this paper, a joint 3D geometric and semantic understanding of the world is used for robust visual localization under a wide range of viewing conditions, enabling it to succeed under conditions where previous approaches failed.
Abstract: Robust visual localization under a wide range of viewing conditions is a fundamental problem in computer vision. Handling the difficult cases of this problem is not only very challenging but also of high practical relevance, e.g., in the context of life-long localization for augmented reality or autonomous robots. In this paper, we propose a novel approach based on a joint 3D geometric and semantic understanding of the world, enabling it to succeed under conditions where previous approaches failed. Our method leverages a novel generative model for descriptor learning, trained on semantic scene completion as an auxiliary task. The resulting 3D descriptors are robust to missing observations by encoding high-level 3D geometric and semantic information. Experiments on several challenging large-scale localization datasets demonstrate reliable localization under extreme viewpoint, illumination, and geometry changes.

281 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations

Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Posted Content
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

44,703 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

28,225 citations