DSAC — Differentiable RANSAC for Camera Localization

doi:10.1109/CVPR.2017.267

DSAC - Differentiable RANSAC for Camera Localization

Eric Brachmann

1

, Alexander Krull

1

, Sebastian Nowozin

2

Jamie Shotton

2

, Frank Michel

1

, Stefan Gumhold

1

, Carsten Rother

1

TU Dresden,

2

Microsoft

Abstract

RANSAC is an important algorithm in robust optimiza-

tion and a central building block for many computer vision

applications. In recent years, traditionally hand-crafted

pipelines have been replaced by deep learning pipelines,

which can be trained in an end-to-end fashion. However,

RANSAC has so far not been used as part of such deep

learning pipelines, because its hypothesis selection proce-

dure is non-differentiable. In this work, we present two dif-

ferent ways to overcome this limitation. The most promising

approach is inspired by reinforcement learning, namely to

replace the deterministic hypothesis selection by a proba-

bilistic selection for which we can derive the expected loss

w.r.t. to all learnable parameters. We call this approach

DSAC, the differentiable counterpart of RANSAC. We apply

DSAC to the problem of camera localization, where deep

learning has so far failed to improve on traditional ap-

proaches. We demonstrate that by directly minimizing the

expected loss of the output camera poses, robustly estimated

by RANSAC, we achieve an increase in accuracy. In the fu-

ture, any deep learning pipeline can use DSAC as a robust

optimization component

1

.

1. Introduction

Introduced in 1981, the random sample consensus

(RANSAC) algorithm [

11] remains the most important al-

gorithm for robust estimation. It is easy to implement, it

can be applied to a wide range of problems and it is able

to handle data with a substantial percentage of outliers,

i.e. data points that are not explained by the data model.

RANSAC and variants thereof [

39, 28, 7] have, for many

years, been important tools in computer vision, including

multi-view geometry [

16], object retrieval [29], pose esti-

mation [

36, 4] and simultaneous localization and mapping

(SLAM) [

27]. Solutions to these diverse tasks often in-

volve a common strategy: Local predictions (e.g. feature

matches) induce a global model (e.g. a homography). In

1

We will make our source code publicly available on the DSAC project

website.

this schema, RANSAC provides robustness to erroneous lo-

cal predictions.

Recently, deep learning has been shown to be highly

successful at image recognition tasks [

37, 17, 13, 31],

and, increasingly, in other domains including geometry

[

10, 19, 20, 9]. Part of this recent success is the ability to

perform end-to-end training, i.e. propagating gradients back

through an entire pipeline to allow the direct optimization of

a task-speciﬁc loss function, examples include [

41, 1, 38].

In this work, we are interested in learning components of

a computer vision pipeline that follows the principle: pre-

dict locally, ﬁt globally. As explained earlier, RANSAC is

an integral component of this wide-spread strategy. We ask

the question, whether we can train such a pipeline end-to-

end. More speciﬁcally, we want to learn parameters of a

convolutional neural network (CNN) such that models, ﬁt

robustly to its predictions via RANSAC, minimize a task

speciﬁc loss function.

RANSAC works by ﬁrst creating multiple model hy-

potheses from small, random subsets of data points. Then

it scores each hypothesis by determining its consensus with

all data points. Finally, RANSAC selects the hypothesis

with the highest consensus as the ﬁnal output. Unfortu-

nately, this hypothesis selection is non-differentiable, mean-

ing that it cannot directly be used in an end-to-end-trained

deep learning pipeline.

A common approach within the deep learning commu-

nity is to soften non-differentiable operators, e.g. argmax

in LIFT [

41] or visual word assignment in NetVLAD [1]. In

the case of RANSAC, the non-differentiable operator is the

argmax operator which selects the highest scoring hypoth-

esis. Similar to [

41], we might substitute the argmax for a

soft argmax, which is a weighted average of arguments [

6].

We indeed explore this direction but argue that this substitu-

tion changes the underlying principle of RANSAC. Instead

of learning how to select a good hypothesis, the pipeline

learns a (robust) average of hypotheses. We show experi-

mentally that this approach learns to focus on a narrow se-

lection of hypotheses and is prone to overﬁtting.

Alternatively, we aim to preserve the hard hypothesis se-

lection but treat it as a probabilistic process. We call this

6684

approach DSAC – Differentiable SAmple Consensus – our

new, differentiable counterpart to RANSAC. DSAC allows

us to differentiate the expected loss of the pipeline w.r.t.

to all learnable parameters. This technique is well known

in reinforcement learning, for stochastic computation prob-

lems like policy gradient approaches [

34].

To demonstrate the principle, we choose the problem of

camera localization: From a single RGB image in a known

static scene, we estimate the 6D camera pose (3D transla-

tion and 3D rotation) relative to the scene. We demonstrate

an end-to-end trainable solution for this problem, build-

ing on the scene coordinate regression forest (SCoRF) ap-

proach [

36, 40, 5]. The original SCoRF approach uses a

regression forest to predict the 3D location of each pixel

in an observed image in terms of ‘scene coordinates’. A

hypothesize-verify-reﬁne RANSAC loop then randomly se-

lect scene coordinates of four pixel locations to generate an

initial set of camera pose hypotheses, which is then itera-

tively pruned and reﬁned until a single high-quality pose es-

timate remains. In contrast to previous SCoRF approaches,

we adopt two CNNs for predicting scene coordinates and

for scoring hypotheses. More importantly, the key novelty

of this work is to replace RANSAC by our new, differen-

tiable DSAC.

Our contributions are in short:

• We present and discuss two alternative ways of mak-

ing RANSAC differentiable, by soft argmax and prob-

abilistic selection. We call our new RANSAC version,

with the latter option, DSAC (Differentiable SAmple

Consensus).

• We put both options into a new end-to-end trainable

camera localization pipeline. It contains two separate

CNNs, linked by our new RANSAC, motivated by pre-

vious work [

36, 23].

• We validate experimentally that the option of proba-

bilistic selection is superior, i.e. less sensitive to over-

ﬁtting, for our application. We conjecture that the ad-

vantage of probabilistic selection is allowing hard de-

cisions and, at the same time, keeping broad distribu-

tions over possible decisions.

• We exceed the state-of-the-art results on camera local-

ization by 3.3%.

1.1. Related Work

Over the last decades, researchers have proposed many

variants of the original RANSAC algorithm [

11]. Most

works focus on either or both of two aspects: speed

[

8, 28, 7], or quality of the ﬁnal estimate [39, 8]. For de-

tailed information about RANSAC variants we refer the

reader to [30]. To the best of our knowledge, this work

is the ﬁrst to introduce a differentiable variant of RANSAC

for the purpose of end-to-end learning.

In the following, we review previous work on differen-

tiable algorithms and solutions for the problem of camera

localization.

Differentiable Algorithms. The success of deep learning

began with systems in which a CNN processes an image

in one forward pass to directly predict the desired output,

e.g. class probabilities [22], a semantic segmentation [25]

or depth values and normals [

10]. Given a sufﬁcient amount

of training data, CNNs can autonomously discover useful

strategies for solving a task at hand, e.g. hierarchical part-

structures for object recognition [

42].

However, for many computer vision tasks, useful strate-

gies have been known for a long time. Recently, researchers

started to revisit and encode such strategies explicitly in

deep learning pipelines. This can reduce the necessary

amount of training data compared to CNNs with an un-

constrained architecture [

35]. Yi et al. [41] introduced a

stack of CNNs that remodels the established sparse fea-

ture pipeline of detection, orientation estimation and de-

scription, originally proposed in [

26]. Arandjelovic et

al. [

1] mapped the Vector of Locally Aggregated Descrip-

tors (VLAD) [

2] to a CNN architecture for place recogni-

tion. Thewlis et al. [

38] substituted the recursive decoding

of Deep Matching [

32] with reverse convolutions for end-

to-end trainable dense image matching.

Similar in spirit to these works, we show how to train

an established, RANSAC-based computer vision pipeline

in an end-to-end fashion. Instead of substituting hard as-

signments by soft counterparts as in [

41, 1], we enable end-

to-end learning by turning the hard selection into a proba-

bilistic process. Thus, we are able to calculate gradients to

minimize the expectation of the task loss function [

34].

Camera Localization. The SCoRF camera localization

pipeline [

36], already discussed in the introduction, has

been extended in several works. Guzman-Rivera et al. [

14]

trained a random forest to predict diverse scene coordinates

to resolve scene ambiguities. Valentin et al. [40] trained the

random forest to predict multi-model distributions of scene

coordinates for increased pose accuracy. Brachmann et

al. [

5] addressed camera localization from an RGB image

instead of RGB-D, utilizing the increased predictive power

of an auto-context random forest. None of these works sup-

port end-to-end learning.

In a system similar to SCoRF but for the task of object

pose estimation, Krull et al. [23] trained a CNN to measure

hypothesis consensus by comparing rendered and observed

images. In this work, we adopt the idea of a CNN measur-

ing hypothesis consensus, but learn it jointly with the scene

coordinate regressor and in an end-to-end fashion.

Kendall et al. [

20] demonstrated that a single CNN is

able to directly regress the 6D camera pose given an RGB

image, but its accuracy on indoor scenes is inferior to a

RGB-based SCoRF pipeline [

5].

6685







 







  

a) Vanilla RANSAC

b) Soft Selection (SoftAM)

c) Probabilistic Selection (DSAC)





 



󰇛



󰇜







󰆓

󰇛



󰆓󰇜





 





󰇛



󰇜





󰆓

󰇛



󰆓󰇜





 













 



  











   















  







 







Correspondence

Prediction

Minimal Set

Sampling

Scoring

Hypothesis

Selection

Hypothesis

Generation

Refinement Loss









Ground

Truth

Figure 1. Stochastic Computation Graphs [34]. A graphical representation of three RANSAC variants investigated in this work. The

variants differ in the way they select the ﬁnal model hypothesis: a) non-differentiable, vanilla RANSAC with hard, deterministic argmax

selection; b) differentiable RANSAC with deterministic, soft argmax selection; c) differentiable RANSAC with hard, probabilistic se-

lection (named DSAC). Nodes shown as boxes represent deterministic functions, while circular nodes with yellow background represent

probabilistic functions. Arrows indicate dependency in computation. All differences between a), b) and c) are marked in red.

2. Method

2.1. Background

As a preface to explaining our method, we ﬁrst brieﬂy

review the standard RANSAC algorithm for model ﬁtting,

and how it can be applied to the camera localization prob-

lem using discriminative scene coordinate regression.

Many problems in computer vision involve ﬁtting a

model to a set of data points, which in practice usually in-

clude outliers due to sensor noise and other factors. The

RANSAC algorithm was speciﬁcally designed to be able to

ﬁt models robustly in the presence of noise [11]. Dozens of

variations of RANSAC exist [

39, 8, 28, 7]. We consider a

general, basic variant here but the new principles presented

in this work can be applied to many RANSAC variants, such

as to locally-reﬁned preemptive RANSAC [

36].

A basic RANSAC implementation consists of four steps:

(i) generate a set of model hypotheses by sampling minimal

subsets of the data; (ii) score hypotheses based on some

measure of consensus, e.g. by counting inliers; (iii) select

the best scoring hypothesis; (iv) reﬁne the selected hypoth-

esis using additional data points, e.g. the full set of inliers.

Step (iv) is optional, though in practice important for high

accuracy.

We introduce our notation below using the example ap-

plication of camera localization. We consider an RGB im-

age I consisting of pixels indexed by i. We wish to esti-

mate the parameters

˜

h of a model that explains I. In the

camera localization problem this is the 6D camera pose, i.e.

the 3D rotation and 3D translation of the camera relative to

the scene’s coordinate frame. Following [

36], we do not ﬁt

model

˜

h directly to image data I, but instead make use of

intermediate, noisy 2D-3D correspondences predicted for

each pixel: Y (I) = {y(I, i)|∀i}, where y(I, i) is the ‘scene

coordinate’ of pixel i, i.e. a discriminative prediction for

where the point imaged at pixel i lives in the 3D scene co-

ordinate frame. We will use y

i

as shorthand for y(I, i).

Y (I) denotes the complete set of scene coordinate predic-

tions for image I, and we write Y for Y (I). To estimate

˜

h

from Y we apply RANSAC as follows:

1. Generate a pool of hypotheses. Each hypothesis is

generated from a subset of correspondences. This sub-

set contains the minimal number of correspondences

to compute a unique solution. We call this a minimal

set Y

J

with correspondence indices J = {j

1

, ..., j

n

},

where n is the minimal set size. To create the set,

we uniformly sample n correspondence indices: j

m

∈

[1, . . . , |Y |] to get Y

J

:= {y

j

1

, ..., y

j

n

}. We assume

a function H which generates a model hypothesis as

h

J

= H(Y

J

) from the minimal set Y

J

. In our appli-

cation, H is the perspective-n-point (PNP) algorithm

[

12], and n = 4.

2. Score hypotheses. Scalar function s(h

J

, Y ) measures

the consensus / quality of hypothesis h

J

, e.g. by count-

6686

ing inlier correspondences. To deﬁne an inlier in our

application, we ﬁrst deﬁne the reprojection error of

scene coordinate y

i

:

e

i

= kp

i

− Ch

J

y

i

k, (1)

where p

i

is the 2D location of pixel i and C is the cam-

era projection matrix. We call y

i

an inlier if e

i

< τ,

where τ is the inlier threshold. In this work, instead

of counting inliers, we to aim to learn s(h

J

, Y ) to di-

rectly regress the hypothesis score from reprojection

errors e

i

, as we will explain shortly.

3. Select best hypothesis. We take

h

AM

= argmax

h

J

s(h

J

, Y ) . (2)

4. Reﬁne hypothesis. h

AM

is reﬁned using function

R(h

AM

, Y ). Reﬁnement may use all correspondences

Y . A common approach is to select a set of inliers

from Y and recalculate function H on this set. The

reﬁned pose is the output of the algorithm

˜

h

AM

=

R(h

AM

, Y ).

2.2. Learning in a RANSAC Pipeline

The system of Shotton et al. [

36] had a single learned

component, namely the regression forest that made the pre-

dictions y(I, i). Krull et al. [

23] extended the approach to

also learn the scoring function s(h

J

, Y ) as a generalization

of the simpler inlier counting scheme of [

36]. However,

these have thus far been learned separately.

Our work instead aims to learn both, the scene coordinate

predictions and the scoring function, and to do so jointly in

an end-to-end fashion within a RANSAC framework. Mak-

ing the parameterizations explicit, we have y(I, i; w) and

s(h

J

, Y ; v). We aim to learn parameters w and v, where

w affects the quality of poses that we generate, and v affects

the selection process which should choose a good hypoth-

esis. We write Y

w

to reﬂect that scene coordinate predic-

tions depend on parameters w. Similarly, we write h

w,v

AM

to

reﬂect that the chosen hypothesis depends on w and v.

We would like to ﬁnd parameters w and v such that the

loss ℓ of the ﬁnal, reﬁned hypotheses over a training set of

images I is minimized, i.e.

˜

w,

˜

v = argmin

w,v

X

I∈I

ℓ(R(h

w,v

AM

, Y

w

), h

∗

), (3)

where h

∗

are ground truth model parameters for I. To al-

low end-to-end learning, we need to differentiate w.r.t. w

and v. We assume a differentiable loss ℓ and differentiable

reﬁnement R.

One might consider differentiating h

w,v

AM

w.r.t. to w via

the minimal set Y

J

of the single selected hypothesis of

Eq.

2. But learning a RANSAC pipeline in this fashion fails

because the selection process itself depends on w and v,

which is not represented in the gradients of the selected hy-

pothesis.

2

Parameters v inﬂuence the selection directly via

the scoring function s(h, Y ; v), and parameters w inﬂuence

the quality of competing hypotheses h, though neither inﬂu-

ence the initial uniform sampling of minimal sets Y

J

.

We next present two approaches to learn parameters w

and v – soft argmax selection (Sec.

2.2.1) and probabilistic

selection (Sec.

2.2.2) – that do model the dependency of the

selection process on the parameters.

2.2.1 Soft argmax Selection (SoftAM)

To solve the problem of non-differentiability, one can relax

the argmax operator of Eq.

2 and substitute it for a soft

argmax operator [

6]. The soft argmax turns the hypothesis

selection into a weighted average of hypotheses:

h

w,v

SoftAM

=

X

J

P (J|v, w)h

w

J

(4)

which averages over candidate hypotheses h

w

J

with

P (J|v, w) =

exp(s(h

w

J

, Y

w

; v))

P

J

′

exp(s(h

w

J

′

Y

w

; v))

. (5)

In this variant, scoring function s(h

w

J

, Y

w

; v) has to pre-

dict weights that lead to a robust average of hypotheses (i.e.

model parameters). This means that model parameters cor-

rupted by outliers should receive sufﬁciently small weights,

such that they do not affect the accuracy of h

w,v

SoftAM

.

Substituting h

w,v

AM

for h

w,v

SoftAM

in Eq.

3 allows us to cal-

culate gradients to learn parameters w and v. We refer the

reader to the supplementary materials for details.

By utilizing the soft argmax operator, we diverge from

the RANSAC principle of making one hard decision for a

hypothesis. Soft argmax hypothesis selection bears simi-

larity with an independent strain within the ﬁeld of robust

optimization, namely robust averaging, see e.g. the work of

Hartley et al. [

15]. While we explore soft argmax selection

in the experimental evaluation, we introduce an alternative

in the next section, that preserves the hard hypothesis selec-

tion, and is empirically superior for our task.

2.2.2 Probabilistic Selection (DSAC)

We substitute the deterministic selection of the highest scor-

ing model hypothesis in Eq.

2 by a probabilistic selection,

i.e. we chose a hypothesis probabilistically according to:

h

w,v

DSAC

= h

w

J

, with J ∼ P (J|v, w), (6)

where P (J| v, w) is the softmax distribution of scores pre-

dicted by s(h

w

J

, Y

w

; v) (see Eq.

5).

2

We observed in early experiments that the training loss immediately

increases without recovering.

6687

The inspiration for this approach comes from policy gra-

dient approaches in reinforcement learning that involve the

minimization of a loss function deﬁned over a stochastic

process [

34]. Similarly, we are able to learn parameters w

and v that minimize the expectation of loss of the stochastic

process deﬁned in Eq. 6:

˜

w,

˜

v = argmin

w,v

X

I∈I

E

J∼P (J |v,w)

[ℓ(R(h

w

J

, Y

w

))] . (7)

As shown in [

34], we can calculate the derivative w.r.t. pa-

rameters w as follows (similarly for parameters v):

∂

∂w

E

J∼P (J |v,w)

[ℓ(·)] =

E

J∼P (J |v,w)



ℓ(·)

∂

∂w

log P (J|v, w) +

∂

∂w

ℓ(·)



, (8)

i.e. the derivative of the expectation is an expectation over

derivatives of the loss and the log probabilities of model

hypotheses. We inlcude further steps of the derivation of

Eq.

8 in the supplementary materials.

We call this method of differentiating RANSAC, that

preserves hard hypothesis selection, DSAC – Differentiable

SAmple Consensus. See Fig.

1 for a schematic view of

DSAC in comparison to the RANSAC variants introduced

at the beginning of this section. While learning parameters

with the vanilla RANSAC is not possible, as mentioned be-

fore, both new variants (SoftAM and DSAC) are sensible

options which we evaluate in the experimental section.

3. Differentiable Camera Localization

We demonstrate the principles for differentiating

RANSAC for the task of one-shot camera localization from

an RGB image. Our pipeline is inspired by the state-of-the-

art pipeline of Brachmann et al. [

5], which is an extension

of the original SCoRF pipeline [

36] from RGB-D to RGB

images. Brachmann et al. use an auto-context random for-

est to predict multi-modal scene coordinate distributions per

image patch. After that, minimal sets of four scene coordi-

nates are randomly sampled and the PNP algorithm [

12] is

applied to create a pool of camera pose hypotheses. A pre-

emptive RANSAC schema iteratively reﬁnes, re-scores and

rejects hypotheses until only one remains. The preemptive

RANSAC scores hypotheses by counting inlier scene co-

ordinates, i.e. scene coordinates y

i

for which reprojection

error e

i

< τ . In a last step, the ﬁnal, remaining hypothe-

sis is further optimized using the uncertainty of the scene

coordinate distributions.

Our pipeline differs from Brachmann et al. [

5] in the fol-

lowing aspects:

• Instead of a random forest, we use a CNN (called ‘Co-

ordinate CNN’ below) to predict scene coordinates.

For each 42x42 pixel image patch, it predicts a scene

coordinate point estimate. We use a VGG style archi-

tecture with 13 layers and 33M parameters. To reduce

test time we process only 40x40 patches per image.

• We score hypotheses using a second CNN (called

‘Score CNN’ below). We took inspiration from the

work of Krull et al. [

23] for the task of object pose

estimation. Instead of learning a CNN to compare ren-

dered and observed images as in [

23], our Score CNN

predicts hypothesis consensus based on reprojection

errors. For each of the 40x40 scene coordinate pre-

dictions y

i

we calculate the reprojection error e

i

for

hypothesis h

J

(see Eq.

1). This results in a 40x40 re-

projection error image, which we feed into the Score

CNN, a VGG style architecture with 13 layers and 6M

parameters.

• Instead of the preemptive RANSAC schema, we score

hypotheses only once and select the ﬁnal pose, either

by applying the soft argmax operator (SoftAM), or

by probabilistic selection according to the softmaxed

scores (DSAC).

• Only the ﬁnal pose is reﬁned. We choose inlier object

coordinate predictions (at most 100), i.e. scene coor-

dinates y

i

with reprojection error e

i

< τ, and solve

PNP [

24] again using this set. This is iterated multiple

times. Since the Coordinate CNN predicts only point

estimates we do no further pose optimization using un-

certainty.

See Fig. 2 for an overview of our pipeline. Where appli-

cable we use the parameter values reported by Brachmann

et al. in [

5], e.g. sampling 256 hypotheses, using 8 reﬁne-

ment steps and an inlier threshold of τ = 10px.

4. Experiments

For comparability to other methods, we show results on

the widely used 7-Scenes dataset [

36]. The dataset consists

of RGB-D images of 7 indoor environments where each

frame is annotated with its 6D camera pose. A 3D model of

each scene is also available. The data of each scene is com-

prised of multiple sequences (= independent camera paths)

which are assigned either to test or training. The number

of images per scene ranges from 1k to 7k for training resp.

test. We omit the depth channels and estimate poses using

RGB images only. See the supplementary materials for a

discussion of the difﬁculty of the 7-Scenes dataset.

We measure accuracy by the percentage of images for

which the pose error is below 5

◦

and 5cm. For training, we

use the following differentiable loss which is closely corre-

lated with the task loss:

ℓ

pose

(h, h

∗

) = max(∡(θ, θ

∗

), kt − t

∗

k), (9)

where h = (θ, t), θ denotes the axis-angle representa-

tion of the camera rotation, and t is the camera translation.

6688

DSAC — Differentiable RANSAC for Camera Localization

Citations

References

Related Papers (5)