What are the main advantages of incremental approaches?

incremental approaches are known to suffer from drift due to the accumulation of errors and to the difficulty to handle cycle closures of the camera trajectory.

How does the method work at city scale?

The authors believe that their method could work at city scale even on a standard computer, provided there is enough RAM for the final bundle adjustments, which is optional.

How many cameras are placed on a circle at distance 5?

A setof fifty 3D points are randomly generated in a [−1, 1]3 cube and 3 cameras are placed on a circle at distance 5, at angles 0◦, α and 2α respectively (see Figure 2, left).

What is the way to estimate a trifocal tensor?

Given estimated global rotations Ri, as computed in Section 2, the authors estimate a “reduced” trifocal tensor using an adaptive RANSAC procedure to be robust to outlier correspondences.

(Open Access) Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion (2013) | Pierre Moulon

Q: What have the authors contributed in "Global fusion of relative motions for robust, accurate and scalable structure from motion" ?

The authors propose a new global calibration approach based on the fusion of relative motions between image pairs. The authors improve an existing method for robustly computing global rotations. The authors present an efficient a contrario trifocal tensor estimation method, from which stable and precise translation directions can be extracted.

Q: What is the approach to calculating the tensor?

Their approach consists in computing the tensor using a small-size linear program as minimal solver with four tracked point across the three views, in conjunction with the AC-RANSAC framework [21] to be robust to noise and outliers.

Q: What is the common method for a sequenced SfM pipeline?

Sequential SfM pipelines start from a minimal reconstruction based on two or three views, then incrementally add new views into a merged representation.

Q: What is the method for estimating translations?

It relies on two linear programs, the first one identifying outliers, and the second one solving translations and 3D structure on the selected inliers.

Q: How can the authors calculate the relative rotations of a range scan?

This rotation averaging task can be performed by distributing the error along all cycles in a cycle basis, as done by Sharp et al. [28] for the alignment of range scans.

Q: How did the authors adjust the cycle error probability?

the authors adapted the cycle error probability using the results of Enqvist et al. [7], weighting errors by a factor 1/ √ l where l is the length of the cycle.

HAL Id: hal-00873504

https://hal-enpc.archives-ouvertes.fr/hal-00873504

Submitted on 15 Oct 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Global Fusion of Relative Motions for Robust, Accurate

and Scalable Structure from Motion

Pierre Moulon, Pascal Monasse, Renaud Marlet

To cite this version:

Pierre Moulon, Pascal Monasse, Renaud Marlet. Global Fusion of Relative Motions for Robust,

Accurate and Scalable Structure from Motion. ICCV, Dec 2013, Sydney, Australia. �hal-00873504�

Global Fusion of Relative Motions

for Robust, Accurate and Scalable Structure from Motion

Pierre Moulon

1,2

, Pascal Monasse

, Renaud Marlet

Université Paris-Est, LIGM (UMR CNRS), ENPC, F-77455 Marne-la-Vallée.

Mikros Image.

firstname.lastname@enpc.fr

Abstract

Multi-view structure from motion (SfM) estimates the po-

sition and orientation of pictures in a common 3D coordi-

nate frame. When views are treated incrementally, this ex-

ternal calibration can be subject to drift, contrary to global

methods that distribute residual errors evenly. We propose a

new global calibration approach based on the fusion of rel-

ative motions between image pairs. We improve an existing

method for robustly computing global rotations. We present

an efﬁcient a contrario trifocal tensor estimation method,

from which stable and precise translation directions can be

extracted. We also deﬁne an efﬁcient translation registra-

tion method that recovers accurate camera positions. These

components are combined into an original SfM pipeline.

Our experiments show that, on most datasets, it outperforms

in accuracy other existing incremental and global pipelines.

It also achieves strikingly good running times: it is about 20

times faster than the other global method we could compare

to, and as fast as the best incremental method. More impor-

tantly, it features better scalability properties.

1. Introduction

Photogrammetry, SLAM (simultaneous localization and

mapping) and SfM (structure from motion) reconstruct a

model of a scene given a set of pictures. They compute both

a 3D point cloud (the structure) and camera poses, i.e., po-

sitions and orientations (the calibration). Methods for that

can be divided into two classes: sequential and global.

Sequential SfM pipelines start from a minimal recon-

struction based on two or three views, then incrementally

add new views into a merged representation. The most

widely used incremental pipeline is Bundler [31]. It per-

forms multiple bundle adjustments (BA) to rigidify the local

structure and motion. As a result, it is a rather slow proce-

dure. Yet, some parts of the problem can be solved more ef-

ﬁciently. Image matching can be made more scalable, e.g.,

thanks to vocabulary tree techniques [24]. Bundle adjust-

Figure 1. Dense mesh obtained with our global SfM pipeline on

the monument datasets (top: 160 images, bottom: 100 images).

ment can be optimized with sparse matrices [1] or using

GPU [36]. The number of variables can also be reduced by

eliminating structure from the bundle adjustment [26]. Fi-

nally some approaches use a divide-and-conquer approach

on the epipolar graph to reduce computations [32, 9, 30, 22].

However, incremental approaches are known to suffer

from drift due to the accumulation of errors and to the difﬁ-

culty to handle cycle closures of the camera trajectory. An

additional weakness is that the quality of the reconstruction

depends heavily on the choice of the initial image pair and

on the order of subsequent image additions.

Most global pipelines solve the SfM optimization prob-

lem in two steps. The ﬁrst step computes the global rota-

tion of each view and the second step computes the camera

translations, together with the structure or not. The interest

of separating the two steps is that the relative two-view ro-

tations can be estimated quite precisely even for small base-

lines, which is not true of relative translations. These ap-

proaches take into account the whole epipolar graph, whose

nodes represent the views and where edges link views hav-

ing enough consistent matching points. All cycles of the

graph yield multi-view constraints, in the sense that the lo-

cal relative motions in successive nodes of the cycle should

compose into the identity when closing the cycle. Enforc-

ing these constraints greatly reduces the risk of drift present

in incremental methods. Moreover errors can be evenly dis-

tributed over the whole graph, contrary to incremental ap-

proaches. But such global approaches suffer from the fact

that some two-view geometries, even when they have a large

support of point correspondences, may fail to reﬂect the un-

derlying global geometry, mainly because of mismatches,

e.g., due to repetitive structures that create outliers. Addi-

tionally, as the minimization is based on the structure and

the reprojection errors, the space and time requirements can

get very large, even for limited-size datasets of images.

In this paper we present a new robust global SfM method

for unordered image sets. The problem complexity is kept

low using relative motions that can be merged very fast.

We ﬁrst solve the structure problem at a local scale (2 and

3 views), then merge the resulting relative motions into a

common global coordinate frame. We assess the efﬁciency

and precision of our reconstruction pipeline on scenes with

ground truth calibration and on challenging datasets with

false epipolar geometries. Compared to other approaches,

we achieve better or similar precision with signiﬁcantly

shorter running times and better scalability. Figure 1 illus-

trates meshing [34] after calibrating with our pipeline.

1.1. Related work

Estimating global rotations. Given the relative rota-

tions R

between views i and j extracted from the essen-

tial matrices, computing the global rotation of each view R

consists in solving the system R

= R

for all i, j. This

topic is covered by Hartley et al. [14].

This rotation averaging task can be performed by dis-

tributing the error along all cycles in a cycle basis, as

done by Sharp et al. [28] for the alignment of range scans.

Approximate solution using least square minimization for

multi-view registration is proposed by Govindu [10], reused

by Martinec et al. [18], and extended with semi-deﬁnite

programming [3]. Alternatively, the averaging can be per-

formed in the SO(3) Lie-group [11, 14]. Crandall et al. [5]

use a cycle belief propagation, but they rely on known ori-

entations, which do not make it suitable in the general case.

Cycle consistency. As relative R

estimates may con-

tain outliers, rotation averaging has to be robust. Given

the camera epipolar graph, the actual task is to identify

both the global rotations and the inconsistent/outlier edges

(false essential geometry). Two classes of methods stand

out, based on spanning trees or cycles. The spanning tree

approaches [12, 25] are based on the classic robust estima-

tor scheme, RANSAC. Random spanning trees are sampled,

and global putative rotations are computed by composing

relative rotations while walking a spanning tree. The re-

maining edges, which create cycles, are evaluated based on

the rotation angle of R

, measuring the discrepancy

between the relative motion and the global motion. The so-

lution with the largest cardinal is kept. Angle thresholds of

0.25

◦

[12] or 1

◦

[25] have been used.

Enqvist et al. [7] perform cycle removal based on devi-

ation from identity. For this, the graph edges are weighted

with the numbers of inlier correspondences and a maximum

spanning tree (MST) is extracted. Cycles formed by the re-

maining edges are considered. A cycle is kept if the de-

viation from identity over the cycle, normalized by a fac-

tor 1/

√

l where l is the cycle lenth, is small enough. The

method is highly dependent on the chosen MST; if this tree

is erroneous, estimated rotations are wrong.

Zach et al. [37] use a Bayesian inference to detect incor-

rect relative rotation using cycle errors. A limit is set on

the number of sampled trees and cycles to keep the problem

tractable. The maximal cycle length is set to 6, also to avoid

taking into account uncertainties w.r.t. cycle length.

Once global camera rotations R

are estimated, global

translations T

can be computed. There are two main ap-

proaches, ﬁnding translations alone or with the structure.

Estimating translations alone. Govindu [10] proposes a

method for recovering the unknown translations T

from the

heading vectors t

, extracted from the estimated essential

matrices. He solves a least square problem with linear equa-

tions in the unknowns T

and relative unknown scale fac-

tors λ

: λ

= T

−T

. Using random sampling, he tries

to ﬁnd the valid set of edges that best represents the global

motion [12].

Sim et al. [29] propose a solution based on the head-

ing vector extracted from the trifocal tensor that minimizes

the angular error between the heading vector and the global

camera position. The advantage of such a method is that

they use a compact formulation (3×number of camera vari-

ables) but they are highly dependent on the quality of the

initial translation estimates. Arie-Nachimson et al. [3] use

a least square minimization of the epipolar equation to ﬁnd

the unknown translations. The obvious drawback is the as-

sumption that there is no outlier correspondence as all cor-

responding point pairs are used. Moreover, Rodríguez et

al. [26] show that this method can handle neither colinear

series of views nor shared optical centers.

Estimating both translations and 3D points. The joint

estimation of translations and 3D points can be formulated

using second-order cone programming expressing the prob-

lem with the l

∞

norm, as proposed by Hartley and Shaffal-

itzky [15], and later generalized [16]. Such methods rely on

upper constraints on the residual error of feature points and

rapidly involve a large number of unknowns. They are com-

putationally and memory expensive. The solution is glob-

ally optimal thanks to multiple convex optimizations, using

bisections of a quasi-convex problem.

Dalalyan et al. [6] deal with outliers with formulation

using l

constraints instead of l

cones. It relies on two

linear programs, the ﬁrst one identifying outliers, and the

second one solving translations and 3D structure on the se-

lected inliers. It avoids the use of the non-negative slack

variables in the single step procedure used by Olsson et al.

[25] as adding one slack variable per measurement rapidly

increases the problem size with the number of images.

Those l

∞

problems can be solved faster. Seo et al. [27]

ﬁnd a global solution by using a growing feasible subset

while all the residual errors of the measurements are under

the precision of the subset. This approach is faster because

only a subpart of the data is fed to the l

∞

minimization.

However, it is not robust to outliers. Agarwal et al. [2] test

different bisection schemes and show that the Gugat algo-

rithm [13] converges faster to the global solution. Zach et

al. [38] use a proximal method to speed up the minimization

of such convex problems.

Other approaches. Martinec et al. [18] use their global

pipeline many times to iteratively discard two-view geome-

tries with largest residuals. To keep good running time, they

compute the translation and structure just on a few point

pairs: each epipolar geometry is represented by 4 points

only. Courchay et al. [4] use a linear parametrization of a

tree of trifocal tensors over the epipolar graph to solve the

camera position. The method is restricted to a single cycle.

1.2. Our global method for global calibration

Our input is an unordered set of pictures {I

, . . . , I

The internal calibration parameters K

are assumed known

for each camera: our goal is to robustly recover the global

pose of each camera (absolute motion rotation R

and trans-

lation T

) from relative camera motions (rotation R

and

heading translation vector t

) between images I

and I

Our contributions are the following:

1. We show that an iterative use of the Bayesian infer-

ence of Zach et al. [37], adjusted with the cycle length

weighting of Enqvist et al. [7], can remove most outlier

edges in the graph, allowing a more robust estimation

of absolute rotations R

(Section 2).

2. We present a new trifocal tensor estimation method

based on l

∞

norm, resulting in a linear pro-

gram, which, used as minimal solver in an adaptive

RANSAC algorithm, is efﬁcient and yields stable rela-

tive translation directions t

(Section 3).

3. We propose a new translation registration method, that

estimates the relative translation scales λ

and abso-

lute translations T

, based on the l

∞

norm, resulting

also in an efﬁcient linear program (Section 4).

4. We put together these ingredients into an SfM pipeline

(Section 5) that ﬁrst cleans up an epipolar graph from

outliers, then computes the global motions from the

relative ones. Our experiments show its robustness,

accuracy and scalability (Section 6)

2. Robust estimation of global rotations

For matching points X and X

′

in images I

and I

re-

spectively, the two-view epipolar constraint can be written

−1

′

) = 0. (1)

The ﬁve-point algorithm of Nistér [23] inserted as minimal

solver in a RANSAC procedure robustly estimates the es-

sential matrices E

= [t

]

, from which R

can be

extracted, together with the direction t

, since the scale is

arbitrary. Four different motions (R

, t

) actually have to

be tested; the one yielding the largest count of points sat-

isfying the cheirality constraint (positive depth of the 3D

point) is retained. It is important to note that the rotation

accuracy is nearly insensitive to the baseline [7], contrary to

the translation direction. Besides, although the camera rota-

tions between connected views can be chained, the relative

translations cannot since they are available up to a differing

unknown scale factor λ

We identify inconsistent relative rotations in the graph

using the edge disambiguation of Zach et al. [37]. As pre-

liminary experiments showed that a number of outlier rota-

tions could pass Zach et al.’s test, we made two improve-

ments. First, we adapted the cycle error probability using

the results of Enqvist et al. [7], weighting errors by a fac-

tor 1/

√

l where l is the length of the cycle. Second, we it-

erate Zach et al.’s algorithm until no more edge is removed

by the Bayesian inference procedure. Finally, we check all

the triplets of the graph and reject the ones with cycle devia-

tion to identity larger than 2

◦

. Experiments in Table 1 show

that half of the outliers can remain after the ﬁrst Bayesian

inference, which motivates our iterated elimination.

Global rotations are computed as done by Martinec et

al. [18], with a least-square minimization that tries to satisfy

equations R

= R

, followed by the computation of the

nearest rotation to cover the lack of orthogonality constraint

during minimization.

Dataset \ #Iterations 1 2 3 2

◦

check

Orangerie (Fig. 5) 8 4 1 9

Opera (Fig. 1 top) 7 3 — 125

Pantheon (Fig. 1 bottom) 9 2 — 7

Table 1. Number of edges rejected by Bayesian inference iteration.

3. Relative translations from trifocal tensors

To improve robustness and accuracy when computing

the relative motion between cameras, we consider triplets

More extensive experiments are provided as supplementary material.

of views instead of pairs as usual. We show in Section 3.2

that this yields a precision jump of an order of magnitude in

the estimated translations.

3.1. Robust trifocal tensor with known rotations

Given estimated global rotations R

, as computed in Sec-

tion 2, we estimate a “reduced” trifocal tensor using an

adaptive RANSAC procedure to be robust to outlier cor-

respondences. Rather than minimizing an algebraic error

having a closed form solution as Sim et al. [29], we mini-

mize the l

∞

reprojection error of 3D points X

compared to

the observed points {(x

, y

)}

i∈{1,2,3}

in the three images:

ρ(t

, X

) =



−

+ t

, y

−

+ t

)



(2)

where t

is the translation of view i and t

its components.

The tensor is found by the feasibility of this linear program:

minimize

}

,{X

}

,γ

subject to ρ(t

, X

) ≤ γ, ∀i, j

+ t

≥ 1, ∀i, j

= (0, 0, 0).

(3)

The second constraint ensures that all 3D points are in front

of the cameras and the third one deﬁnes an origin for the

local coordinate system of the triplet of views.

In general, using a linear program can lead to two issues.

First, as the number of variables increases, the solving time

grows polynomially [27]. Second, robustness to outliers is

typically achieved with slack variables [25], which makes

the problem even bigger.

Our approach consists in computing the tensor using

a small-size linear program as minimal solver with four

tracked point across the three views, in conjunction with

the AC-RANSAC framework [21] to be robust to noise and

outliers. This variant of RANSAC relies on a contrario

(AC) methodology to compute an adaptive threshold for

inlier/outlier discrimination: a conﬁguration is considered

meaningful if its observation in a random setting is unex-

pected. While global l

∞

minimization aims at ﬁnding a

solution with the lowest γ value, found by bisection, AC-

RANSAC determines the number of false alarms (NFA):

NFA(M, k) = (n − 4)







(M)

k−4

(4)

where M is a tested trifocal tensor obtained by the minimal

solver using four random correspondences, γ = 0.5 pixel,

n is the number of corresponding points in the triplet, and

where e

= ǫ

/ max(w, h) depends on the k-th error:

= k

smallest element of {max

ρ(t

(M), X

)}

. (5)

#3D Points Running time (s) Angle accuracy (

◦

)

Slack variables AC Slack variables AC

200 1.37 0.09 0.07 0.03

400 4.06 0.11 0.06 0.03

600 7.94 0.13 0.04 0.02

800 13.1 0.15 0.03 0.02

1000 19.6 0.16 0.03 0.02

Table 2. Required time and accuracy (average angle of translation

directions with ground truth) in robust estimation of trifocal tensor

with the global formulation using slack variables [25] and our a

contrario method (linear program combined with AC-RANSAC).

In these formulas, w and h are the dimensions of the images

and e

is the probability of a point having reprojection er-

ror at most ǫ

. X

is obtained by least-square triangulation

of the corresponding points {(x

, y

)}

i∈{1,2,3}

. k repre-

sents a hypothesized number of inliers. In (4), e

(M)

k−4

therefore the probability that the k −4 minimal reprojection

errors of uniformly distributed independent corresponding

points in the three images (our background model) have er-

ror at most ǫ

, playing the role of the optimal γ of (3) for

the inliers. The other terms in (4) deﬁne the number of sub-

sets of k inliers among the n − 4 remaining points. Thus

NFA(M, k) is the expectation of random correspondences

having maximum error γ = ǫ

. The trifocal tensor M is

deemed meaningful (unlikely to occur by chance) if:

NFA(M) = min

5≤k≤n

NFA(M, k) ≤ 1. (6)

In practice, we draw at most N = 300 random samples of

4 correspondences and evaluate the NFA of the associated

models. As Moisan et al. [19], as soon as a meaningful

model M is found, we stop and reﬁne it by resampling

N/10 times among the inliers of M. If no sample satis-

ﬁes (6), we discard the triplet. Finally, we reﬁne the trans-

lations and the k inlier 3D points by bundle adjustment.

Table 2 evaluates the computation time and accuracy of

our robust a contrario trifocal estimation compared to the

equivalent global estimation with slack variables [25] on

synthetic conﬁgurations. A uniform 1-pixel noise is added

to each perfect correspondence and 2% outliers are intro-

duced. We evaluate the accuracy of the results (angular er-

ror between ground truth and computed translation) and the

required time to ﬁnd the solution. The global solution ﬁnds

a solution that ﬁts the noise of the data, but AC-RANSAC

is able to go further and ﬁnd a more precise solution.

3.2. Relative translation accuracy

Following experiments of Enqvist et al. [7] concerning

two-view rotation precision, we demonstrate that using a

trifocal tensor can lead to substantial improvement in the

relative translation estimation. To assess the impact of small

baseline, a simple synthetic experiment is performed. A set

Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion

Figures

Citations

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Robust Global Translations with 1DSfM

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Visual SLAM and Structure from Motion in Dynamic Environments: A Survey

MicMac – a free, open-source solution for photogrammetry

References

Distinctive Image Features from Scale-Invariant Keypoints

Scalable Recognition with a Vocabulary Tree

Photo tourism: exploring photo collections in 3D

An efficient solution to the five-point relative pose problem

Towards Linear-Time Incremental Structure from Motion

Related Papers (5)

Robust Global Translations with 1DSfM

Photo tourism: exploring photo collections in 3D

Distinctive Image Features from Scale-Invariant Keypoints

Multiple view geometry in computer vision

Towards Linear-Time Incremental Structure from Motion

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Global fusion of relative motions for robust, accurate and scalable structure from motion" ?

Q2. What are the main advantages of incremental approaches?

Q3. What is the approach to calculating the tensor?

Q4. What is the common method for a sequenced SfM pipeline?

Q5. How does the method work at city scale?

Q6. What is the method for estimating translations?

Q7. How can the authors calculate the relative rotations of a range scan?

Q8. How did the authors adjust the cycle error probability?

Q9. How many cameras are placed on a circle at distance 5?

Q10. What is the way to estimate a trifocal tensor?

Q11. What is the main weakness of the SfM pipeline?