scispace - formally typeset
Open AccessProceedings ArticleDOI

Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion

TLDR
This work proposes a new global calibration approach based on the fusion of relative motions between image pairs, and presents an efficient a contrario trifocal tensor estimation method, from which stable and precise translation directions can be extracted.
Abstract
Multi-view structure from motion (SfM) estimates the position and orientation of pictures in a common 3D coordinate frame. When views are treated incrementally, this external calibration can be subject to drift, contrary to global methods that distribute residual errors evenly. We propose a new global calibration approach based on the fusion of relative motions between image pairs. We improve an existing method for robustly computing global rotations. We present an efficient a contrario trifocal tensor estimation method, from which stable and precise translation directions can be extracted. We also define an efficient translation registration method that recovers accurate camera positions. These components are combined into an original SfM pipeline. Our experiments show that, on most datasets, it outperforms in accuracy other existing incremental and global pipelines. It also achieves strikingly good running times: it is about 20 times faster than the other global method we could compare to, and as fast as the best incremental method. More importantly, it features better scalability properties.

read more

Content maybe subject to copyright    Report

HAL Id: hal-00873504
https://hal-enpc.archives-ouvertes.fr/hal-00873504
Submitted on 15 Oct 2013
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Global Fusion of Relative Motions for Robust, Accurate
and Scalable Structure from Motion
Pierre Moulon, Pascal Monasse, Renaud Marlet
To cite this version:
Pierre Moulon, Pascal Monasse, Renaud Marlet. Global Fusion of Relative Motions for Robust,
Accurate and Scalable Structure from Motion. ICCV, Dec 2013, Sydney, Australia. �hal-00873504�

Global Fusion of Relative Motions
for Robust, Accurate and Scalable Structure from Motion
Pierre Moulon
1,2
, Pascal Monasse
1
, Renaud Marlet
1
1
Université Paris-Est, LIGM (UMR CNRS), ENPC, F-77455 Marne-la-Vallée.
2
Mikros Image.
firstname.lastname@enpc.fr
Abstract
Multi-view structure from motion (SfM) estimates the po-
sition and orientation of pictures in a common 3D coordi-
nate frame. When views are treated incrementally, this ex-
ternal calibration can be subject to drift, contrary to global
methods that distribute residual errors evenly. We propose a
new global calibration approach based on the fusion of rel-
ative motions between image pairs. We improve an existing
method for robustly computing global rotations. We present
an efficient a contrario trifocal tensor estimation method,
from which stable and precise translation directions can be
extracted. We also define an efficient translation registra-
tion method that recovers accurate camera positions. These
components are combined into an original SfM pipeline.
Our experiments show that, on most datasets, it outperforms
in accuracy other existing incremental and global pipelines.
It also achieves strikingly good running times: it is about 20
times faster than the other global method we could compare
to, and as fast as the best incremental method. More impor-
tantly, it features better scalability properties.
1. Introduction
Photogrammetry, SLAM (simultaneous localization and
mapping) and SfM (structure from motion) reconstruct a
model of a scene given a set of pictures. They compute both
a 3D point cloud (the structure) and camera poses, i.e., po-
sitions and orientations (the calibration). Methods for that
can be divided into two classes: sequential and global.
Sequential SfM pipelines start from a minimal recon-
struction based on two or three views, then incrementally
add new views into a merged representation. The most
widely used incremental pipeline is Bundler [31]. It per-
forms multiple bundle adjustments (BA) to rigidify the local
structure and motion. As a result, it is a rather slow proce-
dure. Yet, some parts of the problem can be solved more ef-
ficiently. Image matching can be made more scalable, e.g.,
thanks to vocabulary tree techniques [24]. Bundle adjust-
Figure 1. Dense mesh obtained with our global SfM pipeline on
the monument datasets (top: 160 images, bottom: 100 images).
ment can be optimized with sparse matrices [1] or using
GPU [36]. The number of variables can also be reduced by
eliminating structure from the bundle adjustment [26]. Fi-
nally some approaches use a divide-and-conquer approach
on the epipolar graph to reduce computations [32, 9, 30, 22].
However, incremental approaches are known to suffer
from drift due to the accumulation of errors and to the diffi-
culty to handle cycle closures of the camera trajectory. An
additional weakness is that the quality of the reconstruction
depends heavily on the choice of the initial image pair and
on the order of subsequent image additions.
Most global pipelines solve the SfM optimization prob-
lem in two steps. The first step computes the global rota-
tion of each view and the second step computes the camera
translations, together with the structure or not. The interest
of separating the two steps is that the relative two-view ro-
tations can be estimated quite precisely even for small base-
lines, which is not true of relative translations. These ap-
proaches take into account the whole epipolar graph, whose
nodes represent the views and where edges link views hav-
ing enough consistent matching points. All cycles of the
graph yield multi-view constraints, in the sense that the lo-
1

cal relative motions in successive nodes of the cycle should
compose into the identity when closing the cycle. Enforc-
ing these constraints greatly reduces the risk of drift present
in incremental methods. Moreover errors can be evenly dis-
tributed over the whole graph, contrary to incremental ap-
proaches. But such global approaches suffer from the fact
that some two-view geometries, even when they have a large
support of point correspondences, may fail to reflect the un-
derlying global geometry, mainly because of mismatches,
e.g., due to repetitive structures that create outliers. Addi-
tionally, as the minimization is based on the structure and
the reprojection errors, the space and time requirements can
get very large, even for limited-size datasets of images.
In this paper we present a new robust global SfM method
for unordered image sets. The problem complexity is kept
low using relative motions that can be merged very fast.
We first solve the structure problem at a local scale (2 and
3 views), then merge the resulting relative motions into a
common global coordinate frame. We assess the efficiency
and precision of our reconstruction pipeline on scenes with
ground truth calibration and on challenging datasets with
false epipolar geometries. Compared to other approaches,
we achieve better or similar precision with significantly
shorter running times and better scalability. Figure 1 illus-
trates meshing [34] after calibrating with our pipeline.
1.1. Related work
Estimating global rotations. Given the relative rota-
tions R
ij
between views i and j extracted from the essen-
tial matrices, computing the global rotation of each view R
i
consists in solving the system R
j
= R
ij
R
i
for all i, j. This
topic is covered by Hartley et al. [14].
This rotation averaging task can be performed by dis-
tributing the error along all cycles in a cycle basis, as
done by Sharp et al. [28] for the alignment of range scans.
Approximate solution using least square minimization for
multi-view registration is proposed by Govindu [10], reused
by Martinec et al. [18], and extended with semi-definite
programming [3]. Alternatively, the averaging can be per-
formed in the SO(3) Lie-group [11, 14]. Crandall et al. [5]
use a cycle belief propagation, but they rely on known ori-
entations, which do not make it suitable in the general case.
Cycle consistency. As relative R
ij
estimates may con-
tain outliers, rotation averaging has to be robust. Given
the camera epipolar graph, the actual task is to identify
both the global rotations and the inconsistent/outlier edges
(false essential geometry). Two classes of methods stand
out, based on spanning trees or cycles. The spanning tree
approaches [12, 25] are based on the classic robust estima-
tor scheme, RANSAC. Random spanning trees are sampled,
and global putative rotations are computed by composing
relative rotations while walking a spanning tree. The re-
maining edges, which create cycles, are evaluated based on
the rotation angle of R
T
j
R
ij
R
i
, measuring the discrepancy
between the relative motion and the global motion. The so-
lution with the largest cardinal is kept. Angle thresholds of
0.25
[12] or 1
[25] have been used.
Enqvist et al. [7] perform cycle removal based on devi-
ation from identity. For this, the graph edges are weighted
with the numbers of inlier correspondences and a maximum
spanning tree (MST) is extracted. Cycles formed by the re-
maining edges are considered. A cycle is kept if the de-
viation from identity over the cycle, normalized by a fac-
tor 1/
l where l is the cycle lenth, is small enough. The
method is highly dependent on the chosen MST; if this tree
is erroneous, estimated rotations are wrong.
Zach et al. [37] use a Bayesian inference to detect incor-
rect relative rotation using cycle errors. A limit is set on
the number of sampled trees and cycles to keep the problem
tractable. The maximal cycle length is set to 6, also to avoid
taking into account uncertainties w.r.t. cycle length.
Once global camera rotations R
i
are estimated, global
translations T
i
can be computed. There are two main ap-
proaches, finding translations alone or with the structure.
Estimating translations alone. Govindu [10] proposes a
method for recovering the unknown translations T
i
from the
heading vectors t
ij
, extracted from the estimated essential
matrices. He solves a least square problem with linear equa-
tions in the unknowns T
i
and relative unknown scale fac-
tors λ
ij
: λ
ij
t
ij
= T
j
T
i
. Using random sampling, he tries
to find the valid set of edges that best represents the global
motion [12].
Sim et al. [29] propose a solution based on the head-
ing vector extracted from the trifocal tensor that minimizes
the angular error between the heading vector and the global
camera position. The advantage of such a method is that
they use a compact formulation (3×number of camera vari-
ables) but they are highly dependent on the quality of the
initial translation estimates. Arie-Nachimson et al. [3] use
a least square minimization of the epipolar equation to find
the unknown translations. The obvious drawback is the as-
sumption that there is no outlier correspondence as all cor-
responding point pairs are used. Moreover, Rodríguez et
al. [26] show that this method can handle neither colinear
series of views nor shared optical centers.
Estimating both translations and 3D points. The joint
estimation of translations and 3D points can be formulated
using second-order cone programming expressing the prob-
lem with the l
norm, as proposed by Hartley and Shaffal-
itzky [15], and later generalized [16]. Such methods rely on
upper constraints on the residual error of feature points and
rapidly involve a large number of unknowns. They are com-
putationally and memory expensive. The solution is glob-
ally optimal thanks to multiple convex optimizations, using
bisections of a quasi-convex problem.
2

Dalalyan et al. [6] deal with outliers with formulation
using l
1
constraints instead of l
2
cones. It relies on two
linear programs, the first one identifying outliers, and the
second one solving translations and 3D structure on the se-
lected inliers. It avoids the use of the non-negative slack
variables in the single step procedure used by Olsson et al.
[25] as adding one slack variable per measurement rapidly
increases the problem size with the number of images.
Those l
problems can be solved faster. Seo et al. [27]
find a global solution by using a growing feasible subset
while all the residual errors of the measurements are under
the precision of the subset. This approach is faster because
only a subpart of the data is fed to the l
minimization.
However, it is not robust to outliers. Agarwal et al. [2] test
different bisection schemes and show that the Gugat algo-
rithm [13] converges faster to the global solution. Zach et
al. [38] use a proximal method to speed up the minimization
of such convex problems.
Other approaches. Martinec et al. [18] use their global
pipeline many times to iteratively discard two-view geome-
tries with largest residuals. To keep good running time, they
compute the translation and structure just on a few point
pairs: each epipolar geometry is represented by 4 points
only. Courchay et al. [4] use a linear parametrization of a
tree of trifocal tensors over the epipolar graph to solve the
camera position. The method is restricted to a single cycle.
1.2. Our global method for global calibration
Our input is an unordered set of pictures {I
1
, . . . , I
n
}.
The internal calibration parameters K
i
are assumed known
for each camera: our goal is to robustly recover the global
pose of each camera (absolute motion rotation R
i
and trans-
lation T
i
) from relative camera motions (rotation R
ij
and
heading translation vector t
ij
) between images I
i
and I
j
.
Our contributions are the following:
1. We show that an iterative use of the Bayesian infer-
ence of Zach et al. [37], adjusted with the cycle length
weighting of Enqvist et al. [7], can remove most outlier
edges in the graph, allowing a more robust estimation
of absolute rotations R
i
(Section 2).
2. We present a new trifocal tensor estimation method
based on l
norm, resulting in a linear pro-
gram, which, used as minimal solver in an adaptive
RANSAC algorithm, is efficient and yields stable rela-
tive translation directions t
ij
(Section 3).
3. We propose a new translation registration method, that
estimates the relative translation scales λ
ij
and abso-
lute translations T
i
, based on the l
norm, resulting
also in an efficient linear program (Section 4).
4. We put together these ingredients into an SfM pipeline
(Section 5) that first cleans up an epipolar graph from
outliers, then computes the global motions from the
relative ones. Our experiments show its robustness,
accuracy and scalability (Section 6)
1
.
2. Robust estimation of global rotations
For matching points X and X
in images I
i
and I
j
re-
spectively, the two-view epipolar constraint can be written
(K
1
i
X)
T
E
ij
(K
1
j
X
) = 0. (1)
The five-point algorithm of Nistér [23] inserted as minimal
solver in a RANSAC procedure robustly estimates the es-
sential matrices E
ij
= [t
ij
]
×
R
ij
, from which R
ij
can be
extracted, together with the direction t
ij
, since the scale is
arbitrary. Four different motions (R
ij
, t
ij
) actually have to
be tested; the one yielding the largest count of points sat-
isfying the cheirality constraint (positive depth of the 3D
point) is retained. It is important to note that the rotation
accuracy is nearly insensitive to the baseline [7], contrary to
the translation direction. Besides, although the camera rota-
tions between connected views can be chained, the relative
translations cannot since they are available up to a differing
unknown scale factor λ
ij
.
We identify inconsistent relative rotations in the graph
using the edge disambiguation of Zach et al. [37]. As pre-
liminary experiments showed that a number of outlier rota-
tions could pass Zach et al.’s test, we made two improve-
ments. First, we adapted the cycle error probability using
the results of Enqvist et al. [7], weighting errors by a fac-
tor 1/
l where l is the length of the cycle. Second, we it-
erate Zach et al.’s algorithm until no more edge is removed
by the Bayesian inference procedure. Finally, we check all
the triplets of the graph and reject the ones with cycle devia-
tion to identity larger than 2
. Experiments in Table 1 show
that half of the outliers can remain after the first Bayesian
inference, which motivates our iterated elimination.
Global rotations are computed as done by Martinec et
al. [18], with a least-square minimization that tries to satisfy
equations R
j
= R
ij
R
i
, followed by the computation of the
nearest rotation to cover the lack of orthogonality constraint
during minimization.
Dataset \ #Iterations 1 2 3 2
check
Orangerie (Fig. 5) 8 4 1 9
Opera (Fig. 1 top) 7 3 125
Pantheon (Fig. 1 bottom) 9 2 7
Table 1. Number of edges rejected by Bayesian inference iteration.
3. Relative translations from trifocal tensors
To improve robustness and accuracy when computing
the relative motion between cameras, we consider triplets
1
More extensive experiments are provided as supplementary material.
3

of views instead of pairs as usual. We show in Section 3.2
that this yields a precision jump of an order of magnitude in
the estimated translations.
3.1. Robust trifocal tensor with known rotations
Given estimated global rotations R
i
, as computed in Sec-
tion 2, we estimate a “reduced” trifocal tensor using an
adaptive RANSAC procedure to be robust to outlier cor-
respondences. Rather than minimizing an algebraic error
having a closed form solution as Sim et al. [29], we mini-
mize the l
reprojection error of 3D points X
j
compared to
the observed points {(x
i
j
, y
i
j
)}
i∈{1,2,3}
in the three images:
ρ(t
i
, X
j
) =
(x
i
j
R
1
i
X
j
+ t
1
i
R
3
i
X
j
+ t
3
i
, y
i
j
R
2
i
X
j
+ t
2
i
R
3
i
X
j
+ t
3
i
)
,
(2)
where t
i
is the translation of view i and t
m
i
its components.
The tensor is found by the feasibility of this linear program:
minimize
{t
i
}
i
,{X
j
}
j
γ
subject to ρ(t
i
, X
j
) γ, i, j
R
3
i
X
j
+ t
3
i
1, i, j
t
1
= (0, 0, 0).
(3)
The second constraint ensures that all 3D points are in front
of the cameras and the third one defines an origin for the
local coordinate system of the triplet of views.
In general, using a linear program can lead to two issues.
First, as the number of variables increases, the solving time
grows polynomially [27]. Second, robustness to outliers is
typically achieved with slack variables [25], which makes
the problem even bigger.
Our approach consists in computing the tensor using
a small-size linear program as minimal solver with four
tracked point across the three views, in conjunction with
the AC-RANSAC framework [21] to be robust to noise and
outliers. This variant of RANSAC relies on a contrario
(AC) methodology to compute an adaptive threshold for
inlier/outlier discrimination: a configuration is considered
meaningful if its observation in a random setting is unex-
pected. While global l
minimization aims at finding a
solution with the lowest γ value, found by bisection, AC-
RANSAC determines the number of false alarms (NFA):
NFA(M, k) = (n 4)
n
k
k
4
e
k
(M)
k4
(4)
where M is a tested trifocal tensor obtained by the minimal
solver using four random correspondences, γ = 0.5 pixel,
n is the number of corresponding points in the triplet, and
where e
k
= ǫ
k
/ max(w, h) depends on the k-th error:
ǫ
k
= k
th
smallest element of {max
i
ρ(t
i
(M), X
j
)}
j
. (5)
#3D Points Running time (s) Angle accuracy (
)
Slack variables AC Slack variables AC
200 1.37 0.09 0.07 0.03
400 4.06 0.11 0.06 0.03
600 7.94 0.13 0.04 0.02
800 13.1 0.15 0.03 0.02
1000 19.6 0.16 0.03 0.02
Table 2. Required time and accuracy (average angle of translation
directions with ground truth) in robust estimation of trifocal tensor
with the global formulation using slack variables [25] and our a
contrario method (linear program combined with AC-RANSAC).
In these formulas, w and h are the dimensions of the images
and e
k
is the probability of a point having reprojection er-
ror at most ǫ
k
. X
j
is obtained by least-square triangulation
of the corresponding points {(x
i
j
, y
i
j
)}
i∈{1,2,3}
. k repre-
sents a hypothesized number of inliers. In (4), e
k
(M)
k4
is
therefore the probability that the k 4 minimal reprojection
errors of uniformly distributed independent corresponding
points in the three images (our background model) have er-
ror at most ǫ
k
, playing the role of the optimal γ of (3) for
the inliers. The other terms in (4) define the number of sub-
sets of k inliers among the n 4 remaining points. Thus
NFA(M, k) is the expectation of random correspondences
having maximum error γ = ǫ
k
. The trifocal tensor M is
deemed meaningful (unlikely to occur by chance) if:
NFA(M) = min
5kn
NFA(M, k) 1. (6)
In practice, we draw at most N = 300 random samples of
4 correspondences and evaluate the NFA of the associated
models. As Moisan et al. [19], as soon as a meaningful
model M is found, we stop and refine it by resampling
N/10 times among the inliers of M. If no sample satis-
fies (6), we discard the triplet. Finally, we refine the trans-
lations and the k inlier 3D points by bundle adjustment.
Table 2 evaluates the computation time and accuracy of
our robust a contrario trifocal estimation compared to the
equivalent global estimation with slack variables [25] on
synthetic configurations. A uniform 1-pixel noise is added
to each perfect correspondence and 2% outliers are intro-
duced. We evaluate the accuracy of the results (angular er-
ror between ground truth and computed translation) and the
required time to find the solution. The global solution finds
a solution that fits the noise of the data, but AC-RANSAC
is able to go further and find a more precise solution.
3.2. Relative translation accuracy
Following experiments of Enqvist et al. [7] concerning
two-view rotation precision, we demonstrate that using a
trifocal tensor can lead to substantial improvement in the
relative translation estimation. To assess the impact of small
baseline, a simple synthetic experiment is performed. A set
4

Citations
More filters
Book ChapterDOI

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

TL;DR: 3D-R2N2 as discussed by the authors proposes a 3D Recurrent Reconstruction Neural Network that learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data.
Book ChapterDOI

Robust Global Translations with 1DSfM

TL;DR: This work proposes a method for removing outliers from problem instances by solving simpler low-dimensional subproblems, which it refers to as 1DSfM problems.
Posted Content

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

TL;DR: The 3D-R2N2 reconstruction framework outperforms the state-of-the-art methods for single view reconstruction, and enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).
Journal ArticleDOI

Visual SLAM and Structure from Motion in Dynamic Environments: A Survey

TL;DR: This article presents for the first time a survey of visual SLAM and SfM techniques that are targeted toward operation in dynamic environments and identifies three main problems: how to perform reconstruction, how to segment and track dynamic objects, and how to achieve joint motion segmentation and reconstruction.
Journal ArticleDOI

MicMac – a free, open-source solution for photogrammetry

TL;DR: The essential algorithmic aspects of the structure from motion and image dense matching problems are discussed from the implementation and the user’s viewpoints.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Proceedings ArticleDOI

Scalable Recognition with a Vocabulary Tree

TL;DR: A recognition scheme that scales efficiently to a large number of objects and allows a larger and more discriminatory vocabulary to be used efficiently is presented, which it is shown experimentally leads to a dramatic improvement in retrieval quality.
Journal ArticleDOI

Photo tourism: exploring photo collections in 3D

TL;DR: This work presents a system for interactively browsing and exploring large unstructured collections of photographs of a scene using a novel 3D interface that consists of an image-based modeling front end that automatically computes the viewpoint of each photograph and a sparse 3D model of the scene and image to model correspondences.
Journal ArticleDOI

An efficient solution to the five-point relative pose problem

TL;DR: The algorithm is used in a robust hypothesize-and-test framework to estimate structure and motion in real-time with low delay and is the first algorithm well-suited for numerical implementation that also corresponds to the inherent complexity of the problem.
Proceedings ArticleDOI

Towards Linear-Time Incremental Structure from Motion

TL;DR: Through algorithm analysis and extensive experiments, it is shown that incremental SfM requires only O(n) time on many major steps including BA, and offers state of the art performance for large-scale reconstructions.
Frequently Asked Questions (11)
Q1. What have the authors contributed in "Global fusion of relative motions for robust, accurate and scalable structure from motion" ?

The authors propose a new global calibration approach based on the fusion of relative motions between image pairs. The authors improve an existing method for robustly computing global rotations. The authors present an efficient a contrario trifocal tensor estimation method, from which stable and precise translation directions can be extracted. 

incremental approaches are known to suffer from drift due to the accumulation of errors and to the difficulty to handle cycle closures of the camera trajectory. 

Their approach consists in computing the tensor using a small-size linear program as minimal solver with four tracked point across the three views, in conjunction with the AC-RANSAC framework [21] to be robust to noise and outliers. 

Sequential SfM pipelines start from a minimal reconstruction based on two or three views, then incrementally add new views into a merged representation. 

The authors believe that their method could work at city scale even on a standard computer, provided there is enough RAM for the final bundle adjustments, which is optional. 

It relies on two linear programs, the first one identifying outliers, and the second one solving translations and 3D structure on the selected inliers. 

This rotation averaging task can be performed by distributing the error along all cycles in a cycle basis, as done by Sharp et al. [28] for the alignment of range scans. 

the authors adapted the cycle error probability using the results of Enqvist et al. [7], weighting errors by a factor 1/ √ l where l is the length of the cycle. 

A setof fifty 3D points are randomly generated in a [−1, 1]3 cube and 3 cameras are placed on a circle at distance 5, at angles 0◦, α and 2α respectively (see Figure 2, left). 

Given estimated global rotations Ri, as computed in Section 2, the authors estimate a “reduced” trifocal tensor using an adaptive RANSAC procedure to be robust to outlier correspondences. 

An additional weakness is that the quality of the reconstruction depends heavily on the choice of the initial image pair and on the order of subsequent image additions.