scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Building Rome in a day

01 Sep 2009-pp 72-79
TL;DR: A system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city on Internet photo sharing sites and is designed to scale gracefully with both the size of the problem and the amount of available computation.
Abstract: We present a system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city (e.g., Rome) on Internet photo sharing sites. Our system uses a collection of novel parallel distributed matching and reconstruction algorithms, designed to maximize parallelism at each stage in the pipeline and minimize serialization bottlenecks. It is designed to scale gracefully with both the size of the problem and the amount of available computation. We have experimented with a variety of alternative algorithms at each stage of the pipeline and report on which ones work best in a parallel computing environment. Our experimental results demonstrate that it is now possible to reconstruct cities consisting of 150K images in less than a day on a cluster with 500 compute cores.

Content maybe subject to copyright    Report

Building Rome in a Day
Sameer Agarwal
1,
Noah Snavely
2
Ian Simon
1
Steven M. Seitz
1
Richard Szeliski
3
1
University of Washington
2
Cornell University
3
Microsoft Research
Abstract
We present a system that can match and reconstruct 3D
scenes from extremely large collections of photographs such
as those found by searching for a given city (e.g., Rome) on
Internet photo sharing sites. Our system uses a collection
of novel parallel distributed matching and reconstruction
algorithms, designed to maximize parallelism at each stage
in the pipeline and minimize serialization bottlenecks. It is
designed to scale gracefully with both the size of the problem
and the amount of available computation. We have experi-
mented with a variety of alternative algorithms at each stage
of the pipeline and report on which ones work best in a
parallel computing environment. Our experimental results
demonstrate that it is now possible to reconstruct cities con-
sisting of 150K images in less than a day on a cluster with
500 compute cores.
1. Introduction
Entering the search term “Rome” on
flickr.com
re-
turns more than two million photographs. This collection rep-
resents an increasingly complete photographic record of the
city, capturing every popular site, facade, interior, fountain,
sculpture, painting, cafe, and so forth. Most of these pho-
tographs are captured from hundreds or thousands of view-
points and illumination conditions—Trevi Fountain alone
has over 50,000 photographs on Flickr. Exciting progress
has been made on reconstructing individual buildings or
plazas from similar collections [
16
,
17
,
8
], showing the po-
tential of applying structure from motion (SfM) algorithms
on unstructured photo collections of up to a few thousand
photographs. This paper presents the first system capable of
city-scale reconstruction from unstructured photo collections.
We present models that are one to two orders of magnitude
larger than the next largest results reported in the literature.
Furthermore, our system enables the reconstruction of data
sets of 150,000 images in less than a day.
To whom correspondence should be addressed. Email:
sagarwal@
cs.washington.edu
City-scale 3D reconstruction has been explored previ-
ously in the computer vision literature [
12
,
2
,
6
,
21
] and is
now widely deployed e.g., in Google Earth and Microsoft’s
Virtual Earth. However, existing large scale structure from
motion systems operate on data that comes from a structured
source, e.g., aerial photographs taken by a survey aircraft
or street side imagery captured by a moving vehicle. These
systems rely on photographs captured using the same cal-
ibrated camera(s) at a regular sampling rate and typically
leverage other sensors such as GPS and Inertial Navigation
Units, vastly simplifying the computation.
Images harvested from the web have none of these sim-
plifying characteristics. They are taken from a variety of
different cameras, in varying illumination conditions, have
little to no geographic information associated with them, and
in many cases come with no camera calibration information.
The same variability that makes these photo collections so
hard to work with for the purposes of SfM also makes them
an extremely rich source of information about the world. In
particular, they specifically capture things that people find in-
teresting, i.e., worthy of photographing, and include interiors
and artifacts (sculptures, paintings, etc.) as well as exteriors
[
14
]. While reconstructions generated from such collections
do not capture a complete covering of scene surfaces, the
coverage improves over time, and can be complemented by
adding aerial or street-side images.
The key design goal of our system is to quickly produce
reconstructions by leveraging massive parallelism. This
choice is motivated by the increasing prevalence of parallel
compute resources both at the CPU level (multi-core) and
the network level (cloud computing). At today’s prices, for
example, you can rent 1000 nodes of a cluster for 24 hours
for $10,000 [1].
The cornerstone of our approach is a new system for large-
scale distributed computer vision problems, which we will
be releasing to the community. Our pipeline draws largely
from the existing state of the art of large scale matching and
SfM algorithms, including SIFT, vocabulary trees, bundle
adjustment, and other known techniques. For each stage
in our pipeline, we consider several alternatives, as some

algorithms naturally distribute and some do not, and issues
like memory usage and I/O bandwidth become critical. In
cases such as bundle adjustment, where we find that the
existing implementations did not scale, we have created our
own high performance implementations. Designing a truly
scalable system is challenging, and we discovered many
surprises along the way. The main contributions of our
paper, therefore, are the insights and lessons learned, as well
as technical innovations that we invented to improve the
parallelism and throughput of our system.
The rest of the paper is organized as follows. Section 2
discusses the detailed design choices and implementation of
our system. Section 3 reports the results of our experiments
on three city scale data sets, and Section 4 concludes with a
discussion of some of the lessons learned and directions for
future work.
2. System Design
Our system runs on a cluster of computers (nodes), with
one node designated as the master node. The master node is
responsible for the various job scheduling decisions.
In this section, we describe the detailed design of our
system, which can naturally be broken up into three distinct
phases: (1) pre-processing
§
2.1, (2) matching
§
2.2, and (3)
geometric estimation §2.4.
2.1. Preprocessing and feature extraction
We assume that the images are available on a central
store, from which they are distributed to the cluster nodes on
demand in chunks of fixed size. This automatically performs
load balancing, with more powerful nodes receiving more
images to process.
This is the only stage where a central file server required;
the rest of the system operates without using any shared stor-
age. This is done so that we can download the images from
the Internet independent of our matching and reconstruction
experiments. For production use, it would be straightforward
to have each cluster node to crawl the Internet for images at
the start of the matching and reconstruction process.
On each node, we begin by verifying that the image files
are readable, valid images. We then extract the EXIF tags, if
present, and record the focal length. We also downsample
images larger than 2 Mega-pixels, preserving their aspect
ratios and scaling their focal lengths. The images are then
converted to grayscale and SIFT features are extracted from
them [
10
]. We use the SIFT++ implementation by Andrea
Vedaldi for its speed and flexible interface [
20
]. At the end of
this stage, the entire set of images is partitioned into disjoint
sets, one for each node. Each node owns the images and
SIFT features associated with its partition.
Round 3Round 2Round 1
I
0
F
0
SIFT
TF
0
VT
VQ
I
K
F
K
SIFT
TF
K
VT
VQ
I
N
F
N
SIFT
TF
N
VT
VQ
...
...
M
TFIDF
0
TFIDF
K
TFIDF
N
...
...
D
F
N
0
N
M
M
0
M
K
M
N
...
...
M
M
0
M
N
(a) (b) (c) (d) (e)
(f)
Distribute k
1
Verify
{F
0
}
Verify
{F
N
}
Distribute k
2
Verify
{F
0
}
Verify
{F
N
}
CC
... ... ...
Expand
1
(g) (h) (i) (j) (k)
...
Merge
{T
0
}
Merge
{T
N
}
T
0
T
C
Merge
T
0
Merge
T
C
M M M M
(l) (m)
Figure 1. Our multi-stage parallel matching pipeline. The
N
input
images are distributed onto
M
processing nodes, after which the
following processing stages are performed: (a) SIFT feature ex-
traction, vocabulary tree vector quantization, and term frequency
counting; (b) document frequency counting; (c) TFIDF computa-
tion and information broadcast; (d) computation of TF-based match
likelihoods; (e) aggregation at the master node; (f) round-robin
bin-packing distribution of match verification tasks based on the
top
k
1
matches per image; (g) match verification with optional
inter-node feature vector exchange; (h) match proposal expansion
based on images found in connected components
CC
using the
next
k
2
best matches per image; (i) more distributed verification;
(j) four more rounds of query expansion and verification; (k) track
merging based on local verified matches; (l) track aggregation by
into
C
connected components; (m) final distribution of tracks by
image connected components and distributed merging.
2.2. Image Matching
The key computational tasks when matching two images
are the photometric matching of interest points and the geo-
metric verification of these matches using a robust estimate
of the fundamental or essential matrix. While exhaustive
matching of all features between two images is prohibitively
expensive, excellent results have been reported with approxi-
mate nearest neighbor search. We use the ANN library [
3
]
for matching SIFT features. For each pair of images, the
features of one image are inserted into a k-d tree, and the
features from the other image are used as queries. For each
query, we consider the two nearest neighbors, and matches
that pass Lowe’s ratio test are accepted [
10
]. We use the
priority queue based search method, with an upper bound
on the maximum number of bin visits of 200. In our ex-
perience, these parameters offer a good tradeoff between
computational cost and matching accuracy. The matches re-
turned by the approximate nearest neighbor search are then
pruned and verified using a RANSAC-based estimation of
the fundamental or essential matrix [
18
], depending on the
availability of focal length information from the EXIF tags.

These two operations form the computational kernel of our
matching engine.
Unfortunately, even with a well optimized implementa-
tion of the matching procedure described above, it is not
practical to match all pairs of images in our corpus. For a
corpus of 100,000 images, this translates into 5,000,000,000
pairwise comparisons, which with 500 cores operating at 10
image pairs per second per core would require about 11.5
days to match. Furthermore, this does not even take into
account the network transfers required for all cores to have
access to all the SIFT feature data for all images.
Even if we were able to do all these pairwise matches,
it would be a waste of computational effort, since an over-
whelming majority of the image pairs do not match. This is
expected from a set of images associated with a broad tag
like the name of a city. Thus, we must be careful in choosing
the image pairs that the system spends its time matching.
Building upon recent work on efficient object re-
trieval [
15
,
11
,
5
,
13
], we use a multi-stage matching scheme.
Each stage consists of a proposal and a verification step. In
the proposal step, the system determines a set of image pairs
that it expects to share common scene elements. In the
verification step, detailed feature matching is performed on
these image pairs. The matches obtained in this manner then
inform the next proposal step.
In our system, we use two methods to generate proposals:
vocabulary tree based whole image similarity
§
2.2.1 and
query expansion
§
2.2.4. The verification stage is described
in
§
2.2.2. Figure 1 shows the system diagram for the entire
matching pipeline.
2.2.1 Vocabulary Tree Proposals
Methods inspired by text retrieval have been applied with
great success to the problem of object and image retrieval.
These methods are based on representing an image as a bag
of words, where the words are obtained by quantizing the im-
age features. We use a vocabulary tree-based approach [
11
],
where a hierarchical k-means tree is used to quantize the
feature descriptors. (See
§
3 for details on how we build the
vocabulary tree using a small corpus of training images.)
These quantizations are aggregated over all features in an
image to obtain a term frequency (TF) vector for the image,
and a document frequency (DF) vector for the corpus of
images (Figure 1a–e). The document frequency vectors are
gathered across nodes into a single vector that is broadcast
across the cluster. Each node normalizes the term frequency
vectors it owns to obtain the TFIDF matrix for that node.
These per-node TFIDF matrices are broadcast across the
network, so that each node can calculate the inner product
between its TFIDF vectors and the rest of the TFIDF vectors.
In effect, this is a distributed product of the matrix of TFIDF
vectors with itself, but each node only calculates the block
of rows corresponding to the set of images it owns. For
each image, the top scoring
k
1
+ k
2
images are identified,
where the first
k
1
images are used in an initial verification
stage, and the additional
k
2
images are used to enlarge the
connected components (see below).
Our system differs from that of Nister and Stewenius [
11
],
since their system has a fixed database to which they match
incoming images. They can therefore store the database in
the vocabulary tree itself and evaluate the match score of
an image on the fly. In our case, the query set is the same
as the database, and it is not available when the features
are being encoded. Thus, we must have a separate matrix
multiplication stage to find the best matching images.
2.2.2 Verification and detailed matching
The next step is to verify candidate image matches, and to
then find a detailed set of matches between matching images.
If the images were all located on a single machine, the task of
verifying a proposed matching image pair would be a simple
matter of running through the image pairs, perhaps with
some attention paid to the order in which the verifications
are performed so as to minimize disk I/O. However, in our
case, the images and their feature descriptors are distributed
across the cluster. Thus, asking a node to match the image
pair
(i, j)
require it to fetch the image features from two
other nodes of the cluster. This is undesirable, as there
is a large difference between network transfer speeds and
local disk transfers. Furthermore, this creates work for three
nodes. Thus, the image pairs should be distributed across the
network in a manner that respects the locality of the data and
minimizes the amount of network transfers (Figure 1f–g).
We experimented with a number of approaches with some
surprising results. We initially tried to optimize network
transfers before any verification is done. In this setup, once
the master node has all the image pairs that need to be ver-
ified, it builds a graph connecting image pairs which share
an image. Using
MeTiS
[
7
], this graph is partitioned into
as many pieces as there are compute nodes. Partitions are
then matched to the compute nodes by solving a linear as-
signment problem that minimizes the number of network
transfers needed to send the required files to each node.
This algorithm worked well for small problem sizes, but
as the problem size increased, its performance became de-
graded. Our assumption that detailed matches between all
pairs of images take the same constant amount of time was
wrong: some nodes finished early and were idling for up to
an hour.
The second idea we tried was to over-partition the graph
into small pieces, and to parcel them out to the cluster nodes
on demand. When a node requests another chunk of work,
the piece with the fewest network transfers is assigned to it.
This strategy achieved better load balancing, but as the size

of the problem grew, the graph we needed to partition grew
to be enormous, and partitioning itself became a bottleneck.
The approach that gave the best results was to use a simple
greedy bin-packing algorithm (where each bin represents the
set of jobs sent to a node), which works as follows. The
master node maintains a list of images on each node. When
a node asks for work, it runs through the list of available
image pairs, adding them to the bin if they do not require
any network transfers, until either the bin is full or there are
no more image pairs to add. It then chooses an image (list of
feature vectors) to transfer to the node, selecting the image
that will allow it to add the maximum number of image pairs
to the bin, This process is repeated until the bin is full. This
algorithm has one drawback: it can require multiple sweeps
over all the image pairs needing to be matched. For large
problems, the scheduling of jobs can become a bottleneck.
A simple solution is to only consider a subset of the jobs at a
time, instead of trying to optimize globally. This windowed
approach works very well in practice, and all our experiments
are run with this method.
Verifying an image pair is a two-step procedure, consist-
ing of photometric matching between feature descriptors,
and a robust estimation of the essential or fundamental ma-
trix depending upon the availability of camera calibration
information. In cases where the estimation of the essen-
tial matrix succeeds, there is a sufficient angle between the
viewing directions of the two cameras, and the number of
matches are above a threshold, we do a full Euclidean two-
view reconstruction and store it. This information is used in
later stages (see
§
2.4) to reduce the size of the reconstruction
problem.
2.2.3 Merging Connected Components
At this stage, consider a graph on the set of images with
edges connecting two images if matching features were
found between them. We refer to this as the match graph. To
get as comprehensive a reconstruction as possible, we want
the fewest number of connected components in this graph.
To this end, we make further use of the proposals from the
vocabulary tree to try and connect the various connected
components in this graph. For each image, we consider
the next
k
2
images suggested by the vocabulary tree. From
these, we verify those image pairs which straddle two differ-
ent connected components (Figure 1h–i). We do this only
for images which are in components of size 2 or more. Thus,
images which did not match any of their top
k
1
proposed
matches are effectively discarded. Again, the resulting im-
age pairs are subject to detailed feature matching. Figure 2
illustrates this. Notice that after the first round, the match
graph has two connected components, which get connected
after the second round of matching.
2.2.4 Query Expansion
After performing two rounds of matching as described above,
we have a match graph which is usually not dense enough to
reliably produce a good reconstruction. To remedy this, we
use another idea from text and document retrieval research
query expansion [5].
In its simplest form, query expansion is done by first
finding the documents that match a user’s query, and then
using them to query the database again, thus expanding
the initial query. The results returned by the system are
some combination of these two queries. In essence, if we
were to define a graph on the set of documents, with similar
documents connected by an edge, and treat the query as a
document too, then query expansion is equivalent to finding
all vertices that are within two steps of the query vertex.
In our system, we consider the image match graph, where
images
i
and
j
are connected if they have a certain minimum
number of features in common. Now, if image
i
is connected
to image
j
and image
j
is connected to image
k
, we perform
a detailed match to check if image
j
matches image
k
. This
process can be repeated a fixed number of times or until the
match graph converges.
A concern when iterating rounds of query expansion is
drift. Results of the secondary queries can quickly diverge
from the original query. This is not a problem in our system,
since query expanded image pairs are subject to detailed
geometric verification before they are connected by an edge
in the match graph.
2.3. Track Generation
The final step of the matching process is to combine all the
pairwise matching information to generate consistent tracks
across images, i.e., to find and label all of the connected
components in the graph of individual feature matches (the
feature graph). Since the matching information is stored lo-
cally on the compute node that it was computed on, the track
generation process proceeds in two stages (Figure 1k–m). In
the first, each node generates tracks from all the matching
data it has available locally. This data is gathered at the
master node and then broadcast over the network to all the
nodes. Observe that the tracks for each connected compo-
nent of the match graph can be processed independently.
The track generation then proceeds with each compute node
being assigned a connected component for which the tracks
need to be produced. As we merge tracks based on shared
features, inconsistent tracks can be generated, where two
feature points in the same image belong to the same track.
In such cases, we drop the offending points from the track.
Once the tracks have been generated, the next step is to
extract the 2D coordinates of the feature points that occur in
the tracks from the corresponding image feature files. We
also extract the pixel color for each such feature point, which
is later used for rendering the 3D point with the average

104
693
694
1231
1628
2543
2990
3037
3038
3039
4728
5412
5433
6670
8824
10492
10493
13942
13972
14470
14757
14803
15275
15276
15277
16742
16743
18989
18990
19115
20789
22059
22505
23977
25965
27282
30036
33099
34298
34299
34321
34810
34812
35513
36586
36587
38055
39960
39962
39964
40017
40018
41037
41359
41360
41361
41377
43242
43705
43870
45356
46262
48154
48531
48994
48995
49030
49824
49873
52259
53569
54467
54469
55060
55061
55062
55720
55760
55762
55763
55764
57215
59314
60343
60361
61448
61449
61450
61720
61721
61722
62039
62383
62850
63807
64855
64871
65119
65121
65125
66327
66328
66399
68070
68072
68091
68094
68095
70985
73147
74555
74763
75015
83222
83416
83981
84355
86001
86003
86005
86774
86775
87709
87765
88549
88584
91259
91260
92505
92654
92655
94300
96301
96302
100455
101322
101945
102389
102896
108214
108233
109346
109954
109955
109956
112136
116625
116843
116844
116995
116997
117000
117980
118292
118293
118294
122090
123345
123361
124010
124011
124013
124191
124710
129261
130620
130689
131056
134888
104
693
694
1231
1628
2543
2990
3037
3038
3039
4728
5412
5433
6670
8824
10492
10493
13942
13972
14470
14757
14803
15275
15276
15277
16742
16743
18989
18990
19115
20789
22059
22505
23977
25965
27282
30036
33099
34298
34299
34321
34810
34812
35513
36586
36587
38055
39960
39962
39964
40017
40018
41037
41359
41360
41361
41377
43242
43705
43870
45356
46262
48154
48531
48994
48995
49030
49824
49873
52259
53569
54467
54469
55060
55061
55062
55720
55760
55762
55763
55764
57215
59314
60343
60361
61448
61449
61450
61720
61721
61722
62039
62383
62850
63807
64855
64871
65119
65121
65125
66327
66328
66399
68070
68072
68091
68094
68095
70985
73147
74555
74763
75015
83222
83416
83981
84355
86001
86003
86005
86774
86775
87709
87765
88549
88584
91259
91260
92505
92654
92655
94300
96301
96302
100455
101322
101945
102389
102896
108214
108233
109346
109954
109955
109956
112136
116625
116843
116844
116995
116997
117000
117980
118292
118293
118294
122090
123345
123361
124010
124011
124013
124191
124710
129261
130620
130689
131056
134888
104
693
694
1231
1628
2543
2990
3037
3038
3039
4728
5412
5433
6670
8824
10492
10493
13942
13972
14470
14757
14803
15275
15276
15277
16742
16743
18989
18990
19115
20789
22059
22505
23977
25965
27282
30036
33099
34298
34299
34321
34810
34812
35513
36586
36587
38055
39960
39962
39964
40017
40018
41037
41359
41360
41361
41377
43242
43705
43870
45356
46262
48154
48531
48994
48995
49030
49824
49873
52259
53569
54467
54469
55060
55061
55062
55720
55760
55762
55763
55764
57215
59314
60343
60361
61448
61449
61450
61720
61721
61722
62039
62383
62850
63807
64855
64871
65119
65121
65125
66327
66328
66399
68070
68072
68091
68094
68095
70985
73147
74555
74763
75015
83222
83416
83981
84355
86001
86003
86005
86774
86775
87709
87765
88549
88584
91259
91260
92505
92654
92655
94300
96301
96302
100455
101322
101945
102389
102896
108214
108233
109346
109954
109955
109956
112136
116625
116843
116844
116995
116997
117000
117980
118292
118293
118294
122090
123345
123361
124010
124011
124013
124191
124710
129261
130620
130689
131056
134888
104
693
694
1231
1628
2543
2990
3037
3038
3039
4728
5412
5433
6670
8824
10492
10493
13942
13972
14470
14757
14803
15275
15276
15277
16742
16743
18989
18990
19115
20789
22059
22505
23977
25965
27282
30036
33099
34298
34299
34321
34810
34812
35513
36586
36587
38055
39960
39962
39964
40017
40018
41037
41359
41360
41361
41377
43242
43705
43870
45356
46262
48154
48531
48994
48995
49030
49824
49873
52259
53569
54467
54469
55060
55061
55062
55720
55760
55762
55763
55764
57215
59314
60343
60361
61448
61449
61450
61720
61721
61722
62039
62383
62850
63807
64855
64871
65119
65121
65125
66327
66328
66399
68070
68072
68091
68094
68095
70985
73147
74555
74763
75015
83222
83416
83981
84355
86001
86003
86005
86774
86775
87709
87765
88549
88584
91259
91260
92505
92654
92655
94300
96301
96302
100455
101322
101945
102389
102896
108214
108233
109346
109954
109955
109956
112136
116625
116843
116844
116995
116997
117000
117980
118292
118293
118294
122090
123345
123361
124010
124011
124013
124191
124710
129261
130620
130689
131056
134888
104
693
694
1231
1628
2543
2990
3037
3038
3039
4728
5412
5433
6670
8824
10492
10493
13942
13972
14470
14757
14803
15275
15276
15277
16742
16743
18989
18990
19115
20789
22059
22505
23977
25965
27282
30036
33099
34298
34299
34321
34810
34812
35513
36586
36587
38055
39960
39962
39964
40017
40018
41037
41359
41360
41361
41377
43242
43705
43870
45356
46262
48154
48531
48994
48995
49030
49824
49873
52259
53569
54467
54469
55060
55061
55062
55720
55760
55762
55763
55764
57215
59314
60343
60361
61448
61449
61450
61720
61721
61722
62039
62383
62850
63807
64855
64871
65119
65121
65125
66327
66328
66399
68070
68072
68091
68094
68095
70985
73147
74555
74763
75015
83222
83416
83981
84355
86001
86003
86005
86774
86775
87709
87765
88549
88584
91259
91260
92505
92654
92655
94300
96301
96302
100455
101322
101945
102389
102896
108214
108233
109346
109954
109955
109956
112136
116625
116843
116844
116995
116997
117000
117980
118292
118293
118294
122090
123345
123361
124010
124011
124013
124191
124710
129261
130620
130689
131056
134888
Initial Matches CC Merge Query Expansion 1 Query Expansion 4 Skeletal Set
Figure 2. The evolution of the match graph as a function of the rounds of matching, and the skeletal set corresponding to it. Notice how the
second round of matching merges the two components into one, and how rapidly the query expansion increases the density of the within
component connections. The last column shows the skeletal set corresponding to the final match graph. The skeletal sets algorithm can
break up connected components found during the match phase if it determines that a reliable reconstruction is not possible, which is what
happens in this case.
color of the feature points associated with it. Again, this
procedure proceeds in two steps. Given the per-component
tracks, each node extracts the feature point coordinates and
the point colors from the SIFT and image files that it owns.
This data is gathered and broadcast over the network, where
it is processed on a per connected component basis.
2.4. Geometric Estimation
Once the tracks have been generated, the next step is to
run structure from motion (SfM) on every connected compo-
nent of the match graph to recover a pose for every camera
and a 3D position for every track. Most SfM systems for
unordered photo collections are incremental, starting with a
small reconstruction, then growing a few images at a time,
triangulating new points, and doing one or more rounds
of nonlinear least squares optimization (known as bundle
adjustment [
19
]) to minimize the reprojection error. This
process is repeated until no more cameras can be added.
However, due to the scale of our collections, running such an
incremental approach on all the photos at once is impractical.
The incremental reconstruction procedure described
above has in it the implicit assumption that all images con-
tribute more or less equally to the coverage and accuracy of
the reconstruction. Internet photo collections, by their very
nature are redundant—many photographs are taken from
nearby viewpoints and processing all of them does not nec-
essarily add to the reconstruction. It is thus preferable to
find and reconstruct a minimal subset of photographs that
capture the essential connectivity of the match graph and the
geometry of the scene [
8
,
17
]. Once this is done, we can
add back in all the remaining images using pose estimation,
triangulate all remaining points, and then do a final bundle
adjustment to refine the SfM estimates.
For finding this minimal set, we use the skeletal sets
algorithm of [
17
], which computes a spanning set of pho-
tographs that preserves important connectivity information
in the image graph (such as large loops). In [
17
], a two-frame
reconstruction is computed for each pair of matching images
with known focal lengths. In our system, these pairwise
reconstruction are computed as part of the parallel matching
process. Once a skeletal set is computed, we estimate the
SfM parameters of each resulting component using the incre-
mental algorithm of [
16
]. The skeletal sets algorithm often
breaks up connected components across weakly-connected
boundaries, resulting in a larger set of components.
2.4.1 Bundle Adjustment
Having reduced the size of the SFM problem down to the
skeletal set, the primary bottleneck in the reconstruction pro-
cess is the non-linear minimization of the reprojection error,
or bundle adjustment (BA). The best performing BA soft-
ware available publicly is Sparse Bundle Adjustment (SBA)
by Lourakis & Argyros [
9
]. The key to its high performance
is the use of the so called Schur complement trick [
19
] to re-
duce the size of the linear system (also known as the normal
equations) that needs to be solved in each iteration of the
Levenberg-Marquardt (LM) algorithm. The size of this linear
system depends on the 3D point and the camera parameters,
whereas the size of the Schur complement only depends on
the camera parameters. SBA then uses a dense Cholesky
factorization to factor and solve the resulting reduced linear
system. Since the number of 3D points in a typical SfM
problem is usually two orders of magnitude or more larger
than the number of cameras, this leads to substantial sav-
ings. This works for small to moderate sized problems, but
for large problems with thousands of images, computing
the dense Cholesky factorization of the Schur complement
becomes a space and time bottleneck. For large problems
however, the Schur complement itself is quite sparse (a 3D
point is usually visible in only a few cameras) and exploiting
this sparsity can lead to significant time and space savings.
We have developed a new high performance bundle adjust-
ment software that, depending upon the size of the problem,
chooses between a truncated and an exact step LM algorithm.
In the first case, a block diagonal preconditioned conjugate
gradient method is used to solve approximately the normal
equations. In the second case, CHOLMOD [
4
], a sparse
direct method for computing Cholesky factorization is used
to exactly solve the normal equations via the Schur comple-

Citations
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work proposes a new SfM technique that improves upon the state of the art to make a further step towards building a truly general-purpose pipeline.
Abstract: Incremental Structure-from-Motion is a prevalent strategy for 3D reconstruction from unordered image collections. While incremental reconstruction systems have tremendously advanced in all regards, robustness, accuracy, completeness, and scalability remain the key problems towards building a truly general-purpose pipeline. We propose a new SfM technique that improves upon the state of the art to make a further step towards this ultimate goal. The full reconstruction pipeline is released to the public as an open-source implementation.

3,050 citations

Journal ArticleDOI
TL;DR: In this paper, the authors discuss the evolution and state-of-the-art of the use of UAVs in the field of Photogrammetry and Remote Sensing (PaRS).
Abstract: We discuss the evolution and state-of-the-art of the use of Unmanned Aerial Systems (UAS) in the field of Photogrammetry and Remote Sensing (PaRS). UAS, Remotely-Piloted Aerial Systems, Unmanned Aerial Vehicles or simply, drones are a hot topic comprising a diverse array of aspects including technology, privacy rights, safety and regulations, and even war and peace. Modern photogrammetry and remote sensing identified the potential of UAS-sourced imagery more than thirty years ago. In the last five years, these two sister disciplines have developed technology and methods that challenge the current aeronautical regulatory framework and their own traditional acquisition and processing methods. Navety and ingenuity have combined off-the-shelf, low-cost equipment with sophisticated computer vision, robotics and geomatic engineering. The results are cm-level resolution and accuracy products that can be generated even with cameras costing a few-hundred euros. In this review article, following a brief historic background and regulatory status analysis, we review the recent unmanned aircraft, sensing, navigation, orientation and general data processing developments for UAS photogrammetry and remote sensing with emphasis on the nano-micro-mini UAS segment.

2,119 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: PoseNet as mentioned in this paper uses a CNN to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation.
Abstract: We present a robust and real-time monocular six degree of freedom relocalization system. Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking 5ms per frame to compute. It obtains approximately 2m and 3 degrees accuracy for large scale outdoor scenes and 0.5m and 5 degrees accuracy indoors. This is achieved using an efficient 23 layer deep convnet, demonstrating that convnets can be used to solve complicated out of image plane regression problems. This was made possible by leveraging transfer learning from large scale classification data. We show that the PoseNet localizes from high level features and is robust to difficult lighting, motion blur and different camera intrinsics where point based SIFT registration fails. Furthermore we show how the pose feature that is produced generalizes to other scenes allowing us to regress pose with only a few dozen training examples.

1,638 citations

Book ChapterDOI
08 Oct 2016
TL;DR: The core contributions are the joint estimation of depth andnormal information, pixelwise view selection using photometric and geometric priors, and a multi-view geometric consistency term for the simultaneous refinement and image-based depth and normal fusion.
Abstract: This work presents a Multi-View Stereo system for robust and efficient dense modeling from unstructured image collections. Our core contributions are the joint estimation of depth and normal information, pixelwise view selection using photometric and geometric priors, and a multi-view geometric consistency term for the simultaneous refinement and image-based depth and normal fusion. Experiments on benchmarks and large-scale Internet photo collections demonstrate state-of-the-art performance in terms of accuracy, completeness, and efficiency.

1,372 citations

Book ChapterDOI
Christopher Choy1, Danfei Xu1, JunYoung Gwak1, Kevin Chen1, Silvio Savarese1 
08 Oct 2016
TL;DR: 3D-R2N2 as discussed by the authors proposes a 3D Recurrent Reconstruction Neural Network that learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data.
Abstract: Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data [13]. Our network takes in one or more images of an object instance from arbitrary viewpoints and outputs a reconstruction of the object in the form of a 3D occupancy grid. Unlike most of the previous works, our network does not require any image annotations or object class labels for training or testing. Our extensive experimental analysis shows that our reconstruction framework (i) outperforms the state-of-the-art methods for single view reconstruction, and (ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).

1,336 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations


"Building Rome in a day" refers methods in this paper

  • ...For each query, we consider the two nearest neighbors, and matches that pass Lowe’s ratio test are accepted [ 10 ]....

    [...]

  • ...The images are then converted to grayscale and SIFT features are extracted from them [ 10 ]....

    [...]

Proceedings ArticleDOI
Sivic1, Zisserman1
13 Oct 2003
TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
Abstract: We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.

6,938 citations

Journal ArticleDOI
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Abstract: Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph [Bui and Jones, Proc. of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993, 445--452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, NM, 1993]. From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.

5,629 citations


"Building Rome in a day" refers methods in this paper

  • ...Using MeTiS [ 7 ], this graph is partitioned into as many pieces as there are compute nodes....

    [...]

Proceedings ArticleDOI
17 Jun 2006
TL;DR: A recognition scheme that scales efficiently to a large number of objects and allows a larger and more discriminatory vocabulary to be used efficiently is presented, which it is shown experimentally leads to a dramatic improvement in retrieval quality.
Abstract: A recognition scheme that scales efficiently to a large number of objects is presented. The efficiency and quality is exhibited in a live demonstration that recognizes CD-covers from a database of 40000 images of popular music CD’s. The scheme builds upon popular techniques of indexing descriptors extracted from local regions, and is robust to background clutter and occlusion. The local region descriptors are hierarchically quantized in a vocabulary tree. The vocabulary tree allows a larger and more discriminatory vocabulary to be used efficiently, which we show experimentally leads to a dramatic improvement in retrieval quality. The most significant property of the scheme is that the tree directly defines the quantization. The quantization and the indexing are therefore fully integrated, essentially being one and the same. The recognition quality is evaluated through retrieval on a database with ground truth, showing the power of the vocabulary tree approach, going as high as 1 million images.

4,024 citations


"Building Rome in a day" refers methods in this paper

  • ...Building upon recent work on efficient object retrieval [15, 11 , 5, 13], we use a multi-stage matching scheme.,We use a vocabulary tree-based approach [ 11 ], where a hierarchical k-means tree is used to quantize the feature descriptors.,Our system differs from that of Nister and Stewenius [ 11 ], since their system has a fixed database to which they match incoming images....

    [...]

Book ChapterDOI
21 Sep 1999
TL;DR: A survey of the theory and methods of photogrammetric bundle adjustment can be found in this article, with a focus on general robust cost functions rather than restricting attention to traditional nonlinear least squares.
Abstract: This paper is a survey of the theory and methods of photogrammetric bundle adjustment, aimed at potential implementors in the computer vision community. Bundle adjustment is the problem of refining a visual reconstruction to produce jointly optimal structure and viewing parameter estimates. Topics covered include: the choice of cost function and robustness; numerical optimization including sparse Newton methods, linearly convergent approximations, updating and recursive methods; gauge (datum) invariance; and quality control. The theory is developed for general robust cost functions rather than restricting attention to traditional nonlinear least squares.

3,521 citations

Trending Questions (1)
What is ROME in LLM?

The paper does not mention what "ROME" stands for in LLM.