scispace - formally typeset
Open AccessProceedings ArticleDOI

Graph-Based Discriminative Learning for Location Recognition

Song Cao, +1 more
- pp 700-707
Reads0
Chats0
TLDR
New ways for exploiting the structure of a database by representing it as a graph are explored, and how the rich information embedded in a graph can improve a bag-of-words-based location recognition method is shown.
Abstract
Recognizing the location of a query image by matching it to a database is an important problem in computer vision, and one for which the representation of the database is a key issue. We explore new ways for exploiting the structure of a database by representing it as a graph, and show how the rich information embedded in a graph can improve a bag-of-words-based location recognition method. In particular, starting from a graph on a set of images based on visual connectivity, we propose a method for selecting a set of sub graphs and learning a local distance function for each using discriminative techniques. For a query image, each database image is ranked according to these local distance functions in order to place the image in the right part of the graph. In addition, we propose a probabilistic method for increasing the diversity of these ranked database images, again based on the structure of the image graph. We demonstrate that our methods improve performance over standard bag-of-words methods on several existing location recognition datasets.

read more

Content maybe subject to copyright    Report

Graph-Based Discriminative Learning for Location Recognition
Song Cao Noah Snavely
Cornell University
Abstract
Recognizing the location of a query image by matching
it to a database is an important problem in computer vision,
and one for which the representation of the database is a key
issue. We explore new ways for exploiting the structure of a
database by representing it as a graph, and show how the
rich information embedded in a graph can improve a bag-
of-words-based location recognition method. In particular,
starting from a graph on a set of images based on visual
connectivity, we propose a method for selecting a set of sub-
graphs and learning a local distance function for each using
discriminative techniques. For a query image, each database
image is ranked according to these local distance functions
in order to place the image in the right part of the graph. In
addition, we propose a probabilistic method for increasing
the diversity of these ranked database images, again based
on the structure of the image graph. We demonstrate that our
methods improve performance over standard bag-of-words
methods on several existing location recognition datasets.
1. Introduction
Location recognition—determining where an image was
taken—is an important problem in computer vision. How-
ever, there is no single definition for what it means to be
a location, and, accordingly, a wide variety of representa-
tions for places have been used in research: Are places, for
instance, a set of distinct landmarks, each represented by a
set of images? [
33
,
16
] Are places latitude and longitude
coordinates, represented with a set of geotagged images?
[
11
] Should places be represented with 3D geometry, from
which we can estimate an explicit camera pose for a query
image? [
17
,
25
,
18
] This question of representation has ana-
logues in more general object recognition problems, where
many approaches regard objects as belonging to pre-defined
categories (cars, planes, bottles, etc.), but other work repre-
sents objects more implicitly as structural relations between
images, encoded as a graph (as in the Visual Memex [19]).
Inspired by this latter work, our paper addresses the loca-
tion recognition problem by representing places as graphs
encoding relations between images, and explores how this
A (Center)
H
E
D
F
G
I
Query
?
?
?
B (Center)
C (Center)
Figure 1.
A segment of an example image matching graph with
three clusters defined by representative images A, B and C.
Nodes in this graph are images, and edge connect overlapping
images. In order to match a new query image to the graph, our
method learns local distance functions for a set of neighborhoods
that cover the graph, for instance, the neighborhoods centered at
nodes A, B, and C, circled with colored boundaries. Given a query
image, we match to the graph using these learned neighborhood
models, rather than considering database images individually. Each
neighborhood has its own distinctive features, and our goal is to
learn and use them to aid recognition.
representation can aid in recognition. In our case, graphs
represent visual overlap between images—nodes correspond
to images, and edges to overlapping, geometrically consis-
tent image pairs—leveraging recent work on automatically
building image graphs (and 3D models) from large-scale
image collections [1, 7, 4, 3]. An example image graph for
photos of the town of Dubrovnik is shown in Figure 1. Given
an image graph, our goal is to take a query image and plug it
in to the graph in the right place, in effect recognizing its lo-
cation. The idea is that the structure inherent in these graphs
encodes much richer information than the set of database
images alone, and that utilizing this structural information
can result in better recognition methods.
We make use of this structural information in a bag-of-
words-based location recognition framework, in which we
take a query image, retrieve similar images in the database,
and perform detailed matching to verify each retrieved image
1

until a match is found. While others have used image graphs
in various settings before (especially in 3D reconstruction),
our main contribution is to introduce two new ways to ex-
ploit the graph’s structure in recognition. First, we
build
local models
of what it means to be similar to each neighbor-
hood of the graph (Figure 1). To do so, we use the graph’s
structure to define sets of images that are similar, and sets
that are different, and use discriminative learning techniques
to compute local distance functions tuned to specific parts
of the graph. Second, we use the connectivity of the graph
to
encourage diversity
in the set of results, using a proba-
bilistic algorithm to retrieve a shortlist of similar images that
are more likely to have at least one match. We show that
our graph-based approach results in improvements over bag-
of-words retrieval methods, and yields performance that is
close to more expensive direct feature matching techniques
on existing location recognition datasets.
2. Related Work
Image retrieval and supervised learning.
As with other
location recognition approaches [
27
,
12
,
14
,
26
], our work
uses an image-retrieval-based framework using a bag-of-
words model for a database of images. However, our goal
is not retrieval per se (i.e., to retrieve all related instances
of a query image), but instead recognition, where we aim
to determine where an image was taken (for which a single
correctly retrieved database image can be sufficient).
Our work uses supervised learning to improve on such
methods. Prior work has also used various forms of su-
pervision to improve bag-of-words-style methods for both
retrieval and recognition. One type of supervision is based
on geolocation; images that are physically close—on the
same street, say—should also be closer in terms of their
image distance than images across the city or the globe.
Geolocation cues have been used to reweight different vi-
sual words based on their geographic frequency [
27
,
14
], or
to find patches that discriminate different cities [
6
]. Other
methods rely on image matching to identify good features,
as we do. Turcot and Lowe [
31
] perform feature matching
on database images to find reliable features. Arandjelovic
and Zisserman propose discriminative query expansion in
which a per-query-image distance metric is learned based
on feedback from image retrieval [
2
]. Mikulik et al. use
image matches to compute global correlations between vi-
sual words [
21
]. In contrast, we use discriminative learning
to learn a set of local distance metrics for the database as a
pre-process (rather than at query time), leveraging the known
graph structure of the database images.
Representing places.
Places in computer vision are often
represented as sets of images (e.g., the Eiffel Tower can be
represented with a collection of photos [
33
]). However, many
other representations of places have been explored. Some
methods use iconic images to represent sets of images taken
from very similar viewpoints [
15
,
13
]. Other approaches use
3D point clouds, derived from structure from motion, as a
richer geometric representation of a place [
17
,
24
]. Closer
to our approach are methods that explicitly construct and
exploit image graphs. For instance, Torii et al. download
Google Streetview images to form a recognition database,
and leverage the underlying Street View image network; in
their approach, they take linear combinations of neighboring
images (in bag-of-words space) to more accurately recognize
the continuum of possible viewpoints [
30
]. Li et al. use a vis-
ibility graph connecting images and 3D points in a structure-
from-motion model to reason about point co-occurrence for
location recognition [
18
]. A main contribution of our ap-
proach is to combine the power of discriminative learning
methods with the rich structural information in an image
graph, in order to learn a better database representation and
to better analyze results at query time.
3. Graph-based Location Recognition
We base our algorithm on a standard bag-of-words frame-
work [
29
], with images represented as
L
2
normalized his-
tograms of visual words, using a large vocabulary trained
from SIFT descriptors. Our problem takes as input a database
of images
I
represented as bag-of-words vectors, and an im-
age graph
G
, with a node for each image
a I
, and edges
(a, b)
connecting overlapping, geometrically consistent im-
age pairs. Our goal is to take a new query image and predict
which part of the graph this image is connected to, then use
this information to recognize its location.
To achieve this goal, we use the query to retrieve a short-
list of similar database images, and perform detailed match-
ing and geometric verification on the top few matches. Be-
cause our goal is recognition, rather than retrieval, we want
to have at least one correct match appear as close as possible
to the top of the shortlist (rather than retrieve all similar im-
ages). Towards that end, our method improves on the often
noisy raw bag-of-words similarity measure by leveraging
the graph in two ways: (1) we discriminatively learn local
distance functions on neighborhoods of the image graph
(Section 3.2), and (2) we use the graph to generate a ranked
list that encourages more diverse results (Section 3.3).
3.1. Image Matching Graphs
We construct an image graph for the database using a
standard image matching pipeline [
1
]: we extract features
from each image, and, for a set of image pairs, find nearest
neighbor features and perform RANSAC-based geometric
verification. These matches are sufficient for our method
(though to improve the quality of the matching, we can also
run structure from motion to obtain a point cloud and a
refined set of image correspondences). For each image pair
(a, b)
with sufficient inliers matches, we create an edge in

our graph
G
. We also save the number of inliers
N(a, b)
for each image pair to derive edge weights for the graph. In
our experience, the graphs we compute have very few false
edges—almost all of the matching pairs are correct—though
there may be edges missing from the graph because we do
not exhaustively test all possible edges.
In parts of our algorithm, we will threshold edges by their
weights, discarding all edges below a threshold. The edge
weights we define are related to the idea of a Jaccard in-
dex; we define a weight
J(a, b) =
N(a,b)
N(a)+N(b)N(a,b)
, where
N(a)
and
N(b)
denote the total number of points seen in
a
and
b
respectively. This measures the similarity of the
two images as the number of features
N(a, b)
they have
in common, normalized by the union of their feature sets.
This measure ranges from 0 to 1; 0 if no overlap, and 1 if
every feature was matched. This normalization reduces bias
towards images with large numbers of features.
3.2. Graph-based Discriminative Learning
How can we use the information encoded in the graph
to better recognize the location of a query image? We first
address this problem as one of distance (or similarity) metric
learning. There are many possible ways to learn a metric
for the images in the graph. For example, one could take
all the connected pairs in the graph to be positive examples
and the other pairs as negative examples, to learn a single,
global distance metric for a specific dataset [
3
]. At the other
extreme, one could learn a distance metric for each image in
the database, analogous to how Exemplar SVMs have been
used for object detection [20].
We tried both approaches, but found that we achieved
better performance with approach somewhere in the middle
of these two extremes. In particular, we divide the graph
into a set of overlapping subgraphs, and learn a separate
distance metric for each of these representative subgraphs.
Our approach, then, consists of the following steps:
At Training Time
1.
Compute a covering of the graph with a set of subgraphs.
2.
Learn and calibrate an SVM-based distance metric for
each subgraph.
At Query Time
3.
Use the models in Step 2 to compute the distance from
a query image to each database image, and generate a
ranked shortlist of possible image matches.
4.
Perform geometric verification with the top database im-
ages in the shortlist.
We now describe each step in more detail. Later, in Sec-
tion 3.3, we discuss how we improve Step 3 by reranking
the shortlist based on the structure of the graph.
Step 1: Selecting representative neighborhoods.
We start
by covering the graph with a set of representative subgraphs;
afterwards, for each subgraph, we will learn a local similarity
function, using the images in the subgraph as positive exam-
ples, and other, unrelated images in the graph as negative
examples. What makes a good subgraph? We want each
subgraph to contain images that are largely similar, so that
our learning problem has a relatively compact set of positive
example images that can be explained with a simple model.
On the other hand, we also want as many positive examples
as possible, so that our models have enough data from which
to generalize. Finally, we want our subgraphs to completely
cover the graph (i.e., each node is in at least one subgraph),
so that we can build models that apply to any image of the
location modeled in the database.
Based on these criteria, we cover the graph by selecting
a set of representative exemplar images, and defining their
(immediate) neighborhoods as subgraphs in a graph cover,
as illustrated in Figure 1. Formulated this way, the covering
problem becomes one of selecting a set of representative
images that form a dominating set of the graph. For a graph
G
, and a set of exemplar images
C
, we say an image
a I
is covered by
C
if either
a C
, or
a
is adjacent to an image
in
C
. If
C
covers all nodes, then
C
is a dominating set. We
would like
C
to be as small as possible, and accordingly, the
neighborhood of each node in
C
to be as large as possible.
Hence, we seek a minimum dominating set. Such sets have
been used before for 3D reconstruction [
10
]; here we use
them to define a set of classifiers.
Finding an exact minimum dominating set is an NP-
complete problem. We use a simple greedy algorithm to
find an approximate solution [
9
]. Starting with an empty set,
we iteratively choose the image that covers the maximum
number of as-yet uncovered images in the graph, until all
images are covered. Figure 2 shows an example image graph
for the Dubrovnik dataset [
17
] and the exemplar images
selected by our algorithm.
Step 2a: Discriminative learning on neighborhoods.
For
each neighborhood selected in Step 1, the next step is to
learn a classifier that will take a new image, and classify it
as belonging to that neighborhood or not. We learn these
classifiers using standard linear SVMs on bag-of-words his-
tograms, one for each neighborhood, and calibrate the set
of SVMs as described in Step 2b; at query time, these clas-
sifiers will be used to define a set of similarity functions
for ranking the database images given a query image. This
use of classifiers for ranking has found many applications in
vision and machine learning, for instance in image retrieval
using local distance functions [8] or Exemplar SVMs [28].
First, for each neighborhood around an exemplar node
c C
, we must define a set of positive and negative example
images as training data for the SVM. To define the positive
set, we simply use the images in the neighborhood. For
this task, we found that thresholding the edges in the graph
by their weight—applying a stricter definition of connec-

Figure 2.
Image matching graph for the Dubrovnik dataset.
This graph contains 6,844 images; the large, red nodes denote
representative images selected by our covering algorithm (478 im-
ages in total). Although the set of representative images is much
smaller than the entire collection, their neighborhoods cover the
matching graph. For each neighborhood, we learn a classifier for
determining whether a new image belongs to that neighborhood.
tivity, and yielding more compact neighborhoods—yielded
better classifiers than using all edges found by the image
matching procedure. To define the negative set for the neigh-
borhood around an exemplar
c
, we first find a small set of
hard negatives—images with high BoW similarities to
c
, but
not in its neighborhood. These hard negatives are combined
with other randomly sampled non-neighboring images in the
graph to form a negative set. Here we use the original, as
opposed to thresholded, graph to define connectivity, to mini-
mize the chances of including a false negative in the negative
set. In this way, the image graph
G
gives us the supervision
necessary to define positives and negatives for learning, just
as geotags have provided a supervisory cue for discrimina-
tive location recognition in previous work [27, 14].
Given the training data for each neighborhood, we learn
a linear SVM to separate neighborhood images from non-
neighborhood images, using the tf-idf weighted,
L
2
normal-
ized bag-of-words histograms for each image as features. We
randomly split the training data into training and validation
subsets for parameter selection in training the SVM (more
details in Section 4.2). For each neighborhood centered on
exemplar
c
, the result of training is an SVM weight vector
w
c
and a bias term
b
c
. Given a new query image, represented
as a bag-of-words vector
q
, we can compute the decision
value w
c
· q + b
c
for each exemplar image c.
Query Image 1
Query Image 2
...
...
...
...
O
u
r
Ra
n
k
i
n
g
B
oW
R
a
n
k
i
n
g
O
u
r
R
a
n
k
i
n
g
B
o
W
R
a
n
k
i
n
g
Figure 3.
Two example query images and their top 5 ranked
results of our method and raw tf-idf retrieval.
For each result, a
green border indicates a correct match, and a red border indicates
an incorrect match. These two example query images are difficult
for BoW retrieval techniques, due to drastically different lighting
conditions (query image 1) and confusing features (rooftops in
query image 2). However, with our discriminatively learned simi-
larity functions, correctly matching images are ranked higher than
with the baseline method.
Step 2b: Calibrating classifier outputs.
Since our classi-
fiers are independently trained, we need to normalize their
outputs before comparing them. To do so, we convert the de-
cision value of each SVM classifier into a probability value,
using Platt’s method [
23
] on the whole set of training data.
For a neighborhood around exemplar
c
, and a query image
vector q, we refer to this probability value as P
c
(q).
Step 3: Generating a ranked list of database images.
For
a query image represented as a BoW vector
q
, we can now
compute a probability of
q
belonging to the neighborhood of
each exemplar image c. Using these values, it is straightfor-
ward to generate a ranked list of the exemplar images
c C
by sorting by
P
c
(q)
in decreasing order. However, we found
that just verifying the query image against exemplar images
sometimes failed simply because the exemplar images rep-
resent a much sparser set of viewpoints than the full graph.
Hence, we would like to create a ranked list of all database
images. To do so, we take the sorted set of neighborhoods
given by the probability values, and then we sort the images
within each neighborhood by their original tf-idf similarity.
We then concatenate these per-neighborhood sorted lists;
since a database image can appear in multiple overlapping
neighborhoods (see Figure 1), in the final list it appears only
in list of the best-ranked neighborhood. This results in a
ranking of the entire list of database images.
Step 4: Geometric verification.
Finally, using the ranking
of database images from Step 3, we perform feature match-
ing and RANSAC-based geometric verification between the
query image and each of the images in the shortlist in turn,
until we find a true match. If we have a 3D structure from
motion model, we can then associate 3D points with matches

in the query image, and determine its pose [
18
]. If not, we
can associate the location of the matching database image as
the approximate location of the query image. Because fea-
ture matching and verification is relatively computationally
intensive, the quality of the ranking from Step 3 highly im-
pacts the efficiency of the system—ideally, a correct match
will be among the top few matches, if not the first match.
Using this simple approach, we observe improvements
in our ranked lists over raw BoW retrieval results, as shown
in the examples in Figure 3. In particular, the top image in
the ranked list is more often correct. However, when the
top ranked cluster is incorrect, this method has the effect
of saturating the top shortlist with similar images that are
all wrong—there is a lack of diversity in the list, with the
second-best cluster pushed further down the list. To avoid
this, we propose several methods to encourage a diverse
shortlist of images.
3.3. Improving the Shortlist
In this section, we first introduce a probabilistic method
that uses the graph to introduce more diversity into the short-
list, increasing the likelihood of finding a correct match
among the top few retrieved images. Second, we demon-
strate several techniques to introduce regularization using
BoW ranking to further improve recognition performance.
Probabilistic Reranking.
Our problem is akin to the well-
known Web search ranking problem (as opposed to standard
image retrieval). Rather than retrieve all instances relevant
to a given query, we want to retrieve a small set of results
that are both relevant and diverse (see Figure 4 for an ex-
ample), so as to cover multiple possible hypotheses—just
as a Web search for the term “Michael Jordan” might pro-
ductively return results for both the basketball player and
the machine learning researcher. While introducing diver-
sity in Web search has been studied in the machine learning
literature [
32
], we are unaware of it being used in location
recognition; in our problem, it is the automatic verification
procedure that is examining results, rather than a human.
To introduce diversity, we propose a probabilistic approach
for reranking the shortlist. The idea is, in some ways, the
converse of query expansion on positive matches to increase
recall in image retrieval. In our case, we use negative evi-
dence to increase the pool of diverse matches. For instance,
in the case where the first retrieved image is not a match to
the query, we want to select the second image conditioned
on this outcome, perhaps selecting an image dissimilar to
this first match (and similarly for the third image conditioned
on the first two being incorrect). How can we compute such
conditional probabilities? We again turn to the image graph.
First, some terminology. For a database image
a
, we
define a random variable X
a
representing the event that the
query image matches image
a
;
X
a
= 1
if image
a
is a
match, and 0 otherwise. Thus, using the notation above,
P
c
= P (X
c
= 1)
for an exemplar image
c
, and similarly
P
a
= P (X
a
= 1) for any database image, using the simple
heuristic above that a non-exemplar database image takes
the maximum probability of all neighborhoods it belongs
to. As before, we choose the database image
a
with the
highest
P
a
as the top-ranked image. However, to select the
second ranked image, we are instead more interested in the
conditional probability
P
0
b
= P (X
b
= 1|X
a
= 0)
than its
raw appearance-based probability
P (X
b
= 1)
alone. We
can compute this conditional probability as:
P
0
b
= P (X
b
= 1|X
a
= 0) =
P (X
b
= 1, X
a
= 0)
P (X
a
= 0)
=
P (X
b
= 1) P (X
b
= 1, X
a
= 1)
1 P (X
a
= 1)
=
P
b
P (X
b
= 1|X
a
= 1)P (X
a
= 1)
1 P
a
=
P
b
P
ba
P
a
1 P
a
= P
b
1
P
ba
P
b
P
a
1 P
a
!
(1)
where
P
ba
= P (X
b
= 1|X
a
= 1)
denotes the conditional
probability that image
b
matches the query given that image
a
matches. The last line in the derivation above relates
P
0
b
to
P
b
via an update factor,
(1
P
ba
P
b
P
a
)/(1P
a
)
, that depends
on
P
a
(the probability that the top ranked image matches)
and
P
ba
(a conditional probability). We use the image graph
to estimate
P
ba
, the intuition being that the more similar
b
is to
a
—i.e., stronger the connection between
a
and
b
in the graph—the higher
P
ba
should be. In particular, we
estimate
P
ba
as
N(a,b)
N(a)
, the ratio of the number of shared
features between
a
and
b
divided by the total number of
feature points in
a
. Note that in general
P
ab
6≡ P
ba
, i.e., this
similarity measure is asymmetric. These measures are pre-
computed, along with the Jaccard indices
J(a, b)
described
in Section 3.1.
The update factor in Eq. (1) has an intuitive interpretation:
if image
b
is very similar to image
a
according to the graph
(i.e.,
P
ba
is large), then its probability score is downweighted
(because if
a
is an incorrect match, then
b
is also likely
incorrect). On the other hand, if
b
is not connected to
a
, its
score will tend to be boosted. However, we do not want
to apply this update too quickly, for fear of downweighting
many images based on the evidence of a single mismatch. To
regulate this factor, we introduce a parameter
α
, and define
a regularized update factor
(1 α
P
ba
P
b
P
a
)/(1 αP
a
)
. If
α = 0
, the update has no influence on the ranking result,
and when
α = 1
, it has its full effect. We use
α = 0.9
in
our experiments. We iteratively choose the image
b
with the
highest updated score P
0
b
and recalculate scores using (1).
BoW Regularization.
Our learned discriminative models
often perform well, but we observed that for some rare query
images, our models consistently perform poorly (perhaps due

Figures
Citations
More filters
Proceedings ArticleDOI

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

TL;DR: A convolutional neural network architecture that is trainable in an end-to-end manner directly for the place recognition task and an efficient training procedure which can be applied on very large-scale weakly labelled tasks are developed.
Proceedings ArticleDOI

Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions

TL;DR: This paper introduces the first benchmark datasets specifically designed for analyzing the impact of day-night changes, weather and seasonal variations, as well as sequence-based localization approaches and the need for better local features on visual localization.
Journal ArticleDOI

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

TL;DR: A convolutional neural network architecture that is trainable in an end-to-end manner directly for the place recognition task, and significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks.
Journal ArticleDOI

Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization

TL;DR: This paper presents an approach for large scale image-based localization that is both efficient and effective and demonstrates that it offers the best combination of efficiency and effectiveness among current state-of-the-art approaches for localization.
Proceedings ArticleDOI

Image-Based Localization Using LSTMs for Structured Feature Correlation

TL;DR: Experimental results show the proposed CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes outperforms existing deep architectures, and can localize images in hard conditions, where classic SIFT-based methods fail.
References
More filters
Journal Article

LIBLINEAR: A Library for Large Linear Classification

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.
Proceedings ArticleDOI

Video Google: a text retrieval approach to object matching in videos

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
Proceedings ArticleDOI

Object retrieval with large vocabularies and fast spatial matching

TL;DR: To improve query performance, this work adds an efficient spatial verification stage to re-rank the results returned from the bag-of-words model and shows that this consistently improves search quality, though by less of a margin when the visual vocabulary is large.
Proceedings ArticleDOI

Three things everyone should know to improve object retrieval

TL;DR: A new method to compare SIFT descriptors (RootSIFT) which yields superior performance without increasing processing or storage requirements, and a novel method for query expansion where a richer model for the query is learnt discriminatively in a form suited to immediate retrieval through efficient use of the inverted index.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "Graph-based discriminative learning for location recognition" ?

The authors explore new ways for exploiting the structure of a database by representing it as a graph, and show how the rich information embedded in a graph can improve a bagof-words-based location recognition method. In particular, starting from a graph on a set of images based on visual connectivity, the authors propose a method for selecting a set of subgraphs and learning a local distance function for each using discriminative techniques. In addition, the authors propose a probabilistic method for increasing the diversity of these ranked database images, again based on the structure of the image graph. The authors demonstrate that their methods improve performance over standard bag-of-words methods on several existing location recognition datasets. 

Finding a way to automatically adjust learning parameters or synthesize the results from different clusters is an important issue, and an interesting direction of future work. 

A main contribution of their approach is to combine the power of discriminative learning methods with the rich structural information in an image graph, in order to learn a better database representation and to better analyze results at query time. 

In their experiments, the authors use the simple fall back strategy by default, and separately evaluate a combination of averaging and interleaving as a stronger form of tf-idf regularization. 

as a simple strategy, for query images where all models give a probability score below a minimum threshold Pmin (0.1 in their tests), the authors fall back to tf-idf scores, as the authors found low probability scores unreliable for ranking. 

The last line in the derivation above relates P ′b to Pb via an update factor, (1− PbaPb Pa)/(1−Pa), that depends on Pa (the probability that the top ranked image matches) and Pba (a conditional probability). 

Pb − P (Xb = 1|Xa = 1)P (Xa = 1)1− Pa= Pb − PbaPa1− Pa = Pb ( 1− PbaPb Pa 1− Pa ) (1)where Pba = P (Xb = 1|Xa = 1) denotes the conditional probability that image b matches the query given that image a matches. 

These matches are sufficient for their method (though to improve the quality of the matching, the authors can also run structure from motion to obtain a point cloud and a refined set of image correspondences). 

The update factor in Eq. (1) has an intuitive interpretation: if image b is very similar to image a according to the graph (i.e., Pba is large), then its probability score is downweighted (because if a is an incorrect match, then b is also likely incorrect). 

The authors believe this is due to the nature of image graphs for unstructured collections, where some nodes have many neighbors, and others (e.g. very zoomed-in images) have only a few; training and calibration for these low-degree nodes may result in models that overfit the data and contaminate the global ranking. 

If the authors have a 3D structure from motion model, the authors can then associate 3D points with matchesin the query image, and determine its pose [18]. 

The authors make use of this structural information in a bag-ofwords-based location recognition framework, in which the authors take a query image, retrieve similar images in the database, and perform detailed matching to verify each retrieved image1until a match is found.