scispace - formally typeset
Open AccessJournal ArticleDOI

LDAHash: Improved Matching with Smaller Descriptors

Reads0
Chats0
TLDR
This work reduces the size of the descriptors by representing them as short binary strings and learn descriptor invariance from examples, and shows extensive experimental validation, demonstrating the advantage of the proposed approach.
Abstract
SIFT-like local feature descriptors are ubiquitously employed in computer vision applications such as content-based retrieval, video analysis, copy detection, object recognition, photo tourism, and 3D reconstruction. Feature descriptors can be designed to be invariant to certain classes of photometric and geometric transformations, in particular, affine and intensity scale transformations. However, real transformations that an image can undergo can only be approximately modeled in this way, and thus most descriptors are only approximately invariant in practice. Second, descriptors are usually high dimensional (e.g., SIFT is represented as a 128-dimensional vector). In large-scale retrieval and matching problems, this can pose challenges in storing and retrieving descriptor data. We map the descriptor vectors into the Hamming space in which the Hamming metric is used to compare the resulting representations. This way, we reduce the size of the descriptors by representing them as short binary strings and learn descriptor invariance from examples. We show extensive experimental validation, demonstrating the advantage of the proposed approach.

read more

Content maybe subject to copyright    Report

LDAHash:
Improved Matching with Smaller Descriptors
Christoph Strecha, Alexander M. Bronstein, Member, IEEE,
Michael M. Bronstein, Member, IEEE, and Pascal Fua, Senior Member, IEEE
Abstract—SIFT-like local feature descriptors are ubiquitously employed in computer vision applications such as content-based
retrieval, video analysis, copy detection, object recognition, photo tourism, and 3D reconstruction. Feature descriptors can be designed
to be invariant to certain classes of photometric and geometric transformations, in particular, affine and intensity scale transformations.
However, real transformations that an image can undergo can only be approximately modeled in this way, and thus most descriptors
are only approximately invariant in practice. Second, descriptors are usually high dimensional (e.g., SIFT is represented as a
128-dimensional vector). In large-scale retrieval and matching problems, this can pose challenges in storing and retrieving descriptor
data. We map the descriptor vectors into the Hamming space in which the Hamming metric is used to compare the resulting
representations. This way, we reduce the size of the descriptors by representing them as short binary strings and learn descriptor
invariance from examples. We show extensive experimental validation, demonstrating the advantage of the proposed approach.
Index Terms—Local features, SIFT, DAISY, binarization, similarity-sensitive hashing, metric learning, 3D reconstruction, matching.
Ç
1INTRODUCTION
O
VER the last decade, feature point descriptors such as
SIFT [1] and similar methods [2], [3], [4] have become
indispensable tools in the computer vision community. They
are usually represented as high-dimensional vectors, such as
the 128-dimensional SIFT or the 64-dimensiona l SURF
vectors. While a descriptor’s high dimensionality is not an
issue when only a few hundred points need to be
represented, it becomes a significant concern when millions
have to be on a device with limited computational and
storage resources. This happens, for example, when storing
all descriptors for a large-scale urban scene on a mobile
phone for image-based location purposes. Not only does this
require a tremendous amount of storage, it is also slow and
potentially unreliable because most recognition algorithms
rely on nearest-neighbor computations and computing
euclidean distances between long vectors is neither cheap
nor optimal.
Consequently, there have been many recent attempts at
compacting SIFT-like descriptors to allow for faster match-
ing while retaining their outstanding recognition rates. One
class of techniques relies on quantization [5], [6] and
dimensionality reduction [7], [8]. While helpful, this
approach is usually not sufficient to produce truly short
descriptors without loss of matching performance. Another
class [9], [10], [11], [12] takes advantage of training data to
learn short binary codes whose distances are small for
positive training pairs and large for others. This is
particularly promising because not only does binarization
reduce the descriptor size, but also partly increases
performance, as will be shown.
Binarization is usually performed by multiplying the
descriptors by a projection matrix, subtracting a threshold
vector, and retaining only the sign of the result. This maps
the data into a space of binary strings, greatly reducing their
size on the one hand and simplifying their similarity
computation (now becoming the Hamming metric, which
can be computed very efficiently on modern CPUs) on the
other. Another class of locality-sensitive hashing (LSH)
techniques and their variants [9], [13] encode similarity of
data points as the collision probability of their binary codes.
While such similarity can be evaluated very efficiently,
these techniques usually require a large number of hashing
functions to be constructed in order to achieve competitive
performance. Also, families of LSH functions have been
constructed only for classes of standard metrics, such as the
L
p
norms, and do not allow for supervision.
In most supervised binarization techniques based on a
linear projection, the matrix entries and thresholds are
selected so as to preserve similarity relationships in a training
set. Doing this efficiently involves solving a difficult non-
linear optimization problem and most of the existing
methods offer no guarantee of finding a global optimum.
By contrast, spectral hashing (SH) [14] does offer this
guarantee for simple data distributions and has proved very
successful. However, this approach is only weakly super-
vised by imposing a euclidean metric on the input data,
which we will argue is not a particularly good one in our case.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. X, XXXXXXX 2011 1
. C. Strecha is with the EPFL/IC/ISIM/CVLab, Station 14, Lausanne CH-
1015, Switzerland. E-mail: christoph.strecha@epfl.ch.
. A.M. Bronstein is with the Department of Computer Science, Technion-
Israel Institute of Technology, Room 341, Taub Building, Haifa 32000,
Israel. E-mail: bron@cs.technion.ac.il.
. M.M. Bronstein is with the Institute of Computational Science, Faculty of
Informatics, Via Giuseppe Buffi 13, Lugano 6900.
E-mail: michael.bronstein@usi.ch.
. P. Fua is with the IC-CVLab, Station 14, EPFL, Lausanne CH-1015,
Switzerland. E-mail: pascal.fua@epfl.ch.
Manuscript received 27 Aug. 2010; revised 23 Jan. 2011; accepted 3 Mar.
2011; published online 13 May 2011.
Recommended for acceptance by F. Dellaert.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2010-08-0660.
Digital Object Identifier no. 10.1109/TPAMI.2011.103.
0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

To better take advantage of training data composed of
interest point descriptors corresponding to multiple 3D
points seen under different views, we introduce a global
optimization scheme that is inspired by an earlier local
optimization one [10]. In [10], the entries of the projection
matrix and thresholds vectors are constructed progressively
using AdaBoost. Given that Adaboost is a gradient-based
method [15] and that the algorithm optimizes a few matrix
rows at a time, there is no guarantee the solution it finds is
optimal. By contrast, we first compute a projection matrix
that is designed either to solely minimize the in-class
covariance of the descriptors or to jointly minimize the in-
class covariance and maximize the covariance across
classes, both of which can be achieved in closed form. This
being done, we compute optimal thresholds that turn the
projections into binary vectors so as to maximize recogni-
tion rates. In essence, we perform Linear Discriminant
Analysis (LDA) on the descriptors before binarization and
will therefore refer to our approach as LDAHash.
Our experiments show that state-of-the-art metric learn-
ing methods based, e.g., on margin maximization [16], [17]
achieve exceptional performance in the low false negative
rate range, which degrades significantly in the low false
positive rate range. Binarization usually only deteriorates
performance. In large-scale applications that involve match-
ing keypoints against databases containing millions of
them, achieving good performance in the low false positive
rate range is crucial to preventing a list of potential matches
from becoming unacceptably long. We use ROC curves to
show that, in many different cases, the proposed method
has competitive performance in the low false negative rage
while significantly outperforming other methods in the low
false positive range.
We also show that unlike many other techniques where
binarization produces performance degradation, using our
approach to binarize SIFT descriptors [1] actually improves
matching performance. This is especially true in the low
false positive range with 64 or 128-bit descriptors, which
means that they are about 10 to 20 times shorter than the
original ones. Furthermore, using competing approaches
[10], [14], [18] to produce descriptors of the same size as
ours results in lower matching performance over the full
false positive range.
In the following section, we briefly survey existing
approaches to binarization. In Section 3, we introduce our
own framework. In Section 4, we describe the correspond-
ing training methodology, training data, and analyze the
impact of individual components of our approach. Finally,
we present our results in Section 5.
2PRIOR WORK
Most approaches for compacting SIFT-like descriptors and
allowing for faster matching rely on one or more of the
following techniques:
2.1 Tuning
In [8], [19], [6], [20], [18], the authors use training to optimize
the filtering and normalization steps that produce a SIFT-like
vector. The same authors optimize in [18] over the position of
the elements that make up a DAISY descriptor [4].
2.2 Quantization
The SIFT descriptor can be quantized using, for instance,
only 4 bits per coordinate [5], [18], thus saving memory and
speeding up matching because comparing short vectors is
faster than comparing long ones. Chandrasekhar et al. [20]
applied tree-coding methods for lossy compression of
probability distributions to SIFT-like descriptors to obtain
a compressed histogram of gradients (CHOG).
2.3 Dimensionality reduction
PCA has been extensively used to reduce the dimensionality
of SIFT vectors [21], [6]. In this way, the number of bits
required to describe each dimension can be reduced without
loss in matching performance [6], [18]. In [22], a whitening
linear transform was proposed in addition to benefit from
the efficiency of fast nearest-neighbor search methods.
The three approaches above are mostly unsupervised
methods and sometimes require a complex optimization
scheme [20], [18]. Often, they are not specifically tuned for
keypoint matching and do not usually produce descriptors as
short as one would require for large-scale keypoint matching.
Our formulation relates to supervised metric learning
approaches. The problem of optimizing SIFT-like descrip-
tors can be approached from the perspective of metric
learning, where many efficient approaches have been
recently developed for learning similarity between data
from a training set of similar and dissimilar pairs [23], [24].
In particular, similarity-sensitive hashing (SSH) or locality-
sensitive hashing [9], [10], [14], [11], [12] algorithms seek to
find an efficient binary representation of high-dimensional
data maintaining their similarity in the new space. These
methods have also been applied to global image descrip-
tors and bag-of-feature representations in content-based
image search [25], [26], [27], [28], video copy detection [29],
and shape retrieval [30]. In [31] and [32], Hamming
embedding was used to replace vector quantization in
bag-of-feature construction.
There are a few appealing properties of similarity-
sensitive hashing methods in large-scale descriptor match-
ing applications. First, such methods combine the effects of
dimensionality reduction and binarization, which makes
the descriptors more compact and easier to store. Second,
the metric between the binarized descriptors is learned
from examples and renders more correctly their similarity.
In particular, it is possible to take advantage of feature
point redundancy and transitive closures in the training
set, such as those in Fig. 3. Finally, comparison of binary
descriptors is computationally very efficient and is amen-
able for efficient indexing.
Existing methods for similarity-sensitive hashing have a
few serious drawbacks in our application. The method of
Shakhnarovich [10] poses the similarity-sensitive hashing
problem as boosted classification and tries to find its solution
by means of a standard AdaBoost algorithm. However,
given that AdaBoost is a greedy algorithm equivalent to a
gradient-based method [15], there is no guarantee of global
optimality of the solution. The spectral hashing algorithm
[14], on the other hand, has a tacit underlying assumption of
euclidean descriptor similarity, which is typically far from
being correct. Moreover, it is worthwhile mentioning that
spectral hashing, similarity-sensitive hashing, and similar
2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. X, XXXXXXX 2011

methods have so far proven to be very efficient in retrieval
applications for ranking the matches, in which one typically
tries to achieve high recall. Thus, the operating point in these
applications is at low false negative rates, which ensures that
no relevant matches (typically, only a few) are missed. In
large-scale descriptor matching, on the other hand, one has
to create a list of likely candidate matches, which can be very
large if the false positive rate is high. For example, given a set
of 1 M descriptors, which is modest for Internet-scale
applications, and 1 percent false positive rate, 10 K
candidates would have to considered. Consequently, an
important concern in this application is a very low false
positive rate. As we show in the following, our approach is
especially successful at this operating point, while existing
algorithms show poor performance.
3APPROACH
Let us assume we are given a large set of keypoint
descriptors. They are grouped into subsets corresponding
to the same 3D points and all pairs within the subsets
are therefore considered as belonging to the same class. The
main idea of our method is to find a mapping from the
descriptor space to the Hamming space by means of an
affine map followed by a sign function such that the
Hamming distance between the binarized descriptors is as
close as possible to the similarity of the given data set. Our
method involves two key steps:
Projection selection. We compute a projection matrix that is
designed either to solely minimize the in-class covariance of
the descriptors or to jointly minimize the in-class covariance
and maximize the covariance across classes, both of which
can be done in closed form (Sections 3.3.1 and 3.3.2).
Threshold selection. We find thresholds that can be used to
binarize the projections so that the resulting binary strings
maximize recognition rates. We show that this threshold
selection is a separable problem that can be solved using 1D
search. In the remainder of this section, we formalize these
steps and describe them in more details.
3.1 Problem Formulation
Our set of keypoint descriptorsisrepresentedas
n-dimensional vectors in IR
n
. We attempt to find their
representation in some metric space ðZZ;d
ZZ
Þ by means of a
map of the form y :IR
n
ZZ;d
ZZ
Þ. The metric d
ZZ
ðy yÞ
parameterizes the similarity between the feature descrip-
tors, which may be difficult to compute in the original
representation. Our goal in finding such a mapping is
twofold. First, ZZ should be an efficient representation. This
implies that yðxÞ requires significantly less storage than x,
and that d
ZZ
ðyðxÞ; yðx
0
ÞÞ is much easier to compute than,
e.g., kx x
0
k. Second, the metric d
ZZ
ðy yÞ should better
represent some ideal descriptor similarity, in the following
sense: Given a set P of pairs of descriptors from
corresponding points in different images, e.g., the same
object under a different view point (referred to as positives)
and a set N of pairs of descriptors from different points
(negatives), we would l ike d
ZZ
ðyðxÞ; yðx
0
ÞÞ <R for all
ðx; x
0
Þ2P and d
ZZ
ðyðxÞ; yðx
0
ÞÞ >R for all ðx; x
0
Þ2N to
hold with high probability for some range R.
Setting ZZ to be the m-dimensional Hamming space
IH
m
¼f1g
m
, the embedding of a descriptor x can be
expressed as an m-dimensional binary string. Here, we limit
our attention to affine embeddings of the form
y ¼ signðPx þ tÞ; ð1Þ
where P is an m n matrix and t is an m 1 vector;
embeddings having more complicated forms can be
obtained in a relatively straightforward manner by introdu-
cing kernels. Even under the optimistic assumption that real
numbers can be quantized and represented by 8 bits, the
size of the original descriptor is 8n bits, while the size of the
binary representation is m bits. Thus, setting m n allows
us to significantly alleviate the storage complexity and
potentially improve descriptor indexing.
Furthermore, the descriptor dissimilarity is computed in
our representation using the Hamming metric d
IH
m
ðy; y
0
Þ¼
m
2
1
2
P
m
i¼1
signðy
i
y
0
i
Þ, which is done by performing an XOR
operation between y and y
0
and counting the number of
nonzero bits in the result, an operation carried out in a single
instruction on modern CPU architectures (POPCNT SSE4.2).
The embedding y is constructed t o m inimize the
expectation of the Hamming metric on the set positive
pairs while maximizing it on the set of negative pairs. This
can be expressed as minimization of the loss function:
L ¼ E
d
IH
m
ðy; y
0
ÞjP
E
d
IH
m
ðy; y
0
ÞjN
; ð2Þ
with respect to the projection parameters P and t. Here, is
a parameter controlling the trade-off between false positive
and false negative rates (higher corresponds to lower false
negative rates). In practice, the conditional expectations
EfjPg, EfjN g are replaced by averages on a training set of
positive and negative pairs of descriptors, respectively.
3.2 LDAHash
Here, we note that up to constants, problem (2) is equivalent
to the minimization of
L ¼ Efy
T
y
0
jN g Efy
T
y
0
jPg ð3Þ
or
L ¼ Efky y
0
k
2
jPg Efky y
0
k
2
jN g; ð4Þ
attempting to make the correlation of the binary codes as
negative as possible for negative pairs and as positive
as possible for positive pairs. Direct minimization of L is
difficult since the terms y involve a nondifferentiable sign
nonlinearity. While, in principle, smooth approximation is
possible, the solution of the resulting nonconvex problem in
ðm þ 1Þn variables is challenging, typically containing
thousands of unknowns.
As an alternative, we propose to relax the problem,
removing the sign and minimizing a related function:
~
L ¼ EfkPx Px
0
k
2
jPg EfkPx Px
0
k
2
jN g: ð5 Þ
The above objective is independent of the affine term t and
optimization can be performed over the projection matrix P
only, which we further restrict to be orthogonal. Once the
optimal matrix is found, we can fix it and minimize a
smooth version of (4) with respect to t.
STRECHA ET AL.: LDAHASH: IMPROVED MATCHING WITH SMALLER DESCRIPTORS 3

3.3 Projection Selection
Next, we describe two different approaches for computing
P, which we refer to as LDA and Difference of Covariances
(DIF) and that we compare in Sections 4 and 5.
3.3.1 Linear Discriminant Analysis
We start by observing that
EfkPx Px
0
k
2
jPg ¼ tr
P
P
P
T
;
where
P
¼ Ex x
0
Þðx x
0
Þ
T
jPg is the covariance matrix
of the positive descriptor vector differences. This leads to
~
L ¼ tr
P
P
P
T
tr
P
N
P
T
;
with
N
¼ Ex x
0
Þðx x
0
Þ
T
jN g being the covariance
matrix of the negative descriptor vector differences.
Transforming the coordinates by premultiplying x by
1=2
N
turns the second term of
~
L into a constant for any
unitary P, leaving
~
L / tr
P
1=2
N
P
T=2
N
P
T
¼ tr
P
P
1
N
P
T
¼ tr
P
R
P
T
;
ð6Þ
where
R
¼
P
1
N
is the ratio of the positive and negative
covariance matrices. Since
R
is a symmetric positive
semidefinite matrix, it a dmits the eigendecomposition
R
¼ USU
T
, where S is a nonnegative diagonal matrix.
An orthogonal m n matrix P minimizing the trace of
P
R
P
T
is a projection onto the space spanned by the
m smallest eigenvectors of
R
,
~
L is given by
P
1=2
N
¼ð
R
Þ
1=2
m
1=2
N
¼
~
S
1=2
m
~
U
T
1=2
N
; ð7Þ
where
~
S is the m m matrix with the smallest eigenvalues
and
~
U is the n m matrix with the corresponding
eigenvectors (for notation brevity, we denote such a
projection by ð
R
Þ
1=2
m
). This approach resembles the spirit
of linear discriminant analysis. A similar technique has been
introduced in [29] within the framework of boosted
similarity learning. Note that the normalization of columns
of P is unimportant since a sign function is applied to its
output. However, we keep the normalization by the inverse
square root of the variances, which makes the projected
differences Pðx x
0
Þ normal and white.
3.3.2 Difference of Covariances
An alternative approach can be derived by observing that
~
L ¼ tr
P
D
P
T
;
where
D
¼
P
N
. This yields
P ¼ð
D
Þ
1=2
m
; ð8Þ
where at most m smallest negative eigenvectors are selected.
This selection of the projection matrix will be referred to as
covariance difference and denoted by DIF. Note that it allows
controlling the trade-off between false positive and negative
rates through the parameter , which is impossible in the
LDA approach.
The limit !1 is of particular interest as it yields
D
/
P
. In this case, the negative covariance does not play
any role in the training, which is equivalent to assuming
that the differences of negative descriptor vectors are white
Gaussian,
N
¼ I. The corresponding projection matrix is
given by
P ¼ð
P
Þ
1=2
m
: ð9Þ
The main advantage of this approach is that it allows
learning the projection in a semi-supervised setting when
only positive pairs are available.
In general, a fully supervised approach is advantageous
over its semi-supervised counterpart, which assumes a
sometimes unrealistic unit covariance of the negative class
differences. However, unlike the positive training set
containing only pairs of knowingly matching descriptors,
the negative set might be contaminated by positive pairs (a
situation usually referred to a s label noise). If such a
contamination is significant, the semi-supervised setting is
likely to perform better.
3.4 Threshold Selection
Given the projection matrix P selected as described in the
previous section, our next step is to minimize a smooth
version of the loss function (3),
L ¼ EfsignðPx þ tÞ
T
signðPx
0
þ tÞjN g
EfsignðPx þ tÞ
T
signðPx
0
þ tÞjPg
¼
X
m
i¼1
Efsign
p
T
i
x þ t
i
sign
p
T
i
x
0
þ t
i
jN g
Efsign
p
T
i
x þ t
i
sign
p
T
i
x
0
þ t
i
jPg;
ð10Þ
with respect to the thresholds t, where p
T
i
denotes the
ith row of P and t
i
denotes the ith element of t. Observe
that due to its separable form, the problem can be split into
independent subproblems:
min
t
i
E
sign

p
T
i
x þ t
i

p
T
i
x
0
þ t
i

jN
E sign

p
T
i
x þ t
i

p
T
i
x
0
þ t
i

jP

;
ð11Þ
which in turn can be solved using simple 1D search over
each threshold t
i
.
Let y ¼ p
T
i
x and y
0
¼ p
T
i
x
0
be the ith element of the
projected training vectors x and x
0
. The ith bits of y and y
0
coincide if t
i
< minfy; y
0
g or t
i
> maxfy; y
0
g, and differ if
minfy; y
0
gt
i
maxfy; y
0
g. For a given value of the thresh-
old, we express the false negative rate as
FNðtÞ¼Prðminfy; y
0
gt or maxfy; y
0
g <tjPÞ
¼ 1 Prðminfy; y
0
g <tjPÞ þ Prðmaxfy; y
0
g <tjPÞ
¼ 1 cdfðminfy; y
0
gjPÞ þ cdfðmaxfy; y
0
gjPÞ
ð12Þ
with cdf standing for cumulative distribution function.
Similarly, the false positive rate can be expressed as
FPðtÞ¼Prðminfy; y
0
g <t maxfy; y
0
gjN Þ
¼ 1 Prðminfy; y
0
gt or maxfy; y
0
g <tjN Þ
¼ cdfðminfy; y
0
gjN Þ cdfðmaxfy; y
0
gjN Þ:
ð13Þ
We compute histograms of minimal and maximal values of
projected positive and negative pairs, from which the
cumulative densities are estimated. The optimal threshold t
i
4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. X, XXXXXXX 2011

is selected to minimize FP þ FN (or, alternatively, maximize
TN þ TP, where TP ¼ 1 FN and TN ¼ 1 FP are the true
positive and true negative rates, respectively). Fig. 1
visualizes TP, TN, and TP FP for the first two components
i ¼ 1; 2 of the projections LDA and DIF.
4TRAINING METHODOLOGY
In this section, we first describe our ground truth training
and evaluation data. We then evaluate different aspects of
our binary descriptors.
4.1 Ground Truth Data
To build our ground truth database, we used sets of
calibrated images for which we show the 3D point model
and a member image in Figs. 3, 4, 14, 15, and 16. These data
sets contain images we acquired ourself, such as those in
Figs. 14 and 15, and sometimes over extended periods of
time (Fig. 3). Those of Figs. 3, 4, and 15 contain images
downloaded from the Internet or are fully acquired from
this source, as in the case of Fig. 16.
We used our own calibration pipeline [33] to register
them and to compute internal and external camera
parameters as well as a sparse set of 3D points, each
corresponding to a single keypoint track. First, pairwise
keypoint correspondences are established using Vedaldi’s
[34] SIFT [1] descriptors that we compared using the
standard L
2
norm. These are transformed into keypoint
tracks which are used to grow initial reconstructions that
have been obtained by a robust fit of pairwise essential
matrices. This standard procedure is similar to [35] and we
refer to this and our work [33] for more information.
Because our data set contains multiple views of the same
scene, we have many conjunctive closure matches [36] such
as the one depicted by the blue line in Fig. 3 (bottom): A
keypoint that is matched in two other images, as depicted
by the green lines, gives rise to an additional match in these
other two images. Since they may be quite different from
each other, the L
2
distance between the corresponding
descriptors may be large. Yet, the descriptors in all three
images will be treated as belonging to the same class, which
is key to learning a metric that can achieve better matching
performance than the original L
2
norm. In our data sets,
these conjunctive closures partially build long chains for
which individual pairs can have quite large L
2
norm as one
can see in Fig. 2. In practice, we consider only chains with
five or more keypoints, i.e., 3D points that are visible in at
least five images.
For the negative examples, we randomly sampled the
same number of keypoint pairs and checked that none of
them belonged to the positive set.
This training database is more specific than the one used
in [8] and [19], where the authors use a calibrated database
of images and their dense multiview stereo correspon-
dences. However, calibration and dense stereo information
is used to extract the image patches which are centered
around 3D point projections and use these to build a
training database of positive matches. In our framework, we
use the calibration only to geometrically verify SIFT matches
as being consistent with the camera parameters and with
the 3D structure. The 2D position, scale, and orientation of
the original interest points is kept such that we can perform
STRECHA ET AL.: LDAHASH: IMPROVED MATCHING WITH SMALLER DESCRIPTORS 5
Fig. 1. The probability density functions for the classification perfor-
mance for positive and negative training examples (a) for the first two
dimensions and (b) for DIF.
Fig. 2. Some of the keypoints from the same 3D point for the Venice data set in Fig. 16 are shown as an example. The red circle shows the keypoint
(DoG) position and its scale. The track was extracted by consecutive SIFT L
2
matching, which makes it possible to include keypoint pairs
(conjunctive closures) that are quite different into the training and evaluation set.

Figures
Citations
More filters
Journal ArticleDOI

Unmanned aerial systems for photogrammetry and remote sensing: A review

TL;DR: In this paper, the authors discuss the evolution and state-of-the-art of the use of UAVs in the field of Photogrammetry and Remote Sensing (PaRS).
Journal ArticleDOI

Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-Scale Image Retrieval

TL;DR: This paper addresses the problem of learning similarity-preserving binary codes for efficient similarity search in large-scale image collections by proposing a simple and efficient alternating minimization algorithm, dubbed iterative quantization (ITQ), and demonstrating an application of ITQ to learning binary attributes or "classemes" on the ImageNet data set.
Proceedings ArticleDOI

Supervised hashing with kernels

TL;DR: A novel kernel-based supervised hashing model which requires a limited amount of supervised information, i.e., similar and dissimilar data pairs, and a feasible training cost in achieving high quality hashing, and significantly outperforms the state-of-the-arts in searching both metric distance neighbors and semantically similar neighbors is proposed.
Proceedings ArticleDOI

Iterative quantization: A procrustean approach to learning binary codes

TL;DR: A simple and efficient alternating minimization scheme for finding a rotation of zero- centered data so as to minimize the quantization error of mapping this data to the vertices of a zero-centered binary hypercube is proposed.
Proceedings ArticleDOI

Supervised Discrete Hashing

TL;DR: This work proposes a new supervised hashing framework, where the learning objective is to generate the optimal binary hash codes for linear classification, and introduces an auxiliary variable to reformulate the objective such that it can be solved substantially efficiently by employing a regularization algorithm.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

Support-Vector Networks

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Book ChapterDOI

SURF: speeded up robust features

TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Journal ArticleDOI

Speeded-Up Robust Features (SURF)

TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Journal ArticleDOI

A performance evaluation of local descriptors

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Related Papers (5)