What contributions have the authors mentioned in the paper "Query adaptive similarity for large scale object retrieval" ?

In this paper the authors present a probabilistic framework for modeling the feature to feature similarity measure. Furthermore, the authors propose a function to score the individual contributions into an image to image similarity within the probabilistic framework.

How does the mAP function perform without the feature scaling?

For the experiment on Oxford5k, the authors find out that without the feature scaling, mAP will drop from 0.739 to 0.707, while without burstiness weighting, mAP will drop to 0.692.

How can the authors estimate the distance to the non-corresponding features?

The expected distance to the non-corresponding features can be used to adapt the original distance and can be efficiently estimated by introducing a small set of random features as negative examples.

How do the authors normalize distance between features?

Since the distribution of the Euclidean distance varies enormously from one query feature to another, the authors propose to normalize the distance locally to obtain similar degree of measurement across queries.

What is the threshold for comparing the adaptive distance function to the Euclidean distance?

In order to compare the adaptive distance function to the Euclidean distance, the authors use a threshold for separating matching and non-matching features.

how do the authors calculate the distance between a query and a database?

in order to have an estimation of the pairwise distance d(xi, yj) between query and database features, the authors add a product quantization scheme as in [12] and select the same parameters as the original author.

(Open Access) Query Adaptive Similarity for Large Scale Object Retrieval (2013) | Danfeng Qin

Q: How can the authors estimate the distance between the non-corresponding features?

Since the non-corresponding features are independent from the query, a set of randomly sampled, thus unrelated features can be used to represent the set of noncorrespondent features to each query.

Q: How does the model adapt to the query feature?

The authors show - both on simulated and real data - that the Euclidean distance density distribution is highly query dependent and that their model adapts the original distance accordingly.

Query Adaptive Similarity for Large Scale Object Retrieval

Danfeng Qin Christian Wengert Luc van Gool

ETH Z

urich, Switzerland

{qind,wengert,vangool}@vision.ee.ethz.ch

Abstract

Many recent object retrieval systems rely on local fea-

tures for describing an image. The similarity between a

pair of images is measured by aggregating the similarity

between their corresponding local features. In this paper

we present a probabilistic framework for modeling the fea-

ture to feature similarity measure. We then derive a query

adaptive distance which is appropriate for global similar-

ity evaluation. Furthermore, we propose a function to score

the individual contributions into an image to image similar-

ity within the probabilistic framework. Experimental results

show that our method improves the retrieval accuracy sig-

niﬁcantly and consistently. Moreover, our result compares

favorably to the state-of-the-art.

1. Introduction

We consider the problem of content-based image re-

trieval for applications such as object recognition or simi-

lar image retrieval. This problem has applications in web

image retrieval, location recognition, mobile visual search,

and tagging of photos.

Most of the recent state-of-the-art large scale image re-

trieval systems rely on local features, in particular the SIFT

descriptor [14] and its variants. Moreover, these descrip-

tors are typically used jointly with a bag-of-words (BOW)

approach, reducing considerably the computational burden

and memory requirements in large scale scenarios.

The similarity between two images is usually expressed

by aggregating the similarities between corresponding lo-

cal features. However, to the best of our knowledge, few

attempts have been made to systematically analyze how to

model the employed similarity measures.

In this paper we present a probabilistic view of the fea-

ture to feature similarity. We then derive a measure that is

adaptive to the query feature. We show - both on simulated

and real data - that the Euclidean distance density distribu-

tion is highly query dependent and that our model adapts

the original distance accordingly. While it is difﬁcult to

know the distribution of true correspondences, it is actu-

ally quite easy to estimate the distribution of the distance of

non-corresponding features. The expected distance to the

non-corresponding features can be used to adapt the origi-

nal distance and can be efﬁciently estimated by introducing

a small set of random features as negative examples. Fur-

thermore, we derive a global similarity function that scores

the feature to feature similarities. Based on simulated data,

this function approximates the analytical result.

Moreover, in contrast to some existing methods, our

method does not require any parameter tuning to achieve its

best performance on different datasets. Despite its simplic-

ity, experimental results on standard benchmarks show that

our method improves the retrieval accuracy consistently and

signiﬁcantly and compares favorably to the state-of-the-art.

Furthermore, all recently presented post-processing

steps can still be applied on top of our method and yield

an additional performance gain.

The rest of this paper is organized as follows. Section 2

gives an overview of related research. Section 3 describes

our method in more detail. The experiments for evaluating

our approach are described in Section 4. Results in a large

scale image retrieval system are presented in Section 5 and

compared with the state-of-the-art.

2. Related Work

Most of the recent works addressing the image similar-

ity problem in image retrieval can be roughly grouped into

three categories.

Feature-feature similarity The ﬁrst group mainly works

on establishing local feature correspondence. The most fa-

mous work in this group is the bag-of-words (BOW) ap-

proach [24]. Two features are considered to be similar if

they are assigned to the same visual word. Despite the efﬁ-

ciency of the BOW model, the hard visual word assignment

signiﬁcantly reduces the discriminative power of the local

features. In order to reduce quantization artifacts, [20] pro-

posed to assign each feature to multiple visual words. In

contrast, [8] rely on using smaller codebooks but in con-

junction with short binary codes for each local feature, re-

ﬁning the feature matching within the same Voronoi cell.

Additionally, product quantization [12] was used to esti-

2013 IEEE Conference on Computer Vision and Pattern Recognition

DOI 10.1109/CVPR.2013.211

1608

2013 IEEE Conference on Computer Vision and Pattern Recognition

DOI 10.1109/CVPR.2013.211

1608

2013 IEEE Conference on Computer Vision and Pattern Recognition

DOI 10.1109/CVPR.2013.211

1610

mate the pairwise Euclidean distance between features, and

the top k nearest neighbors of a query feature is considered

as matches. Recently, several researchers have addressed

the problem of the Euclidean distance not being the optimal

similarity measure in most situations. For instance in [16],

a probabilistic relationship between visual words is learned

from a large collection of corresponding feature tracks. Al-

ternatively, in [21], they learn a projection from the original

feature space to a new space, such that Euclidean metric in

this new space can appropriately model feature similarity.

Intra-image similarity The second group focuses on effec-

tively weighting the similarity of a feature pair considering

its relationship to other matched pairs.

Several authors exploit the property that the local fea-

tures inside the same image are not independent. As a

consequence, a direct accumulation of local feature sim-

ilarities can lead to inferior performance. This problem

was addressed in [4] by down-weighting the contribution

of non-incidentally co-occurring features. In [9] this prob-

lem was approached by re-weighting features according to

their burstiness measurement.

As the BOW approach discards spatial information, a

scoring step can be introduced which exploits the property

that the true matched feature pairs should follow a consis-

tent spatial transformation. The authors of [19] proposed

to use RANSAC to estimate the homography between im-

ages, and only count the contribution of feature pairs con-

sistent with this model. [26] and [23] propose to quantize

the image transformation parameter space in a Hough vot-

ing manner, and let each matching feature pair vote for its

correspondent parameter cells. A feature pair is considered

valid if it supports the cell of maximum votes.

Inter-image similarity Finally, the third group addresses

the problem of how to improve the retrieval performance by

exploiting additional information contained in other images

in the database, that depict the same object as the query im-

age. [5] relies on query expansion. That is, after retrieving

a set of spatially veriﬁed database images, this new set is

used to query the system again to increase recall. In [22],

a set of relevant images is constructed using k-reciprocal

nearest neighbors, and the similarity score is evaluated on

how similar a database image is to this set.

Our work belongs to the ﬁrst group. By formulating the

feature-feature matching problem in a probabilistic frame-

work, we propose an adaptive similarity to each query fea-

ture, and a similarity function to approximate the quanti-

tative result. Although the idea of adapting similarity by

dissimilarity has already been exploited in [11][17], we pro-

pose to measure dissimilarity by mean distance of the query

to a set of random features, while theirs use k nearest neigh-

bors (kNN). According to the fact that, in a realistic dataset,

different objects may have different numbers of relevant im-

ages, it is actually quite hard for the kNN based method to

ﬁnd an generalized k for all queries. Moreover, as kNN is an

order statistic, it could be sensitive to outliers and can’t be

used reliably as an estimator in realistic scenarios. In con-

trast, in our work, the set of random features could be con-

sidered as a clean set of negative examples, and the mean

operator is actually quite robust as shown later.

Considering the large amount of data in a typical large

scale image retrieval system, it is impractical to compute

the pairwise distances between high-dimensional original

feature vectors. However, several approaches exist to re-

lieve that burden using efﬁcient approximations such as

[12, 13, 3, 6]. For simplicity, we adopt the method proposed

in [12] to estimate the distance between features.

3. Our Approach

In this section, we present a theoretical framework for

modeling the visual similarity between a pair of features,

given a pairwise measurement. We then derive an analytical

model for computing the accuracy of the similarity estima-

tion in order to compare different similarity measures. Fol-

lowing the theoretical analysis, we continue the discussion

on simulated data. Since the distribution of the Euclidean

distance varies enormously from one query feature to an-

other, we propose to normalize the distance locally to ob-

tain similar degree of measurement across queries. Further-

more, using the adaptive measure, we quantitatively analyze

the similarity function on the simulated data and propose a

function to approximate the quantitative result. Finally, we

discuss how to integrate our ﬁndings into a retrieval system.

3.1. A probabilistic view of similarity estimation

We are interested in modeling the visual similarity be-

tween features based on a pairwise measurement.

Let us denote as x

the local feature vectors from a query

image and as Y = {y

, ..., y

} a set of local fea-

tures from a collection of database images. Furthermore,

let m(x

) denote a pairwise measurement between x

and y

. Finally T (x

) represents the set of features which

are visually similar to x

, and F (x

) as the set of features

which are dissimilar to x

. Instead of considering whether

is similar to x

and how similar they look, we want to

evaluate how likely y

belongs to T (x

) given a measure

m. This can be modeled as follows

f(x

)=p(y

∈ T (x

) | m(x

)) (1)

For simplicity, we denote m

= m(x

), T

= T (x

and F

= F (x

).Asy

either belongs to T

or F

,wehave

p(y

∈ T

| m

)+p(y

∈ F

| m

)=1 (2)

Furthermore, according to the Bayes Theorem

p(y

∈ T

| m

p(m

| y

∈ T

) × p(y

∈ T

)

p(m

)

(3)

160916091611

and

p(y

∈ F

| m

p(m

| y

∈ F

) × p(y

∈ F

)

p(m

)

(4)

Finally, by combining Equations 2, 3 and 4 we get

p(y

∈ T

| m



p(m

| y

∈ F

)

p(m

| y

∈ T

)

p(y

∈ F

)

p(y

∈ T

)



−1

(5)

For large datasets the quantity p(y

∈ T

) can be modeled

by the occurrence frequency of x

. Therefore, p(y

∈ T

)

and p(y

∈ F

) only depend on the query feature x

In contrast, p(m

| y

∈ T

) and p(m

| y

∈ F

) are

the probability density functions of the distribution of m

for {y

| y ∈ T

} and {y

| y ∈ F

}. We will show in

Section 3.3, how to generate simulated data for estimating

these distributions. In Section 3.5 we will further exploit

these distributions in our framework.

3.2. Estimation accuracy

Since the pairwise measurement between features is the

only observation for our model, it is essential to estimate

its reliability. Intuitively, an optimal measurement should

be able to perfectly separate the true correspondences from

the false ones. In other words, the better the measurement

distinguishes the true correspondences from the false ones,

the more accurately the feature similarity based on it can

be estimated. Therefore, the measurement accuracy can be

modeled as the expected pureness. Let T be a collection of

all matched pairs of features, i.e,

T = {(x, y) | y ∈ T (x))} (6)

The probability that a pair of features is a true match given

the measurement value z can be expressed as

p(T|z)=p((x, y) ∈T |m(x, y)=z) (7)

Furthermore, the probability of observing a measurement

value z given a corresponding feature pair is

p(z |T)=p(m(x, y)=z | (x, y) ∈T) (8)

Then, the accuracy for the similarity estimation is

Acc(m)=



∞

−∞

p(T|z) × p(z |T)dz (9)

with m some pairwise measurement and Acc(m) the accu-

racy of the model based on m. Since

p(T|z) ≤ 1 and



∞

−∞

p(z |T)dz =1 (10)

the accuracy of a measure m is

Acc(m) ≤ 1 (11)

and

Acc(m)=1⇔ p(T|z)=1, ∀p(z |T) > 0 (12)

This measure allows to compare the accuracy of different

distance measurements as will be shown in the next section.

3.3. Ground truth data generation

In order to model the property of T (x

), we simulate cor-

responding features using the following method: First, re-

gions r

i,0

are detected on a random set of images by the

Hessian Afﬁne detector[15]. Then, we apply numerous ran-

dom afﬁne warpings (using the afﬁne model proposed by

ASIFT [25]) to r

i,0

, and generate a set of related regions.

Finally, SIFT features are computed on all regions resulting

in {x

i,1

i,2

, ..., x

i,n

} as a subset of T (x

i,0

The parameters for the simulated afﬁne transformation

are selected randomly and some random jitter is added to

model the detection errors occurring in a practical setting.

The non-corresponding features F (x

) are simply generated

by selecting 500K random patches extracted from a differ-

ent and unrelated dataset. In this way, we also generate a

dataset D containing 100K matched pairs of features from

different images, and 1M non-matched paris. Figure 1 de-

picts two corresponding image patches randomly selected

from the simulated data.

Figure 1. Corresponding image patches for two randomly selected

points of the simulated data

3.4. Query adaptive distance

It has been observed that the Euclidean distance is not

an appropriate measurement for similarity [21, 16, 11]. We

argue that the Euclidean distance is a robust estimator when

normalized locally.

As an example, Figure 2 depicts the distributions of

the Euclidean distance of the corresponding and non corre-

sponding features for the two different interest points shown

in Figure 1. For each sample point x

, we collected a set of

500 corresponding features T (x

) using the procedure from

Section 3.3 and a set of 500K random non-corresponding

features F (x

). It can be seen, that the Euclidean dis-

tance separates the matching from the non-matching fea-

tures quite well in the local neighborhood of a given query

feature x

However, by averaging the distributions of T (x

) and

F (x

) respectively for all queries x

, the Euclidean distance

loses its discriminative power. This explains, why the Eu-

clidean distance has inferior performance in estimating vi-

161016101612

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

T(x_1)

F(x_1)

T(x_2)

F(x_2)

Probability

Euclidean Distance

Figure 2. Distribution of the Euclidean distance for two points

from the simulated data. The solid lines show the distribution for

corresponding features T (x

), whereas the dotted line depict non-

corresponding ones F (x

sual similarity from a global point of view. A local adapta-

tion is therefore necessary to recover the discriminability of

the Euclidean Distance.

Another property can also be observed in Figure 2:ifa

feature has a large distance to its correspondences, it also

has a large distance to the non-matching features. By ex-

ploiting this property, a normalization of the distance can

be derived for each query feature

)=d(x

)/N

d(x

)

(13)

where d

(·, ·) represents the normalized distance, d(·, ·)

represents the original Euclidean distance and N

d(x

)

rep-

resents the expected distance of x

to its non-matching fea-

tures. It is intractable to estimate the distance distribution

between all feature and their correspondences, but it is sim-

ple to estimate the expected distance to non-corresponding

features. Since the non-corresponding features are inde-

pendent from the query, a set of randomly sampled, thus

unrelated features can be used to represent the set of non-

correspondent features to each query. Moreover, if we as-

sume the distance distribution of the non-corresponding set

to follow a normal distribution N (μ, σ), then the estima-

tion error of its mean based on a subset follows another

normal distribution N (0,σ/N), with N the size of the sub-

set. Therefore, N

d(x

)

can be estimated sufﬁciently well

and very efﬁciently from even a small set of random, i.e.

non-corresponding features.

The probability that an unknown feature matches to the

query one when observing their distance z can be modeled

as,

p(T|z)=

× p(z |T)

× p(z |T)+N

× p(z |F)

= {1+

p(z |F)

p(z |T)

}

−1

(14)

with N

and N

the number of corresponding and non-

corresponding pairs respectively. In practical settings, N

is usually many orders of magnitude larger than N

. There-

fore, once p(z |F) starts getting bigger than 0, p( T|z)

rapidly decreases, and the corresponding features would be

quickly get confused with the non-corresponding ones.

Figure 3 illustrates how the adaptive distance recovers

more correct matches compared to the Euclidean distance.

Moreover, by assuming that N

≈ 1000 the

measurement accuracy following Equation 9 can be com-

puted. For the Euclidean distance, the estimation accuracy

is 0.7291, and for the adaptive distance, the accuracy is

0.7748. Our proposed distance thus signiﬁcantly outper-

forms the Euclidean distance.

3.5. Similarity function

In this section, we show how to derive a globally appro-

priate feature similarity in a quantitative manner. After hav-

ing established the distance distribution of the query adap-

tive distance in the previous section, the only unknown in

Equation 5 remains

p(y

∈F

)

p(y

∈T

)

As discussed in Section 3.1, this quantity is inversely

proportional to the occurrence frequency of x

, and it is

generally a very large term. Assuming c =

p(y

∈F

)

p(y

∈T

)

be-

ing between 10 and 100000, the full similarity function can

be estimated and is depicted in Figure 4.

The resulting curves follow an inverse sigmoid form

such that the similarity is 1 for d

→ 0 and 0 if d

→ 1.

They all have roughly the same shape and differ approxi-

mately only by an offset. It is to be noted, that they show

a very sharp transition making it very difﬁcult to correctly

estimate the transition point and thus to achieve a good sep-

aration between true and false matches.

In order to reduce the estimation error due to such sharp

transitions, a smoother curve would be desirable. Since the

distance distributions are all long-tailed, we have ﬁtted dif-

ferent kinds of exponential functions to those curves. How-

ever, we observe similar results. For the reason of simplic-

ity, we choose to approximate the similarity function as

f(x

)=exp(−α × d

)

) (15)

As can be seen in Figure 4, this curve is ﬂatter and covers

approximately the full range of possible values for c.

In Equation 15, α can be used to tune the shape of the

ﬁnal function and roughly steers the slope of our function,

we achieved best results with α =9and keep this value

throughout all experiments.

In the next section, the robustness of this function in real

image retrieval system will be evaluated.

3.6. Overall method

In this section we will integrate the query adaptive dis-

tance measurement and the similarity function presented

161116111613

0 0.5 1 1.5 2

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Euclidean Distance

Empirical Probability

true correpondent pairs

false correpondent pairs

(a) Euclidean Distance

0 0.5 1 1.5 2

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Query Adaptive Distance

Empirical Probability

true correpondent pairs

false correpondent pairs

(b) Query Adaptive Distance

      1   























Empirical Probability

Euclidean Distance

Query Adaptive Distance

Figure 3. The comparison of our adaptive distance to the Euclidean distance on dataset D. The solid lines are the distance distribution of the

matched pairs, and the dotted lines are the distance distribution of non-matched pairs. The green dashed lines denotes where the probability

of the non-matching distance exceed 0.1%, i.e, the non-matching feature is very likely to dominate our observation. A comparison of the

right tails of both distributions is shown in (c).

0 0.5 1 1.5 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Our adaptive distance

Similarity

c=10

c=100

c=1000

c=10000

c=100000

Our function

Figure 4. Feature similarity evaluated on dataset D. Red lines are

the visual similarity for different c evaluated on the simulated data.

The blue line is our ﬁnal similarity function with α =9.

before into an image retrieval system.

Let the visual similarity between the query image q =

, ..., x

} and a database image d = {y

, ..., y

} be

sim(q, d)=



i=1



j=1

f(x

) (16)

with f ( x

) the pairwise feature similarity as in Equa-

tion 15. As mentioned before, d

) and N

d(x

)

are

estimated using the random set of features.

For retrieval, we use a standard bag-of-words inverted

ﬁle. However, in order to have an estimation of the pairwise

distance d(x

) between query and database features, we

add a product quantization scheme as in [12] and select the

same parameters as the original author. The feature space

is ﬁrstly partitioned into N

=20



000 Voronoi cells ac-

cording to a coarse quantization codebook K

. All features

located in the same Voronoi cell are grouped into the same

inverted list. Each feature is further quantized with respect

to its coarse quantization centroid. That is, the residual be-

tween the feature and its closest centroid is equally split into

m =8parts and each part is separately quantized according

to a product quantization codebook K

with N

= 256 cen-

troids. Then, each feature is encoded using its related image

identiﬁer and a set of quantization codes, and is stored in its

corresponding inverted list.

We select random features from Flickr and add 100 of

them to each inverted list. For performance reasons, we

make sure that the random features are added to the inverted

list before adding the database vectors.

At query time, all inverted lists whose related coarse

quantization centers are in the k nearest neighborhood of

the query vector are scanned.

With our indexing scheme, the distances to non-

matching features are always computed ﬁrst, with their

mean value being directly N

d(x

)

. Then, the query adap-

tive distance d

) to each database vector can directly

be computed as in Equation 13. In order to reduce un-

necessary computation even more, a threshold β is used

to quickly drop features whose Euclidean distance is larger

than β × N

d(x

)

. This parameter has little inﬂuence on the

retrieval performance, but reduces the computational load

signiﬁcantly. Its inﬂuence is evaluated in Section 4.

As pointed out by [9], local features of an image tend to

occur in bursts. In order to avoid multiple counting of statis-

tically correlated features, we incorporate both “intra bursti-

ness” and “inter burstiness” normalization [9] to re-weight

the contributions of every pair of features. The similarity

function thus changes to

sim(q, d)=



i=1



j=1

w(x

)f(x

) (17)

with w(x

) the burstiness weighting.

161216121614

Query Adaptive Similarity for Large Scale Object Retrieval

Figures

Citations

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

Image retrieval using scene graphs

Siamese Instance Search for Tracking

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

SIFT Meets CNN: A Decade Survey of Instance Retrieval

References

Distinctive Image Features from Scale-Invariant Keypoints

Video Google: a text retrieval approach to object matching in videos

Scale & Affine Invariant Interest Point Detectors

Object retrieval with large vocabularies and fast spatial matching

Product Quantization for Nearest Neighbor Search

Related Papers (5)

Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search

Object retrieval with large vocabularies and fast spatial matching

Distinctive Image Features from Scale-Invariant Keypoints

Scalable Recognition with a Vocabulary Tree

Video Google: a text retrieval approach to object matching in videos

Frequently Asked Questions (8)

Q1. What contributions have the authors mentioned in the paper "Query adaptive similarity for large scale object retrieval" ?

Q2. How does the mAP function perform without the feature scaling?

Q3. How can the authors estimate the distance to the non-corresponding features?

Q4. How do the authors normalize distance between features?

Q5. What is the threshold for comparing the adaptive distance function to the Euclidean distance?

Q6. How can the authors estimate the distance between the non-corresponding features?

Q7. how do the authors calculate the distance between a query and a database?

Q8. How does the model adapt to the query feature?