scispace - formally typeset
Open AccessJournal ArticleDOI

Distance-Based Image Classification: Generalizing to New Classes at Near-Zero Cost

Reads0
Chats0
TLDR
Two distance-based classifiers, the k-nearest neighbor (k-NN) and nearest class mean (NCM) classifiers are considered, and a new metric learning approach is introduced for the latter, and an extension of the NCM classifier is introduced to allow for richer class representations.
Abstract
We study large-scale image classification methods that can incorporate new classes and training images continuously over time at negligible cost. To this end, we consider two distance-based classifiers, the k-nearest neighbor (k-NN) and nearest class mean (NCM) classifiers, and introduce a new metric learning approach for the latter. We also introduce an extension of the NCM classifier to allow for richer class representations. Experiments on the ImageNet 2010 challenge dataset, which contains over 106 training images of 1,000 classes, show that, surprisingly, the NCM classifier compares favorably to the more flexible k-NN classifier. Moreover, the NCM performance is comparable to that of linear SVMs which obtain current state-of-the-art performance. Experimentally, we study the generalization performance to classes that were not used to learn the metrics. Using a metric learned on 1,000 classes, we show results for the ImageNet-10K dataset which contains 10,000 classes, and obtain performance that is competitive with the current state-of-the-art while being orders of magnitude faster. Furthermore, we show how a zero-shot class prior based on the ImageNet hierarchy can improve performance when few training images are available.

read more

Content maybe subject to copyright    Report

HAL Id: hal-00817211
https://hal.inria.fr/hal-00817211
Submitted on 24 Apr 2013
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Distance-Based Image Classication: Generalizing to
new classes at near-zero cost
Thomas Mensink, Jakob Verbeek, Florent Perronnin, Gabriela Csurka
To cite this version:
Thomas Mensink, Jakob Verbeek, Florent Perronnin, Gabriela Csurka. Distance-Based Image Clas-
sication: Generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and
Machine Intelligence, Institute of Electrical and Electronics Engineers, 2013, 35 (11), pp.2624-2637.
�10.1109/TPAMI.2013.83�. �hal-00817211�

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 1
Distance-Based Image Classification:
Generalizing to new classes at near-zero cost
Thomas Mensink, Member IEEE, Jakob Verbeek, Member, IEEE,
Florent Perronnin, and Gabriela Csurka
Abstract—We study large-scale image classification methods that can incorporate new classes and training images continuously
over time at negligible cost. To this end we consider two distance-based classifiers, the k-nearest neighbor (k-NN) and nearest
class mean (NCM) classifiers, and introduce a new metric learning approach for the latter. We also introduce an extension of the
NCM classifier to allow for richer class representations. Experiments on the ImageNet 2010 challenge dataset, which contains
over 10
6
training images of 1,000 classes, show that, surprisingly, the NCM classifier compares favorably to the more flexible
k-NN classifier. Moreover, the NCM performance is comparable to that of linear SVMs which obtain current state-of-the-art
performance. Experimentally we study the generalization performance to classes that were not used to learn the metrics. Using
a metric learned on 1,000 classes, we show results for the ImageNet-10K dataset which contains 10,000 classes, and obtain
performance that is competitive with the current state-of-the-art, while being orders of magnitude faster. Furthermore, we show
how a zero-shot class prior based on the ImageNet hierarchy can improve performance when few training images are available.
Index Terms—Metric Learning, k-Nearest Neighbors Classification, Nearest Class Mean Classification, Large Scale Image
Classification, Transfer Learning, Zero-Shot Learning, Image Retrieval
F
1 INTRODUCTION
I
N this paper we focus on the problem of large-scale, multi-
class image classification, where the goal is to assign
automatically an image to one class out of a finite set of
alternatives, e.g. the name of the main object appearing in
the image, or a general label like the scene type of the
image. To ensure scalability, often linear classifiers such
as linear SVMs are used [1], [2]. Additionally, to speed-
up classification, dimension reduction techniques could be
used [3], or a hierarchy of classifiers could be learned [4],
[5]. The introduction of the ImageNet dataset [6], which
contains more than 14M manually labeled images of 22K
classes, has provided an important benchmark for large-scale
image classification and annotation algorithms. Recently,
impressive results have been reported on 10,000 or more
classes [1], [3], [7]. A drawback of these methods, however,
is that when images of new categories become available, new
classifiers have to be trained from scratch at a relatively high
computational cost.
Many real-life large-scale datasets are open-ended and
dynamic: new images are continuously added to existing
classes, new classes appear over time, and the semantics
of existing classes might evolve too. Therefore, we are
Thomas Mensink
ISLA Lab - University of Amsterdam
E-mail: firstname.lastname@uva.nl
Jakob Verbeek
LEAR Team - INRIA Grenoble
E-mail: firstname.lastname@inria.fr
Florent Perronnin and Gabriela Csurka
Xerox Research Centre Europe
E-mail: firstname.lastname@xrce.xerox.com
interested in distance-based classifiers which enable the
addition of new classes and new images to existing classes
at (near) zero cost. Such methods can be used continuously
as new data becomes available, and additionally alternated
from time to time with a computationally heavier method
to learn a good metric using all available training data. In
particular we consider two distance-based classifiers.
The first is the k-nearest neighbor (k-NN) classifier, which
uses all examples to represent a class, and is a highly non-
linear classifier that has shown competitive performance for
image classification [3], [7], [8], [9]. New images (of new
classes) are simply added to the database, and can be used
for classification without further processing.
The second is the nearest class mean classifier (NCM),
which represents classes by their mean feature vector of its
elements, see e.g. [10]. Contrary to the k-NN classifier, this
is an efficient linear classifier. To incorporate new images (of
new classes), the relevant class means have to be adjusted or
added to the set of class means. In Section 3, we introduce
an extension which uses several prototypes per class, which
allows a trade-off between the model complexity and the
computational cost of classification.
The success of these methods critically depends on the
used distance functions. Therefore, we cast our classifier
learning problem as one of learning a low-rank Mahalanobis
distance which is shared across all classes. The dimension-
ality of the low-rank matrix is used as regularizer, and to
improve computational and storage efficiency.
In this paper we explore several strategies for learning
such a metric. For the NCM classifier, we propose a novel
metric learning algorithm based on multi-class logistic dis-
crimination (NCMML), where a sample from a class is
enforced to be closer to its class mean than to any other

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 2
class mean in the projected space. We show qualitatively and
quantitatively the advantages of our NCMML approach over
the classical Fisher Discriminant Analysis [10]. For k-NN
classification, we rely on the Large Margin Nearest Neighbor
(LMNN) framework [11] and investigate two variations
similar to the ideas presented in [11], [12] that significantly
improve classification performance.
Most of our experiments are conducted on the Im-
ageNet Large Scale Visual Recognition Challenge 2010
(ILSVRC’10) dataset, which consists of 1.2M training im-
ages of 1,000 classes. To apply the proposed metric learn-
ing techniques on such a large-scale dataset, we employ
stochastic gradient descend (SGD) algorithms, which access
only a small fraction of the training data at each iteration
[13]. To allow metric learning on high-dimensional image
features of datasets that are too large to fit in memory, we
use in addition product quantization [14], a data compression
technique that was recently used with success for large-scale
image retrieval [15] and classifier training [1].
As a baseline approach, we follow the winning entry of
the ILSVRC’11 challenge [1]: Fisher vector image repre-
sentations [16] are used to describe images and one-vs-
rest linear SVM classifiers are learned independently for
each class. Surprisingly, we find that the NCM classifier
outperforms the more flexible k-NN classifier. Moreover, the
NCM classifier performs on par with the SVM baseline, and
shows competitive performance on new classes.
This paper extends our earlier work [17], as follows.
First, for the NCM classifier, in Section 3, we compare the
NCMML metric learning to the classic FDA, we introduce
an extension which uses multiple centroids per class, we
explore a different learning objective, and we examine the
critical points of the objective. Second, in Section 4, we
provide more details on the SGD triplet sampling strategy
used for LMNN metric learning, and we present an efficient
gradient evaluation method. Third, we extend the experimen-
tal evaluation with an experiment where NCMML is used
to learn a metric for instance level image retrieval.
The rest of the paper is organized as follows. We first
discuss a selection of related works which are most relevant
to this paper. In Section 3 we introduce the NCM classifier
and the NCMML metric learning approach. In Section 4
we review LMNN metric learning for k-NN classifiers.
We present extensive experimental results in Section 5,
analyzing different aspects of the proposed methods and
comparing them to the current state-of-the-art in different
application settings such as large scale image annotation,
transfer learning and image retrieval. Finally, we present our
conclusions in Section 6.
2 RELATED WORK
In this section we review related work on large-scale image
classification, metric learning, and transfer learning.
2.1 Large-scale image classification
The ImageNet dataset [6] has been a catalyst for research
on large-scale image annotation. The current state-of-the-art
[1], [2] uses efficient linear SVM classifiers trained in a one-
vs-rest manner in combination with high-dimensional bag-
of-words [18], [19] or Fisher vector representations [16].
Besides one-vs-rest training, large-scale ranking-based for-
mulations have also been explored in [3]. Interestingly, their
WSABIE approach performs joint classifier learning and
dimensionality reduction of the image features. Operating
in a lower-dimensional space acts as a regularization during
learning, and also reduces the cost of classifier evaluation
at test time. Our proposed NCM approach also learns low-
dimensional projection matrices but the weight vectors are
constrained to be the projected class means. This allows for
efficient addition of novel classes.
In [3], [7] k-NN classifiers were found to be competitive
with linear SVM classifiers in a very large-scale setting
involving 10,000 or more classes. The drawback of k-NN
classifiers, however, is that they are expensive in storage
and computation, since in principle all training data needs
to be kept in memory and accessed to classify new images.
This holds even more for Naive-Bayes Nearest Neighbor
(NBNN) [9], which does not use descriptor quantization, but
requires storage of all local descriptors of all training images.
The storage issue is also encountered when SVM classifiers
are trained since all training data needs to be processed in
multiple passes. Product quantization (PQ) was introduced
in [15] as a lossy compression mechanism for local SIFT
descriptors in a bag-of-features image retrieval system. It
has been subsequently used to compress bag-of-words and
Fisher vector image representations in the context of image
retrieval [20] and classifier training [1]. We also exploit PQ
encoding in our work to compress high-dimensional image
signatures when learning our metrics.
2.2 Metric learning
There is a large body of literature on metric learning, but
here we limit ourselves to highlighting just several methods
that learn metrics for (image) classification problems. Other
methods aim at learning metrics for verification problems
and essentially learn binary classifiers that threshold the
learned distance to decide whether two images belong to
the same class or not, see e.g. [21], [22], [23]. Yet another
line of work concerns metric learning for ranking problems,
e.g. to address text retrieval tasks as in [24].
Among those methods that learn metrics for classification,
the Large Margin Nearest Neighbor (LMNN) approach of
[11] is specifically designed to support k-NN classification.
It tries to ensure that for each image a predefined set of
target neighbors from the same class are closer than samples
from other classes. Since the cost function is defined over
triplets of points —that can be sampled in an SGD training
procedure— this method can scale to large datasets. The set
of target neighbors is chosen and fixed using the `
2
metric in
the original space; this can be problematic as the `
2
distance
might be quite different from the optimal metric for image
classification. Therefore, we explore two variants of LMNN
that avoid using such a pre-defined set of target neighbors,
similar to the ideas presented in [12].

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 3
The large margin nearest local mean classifier [25] assigns
a test image to a class based on the distance to the mean of its
nearest neighbors in each class. This method was reported
to outperform LMNN but requires computing all pairwise
distances between training instances and therefore does not
scale well to large datasets. Similarly, TagProp [8] suffers
from the same problem; it consists in assigning weights to
training samples based on their distance to the test instance
and in computing the class prediction by the total weight of
samples of each class in a neighborhood.
Other closely related methods are metric learning by col-
lapsing classes [26] and neighborhood component analysis
[27]. As TagProp, for each data point these define weights
to other data points proportional to the exponent of negative
distance. In [26] the target is to learn a distance that makes
the weights uniform for samples of the same class and
close to zero for other samples. While in [27] the target
is only to ensure that zero weight is assigned to samples
from other classes. These methods also require computing
distances between all pairs of data points. Because of their
poor scaling, we do not consider any of these methods below.
Closely related to our NCMML metric learning approach
for the NCM classifier is the LESS model of [28]. They
learn a diagonal scaling matrix to modify the `
2
distance by
rescaling the data dimensions, and include an `
1
penalty on
the weights to perform feature selection. However, in their
case, NCM is used to address small sample size problems
in binary classification, i.e. cases where there are fewer
training points (tens to hundreds) than features (thousands).
Our approach differs significantly in that (i) we work in
a multi-class setting and (ii) we learn a low-dimensional
projection which allows efficiency in large-scale.
Another closely related method is the Taxonomy-
embedding method of [29], where a nearest prototype classi-
fier is used in combination with a hierarchical cost function.
Documents are embedded in a lower dimensional space in
which each class is represented by a single prototype. In
contrast to our approach, they use a predefined embedding
of the images and learn low-dimensional classifies, and
therefore their method resembles more to the WSABIE
method of [3].
The Sift-bag kernel of [30] is also related to our method
since it uses an NCM classifier and an `
2
distance in a
subspace that is orthogonal to the subspace with maximum
within-class variance. However, it involves computing the
first eigenvectors of the within-class covariance matrix,
which has a computational cost between O(D
2
) and O(D
3
),
undesirable for high-dimensional feature vectors. Moreover,
this metric is heuristically obtained, rather than directly
optimized for maximum classification performance.
Finally, the image-to-class metric learning method of [31],
learns per class a Mahalanobis metric, which in contrast to
our method cannot generalize to new classes. Besides, it uses
the idea of NBNN [9], and therefore requires the storage of
all local descriptors of all images, which is impractical for
the large-scale datasets used in this paper.
2.3 Transfer learning
The term transfer learning is used to refer to methods that
share information across classes during learning. Examples
of transfer learning in computer vision include the use
of part-based or attribute class representations. Part-based
object recognition models [32] define an object as a spatial
constellation of parts, and share the part detectors across
different classes. Attribute-based models [33] characterize
a category (e.g. a certain animal) by a combination of
attributes (e.g. is yellow, has stripes, is carnivore), and share
the attribute classifiers across classes. Other approaches
include biasing the weight vector learned for a new class
towards the weight vectors of classes that have already been
trained [34]. Zero-shot learning [35] is an extreme case of
transfer learning where for a new class no training instances
are available but a description is provided in terms of
parts, attributes, or other relations to already learned classes.
Transfer learning is related to multi-task learning, where
the goal is to leverage the commonalities between several
distinct but related classification problems, or classifiers
learned for one type of images (e.g. ImageNet) are adapted to
a new domain (e.g. imagery obtained from a robot camera),
see e.g. [36], [37].
In [38] various transfer learning methods were evalu-
ated in a large-scale setting using the ILSVRC’10 dataset.
They found transfer learning methods to have little added
value when training images are available for all classes.
In contrast, transfer learning was found to be effective in
a zero-shot learning setting, where classifiers were trained
for 800 classes, and performance was tested in a 200-way
classification across the held-out classes.
In this paper we also aim at transfer learning, in the sense
that we allow only a trivial amount of processing on the
data of new classes (storing in a database, or averaging),
and rely on a metric that was trained on other classes to
recognize the new ones. In contrast to most works on transfer
learning, we do not use any intermediate representation in
terms of parts or attributes, nor do we train classifiers for
the new classes. While also considering zero-shot learning,
we further evaluate performance when combining a zero-
shot model inspired by [38] with progressively more training
images per class, from one up to thousands. We find that the
zero-shot model provides an effective prior when a small
amount of training data is available.
3 THE NEAREST CLASS MEAN CLASSIFIER
The nearest class mean (NCM) classifier assigns an image
to the class c
{1, . . . , C} with the closest mean:
c
= argmin
c∈{1,...,C}
d(x, µ
c
), (1)
µ
c
=
1
N
c
X
i:y
i
=c
x
i
, (2)
where d(x, µ
c
) is the Euclidean distance between an image
x and the class mean µ
c
, and y
i
is the ground-truth label of
image i, and N
c
is the number of training images in class c.

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 4
Next, we introduce our NCM metric learning approach,
and its relations to existing models. Then, we present an ex-
tension to use multiple centroids per class, which transforms
the NCM into a non-linear classifier. Finally, we explore
some variants of the objective which allow for smaller SGD
batch sizes, and we give some insights in the critical points
of the objective function.
3.1 Metric learning for the NCM classifier
In this section we introduce our metric learning approach,
which we will refer to as “nearest class mean metric learn-
ing” (NCMML). We replace the Euclidean distance in NCM
by a learned (squared) Mahalanobis distance:
d
M
(x, x
0
) = (x x
0
)
>
M(x x
0
), (3)
where x and x
0
are D dimensional vectors, and M is
a positive definite matrix. We focus on low-rank metrics
with M = W
>
W and W IR
d×D
, where d D
acts as regularizer and improves efficiency for computation
and storage. The Mahalanobis distance induced by W is
equivalent to the squared `
2
distance after linear projection
of the feature vectors on the rows of W :
d
W
(x, x
0
) = (x x
0
)
>
W
>
W (x x
0
)
= kW x W x
0
k
2
2
. (4)
We do not consider using the more general formulation
of M = W
>
W + S, where S is a diagonal matrix, as
in [24]. While this formulation requires only D additional
parameters to estimate, it still requires computing distances
in the original high-dimensional space. This is costly for
the dense and high-dimensional (4K-64K) Fisher vectors
representations we use in our experiments, see Section 5.
We formulate the NCM classifier using a probabilistic
model based on multi-class logistic regression and define
the probability for a class c given an feature vector x as:
p(c|x) =
exp
1
2
d
W
(x, µ
c
)
P
C
c
0
=1
exp
1
2
d
W
(x, µ
c
0
)
. (5)
This definition may also be interpreted as giving the pos-
terior probabilities of a generative model where p(x
i
|c) =
N (x
i
; µ
c
, Σ), is a Gaussian with mean µ
c
, and a covariance
matrix Σ =
W
>
W
1
, which is shared across all classes
1
.
The class probabilities p(c) are set to be uniform over all
classes. Later, in Eq. (21), we formulate an NCM classifier
with non-uniform class probabilities.
To learn the projection matrix W , we maximize the log-
likelihood of the correct predictions of the training images:
L =
1
N
N
X
i=1
ln p(y
i
|x
i
). (6)
The gradient of the NCMML objective Eq. (6) is:
W
L =
1
N
N
X
i=1
C
X
c=1
α
ic
W z
ic
z
>
ic
, (7)
1. Strictly speaking the covariance matrix is not properly defined as the
low-rank matrix W
>
W is non-invertible.
Fig. 1: Illustration to compare FDA (left ) and NCMML
(right), the obtained projection direction is indicated by the
gray line on which also the projected samples are plotted.
For FDA the result is clearly suboptimal since the blue
and green classes are collapsed in the projected space.
The proposed NCMML method finds a projection direction
which separates the classes reasonably well.
where α
ic
= p(c|x
i
) [[y
i
= c]], z
ic
= µ
c
x
i
, and we use
the Iverson brackets [[·]] to denote the indicator function that
equals one if its argument is true and zero otherwise.
Although not included above for clarity, the terms in
the log-likelihood in Eq. (6) could be weighted in cases
where the class distributions in the training data are not
representative for those when the learned model is applied.
3.2 Relation to existing linear classifiers
First we compare the NCMML objective with the classic
Fisher Discriminant Analysis (FDA) [10]. The objective of
FDA is to find a projection matrix W that maximizes the
ratio of between-class variance to within-class variance:
L
FDA
= tr
W S
B
W
>
W S
W
W
>
, (8)
where S
B
=
P
C
c=1
N
c
N
(µ µ
c
)(µ µ
c
)
>
is the weighted
covariance matrix of the class centers (µ being the data
center), and S
W
=
P
C
c=1
N
c
N
Σ
c
is the weighted sum of
within class covariance matrices Σ
c
, see e.g. [10] for details.
In the case where the within class covariance for each
class equals the identity matrix, the FDA objective seeks
the direction of maximum variance in S
B
, i.e. it performs
a PCA projection on the class means. To illustrate this, we
show an example of a two-dimensional problem with three
classes in Figure 1. In contrast, our NCMML method aims
at separating the classes which are nearby in the projected
space, so as to ensure correct predictions. The resulting
projection separates the three classes reasonably well.
To relate the NCM classifier to other linear classifiers, we
represent them using the class specific score functions:
f(c, x) = w
>
c
x + b
c
, (9)
which are used to assign samples to the class with maximum
score. NCM can be recognized as a linear classifier by

Citations
More filters
Proceedings Article

Prototypical Networks for Few-shot Learning

TL;DR: Prototypical Networks as discussed by the authors learn a metric space in which classification can be performed by computing distances to prototype representations of each class, and achieve state-of-the-art results on the CU-Birds dataset.
Proceedings ArticleDOI

iCaRL: Incremental Classifier and Representation Learning

TL;DR: In this paper, the authors introduce a new training strategy, iCaRL, that allows learning in such a class-incremental way: only the training data for a small number of classes has to be present at the same time and new classes can be added progressively.
Book ChapterDOI

Statistical Pattern Recognition

TL;DR: This chapter introduces the subject of statistical pattern recognition (SPR) by considering how features are defined and emphasizes that the nearest neighbor algorithm achieves error rates comparable with those of an ideal Bayes’ classifier.
Proceedings ArticleDOI

Meta-Learning With Differentiable Convex Optimization

TL;DR: The objective is to learn feature embeddings that generalize well under a linear classification rule for novel categories and this work exploits two properties of linear classifiers: implicit differentiation of the optimality conditions of the convex problem and the dual formulation of the optimization problem.
Proceedings ArticleDOI

Towards Open Set Deep Networks

TL;DR: The proposed OpenMax model significantly outperforms open set recognition accuracy of basic deep networks as well as deep networks with thresholding of SoftMax probabilities, and it is proved that the OpenMax concept provides bounded open space risk, thereby formally providing anopen set recognition solution.
References
More filters
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Book

Convex Optimization

TL;DR: In this article, the focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them, and a comprehensive introduction to the subject is given. But the focus of this book is not on the optimization problem itself, but on the problem of finding the appropriate technique to solve it.
Book ChapterDOI

Large-Scale Machine Learning with Stochastic Gradient Descent

Léon Bottou
TL;DR: A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems.
Proceedings Article

Visual categorization with bags of keypoints

TL;DR: This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches and shows that it is simple, computationally efficient and intrinsically invariant.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What have the authors contributed in "Distance-based image classification: generalizing to new classes at near-zero cost" ?

The authors study large-scale image classification methods that can incorporate new classes and training images continuously over time at negligible cost. To this end the authors consider two distance-based classifiers, the k-nearest neighbor ( k-NN ) and nearest class mean ( NCM ) classifiers, and introduce a new metric learning approach for the latter. The authors also introduce an extension of the NCM classifier to allow for richer class representations. Experimentally the authors study the generalization performance to classes that were not used to learn the metrics. Using a metric learned on 1,000 classes, the authors show results for the ImageNet-10K dataset which contains 10,000 classes, and obtain performance that is competitive with the current state-of-the-art, while being orders of magnitude faster. Furthermore, the authors show how a zero-shot class prior based on the ImageNet hierarchy can improve performance when few training images are available. 

(20)To obtain the centroids of each class, the authors apply k-means clustering on the features x belonging to that class, using the `2 distance. 

To learn the projection matrix W , the authors use SGD training and sample at each iteration a fixed number of m training images to estimate the gradient. 

Query-by-example image retrieval can be seen as an image classification problem where only a single positive sample (the query) is given and negative examples are not explicitly provided. 

The cost of this computation is dominated by the computation of the squared distances dW (x,µc), required to compute the m × C probabilities p(c|x) for C classes in the SGD update. 

For either choice of the target set Pq , the gradient can be computed without explicitly iterating over all triplets, by sorting the distances w.r.t. query images. 

the authors can first project both the m data vectors and C class centers, and then compute distances in the low dimensional space, at a total cost of O ( dD(m+C)+mC(d) ) . 

Other methods aim at learning metrics for verification problems and essentially learn binary classifiers that threshold the learned distance to decide whether two images belong to the same class or not, see e.g . [21], [22], [23]. 

Instead of using a fixed set of class means, it could be advantageous to iterate the k-means clustering and the learning of the projection matrix W . 

The posterior probability for class c can be defined as:p(c|x) = k∑j=1p(mcj |x), (16)p(mcj |x) = 1Z exp( − 12dW (x,mcj) ) , (17)where p(mcj |x) denotes the posterior of a centroid mcj , and Z = ∑ c ∑ j exp ( − 12dW (x,mcj) ) is the normalizer. 

The gradient of the objective of Eq. (6) w.r.t. to M is:∇ML = 1N ∑ i,c αic zicz > ic ≡ H, (22)where αic = [[yi = c]]− p(c|xi), and zic = µc − xi. 

the core components of these methods can be written as matrix products (e.g . projections of the means or images, the gradients of the objectives, etc .), for which the authors benefit from optimized multi-threaded implementations. 

This shows that once A is available, the gradient can be computed in time O(m2), even if a much larger number of triplets is used.