What have the authors contributed in "Distance-based image classification: generalizing to new classes at near-zero cost" ?

The authors study large-scale image classification methods that can incorporate new classes and training images continuously over time at negligible cost. To this end the authors consider two distance-based classifiers, the k-nearest neighbor ( k-NN ) and nearest class mean ( NCM ) classifiers, and introduce a new metric learning approach for the latter. The authors also introduce an extension of the NCM classifier to allow for richer class representations. Experimentally the authors study the generalization performance to classes that were not used to learn the metrics. Using a metric learned on 1,000 classes, the authors show results for the ImageNet-10K dataset which contains 10,000 classes, and obtain performance that is competitive with the current state-of-the-art, while being orders of magnitude faster. Furthermore, the authors show how a zero-shot class prior based on the ImageNet hierarchy can improve performance when few training images are available.

What is the way to obtain the centroids of each class?

(20)To obtain the centroids of each class, the authors apply k-means clustering on the features x belonging to that class, using the `2 distance.

How many training images are used to estimate the gradient?

To learn the projection matrix W , the authors use SGD training and sample at each iteration a fixed number of m training images to estimate the gradient.

What is the definition of a query-by-example image retrieval problem?

Query-by-example image retrieval can be seen as an image classification problem where only a single positive sample (the query) is given and negative examples are not explicitly provided.

What is the cost of the class p(c|x)?

The cost of this computation is dominated by the computation of the squared distances dW (x,µc), required to compute the m × C probabilities p(c|x) for C classes in the SGD update.

How can the authors compute the gradient without iterating over all triplets?

For either choice of the target set Pq , the gradient can be computed without explicitly iterating over all triplets, by sorting the distances w.r.t. query images.

What is the cost of projecting the m data vectors and C class centers?

the authors can first project both the m data vectors and C class centers, and then compute distances in the low dimensional space, at a total cost of O ( dD(m+C)+mC(d) ) .

What other methods aim at learning metrics for verification problems?

Other methods aim at learning metrics for verification problems and essentially learn binary classifiers that threshold the learned distance to decide whether two images belong to the same class or not, see e.g . [21], [22], [23].

What is the way to learn the projection matrix?

Instead of using a fixed set of class means, it could be advantageous to iterate the k-means clustering and the learning of the projection matrix W .

What is the posterior probability for class c?

The posterior probability for class c can be defined as:p(c|x) = k∑j=1p(mcj |x), (16)p(mcj |x) = 1Z exp( − 12dW (x,mcj) ) , (17)where p(mcj |x) denotes the posterior of a centroid mcj , and Z = ∑ c ∑ j exp ( − 12dW (x,mcj) ) is the normalizer.

What is the gradient of the objective of Eq. (6) to M?

The gradient of the objective of Eq. (6) w.r.t. to M is:∇ML = 1N ∑ i,c αic zicz > ic ≡ H, (22)where αic = [[yi = c]]− p(c|xi), and zic = µc − xi.

What are the core components of the methods that can be written as matrix products?

the core components of these methods can be written as matrix products (e.g . projections of the means or images, the gradients of the objectives, etc .), for which the authors benefit from optimized multi-threaded implementations.

How many triplets can be used to compute the gradient?

This shows that once A is available, the gradient can be computed in time O(m2), even if a much larger number of triplets is used.

(Open Access) Distance-Based Image Classification: Generalizing to New Classes at Near-Zero Cost (2013) | Thomas Mensink

HAL Id: hal-00817211

https://hal.inria.fr/hal-00817211

Submitted on 24 Apr 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Distance-Based Image Classication: Generalizing to

new classes at near-zero cost

Thomas Mensink, Jakob Verbeek, Florent Perronnin, Gabriela Csurka

To cite this version:

Thomas Mensink, Jakob Verbeek, Florent Perronnin, Gabriela Csurka. Distance-Based Image Clas-

sication: Generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and

Machine Intelligence, Institute of Electrical and Electronics Engineers, 2013, 35 (11), pp.2624-2637.

�10.1109/TPAMI.2013.83�. �hal-00817211�

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 1

Distance-Based Image Classiﬁcation:

Generalizing to new classes at near-zero cost

Thomas Mensink, Member IEEE, Jakob Verbeek, Member, IEEE,

Florent Perronnin, and Gabriela Csurka

Abstract—We study large-scale image classiﬁcation methods that can incorporate new classes and training images continuously

over time at negligible cost. To this end we consider two distance-based classiﬁers, the k-nearest neighbor (k-NN) and nearest

class mean (NCM) classiﬁers, and introduce a new metric learning approach for the latter. We also introduce an extension of the

NCM classiﬁer to allow for richer class representations. Experiments on the ImageNet 2010 challenge dataset, which contains

over 10

training images of 1,000 classes, show that, surprisingly, the NCM classiﬁer compares favorably to the more ﬂexible

k-NN classiﬁer. Moreover, the NCM performance is comparable to that of linear SVMs which obtain current state-of-the-art

performance. Experimentally we study the generalization performance to classes that were not used to learn the metrics. Using

a metric learned on 1,000 classes, we show results for the ImageNet-10K dataset which contains 10,000 classes, and obtain

performance that is competitive with the current state-of-the-art, while being orders of magnitude faster. Furthermore, we show

how a zero-shot class prior based on the ImageNet hierarchy can improve performance when few training images are available.

Index Terms—Metric Learning, k-Nearest Neighbors Classiﬁcation, Nearest Class Mean Classiﬁcation, Large Scale Image

Classiﬁcation, Transfer Learning, Zero-Shot Learning, Image Retrieval

1 INTRODUCTION

N this paper we focus on the problem of large-scale, multi-

class image classiﬁcation, where the goal is to assign

automatically an image to one class out of a ﬁnite set of

alternatives, e.g. the name of the main object appearing in

the image, or a general label like the scene type of the

image. To ensure scalability, often linear classiﬁers such

as linear SVMs are used [1], [2]. Additionally, to speed-

up classiﬁcation, dimension reduction techniques could be

used [3], or a hierarchy of classiﬁers could be learned [4],

[5]. The introduction of the ImageNet dataset [6], which

contains more than 14M manually labeled images of 22K

classes, has provided an important benchmark for large-scale

image classiﬁcation and annotation algorithms. Recently,

impressive results have been reported on 10,000 or more

classes [1], [3], [7]. A drawback of these methods, however,

is that when images of new categories become available, new

classiﬁers have to be trained from scratch at a relatively high

computational cost.

Many real-life large-scale datasets are open-ended and

dynamic: new images are continuously added to existing

classes, new classes appear over time, and the semantics

of existing classes might evolve too. Therefore, we are

• Thomas Mensink

ISLA Lab - University of Amsterdam

E-mail: ﬁrstname.lastname@uva.nl

• Jakob Verbeek

LEAR Team - INRIA Grenoble

E-mail: ﬁrstname.lastname@inria.fr

• Florent Perronnin and Gabriela Csurka

Xerox Research Centre Europe

E-mail: ﬁrstname.lastname@xrce.xerox.com

interested in distance-based classiﬁers which enable the

addition of new classes and new images to existing classes

at (near) zero cost. Such methods can be used continuously

as new data becomes available, and additionally alternated

from time to time with a computationally heavier method

to learn a good metric using all available training data. In

particular we consider two distance-based classiﬁers.

The ﬁrst is the k-nearest neighbor (k-NN) classiﬁer, which

uses all examples to represent a class, and is a highly non-

linear classiﬁer that has shown competitive performance for

image classiﬁcation [3], [7], [8], [9]. New images (of new

classes) are simply added to the database, and can be used

for classiﬁcation without further processing.

The second is the nearest class mean classiﬁer (NCM),

which represents classes by their mean feature vector of its

elements, see e.g. [10]. Contrary to the k-NN classiﬁer, this

is an efﬁcient linear classiﬁer. To incorporate new images (of

new classes), the relevant class means have to be adjusted or

added to the set of class means. In Section 3, we introduce

an extension which uses several prototypes per class, which

allows a trade-off between the model complexity and the

computational cost of classiﬁcation.

The success of these methods critically depends on the

used distance functions. Therefore, we cast our classiﬁer

learning problem as one of learning a low-rank Mahalanobis

distance which is shared across all classes. The dimension-

ality of the low-rank matrix is used as regularizer, and to

improve computational and storage efﬁciency.

In this paper we explore several strategies for learning

such a metric. For the NCM classiﬁer, we propose a novel

metric learning algorithm based on multi-class logistic dis-

crimination (NCMML), where a sample from a class is

enforced to be closer to its class mean than to any other

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 2

class mean in the projected space. We show qualitatively and

quantitatively the advantages of our NCMML approach over

the classical Fisher Discriminant Analysis [10]. For k-NN

classiﬁcation, we rely on the Large Margin Nearest Neighbor

(LMNN) framework [11] and investigate two variations

similar to the ideas presented in [11], [12] that signiﬁcantly

improve classiﬁcation performance.

Most of our experiments are conducted on the Im-

ageNet Large Scale Visual Recognition Challenge 2010

(ILSVRC’10) dataset, which consists of 1.2M training im-

ages of 1,000 classes. To apply the proposed metric learn-

ing techniques on such a large-scale dataset, we employ

stochastic gradient descend (SGD) algorithms, which access

only a small fraction of the training data at each iteration

[13]. To allow metric learning on high-dimensional image

features of datasets that are too large to ﬁt in memory, we

use in addition product quantization [14], a data compression

technique that was recently used with success for large-scale

image retrieval [15] and classiﬁer training [1].

As a baseline approach, we follow the winning entry of

the ILSVRC’11 challenge [1]: Fisher vector image repre-

sentations [16] are used to describe images and one-vs-

rest linear SVM classiﬁers are learned independently for

each class. Surprisingly, we ﬁnd that the NCM classiﬁer

outperforms the more ﬂexible k-NN classiﬁer. Moreover, the

NCM classiﬁer performs on par with the SVM baseline, and

shows competitive performance on new classes.

This paper extends our earlier work [17], as follows.

First, for the NCM classiﬁer, in Section 3, we compare the

NCMML metric learning to the classic FDA, we introduce

an extension which uses multiple centroids per class, we

explore a different learning objective, and we examine the

critical points of the objective. Second, in Section 4, we

provide more details on the SGD triplet sampling strategy

used for LMNN metric learning, and we present an efﬁcient

gradient evaluation method. Third, we extend the experimen-

tal evaluation with an experiment where NCMML is used

to learn a metric for instance level image retrieval.

The rest of the paper is organized as follows. We ﬁrst

discuss a selection of related works which are most relevant

to this paper. In Section 3 we introduce the NCM classiﬁer

and the NCMML metric learning approach. In Section 4

we review LMNN metric learning for k-NN classiﬁers.

We present extensive experimental results in Section 5,

analyzing different aspects of the proposed methods and

comparing them to the current state-of-the-art in different

application settings such as large scale image annotation,

transfer learning and image retrieval. Finally, we present our

conclusions in Section 6.

2 RELATED WORK

In this section we review related work on large-scale image

classiﬁcation, metric learning, and transfer learning.

2.1 Large-scale image classiﬁcation

The ImageNet dataset [6] has been a catalyst for research

on large-scale image annotation. The current state-of-the-art

[1], [2] uses efﬁcient linear SVM classiﬁers trained in a one-

vs-rest manner in combination with high-dimensional bag-

of-words [18], [19] or Fisher vector representations [16].

Besides one-vs-rest training, large-scale ranking-based for-

mulations have also been explored in [3]. Interestingly, their

WSABIE approach performs joint classiﬁer learning and

dimensionality reduction of the image features. Operating

in a lower-dimensional space acts as a regularization during

learning, and also reduces the cost of classiﬁer evaluation

at test time. Our proposed NCM approach also learns low-

dimensional projection matrices but the weight vectors are

constrained to be the projected class means. This allows for

efﬁcient addition of novel classes.

In [3], [7] k-NN classiﬁers were found to be competitive

with linear SVM classiﬁers in a very large-scale setting

involving 10,000 or more classes. The drawback of k-NN

classiﬁers, however, is that they are expensive in storage

and computation, since in principle all training data needs

to be kept in memory and accessed to classify new images.

This holds even more for Naive-Bayes Nearest Neighbor

(NBNN) [9], which does not use descriptor quantization, but

requires storage of all local descriptors of all training images.

The storage issue is also encountered when SVM classiﬁers

are trained since all training data needs to be processed in

multiple passes. Product quantization (PQ) was introduced

in [15] as a lossy compression mechanism for local SIFT

descriptors in a bag-of-features image retrieval system. It

has been subsequently used to compress bag-of-words and

Fisher vector image representations in the context of image

retrieval [20] and classiﬁer training [1]. We also exploit PQ

encoding in our work to compress high-dimensional image

signatures when learning our metrics.

2.2 Metric learning

There is a large body of literature on metric learning, but

here we limit ourselves to highlighting just several methods

that learn metrics for (image) classiﬁcation problems. Other

methods aim at learning metrics for veriﬁcation problems

and essentially learn binary classiﬁers that threshold the

learned distance to decide whether two images belong to

the same class or not, see e.g. [21], [22], [23]. Yet another

line of work concerns metric learning for ranking problems,

e.g. to address text retrieval tasks as in [24].

Among those methods that learn metrics for classiﬁcation,

the Large Margin Nearest Neighbor (LMNN) approach of

[11] is speciﬁcally designed to support k-NN classiﬁcation.

It tries to ensure that for each image a predeﬁned set of

target neighbors from the same class are closer than samples

from other classes. Since the cost function is deﬁned over

triplets of points —that can be sampled in an SGD training

procedure— this method can scale to large datasets. The set

of target neighbors is chosen and ﬁxed using the `

metric in

the original space; this can be problematic as the `

distance

might be quite different from the optimal metric for image

classiﬁcation. Therefore, we explore two variants of LMNN

that avoid using such a pre-deﬁned set of target neighbors,

similar to the ideas presented in [12].

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 3

The large margin nearest local mean classiﬁer [25] assigns

a test image to a class based on the distance to the mean of its

nearest neighbors in each class. This method was reported

to outperform LMNN but requires computing all pairwise

distances between training instances and therefore does not

scale well to large datasets. Similarly, TagProp [8] suffers

from the same problem; it consists in assigning weights to

training samples based on their distance to the test instance

and in computing the class prediction by the total weight of

samples of each class in a neighborhood.

Other closely related methods are metric learning by col-

lapsing classes [26] and neighborhood component analysis

[27]. As TagProp, for each data point these deﬁne weights

to other data points proportional to the exponent of negative

distance. In [26] the target is to learn a distance that makes

the weights uniform for samples of the same class and

close to zero for other samples. While in [27] the target

is only to ensure that zero weight is assigned to samples

from other classes. These methods also require computing

distances between all pairs of data points. Because of their

poor scaling, we do not consider any of these methods below.

Closely related to our NCMML metric learning approach

for the NCM classiﬁer is the LESS model of [28]. They

learn a diagonal scaling matrix to modify the `

distance by

rescaling the data dimensions, and include an `

penalty on

the weights to perform feature selection. However, in their

case, NCM is used to address small sample size problems

in binary classiﬁcation, i.e. cases where there are fewer

training points (tens to hundreds) than features (thousands).

Our approach differs signiﬁcantly in that (i) we work in

a multi-class setting and (ii) we learn a low-dimensional

projection which allows efﬁciency in large-scale.

Another closely related method is the Taxonomy-

embedding method of [29], where a nearest prototype classi-

ﬁer is used in combination with a hierarchical cost function.

Documents are embedded in a lower dimensional space in

which each class is represented by a single prototype. In

contrast to our approach, they use a predeﬁned embedding

of the images and learn low-dimensional classiﬁes, and

therefore their method resembles more to the WSABIE

method of [3].

The Sift-bag kernel of [30] is also related to our method

since it uses an NCM classiﬁer and an `

distance in a

subspace that is orthogonal to the subspace with maximum

within-class variance. However, it involves computing the

ﬁrst eigenvectors of the within-class covariance matrix,

which has a computational cost between O(D

) and O(D

undesirable for high-dimensional feature vectors. Moreover,

this metric is heuristically obtained, rather than directly

optimized for maximum classiﬁcation performance.

Finally, the image-to-class metric learning method of [31],

learns per class a Mahalanobis metric, which in contrast to

our method cannot generalize to new classes. Besides, it uses

the idea of NBNN [9], and therefore requires the storage of

all local descriptors of all images, which is impractical for

the large-scale datasets used in this paper.

2.3 Transfer learning

The term transfer learning is used to refer to methods that

share information across classes during learning. Examples

of transfer learning in computer vision include the use

of part-based or attribute class representations. Part-based

object recognition models [32] deﬁne an object as a spatial

constellation of parts, and share the part detectors across

different classes. Attribute-based models [33] characterize

a category (e.g. a certain animal) by a combination of

attributes (e.g. is yellow, has stripes, is carnivore), and share

the attribute classiﬁers across classes. Other approaches

include biasing the weight vector learned for a new class

towards the weight vectors of classes that have already been

trained [34]. Zero-shot learning [35] is an extreme case of

transfer learning where for a new class no training instances

are available but a description is provided in terms of

parts, attributes, or other relations to already learned classes.

Transfer learning is related to multi-task learning, where

the goal is to leverage the commonalities between several

distinct but related classiﬁcation problems, or classiﬁers

learned for one type of images (e.g. ImageNet) are adapted to

a new domain (e.g. imagery obtained from a robot camera),

see e.g. [36], [37].

In [38] various transfer learning methods were evalu-

ated in a large-scale setting using the ILSVRC’10 dataset.

They found transfer learning methods to have little added

value when training images are available for all classes.

In contrast, transfer learning was found to be effective in

a zero-shot learning setting, where classiﬁers were trained

for 800 classes, and performance was tested in a 200-way

classiﬁcation across the held-out classes.

In this paper we also aim at transfer learning, in the sense

that we allow only a trivial amount of processing on the

data of new classes (storing in a database, or averaging),

and rely on a metric that was trained on other classes to

recognize the new ones. In contrast to most works on transfer

learning, we do not use any intermediate representation in

terms of parts or attributes, nor do we train classiﬁers for

the new classes. While also considering zero-shot learning,

we further evaluate performance when combining a zero-

shot model inspired by [38] with progressively more training

images per class, from one up to thousands. We ﬁnd that the

zero-shot model provides an effective prior when a small

amount of training data is available.

3 THE NEAREST CLASS MEAN CLASSIFIER

The nearest class mean (NCM) classiﬁer assigns an image

to the class c

∗

∈ {1, . . . , C} with the closest mean:

∗

= argmin

c∈{1,...,C}

d(x, µ

), (1)

i:y

, (2)

where d(x, µ

) is the Euclidean distance between an image

x and the class mean µ

, and y

is the ground-truth label of

image i, and N

is the number of training images in class c.

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 4

Next, we introduce our NCM metric learning approach,

and its relations to existing models. Then, we present an ex-

tension to use multiple centroids per class, which transforms

the NCM into a non-linear classiﬁer. Finally, we explore

some variants of the objective which allow for smaller SGD

batch sizes, and we give some insights in the critical points

of the objective function.

3.1 Metric learning for the NCM classiﬁer

In this section we introduce our metric learning approach,

which we will refer to as “nearest class mean metric learn-

ing” (NCMML). We replace the Euclidean distance in NCM

by a learned (squared) Mahalanobis distance:

(x, x

) = (x − x

)

M(x − x

), (3)

where x and x

are D dimensional vectors, and M is

a positive deﬁnite matrix. We focus on low-rank metrics

with M = W

W and W ∈ IR

d×D

, where d ≤ D

acts as regularizer and improves efﬁciency for computation

and storage. The Mahalanobis distance induced by W is

equivalent to the squared `

distance after linear projection

of the feature vectors on the rows of W :

(x, x

) = (x − x

)

W (x − x

)

= kW x − W x

. (4)

We do not consider using the more general formulation

of M = W

W + S, where S is a diagonal matrix, as

in [24]. While this formulation requires only D additional

parameters to estimate, it still requires computing distances

in the original high-dimensional space. This is costly for

the dense and high-dimensional (4K-64K) Fisher vectors

representations we use in our experiments, see Section 5.

We formulate the NCM classiﬁer using a probabilistic

model based on multi-class logistic regression and deﬁne

the probability for a class c given an feature vector x as:

p(c|x) =

exp



−

(x, µ

)



exp



−

(x, µ

)



. (5)

This deﬁnition may also be interpreted as giving the pos-

terior probabilities of a generative model where p(x

|c) =

N (x

; µ

, Σ), is a Gaussian with mean µ

, and a covariance

matrix Σ =





−1

, which is shared across all classes

The class probabilities p(c) are set to be uniform over all

classes. Later, in Eq. (21), we formulate an NCM classiﬁer

with non-uniform class probabilities.

To learn the projection matrix W , we maximize the log-

likelihood of the correct predictions of the training images:

L =

i=1

ln p(y

). (6)

The gradient of the NCMML objective Eq. (6) is:

∇

L =

i=1

c=1

W z

, (7)

1. Strictly speaking the covariance matrix is not properly deﬁned as the

low-rank matrix W

W is non-invertible.

Fig. 1: Illustration to compare FDA (left ) and NCMML

(right), the obtained projection direction is indicated by the

gray line on which also the projected samples are plotted.

For FDA the result is clearly suboptimal since the blue

and green classes are collapsed in the projected space.

The proposed NCMML method ﬁnds a projection direction

which separates the classes reasonably well.

where α

= p(c|x

) − [[y

= c]], z

= µ

− x

, and we use

the Iverson brackets [[·]] to denote the indicator function that

equals one if its argument is true and zero otherwise.

Although not included above for clarity, the terms in

the log-likelihood in Eq. (6) could be weighted in cases

where the class distributions in the training data are not

representative for those when the learned model is applied.

3.2 Relation to existing linear classiﬁers

First we compare the NCMML objective with the classic

Fisher Discriminant Analysis (FDA) [10]. The objective of

FDA is to ﬁnd a projection matrix W that maximizes the

ratio of between-class variance to within-class variance:

FDA

= tr



W S



, (8)

where S

c=1

(µ − µ

)(µ − µ

)

is the weighted

covariance matrix of the class centers (µ being the data

center), and S

c=1

is the weighted sum of

within class covariance matrices Σ

, see e.g. [10] for details.

In the case where the within class covariance for each

class equals the identity matrix, the FDA objective seeks

the direction of maximum variance in S

, i.e. it performs

a PCA projection on the class means. To illustrate this, we

show an example of a two-dimensional problem with three

classes in Figure 1. In contrast, our NCMML method aims

at separating the classes which are nearby in the projected

space, so as to ensure correct predictions. The resulting

projection separates the three classes reasonably well.

To relate the NCM classiﬁer to other linear classiﬁers, we

represent them using the class speciﬁc score functions:

f(c, x) = w

x + b

, (9)

which are used to assign samples to the class with maximum

score. NCM can be recognized as a linear classiﬁer by

Distance-Based Image Classification: Generalizing to New Classes at Near-Zero Cost

Figures

Citations

Prototypical Networks for Few-shot Learning

iCaRL: Incremental Classifier and Representation Learning

Statistical Pattern Recognition

Meta-Learning With Differentiable Convex Optimization

Towards Open Set Deep Networks

References

ImageNet: A large-scale hierarchical image database

Distinctive Image Features from Scale-Invariant Keypoints

Convex Optimization

Large-Scale Machine Learning with Stochastic Gradient Descent

Visual categorization with bags of keypoints

Related Papers (5)

Deep Residual Learning for Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Learning Multiple Layers of Features from Tiny Images

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet: A large-scale hierarchical image database

Frequently Asked Questions (13)

Q1. What have the authors contributed in "Distance-based image classification: generalizing to new classes at near-zero cost" ?

Q2. What is the way to obtain the centroids of each class?

Q3. How many training images are used to estimate the gradient?

Q4. What is the definition of a query-by-example image retrieval problem?

Q5. What is the cost of the class p(c|x)?

Q6. How can the authors compute the gradient without iterating over all triplets?

Q7. What is the cost of projecting the m data vectors and C class centers?

Q8. What other methods aim at learning metrics for verification problems?

Q9. What is the way to learn the projection matrix?

Q10. What is the posterior probability for class c?

Q11. What is the gradient of the objective of Eq. (6) to M?

Q12. What are the core components of the methods that can be written as matrix products?

Q13. How many triplets can be used to compute the gradient?