What contributions have the authors mentioned in the paper "Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation" ?

For this task the authors propose TagProp, a discriminatively trained nearest neighbor model. The authors also introduce a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words. The authors investigate the performance of different variants of their model and compare to existing work. The authors present experimental results for three challenging data sets. In this manner, the authors can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms.

How do the authors combine the base distances?

The authors also combine the base distances by learning a binary classifier separating image pairs that have several tags in common from images that do not share any tags.

How does the model perform on image regions?

In future work, the authors will consider extending the model to assign tags to image regions, in order to address tasks such as image region labelling and object detection from imagewide annotations.

Why is the model easlily used to predict the relevance of images?

Due to the probabilistic output of TagProp, this is easlily done by taking the product over the single keyword probabilities, as their model does not explicitly account for dependencies between words.

What is the smallest neighbor rank for each i?

For each i, the authors select K neighbors such that the authors maximise k∗ = min{kd}, where kd is the largest neighbor rank for which neighbors 1 to k of base distance d are included among the selected neighbors.

(Open Access) TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation (2009) | Matthieu Guillaumin

Q: What is the goal of the proposed method?

Their proposed method is based on a weighted nearest neighbor approach, inspired by recent successful methods [5, 11, 13, 17], that propagate the annotations of training images to new images.

Q: What are the two types of global image descriptors?

The authors use two types of global image descriptors: Gist features [21], and color histograms with 16 bins in each color channel for RGB, LAB, HSV representations.

Q: What are the performance measures used in previous work?

The authors evaluate their models with standard performance measures, used in previous work, that evaluate retrieval performance per keyword, and then average over keywords.

TagProp: Discriminative Metric Learning

in Nearest Neighbor Models for Image Auto-Annotation

Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek and Cordelia Schmid

LEAR, INRIA Grenoble Laboratoire Jean Kuntzmann

firstname.lastname@inrialpes.fr

Abstract

Image auto-annotation is an important open problem in

computer vision. For this task we propose TagProp, a dis-

criminatively trained nearest neighbor model. Tags of test

images are predicted using a weighted nearest-neighbor

model to exploit labeled training images. Neighbor weights

are based on neighbor rank or distance. TagProp allows

the integration of metric learning by directly maximizing

the log-likelihood of the tag predictions in the training set.

In this manner, we can optimally combine a collection of

image similarity metrics that cover different aspects of im-

age content, such as local shape descriptors, or global

color histograms. We also introduce a word speciﬁc sig-

moidal modulation of the weighted neighbor tag predictions

to boost the recall of rare words. We investigate the perfor-

mance of different variants of our model and compare to ex-

isting work. We present experimental results for three chal-

lenging data sets. On all three, TagProp makes a marked

improvement as compared to the current state-of-the-art.

1. Introduction

Image auto-annotation is an active subject of research [

15, 16, 18]. The goal is to develop methods that can predict

for a new image the relevant keywords from an annotation

vocabulary. These keyword predictions can be used either

to propose tags for an image, or to propose images for a

tag or a combination of tags. Such methods are becoming

more and more important given the growing collections of

user-provided visual content, e.g. on photo or video sharing

sites, and desktop photo management applications. These

large-scale collections feed the demand for automatic re-

trieval and annotation methods. Since the amount of images

with more or less structured annotations is also increasing,

this allows the deployment of machine learning techniques

to leverage this potential by estimating accurate tag predic-

tion models.

Although the general problem is a difﬁcult one, progress

has been made in the research community by evaluations

on standardized annotated data sets. In the next section we

will detail the related work that is most closely linked to

ours. The main shortcomings of existing work are twofold.

First, models are often estimated to maximize generative

likelihood of image features and annotations, which might

not be optimal for tag prediction. Second, many para-

metric models are not rich enough to accurately capture

the intricate dependencies between image content and an-

notations. Non-parametric nearest neighbor like methods

have been found to be quite successful for tag prediction

[5, 11, 13, 17, 22, 27]. This is mainly due to the high ‘capac-

ity’ of such models: they can adapt ﬂexibly to the patterns in

the data as more data is available. However, existing nearest

neighbor type methods do not allow for integrated learning

of the metric that deﬁnes the nearest neighbors in order to

maximize the predictive performance of the model. Either

a ﬁxed metric [5, 27] or adhoc combinations of several met-

rics [17] are used, despite many recent work showing the

beneﬁts of metric learning for many computer vision tasks

such as image classiﬁcation [12], image retrieval [10], or

visual identiﬁcation [9].

In this paper we present TagProp, short for Tag Propaga-

tion, a new nearest neighbor type model that predicts tags by

taking a weighted combination of the tag absence/presence

among neighbors. Our contributions are the following.

First, the weights for neighbors are either determined based

on the neighbor rank or its distance, and set automatically

by maximizing the likelihood of annotations in a set of

training images. With rank based weights the k-th neigh-

bor always receives a ﬁxed weight, whereas distance based

weights decay exponentially with the distance. Our tag pre-

diction model is conceptually simple, yet outperforms the

current state-of-the-art methods using the same feature set.

Second, contrary to earlier work, our model allows the in-

tegration of metric learning. This enables us to optimize

e.g. a Mahalanobis metric between image features – or,

less costly, a combination of several distance measures –

to deﬁne the neighbor weights for the tag prediction task.

Corel 5k ESP Game IAPR TC12

arctic tree (0.99)

box box

(1.00) glacier glacier (1.00)

den grass (0.94) brown square (1.00) mountain mountain (1.00)

fox rocks (0.91) square brown (1.00) people front (0.64)

grass ﬂowers (0.82) white white (0.79) tourist sky (0.58)

tiger (0.82) yellow (0.72) people (0.58)

iguana iguana (1.00) blue man (0.98)

landscape llama (1.00)

lizard marine (1.00) cartoon anime (0.96) lot water (1.00)

marine lizard (1.00) man cartoon (0.92) meadow landscape (1.00)

rocks water (0.67) woman people (0.89) water front (0.60)

sky (0.66) woman (0.88) people (0.51)

Figure 1. Example test images from the three data sets. Next to each image, we show the ground truth annotation (left), and the ﬁve tags

with highest relevance predictions (correct ones are underlined) given by our TagProp model (σML variant with K = 200). Note the large

diversity between the data sets, and that the ground truth annotations do not always contain all relevant tags (e.g. ‘water’ for the bottom

left image), and sometimes contain tags for which one can argue whether they are relevant or not (e.g. ‘lot’ for the bottom right image).

Third, TagProp includes word-speciﬁc logistic discriminant

models. These models use the tag predictions of the word-

invariant models as inputs and are able, using just two pa-

rameters per word, to boost or suppress the tag presence

probabilities for very frequent or rare words. This results

in a signiﬁcant increase in the number of words that are re-

called, i.e. assigned to at least one test image.

To evaluate our models and to compare to previous work,

we use three data sets – Corel 5k, IAPR TC12 and ESP

Game – and standard measures including precision, recall,

mean average precision and break-even point. In Figure 1

we show several examples of images with their annota-

tions, and predictions from our model. On all data sets and

measures we show signiﬁcantly improved accuracy of our

method as compared to earlier work.

The rest of this paper is organized as follows. In the next

section we give an overview of the related work. Then, in

Section 3, we present our tag prediction models, and how

we estimate their parameters. In Section 4 we present the

three data sets, evaluation criteria as well as the image fea-

tures we use in our experiments. The experimental results

are presented in Section 5. In Section 6 we present our con-

clusions and directions for further research.

2. Related Work

In this section we discuss models for image annotation

and keyword based retrieval most relevant for our work. We

identify four main groups of methods: those based on topic

models or mixture models, discriminatively trained ones,

and nearest neighbor type models.

The ﬁrst group of methods are based on topic models

such as latent Dirichlet allocation, probabilistic latent se-

mantic analysis, and hierarchical Dirichlet processes, see

e.g. [1, 20, 25]. These methods model annotated images

as samples from a speciﬁc mix of topics, where each topic

is a distribution over image features and annotation words.

Parameter estimation involves estimating the topic mix for

each image, and estimating the data distributions of the top-

ics. Most often, a multinomial distribution over words is

used, and a Gaussian over visual features from different re-

gions of the image. Methods inspired by machine trans-

lation [4], in this case translating from discrete visual fea-

tures to the annotation vocabulary, can also be understood

as topic models, using one topic per visual descriptor type.

A second family of methods uses mixture models to de-

ﬁne a joint distribution over image features and annota-

tion tags. To annotate a new image, these models com-

pute the conditional probability over tags given the visual

features by normalising the joint likelihood. Sometimes a

ﬁxed number of mixture components over visual features

per keyword is used [2], while other models use the training

images as components to deﬁne a mixture model over visual

features and tags [

5, 11, 13]. Each training image deﬁnes

a likelihood over visual features and tags by a smoothed

distribution around the observed values. These models can

be seen as non-parametric density estimators over the co-

occurrence of images and annotations. For visual features

Gaussians are used, while the distributions over annotations

are multinomials, or separate Bernoullis for each word.

Both families of generative models discussed above may

be critisized because they maximize the generative data

likelihood, which is not necessarily optimal for predictive

performance. Therefore, discriminative models for tag pre-

diction have also been proposed [3, 7, 10]. These methods

learn a separate classiﬁer for each tag, and use these to pre-

dict for each test image whether it belongs to the class of

images that are annotated with each particular tag. Different

learning methods have been used, including support vector

machines, multiple-instance learning, and Bayes point ma-

chines. Notable is [7] which also addresses the problem of

retrieving images based on multi-word queries.

Given the increasing amount of training data that is

currently available, local learning techniques are becom-

ing more attractive as a simple yet powerful alternative to

parametric models. Examples of such techniques include

methods based on label diffusion over a similarity graph

of labeled and unlabeled images [16, 22], or learning dis-

criminative models in neighborhoods of test images [27].

A simpler adhoc nearest-neighbor tag transfer mechanism

was recently introduced [17], showing state-of-the-art per-

formance. There, nearest neighbors are determined by the

average of several distances computed from different visual

features. The authors also combine the base distances by

learning a binary classiﬁer separating image pairs that have

several tags in common from images that do not share any

tags. However, this linear distance combination did not give

better results than an equally weighted combination.

3. Tag Relevance Prediction Models

Our goal is to predict the relevance of annotation tags

for images. Given these relevance predictions we can an-

notate images by ranking the tags for a given image, or

do keyword based retrieval by ranking images for a given

tag. Our proposed method is based on a weighted nearest

neighbor approach, inspired by recent successful methods

[

5, 11, 13, 17], that propagate the annotations of training

images to new images. Our models are learnt in a discrimi-

native manner, rather than using held-out data [5], or using

neighbors in an adhoc manner [17]. We assume that some

visual similarity or distance measures between images are

given, abstracting away from their precise deﬁnition.

3.1. Weighted Nearest Neighbor Tag Prediction

To model image annotations, we use Bernoulli models

for each keyword. This choice is natural because keywords,

unlike natural text where word frequency is meaningful, are

either present or absent. The dependencies between key-

words in the training data are not explicitly modeled, but

are implicitly exploited in our model.

We use y

∈ {−1, +1} to denote the absence/presence

of keyword w for image i, hence encoding the image anno-

tations. The tag presence prediction p(y

= +1) for image

i is a weighted sum over the training images, indexed by j:

p(y

= +1) =

p(y

= +1|j), (1)

p(y

= +1|j) =

(

1 −  for y

= +1,

 otherwise,

(2)

where π

denotes the weight of image j for predicting the

tags of image i. We require that π

≥ 0, and

= 1.

We use  to avoid zero prediction probabilities, and in prac-

tice we set  = 10

−5

. To estimate the parameters that

control the weights π

we maximize the log-likelihood of

the predictions of training annotations. Taking care to set

the weight of training images to themselves to zero, i.e.

= 0, our objective is to maximize

L =

i,w

ln p(y

(3)

where c

is a cost that takes into account the imbalance

between keyword presence and absence. Indeed, in prac-

tice, there are many more tag absences than presences, and

absences are much noisier than presences. This is because

most tags in annotations are relevant, but often the annota-

tion does not include all relevant tags. We set c

= 1/n

= +1, where n

is the total number of positive labels,

and likewise c

= 1/n

−

when y

= −1.

Rank-based weights. In the case of rank-based weights

over K neighbors we set π

= γ

if j is the k-th nearest

neighbor of i. The data log-likelihood (

3) is concave in the

parameters γ

and can be estimated using an EM-algorithm,

or a projected-gradient algorithm. The derivative of Eq. (3)

with respect to γ

equals

∂L

∂γ

i,w

p(y

)

p(y

)

(4)

where n

denotes the index of the k-th neighbor of image

i. The number of parameters equals the neighborhood size

K. We refer to this variant as RK, for “rank-based”.

Distance-based weights. The other possibility is to de-

ﬁne the weights directly as a function of the distance, rather

than the rank. This has the advantage that weights will de-

pend smoothly on the distance, which is crucial if the dis-

tance is to be adjusted during training. The weights of train-

ing images j for an image i are redeﬁned as

exp(−d

(i, j))

exp(−d

(i, j

))

, (5)

where d

is a distance metric with parameters θ that we

want to optimize. Note that the weights π

decay exponen-

tially with distance d

to image i. Choices for d

include

Mahalanobis distances d

parametrized by a semi-deﬁnite

matrix M, and d

(i, j) = w

where d

is a vector of

base distances between image i and j, and w contains the

positive coefﬁcients of the linear distance combination. The

number of parameters equals the number of base distances

that are combined. In the rest of the paper we focus on this

particular case. When we use a single distance, referred to

as the SD variant, w is a scalar that controls the decay of

the weights with distance, and it is the only parameter of

the model. When multiple distances are used, the variant is

referred to as ML, for “metric learning”.

Again, rather than using an EM-algorithm we directly

maximize the log-likelihood using a projected gradient al-

gorithm under positivity constraints on the elements of w.

Using the new deﬁnition of the weights, the gradient of the

log-likelihood Eq. (3) with respect to w equals

∂L

∂w

i,j

(π

− ρ

, (6)

where W

and ρ

denotes the weighted average

over all words w of the posterior probability of neighbor j

for image i given the annotation:

p(j|y

). (7)

To reduce the computational cost of training the model,

we do not compute all pairwise π

and ρ

. Rather, for

each i we compute them only over a large set, and assume

the remaining π

and ρ

to be zero. For each i, we select

K neighbors such that we maximise k

∗

= min{k

}, where

is the largest neighbor rank for which neighbors 1 to k of

base distance d are included among the selected neighbors.

In this way we are likely to include all images with large

regardless of the distance combination w that is learnt.

Therefore, after determining these neighborhoods, our al-

gorithm scales linearly with the number of training images.

Note the relation of our model to the multi-class metric

learning approach of [6]. In that work, a metric is learnt

such that weights π

as deﬁned by Eq. (5) are as close as

possible in the sense of Kullback-Leibler (KL) divergence

to ﬁxed set of target weights ρ

. The target weights were

deﬁned to be zero for pairs from different classes, and set to

a constant for all pairs from the same class. In fact, when

deriving an EM-algorithm for our model, we ﬁnd the ob-

jective of the M-step to be of the form of a KL divergence

between the ρ

(ﬁxed to values computed in the E-step) and

the π

. For ﬁxed ρ

this KL divergence is convex in w.

3.2. Word-speciﬁc Logistic Discriminant Models

Weighted nearest neighbor approaches tend to have rel-

atively low recall scores, which is easily understood as fol-

lows. In order to receive a high probability for the presence

of a tag, it needs to be present among most neighbors with

a signiﬁcant weight. This, however, is unlikely to be the

case for rare tags. So, even if we are lucky enough to have

a few neighbors annotated with the tag, we will predict the

presence with a low probability.

To overcome this, we introduce word-speciﬁc logistic

discriminant models that can boost the probability for rare

tags and decrease it for very frequent ones. The logistic

model uses weighted neighbor predictions by deﬁning

p(y

= +1) = σ(α

+ β

), (8)

, (9)

where σ(z) = (1 + exp(−z))

−1

and x

is the weighted

average of annotations for tag w among the neighbors of i,

which is equivalent to Eq. (1) up to an afﬁne transformation.

The word-speciﬁc models add 2 parameters to estimate for

each word. The resulting modulated variants are referred to

as σ RK, σSD and σML, respectively.

For ﬁxed π

the model is a logistic discriminant model,

and the log-likelihood is concave in {α

, β

}, and can be

trained per keyword. Using the new model, the gradient of

the log-likelihood of the training annotations with respect

to the parameters θ that control the weights equals

∂L

∂θ

i,w

p(−y

∂x

∂θ

, (10)

and for the model based on rank or distance respectively

∂x

∂γ

= y

(11)

∂x

∂w

− y

) d

. (12)

In practice we estimate the parameters θ and {α

, β

} in

an alternating fashion. We observe rapid convergence, typ-

ically after alternating the maximization three times.

4. Data Sets and Experimental Setup

In this section we ﬁrst present the data sets used in our

experiments, then in Section 4.2 we describe the different

features that we extract from images to compute distance

measures between images, and in Section 4.3 we discuss

the evaluation measures for image annotation and retrieval.

4.1. Data Sets

We consider three publicly available data sets that have

been used in previous work, and allow for direct compari-

son. Table 1 summarizes some statistics of these data sets,

example images are shown in Figure 1.

Corel 5k. This data set was ﬁrst used in [4]. Since then,

it has become an important benchmark for keyword based

image retrieval and image annotation. It contains around

5000 images manually annotated with 1 to 5 keywords. The

vocabulary contains 260 words. A ﬁxed set of 499 images

are used as test, and the rest is used for training.

ESP Game. This data set is obtained from an online

game where two players, that can not communicate outside

the game, gain points by agreeing on words describing the

image [24]. This way the players are encouraged to provide

important and meaningful tags to images. We use the subset

of 20.000, out of the 60.000 images publicly available, that

was also used in [17]. This data set is very challenging,

as it contains a wide variety of images including: logos,

drawings, and personal photos.

Corel 5k ESP Game IAPR TC12

Vocabulary size 260 268 291

Nr. of images 4,493 18,689 17,665

Words per img. 3.4 / 5 4.7 / 15 5.7 / 23

Img. per word 58.6 / 1004 362.7 / 4553 347.7 / 4999

Table 1. Statistics of the training sets of the three data sets. Image

and word counts are given in the format mean / maximum. Statis-

tics for the test sets resemble closely those of the training sets.

IAPR TC12. This set of 20.000 images accompanied

with descriptions in several languages was initially pub-

lished for cross-lingual retrieval [8]. It can be transformed

into a format comparable to the other sets by extracting

common nouns using natural language processing tech-

niques. We use the same resulting annotation as in [17].

4.2. Feature Extraction

We extract different types of features commonly used for

image search and categorisation. We use two types of global

image descriptors: Gist features [21], and color histograms

with 16 bins in each color channel for RGB, LAB, HSV

representations. Local features include SIFT as well as a ro-

bust hue descriptor [

23], both extracted densely on a multi-

scale grid or for Harris-Laplacian interest points. Each lo-

cal feature descriptor is quantized using k-means on sam-

ples from the training set. Images are then represented as a

‘bag-of-words’ histogram. All descriptors but Gist are L1-

normalised and also computed in a spatial arrangement [14].

We compute the histograms over three horizontal regions of

the image, and concatenate them to form a new global de-

scriptor, albeit one that encodes some information of the

spatial layout of the image. To limit color histogram sizes,

here, we reduced the quantization to 12 bins in each chan-

nel. Note that this spatial binning differs from segmented

image regions, as used in some previous work.

This results in 15 distinct descriptors, namely one Gist

descriptor, 6 color histograms and 8 bag-of-features (2 fea-

tures types x 2 descriptors x 2 layouts). To compute the

distances from the descriptors we follow previous work and

use L2 as the base metric for Gist, L1 for global color his-

tograms, and χ

for the others.

4.3. Evaluation Measures

We evaluate our models with standard performance mea-

sures, used in previous work, that evaluate retrieval perfor-

mance per keyword, and then average over keywords.

Precision and recall for ﬁxed annotation length.

Following [4], each image is annotated with the 5 most rel-

evant keywords. Then, the mean precision P and recall R

over keywords are computed. N+ is used to denote the num-

ber of keywords with non-zero recall value. Note that each

image is forced to be annotated with 5 keywords, even if

the image has fewer or more keywords in the ground truth.

Therefore, even if a model predicts all ground-truth key-

words with a signiﬁcantly higher probability than other key-

words, we will not measure perfect precision and recall.

Precision at different levels of recall. We also eval-

uate precision at different levels of recall as in [7]. The

break-even point (BEP), or R-precision, measures for each

keyword w the precision among the top n

relevant im-

ages, where n

is the number of images annotated with this

keyword in the ground truth. The mean average precision

(mAP) over keywords is found by computing for each key-

word the average of the precisions measured after each rel-

evant image is retrieved.

5. Experimental Results

In this section we present a quantitative evaluation of

TagProp and compare to previous work, qualitative results

can be found in Figure 1. We ﬁrst give a detailed presenta-

tion of results obtained on the Corel 5k data set, and com-

pare them to previous work. In Section 5.2 we present our

results for the IAPR TC12 and ESP Game data sets. Results

for multi-word image retrieval are presented in Section 5.3.

5.1. Results for the Corel 5k data set

In a ﬁrst set of experiments we compare the different

variants of TagProp and compare them to the original re-

sults of [17], referred to as JEC, and also using our own

features (JEC-15). That is, we take an equally weighted

combination of our 15 normalized base distances to deﬁne

image similarity.

From the results in Table 2 we can make several obser-

vations. First, using the tag transfer method proposed in

[17] with our own features we obtain results very similar to

the original work. Thus, other performance differences ob-

tained using our methods must be due to the tag prediction

methods. Our models that use this ﬁxed distance combina-

tion to deﬁne weights (either directly in SD or using ranks

in RK) perform comparably. Among these results, the ones

of the sigmoidal model using distance-based weights (σSD)

are the best, and they show a modest improvement over the

results obtained with the more adhoc JEC-15.

More importantly, using our models that integrate met-

ric learning (ML and σML), much larger improvements are

obtained, in particular using the σML variant. Compared to

the current state-of-the-art method using the same features,

we obtain marked improvements of 5% in precision, 9% in

recall, and count 20 more words with positive recall. This

result shows clearly that nearest neighbor type tag predic-

tion can beneﬁt from metric learning. Interestingly, earlier

efforts to exploit metric learning did not succeed [

17], c.f .

Section

2. The key to our successful use of metric learning

is its integration in the prediction model.

TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation

Figures

Citations

Deep Sets

Is object localization for free? - Weakly-supervised learning with convolutional neural networks

Learning Fine-grained Image Similarity with Deep Ranking

Learning Fine-Grained Image Similarity with Deep Ranking

CNN-RNN: A Unified Framework for Multi-label Image Classification

References

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

Labeling images with a computer game

Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

Matching words and pictures

Related Papers (5)

Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

Supervised Learning of Semantic Classes for Image Annotation and Retrieval

NUS-WIDE: a real-world web image database from National University of Singapore

Automatic image annotation and retrieval using cross-media relevance models

ImageNet: A large-scale hierarchical image database

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation" ?

Q2. What are the future works in "Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation" ?

Q3. What is the goal of the proposed method?

Q4. What are the two types of global image descriptors?

Q5. What are the performance measures used in previous work?

Q6. How do the authors combine the base distances?

Q7. How does the model perform on image regions?

Q8. Why is the model easlily used to predict the relevance of images?

Q9. What methods have been used to learn the likelihood over visual features?

Q10. What is the smallest neighbor rank for each i?

Q11. How do the authors determine the weights for neighbors?

Q12. What is the metric used to compute the histograms?