scispace - formally typeset
Open AccessProceedings ArticleDOI

TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation

TLDR
This work proposes TagProp, a discriminatively trained nearest neighbor model that allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set, and introduces a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words.
Abstract
Image auto-annotation is an important open problem in computer vision. For this task we propose TagProp, a discriminatively trained nearest neighbor model. Tags of test images are predicted using a weighted nearest-neighbor model to exploit labeled training images. Neighbor weights are based on neighbor rank or distance. TagProp allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set. In this manner, we can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms. We also introduce a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words. We investigate the performance of different variants of our model and compare to existing work. We present experimental results for three challenging data sets. On all three, TagProp makes a marked improvement as compared to the current state-of-the-art.

read more

Content maybe subject to copyright    Report

TagProp: Discriminative Metric Learning
in Nearest Neighbor Models for Image Auto-Annotation
Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek and Cordelia Schmid
LEAR, INRIA Grenoble Laboratoire Jean Kuntzmann
firstname.lastname@inrialpes.fr
Abstract
Image auto-annotation is an important open problem in
computer vision. For this task we propose TagProp, a dis-
criminatively trained nearest neighbor model. Tags of test
images are predicted using a weighted nearest-neighbor
model to exploit labeled training images. Neighbor weights
are based on neighbor rank or distance. TagProp allows
the integration of metric learning by directly maximizing
the log-likelihood of the tag predictions in the training set.
In this manner, we can optimally combine a collection of
image similarity metrics that cover different aspects of im-
age content, such as local shape descriptors, or global
color histograms. We also introduce a word specific sig-
moidal modulation of the weighted neighbor tag predictions
to boost the recall of rare words. We investigate the perfor-
mance of different variants of our model and compare to ex-
isting work. We present experimental results for three chal-
lenging data sets. On all three, TagProp makes a marked
improvement as compared to the current state-of-the-art.
1. Introduction
Image auto-annotation is an active subject of research [
7,
15, 16, 18]. The goal is to develop methods that can predict
for a new image the relevant keywords from an annotation
vocabulary. These keyword predictions can be used either
to propose tags for an image, or to propose images for a
tag or a combination of tags. Such methods are becoming
more and more important given the growing collections of
user-provided visual content, e.g. on photo or video sharing
sites, and desktop photo management applications. These
large-scale collections feed the demand for automatic re-
trieval and annotation methods. Since the amount of images
with more or less structured annotations is also increasing,
this allows the deployment of machine learning techniques
to leverage this potential by estimating accurate tag predic-
tion models.
Although the general problem is a difficult one, progress
has been made in the research community by evaluations
on standardized annotated data sets. In the next section we
will detail the related work that is most closely linked to
ours. The main shortcomings of existing work are twofold.
First, models are often estimated to maximize generative
likelihood of image features and annotations, which might
not be optimal for tag prediction. Second, many para-
metric models are not rich enough to accurately capture
the intricate dependencies between image content and an-
notations. Non-parametric nearest neighbor like methods
have been found to be quite successful for tag prediction
[5, 11, 13, 17, 22, 27]. This is mainly due to the high ‘capac-
ity’ of such models: they can adapt flexibly to the patterns in
the data as more data is available. However, existing nearest
neighbor type methods do not allow for integrated learning
of the metric that defines the nearest neighbors in order to
maximize the predictive performance of the model. Either
a fixed metric [5, 27] or adhoc combinations of several met-
rics [17] are used, despite many recent work showing the
benefits of metric learning for many computer vision tasks
such as image classification [12], image retrieval [10], or
visual identification [9].
In this paper we present TagProp, short for Tag Propaga-
tion, a new nearest neighbor type model that predicts tags by
taking a weighted combination of the tag absence/presence
among neighbors. Our contributions are the following.
First, the weights for neighbors are either determined based
on the neighbor rank or its distance, and set automatically
by maximizing the likelihood of annotations in a set of
training images. With rank based weights the k-th neigh-
bor always receives a fixed weight, whereas distance based
weights decay exponentially with the distance. Our tag pre-
diction model is conceptually simple, yet outperforms the
current state-of-the-art methods using the same feature set.
Second, contrary to earlier work, our model allows the in-
tegration of metric learning. This enables us to optimize
e.g. a Mahalanobis metric between image features or,
less costly, a combination of several distance measures
to define the neighbor weights for the tag prediction task.

Corel 5k ESP Game IAPR TC12
arctic tree (0.99)
box box
(1.00) glacier glacier (1.00)
den grass (0.94) brown square (1.00) mountain mountain (1.00)
fox rocks (0.91) square brown (1.00) people front (0.64)
grass flowers (0.82) white white (0.79) tourist sky (0.58)
tiger (0.82) yellow (0.72) people (0.58)
iguana iguana (1.00) blue man (0.98)
landscape llama (1.00)
lizard marine (1.00) cartoon anime (0.96) lot water (1.00)
marine lizard (1.00) man cartoon (0.92) meadow landscape (1.00)
rocks water (0.67) woman people (0.89) water front (0.60)
sky (0.66) woman (0.88) people (0.51)
Figure 1. Example test images from the three data sets. Next to each image, we show the ground truth annotation (left), and the five tags
with highest relevance predictions (correct ones are underlined) given by our TagProp model (σML variant with K = 200). Note the large
diversity between the data sets, and that the ground truth annotations do not always contain all relevant tags (e.g. ‘water’ for the bottom
left image), and sometimes contain tags for which one can argue whether they are relevant or not (e.g. ‘lot’ for the bottom right image).
Third, TagProp includes word-specific logistic discriminant
models. These models use the tag predictions of the word-
invariant models as inputs and are able, using just two pa-
rameters per word, to boost or suppress the tag presence
probabilities for very frequent or rare words. This results
in a significant increase in the number of words that are re-
called, i.e. assigned to at least one test image.
To evaluate our models and to compare to previous work,
we use three data sets Corel 5k, IAPR TC12 and ESP
Game and standard measures including precision, recall,
mean average precision and break-even point. In Figure 1
we show several examples of images with their annota-
tions, and predictions from our model. On all data sets and
measures we show significantly improved accuracy of our
method as compared to earlier work.
The rest of this paper is organized as follows. In the next
section we give an overview of the related work. Then, in
Section 3, we present our tag prediction models, and how
we estimate their parameters. In Section 4 we present the
three data sets, evaluation criteria as well as the image fea-
tures we use in our experiments. The experimental results
are presented in Section 5. In Section 6 we present our con-
clusions and directions for further research.
2. Related Work
In this section we discuss models for image annotation
and keyword based retrieval most relevant for our work. We
identify four main groups of methods: those based on topic
models or mixture models, discriminatively trained ones,
and nearest neighbor type models.
The first group of methods are based on topic models
such as latent Dirichlet allocation, probabilistic latent se-
mantic analysis, and hierarchical Dirichlet processes, see
e.g. [1, 20, 25]. These methods model annotated images
as samples from a specific mix of topics, where each topic
is a distribution over image features and annotation words.
Parameter estimation involves estimating the topic mix for
each image, and estimating the data distributions of the top-
ics. Most often, a multinomial distribution over words is
used, and a Gaussian over visual features from different re-
gions of the image. Methods inspired by machine trans-
lation [4], in this case translating from discrete visual fea-
tures to the annotation vocabulary, can also be understood
as topic models, using one topic per visual descriptor type.
A second family of methods uses mixture models to de-
fine a joint distribution over image features and annota-
tion tags. To annotate a new image, these models com-
pute the conditional probability over tags given the visual
features by normalising the joint likelihood. Sometimes a
fixed number of mixture components over visual features
per keyword is used [2], while other models use the training
images as components to define a mixture model over visual
features and tags [
5, 11, 13]. Each training image defines
a likelihood over visual features and tags by a smoothed
distribution around the observed values. These models can
be seen as non-parametric density estimators over the co-
occurrence of images and annotations. For visual features
Gaussians are used, while the distributions over annotations
are multinomials, or separate Bernoullis for each word.
Both families of generative models discussed above may
be critisized because they maximize the generative data
likelihood, which is not necessarily optimal for predictive
performance. Therefore, discriminative models for tag pre-
diction have also been proposed [3, 7, 10]. These methods
learn a separate classifier for each tag, and use these to pre-
dict for each test image whether it belongs to the class of
images that are annotated with each particular tag. Different
learning methods have been used, including support vector
machines, multiple-instance learning, and Bayes point ma-
chines. Notable is [7] which also addresses the problem of
retrieving images based on multi-word queries.
Given the increasing amount of training data that is
currently available, local learning techniques are becom-
ing more attractive as a simple yet powerful alternative to

parametric models. Examples of such techniques include
methods based on label diffusion over a similarity graph
of labeled and unlabeled images [16, 22], or learning dis-
criminative models in neighborhoods of test images [27].
A simpler adhoc nearest-neighbor tag transfer mechanism
was recently introduced [17], showing state-of-the-art per-
formance. There, nearest neighbors are determined by the
average of several distances computed from different visual
features. The authors also combine the base distances by
learning a binary classifier separating image pairs that have
several tags in common from images that do not share any
tags. However, this linear distance combination did not give
better results than an equally weighted combination.
3. Tag Relevance Prediction Models
Our goal is to predict the relevance of annotation tags
for images. Given these relevance predictions we can an-
notate images by ranking the tags for a given image, or
do keyword based retrieval by ranking images for a given
tag. Our proposed method is based on a weighted nearest
neighbor approach, inspired by recent successful methods
[
5, 11, 13, 17], that propagate the annotations of training
images to new images. Our models are learnt in a discrimi-
native manner, rather than using held-out data [5], or using
neighbors in an adhoc manner [17]. We assume that some
visual similarity or distance measures between images are
given, abstracting away from their precise definition.
3.1. Weighted Nearest Neighbor Tag Prediction
To model image annotations, we use Bernoulli models
for each keyword. This choice is natural because keywords,
unlike natural text where word frequency is meaningful, are
either present or absent. The dependencies between key-
words in the training data are not explicitly modeled, but
are implicitly exploited in our model.
We use y
iw
{−1, +1} to denote the absence/presence
of keyword w for image i, hence encoding the image anno-
tations. The tag presence prediction p(y
iw
= +1) for image
i is a weighted sum over the training images, indexed by j:
p(y
iw
= +1) =
X
j
π
ij
p(y
iw
= +1|j), (1)
p(y
iw
= +1|j) =
(
1 for y
jw
= +1,
otherwise,
(2)
where π
ij
denotes the weight of image j for predicting the
tags of image i. We require that π
ij
0, and
P
j
π
ij
= 1.
We use to avoid zero prediction probabilities, and in prac-
tice we set = 10
5
. To estimate the parameters that
control the weights π
ij
we maximize the log-likelihood of
the predictions of training annotations. Taking care to set
the weight of training images to themselves to zero, i.e.
π
ii
= 0, our objective is to maximize
L =
X
i,w
c
iw
ln p(y
iw
),
(3)
where c
iw
is a cost that takes into account the imbalance
between keyword presence and absence. Indeed, in prac-
tice, there are many more tag absences than presences, and
absences are much noisier than presences. This is because
most tags in annotations are relevant, but often the annota-
tion does not include all relevant tags. We set c
iw
= 1/n
+
if
y
iw
= +1, where n
+
is the total number of positive labels,
and likewise c
iw
= 1/n
when y
iw
= 1.
Rank-based weights. In the case of rank-based weights
over K neighbors we set π
ij
= γ
k
if j is the k-th nearest
neighbor of i. The data log-likelihood (
3) is concave in the
parameters γ
k
and can be estimated using an EM-algorithm,
or a projected-gradient algorithm. The derivative of Eq. (3)
with respect to γ
k
equals
L
γ
k
=
X
i,w
c
iw
p(y
iw
|n
ik
)
p(y
iw
)
,
(4)
where n
ik
denotes the index of the k-th neighbor of image
i. The number of parameters equals the neighborhood size
K. We refer to this variant as RK, for “rank-based”.
Distance-based weights. The other possibility is to de-
fine the weights directly as a function of the distance, rather
than the rank. This has the advantage that weights will de-
pend smoothly on the distance, which is crucial if the dis-
tance is to be adjusted during training. The weights of train-
ing images j for an image i are redefined as
π
ij
=
exp(d
θ
(i, j))
P
j
0
exp(d
θ
(i, j
0
))
, (5)
where d
θ
is a distance metric with parameters θ that we
want to optimize. Note that the weights π
ij
decay exponen-
tially with distance d
θ
to image i. Choices for d
θ
include
Mahalanobis distances d
M
parametrized by a semi-definite
matrix M, and d
w
(i, j) = w
>
d
ij
where d
ij
is a vector of
base distances between image i and j, and w contains the
positive coefficients of the linear distance combination. The
number of parameters equals the number of base distances
that are combined. In the rest of the paper we focus on this
particular case. When we use a single distance, referred to
as the SD variant, w is a scalar that controls the decay of
the weights with distance, and it is the only parameter of
the model. When multiple distances are used, the variant is
referred to as ML, for “metric learning”.
Again, rather than using an EM-algorithm we directly
maximize the log-likelihood using a projected gradient al-
gorithm under positivity constraints on the elements of w.

Using the new definition of the weights, the gradient of the
log-likelihood Eq. (3) with respect to w equals
L
w
=
X
i,j
W
i
(π
ij
ρ
ij
)d
ij
, (6)
where W
i
=
P
w
c
iw
and ρ
ij
denotes the weighted average
over all words w of the posterior probability of neighbor j
for image i given the annotation:
ρ
ij
=
X
w
c
iw
W
i
p(j|y
iw
). (7)
To reduce the computational cost of training the model,
we do not compute all pairwise π
ij
and ρ
ij
. Rather, for
each i we compute them only over a large set, and assume
the remaining π
ij
and ρ
ij
to be zero. For each i, we select
K neighbors such that we maximise k
= min{k
d
}, where
k
d
is the largest neighbor rank for which neighbors 1 to k of
base distance d are included among the selected neighbors.
In this way we are likely to include all images with large
π
ij
regardless of the distance combination w that is learnt.
Therefore, after determining these neighborhoods, our al-
gorithm scales linearly with the number of training images.
Note the relation of our model to the multi-class metric
learning approach of [6]. In that work, a metric is learnt
such that weights π
ij
as defined by Eq. (5) are as close as
possible in the sense of Kullback-Leibler (KL) divergence
to fixed set of target weights ρ
ij
. The target weights were
defined to be zero for pairs from different classes, and set to
a constant for all pairs from the same class. In fact, when
deriving an EM-algorithm for our model, we find the ob-
jective of the M-step to be of the form of a KL divergence
between the ρ
ij
(fixed to values computed in the E-step) and
the π
ij
. For fixed ρ
ij
this KL divergence is convex in w.
3.2. Word-specific Logistic Discriminant Models
Weighted nearest neighbor approaches tend to have rel-
atively low recall scores, which is easily understood as fol-
lows. In order to receive a high probability for the presence
of a tag, it needs to be present among most neighbors with
a significant weight. This, however, is unlikely to be the
case for rare tags. So, even if we are lucky enough to have
a few neighbors annotated with the tag, we will predict the
presence with a low probability.
To overcome this, we introduce word-specific logistic
discriminant models that can boost the probability for rare
tags and decrease it for very frequent ones. The logistic
model uses weighted neighbor predictions by defining
p(y
iw
= +1) = σ(α
w
x
iw
+ β
w
), (8)
x
iw
=
X
j
π
ij
y
jw
, (9)
where σ(z) = (1 + exp(z))
1
and x
iw
is the weighted
average of annotations for tag w among the neighbors of i,
which is equivalent to Eq. (1) up to an affine transformation.
The word-specific models add 2 parameters to estimate for
each word. The resulting modulated variants are referred to
as σ RK, σSD and σML, respectively.
For fixed π
ij
the model is a logistic discriminant model,
and the log-likelihood is concave in {α
w
, β
w
}, and can be
trained per keyword. Using the new model, the gradient of
the log-likelihood of the training annotations with respect
to the parameters θ that control the weights equals
L
θ
=
X
i,w
c
iw
α
w
p(y
iw
)y
iw
x
iw
θ
, (10)
and for the model based on rank or distance respectively
x
iw
γ
k
= y
n
ik
w
,
(11)
x
iw
w
=
X
j
π
ij
(x
iw
y
jw
) d
ij
. (12)
In practice we estimate the parameters θ and {α
w
, β
w
} in
an alternating fashion. We observe rapid convergence, typ-
ically after alternating the maximization three times.
4. Data Sets and Experimental Setup
In this section we first present the data sets used in our
experiments, then in Section 4.2 we describe the different
features that we extract from images to compute distance
measures between images, and in Section 4.3 we discuss
the evaluation measures for image annotation and retrieval.
4.1. Data Sets
We consider three publicly available data sets that have
been used in previous work, and allow for direct compari-
son. Table 1 summarizes some statistics of these data sets,
example images are shown in Figure 1.
Corel 5k. This data set was first used in [4]. Since then,
it has become an important benchmark for keyword based
image retrieval and image annotation. It contains around
5000 images manually annotated with 1 to 5 keywords. The
vocabulary contains 260 words. A fixed set of 499 images
are used as test, and the rest is used for training.
ESP Game. This data set is obtained from an online
game where two players, that can not communicate outside
the game, gain points by agreeing on words describing the
image [24]. This way the players are encouraged to provide
important and meaningful tags to images. We use the subset
of 20.000, out of the 60.000 images publicly available, that
was also used in [17]. This data set is very challenging,
as it contains a wide variety of images including: logos,
drawings, and personal photos.

Corel 5k ESP Game IAPR TC12
Vocabulary size 260 268 291
Nr. of images 4,493 18,689 17,665
Words per img. 3.4 / 5 4.7 / 15 5.7 / 23
Img. per word 58.6 / 1004 362.7 / 4553 347.7 / 4999
Table 1. Statistics of the training sets of the three data sets. Image
and word counts are given in the format mean / maximum. Statis-
tics for the test sets resemble closely those of the training sets.
IAPR TC12. This set of 20.000 images accompanied
with descriptions in several languages was initially pub-
lished for cross-lingual retrieval [8]. It can be transformed
into a format comparable to the other sets by extracting
common nouns using natural language processing tech-
niques. We use the same resulting annotation as in [17].
4.2. Feature Extraction
We extract different types of features commonly used for
image search and categorisation. We use two types of global
image descriptors: Gist features [21], and color histograms
with 16 bins in each color channel for RGB, LAB, HSV
representations. Local features include SIFT as well as a ro-
bust hue descriptor [
23], both extracted densely on a multi-
scale grid or for Harris-Laplacian interest points. Each lo-
cal feature descriptor is quantized using k-means on sam-
ples from the training set. Images are then represented as a
‘bag-of-words’ histogram. All descriptors but Gist are L1-
normalised and also computed in a spatial arrangement [14].
We compute the histograms over three horizontal regions of
the image, and concatenate them to form a new global de-
scriptor, albeit one that encodes some information of the
spatial layout of the image. To limit color histogram sizes,
here, we reduced the quantization to 12 bins in each chan-
nel. Note that this spatial binning differs from segmented
image regions, as used in some previous work.
This results in 15 distinct descriptors, namely one Gist
descriptor, 6 color histograms and 8 bag-of-features (2 fea-
tures types x 2 descriptors x 2 layouts). To compute the
distances from the descriptors we follow previous work and
use L2 as the base metric for Gist, L1 for global color his-
tograms, and χ
2
for the others.
4.3. Evaluation Measures
We evaluate our models with standard performance mea-
sures, used in previous work, that evaluate retrieval perfor-
mance per keyword, and then average over keywords.
Precision and recall for fixed annotation length.
Following [4], each image is annotated with the 5 most rel-
evant keywords. Then, the mean precision P and recall R
over keywords are computed. N+ is used to denote the num-
ber of keywords with non-zero recall value. Note that each
image is forced to be annotated with 5 keywords, even if
the image has fewer or more keywords in the ground truth.
Therefore, even if a model predicts all ground-truth key-
words with a significantly higher probability than other key-
words, we will not measure perfect precision and recall.
Precision at different levels of recall. We also eval-
uate precision at different levels of recall as in [7]. The
break-even point (BEP), or R-precision, measures for each
keyword w the precision among the top n
w
relevant im-
ages, where n
w
is the number of images annotated with this
keyword in the ground truth. The mean average precision
(mAP) over keywords is found by computing for each key-
word the average of the precisions measured after each rel-
evant image is retrieved.
5. Experimental Results
In this section we present a quantitative evaluation of
TagProp and compare to previous work, qualitative results
can be found in Figure 1. We first give a detailed presenta-
tion of results obtained on the Corel 5k data set, and com-
pare them to previous work. In Section 5.2 we present our
results for the IAPR TC12 and ESP Game data sets. Results
for multi-word image retrieval are presented in Section 5.3.
5.1. Results for the Corel 5k data set
In a first set of experiments we compare the different
variants of TagProp and compare them to the original re-
sults of [17], referred to as JEC, and also using our own
features (JEC-15). That is, we take an equally weighted
combination of our 15 normalized base distances to define
image similarity.
From the results in Table 2 we can make several obser-
vations. First, using the tag transfer method proposed in
[17] with our own features we obtain results very similar to
the original work. Thus, other performance differences ob-
tained using our methods must be due to the tag prediction
methods. Our models that use this fixed distance combina-
tion to define weights (either directly in SD or using ranks
in RK) perform comparably. Among these results, the ones
of the sigmoidal model using distance-based weights (σSD)
are the best, and they show a modest improvement over the
results obtained with the more adhoc JEC-15.
More importantly, using our models that integrate met-
ric learning (ML and σML), much larger improvements are
obtained, in particular using the σML variant. Compared to
the current state-of-the-art method using the same features,
we obtain marked improvements of 5% in precision, 9% in
recall, and count 20 more words with positive recall. This
result shows clearly that nearest neighbor type tag predic-
tion can benefit from metric learning. Interestingly, earlier
efforts to exploit metric learning did not succeed [
17], c.f .
Section
2. The key to our successful use of metric learning
is its integration in the prediction model.

Figures
Citations
More filters
Posted Content

Deep Sets

TL;DR: The main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation covariant objective function must belong, which enables the design of a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks.
Proceedings ArticleDOI

Is object localization for free? - Weakly-supervised learning with convolutional neural networks

TL;DR: A weakly supervised convolutional neural network is described for object classification that relies only on image-level labels, yet can learn from cluttered scenes containing multiple objects.
Posted Content

Learning Fine-grained Image Similarity with Deep Ranking

TL;DR: A deep ranking model that employs deep learning techniques to learn similarity metric directly from images has higher learning capability than models based on hand-crafted features and deep classification models.
Proceedings ArticleDOI

Learning Fine-Grained Image Similarity with Deep Ranking

TL;DR: Zhang et al. as mentioned in this paper proposed a deep ranking model that employs deep learning techniques to learn similarity metric directly from images, which has higher learning capability than models based on hand-crafted features.
Proceedings ArticleDOI

CNN-RNN: A Unified Framework for Multi-label Image Classification

TL;DR: In this article, a CNN-RNN framework is proposed to learn a joint image-label embedding to characterize the semantic label dependency as well as the image label relevance, and it can be trained end-to-end from scratch to integrate both information in a unified framework.
References
More filters
Proceedings ArticleDOI

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.
Journal ArticleDOI

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

TL;DR: The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Proceedings ArticleDOI

Labeling images with a computer game

TL;DR: A new interactive system: a game that is fun and can be used to create valuable output that addresses the image-labeling problem and encourages people to do the work by taking advantage of their desire to be entertained.
Book ChapterDOI

Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

TL;DR: This work shows how to cluster words that individually are difficult to predict into clusters that can be predicted well, and cannot predict the distinction between train and locomotive using the current set of features, but can predict the underlying concept.
Journal ArticleDOI

Matching words and pictures

TL;DR: A new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text, is presented, and a number of models for the joint distribution of image regions and words are developed, including several which explicitly learn the correspondence between regions and Words.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation" ?

For this task the authors propose TagProp, a discriminatively trained nearest neighbor model. The authors also introduce a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words. The authors investigate the performance of different variants of their model and compare to existing work. The authors present experimental results for three challenging data sets. In this manner, the authors can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms. 

In future work, the authors will consider extending the model to assign tags to image regions, in order to address tasks such as image region labelling and object detection from imagewide annotations. 

Their proposed method is based on a weighted nearest neighbor approach, inspired by recent successful methods [5, 11, 13, 17], that propagate the annotations of training images to new images. 

The authors use two types of global image descriptors: Gist features [21], and color histograms with 16 bins in each color channel for RGB, LAB, HSV representations. 

The authors evaluate their models with standard performance measures, used in previous work, that evaluate retrieval performance per keyword, and then average over keywords. 

The authors also combine the base distances by learning a binary classifier separating image pairs that have several tags in common from images that do not share any tags. 

In future work, the authors will consider extending the model to assign tags to image regions, in order to address tasks such as image region labelling and object detection from imagewide annotations. 

Due to the probabilistic output of TagProp, this is easlily done by taking the product over the single keyword probabilities, as their model does not explicitly account for dependencies between words. 

Different learning methods have been used, including support vector machines, multiple-instance learning, and Bayes point machines. 

For each i, the authors select K neighbors such that the authors maximise k∗ = min{kd}, where kd is the largest neighbor rank for which neighbors 1 to k of base distance d are included among the selected neighbors. 

the weights for neighbors are either determined based on the neighbor rank or its distance, and set automatically by maximizing the likelihood of annotations in a set of training images. 

The authors compute the histograms over three horizontal regions of the image, and concatenate them to form a new global descriptor, albeit one that encodes some information of the spatial layout of the image.