Book ChapterDOI

# Image annotation using metric learning in semantic neighbourhoods

07 Oct 2012-pp 836-849

TL;DR: 2PKNN, a two-step variant of the classical K-nearest neighbour algorithm, is proposed that performs comparable to the current state-of-the-art on three challenging image annotation datasets, and shows significant improvements after metric learning.

AbstractAutomatic image annotation aims at predicting a set of textual labels for an image that describe its semantics. These are usually taken from an annotation vocabulary of few hundred labels. Because of the large vocabulary, there is a high variance in the number of images corresponding to different labels ("class-imbalance"). Additionally, due to the limitations of manual annotation, a significant number of available images are not annotated with all the relevant labels ("weak-labelling"). These two issues badly affect the performance of most of the existing image annotation models. In this work, we propose 2PKNN, a two-step variant of the classical K-nearest neighbour algorithm, that addresses these two issues in the image annotation task. The first step of 2PKNN uses "image-to-label" similarities, while the second step uses "image-to-image" similarities; thus combining the benefits of both. Since the performance of nearest-neighbour based methods greatly depends on how features are compared, we also propose a metric learning framework over 2PKNN that learns weights for multiple features as well as distances together. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label prediction. For scalability, we implement it by alternating between stochastic sub-gradient descent and projection steps. Extensive experiments demonstrate that, though conceptually simple, 2PKNN alone performs comparable to the current state-of-the-art on three challenging image annotation datasets, and shows significant improvements after metric learning.

Content maybe subject to copyright    Report

##### Citations
More filters
Journal ArticleDOI
TL;DR: This paper starts with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporates a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts.
Abstract: This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.

545 citations

### Cites background or methods from "Image annotation using metric learn..."

• ...Metric learning has been used for image classification and annotation (Guillaumin et al., 2009; Mensink et al., 2012; Verma and Jawahar, 2012)....

[...]

• ...…of publications have reported better results with simple data-driven schemes based on retrieving database images similar to a query and transferring the annotations from those images (Chua et al., 2009; Guillaumin et al., 2009; Makadia et al., 2008; Verma and Jawahar, 2012; Wang et al., 2008)....

[...]

• ...More recently, a number of publications have reported better results with simple data-driven schemes based on retrieving database images similar to a query and transferring the annotations from those images (Chua et al., 2009; Guillaumin et al., 2009; Makadia et al., 2008; Verma and Jawahar, 2012; Wang et al., 2008)....

[...]

• ...In fact, the standard datasets used for image annotation by Makadia et al. (2008); Guillaumin et al. (2009); Verma and Jawahar (2012) consist of 5K-20K images and have 260-290 tags each....

[...]

• ...One of the shortcomings of data-driven annotation approaches (Guillaumin et al., 2009; Makadia et al., 2008; Verma and Jawahar, 2012) as well as Wsabie is that they not account for co-occurrence and mutual exclusion constraints between different tags for the same image....

[...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: The key idea is to learn query-specific generative model on the features of nearest-neighbors and tags using the proposed NMF-KNN approach which imposes consensus constraint on the coefficient matrices across different features to solve the problem of feature fusion.
Abstract: The real world image databases such as Flickr are characterized by continuous addition of new images. The recent approaches for image annotation, i.e. the problem of assigning tags to images, have two major drawbacks. First, either models are learned using the entire training data, or to handle the issue of dataset imbalance, tag-specific discriminative models are trained. Such models become obsolete and require relearning when new images and tags are added to database. Second, the task of feature-fusion is typically dealt using ad-hoc approaches. In this paper, we present a weighted extension of Multi-view Non-negative Matrix Factorization (NMF) to address the aforementioned drawbacks. The key idea is to learn query-specific generative model on the features of nearest-neighbors and tags using the proposed NMF-KNN approach which imposes consensus constraint on the coefficient matrices across different features. This results in coefficient vectors across features to be consistent and, thus, naturally solves the problem of feature fusion, while the weight matrices introduced in the proposed formulation alleviate the issue of dataset imbalance. Furthermore, our approach, being query-specific, is unaffected by addition of images and tags in a database. We tested our method on two datasets used for evaluation of image annotation and obtained competitive results.

116 citations

### Cites background or methods from "Image annotation using metric learn..."

• ...Another important difference between our method and existing methods [22, 15, 29] is the number of nearest-neighbors used to propagate the tags....

[...]

• ...Verma and Jawahar [29] presented two-pass kNN to find neighbors in semantic neighborhoods besides metric learning which learns weights for combining different features....

[...]

Proceedings ArticleDOI
22 Jun 2015
TL;DR: It is demonstrated that word embedding vectors perform better than binary vectors as a representation of the tags associated with an image and the CCA model is compared to a simple CNN based linear regression model, which allows the CNN layers to be trained using back-propagation.
Abstract: We propose simple and effective models for the image annotation that make use of Convolutional Neural Network (CNN) features extracted from an image and word embedding vectors to represent their associated tags. Our first set of models is based on the Canonical Correlation Analysis (CCA) framework that helps in modeling both views - visual features (CNN feature) and textual features (word embedding vectors) of the data. Results on all three variants of the CCA models, namely linear CCA, kernel CCA and CCA with k-nearest neighbor (CCA-KNN) clustering, are reported. The best results are obtained using CCA-KNN which outperforms previous results on the Corel-5k and the ESP-Game datasets and achieves comparable results on the IAPRTC-12 dataset. In our experiments we evaluate CNN features in the existing models which bring out the advantages of it over dozens of handcrafted features. We also demonstrate that word embedding vectors perform better than binary vectors as a representation of the tags associated with an image. In addition we compare the CCA model to a simple CNN based linear regression model, which allows the CNN layers to be trained using back-propagation.

114 citations

### Cites background or methods from "Image annotation using metric learn..."

• ...TagProp [30] is again based on nearest neighbor model but they achieved significant improvement by using 15 different local and global features along with metric learning....

[...]

• ...These models can be generative [5, 17, 32], discriminative [2, 29, 10, 30] or nearest neighbor based and among these, nearest neighbor based models are shown to be the most successful [18, 10, 30]....

[...]

• ...CRM [17] HC - 16 19 17 107 - - - - - - - SML [2] HC - 23 29 26 137 - - - - - - - MRFA [32] HC - 31 36 33 172 - - - - - - - GS [33] HC - 30 33 31 146 - - - - 32 29 30 252 JEC [18] HC - 27 32 29 139 22 25 23 224 28 29 29 250 CCD [22] HC - 36 41 38 159 36 24 29 232 44 29 35 251 KSVM-VT [29] HC - 32 42 36 179 33 32 33 259 47 29 36 268 MBRM [5] HC - 24 25 25 122 18 19 19 209 24 23 24 223 TagProp(σML) [10] HC - 33 42 37 160 39 27 32 239 46 35 40 266 2PKNN+ML [30] HC - 44 46 45 191 53 27 36 252 54 37 44 278 SVM-DMBRM [21] HC - 36 48 41 197 55 25 34 259 56 29 38 283 KCCA-2PKNN [1] HC - 42 46 44 179 - - - - 59 30 40 259 SKL-CRM [20] HC - 39 46 42 184 41 26 32 248 47 32 38 274...

[...]

• ...Multiple features with the right type of model are shown to improve the annotation performance significantly in the current state of the art system [30]....

[...]

• ...Firstly, we present the most widely reported type of evaluation where the recall and precision are computed per word and their average over all the words are reported [21, 1, 20, 30, 18, 5, 17]....

[...]

Proceedings ArticleDOI
07 Dec 2015
TL;DR: In this paper, the authors use image metadata nonparametrategically to generate neighborhoods of related images using Jaccard similarities, then use a deep neural network to blend visual information from the image and its neighbors.
Abstract: Some images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata. We build on this intuition to improve multilabel image annotation. Our model uses image metadata nonparametrically to generate neighborhoods of related images using Jaccard similarities, then uses a deep neural network to blend visual information from the image and its neighbors. Prior work typically models image metadata parametrically, in contrast, our nonparametric treatment allows our model to perform well even when the vocabulary of metadata changes between training and testing. We perform comprehensive experiments on the NUS-WIDE dataset, where we show that our model outperforms state-of-the-art methods for multilabel image annotation even when our model is forced to generalize to new types of metadata.

104 citations

Journal ArticleDOI
, Fei Wu1
TL;DR: A novel multi- label dictionary learning approach, named multi-label dictionary learning (MLDL) with label consistency regularization and partial-identical label embedding MLDL, which conducts MLDL and partial -identicallabel embedding simultaneously.
Abstract: Image annotation has attracted a lot of research interest, and multi-label learning is an effective technique for image annotation. How to effectively exploit the underlying correlation among labels is a crucial task for multi-label learning. Most existing multi-label learning methods exploit the label correlation only in the output label space, leaving the connection between the label and the features of images untouched. Although, recently some methods attempt toward exploiting the label correlation in the input feature space by using the label information, they cannot effectively conduct the learning process in both the spaces simultaneously, and there still exists much room for improvement. In this paper, we propose a novel multi-label learning approach, named multi-label dictionary learning (MLDL) with label consistency regularization and partial-identical label embedding MLDL, which conducts MLDL and partial-identical label embedding simultaneously. In the input feature space, we incorporate the dictionary learning technique into multi-label learning and design the label consistency regularization term to learn the better representation of features. In the output label space, we design the partial-identical label embedding, in which the samples with exactly same label set can cluster together, and the samples with partial-identical label sets can collaboratively represent each other. Experimental results on the three widely used image datasets, including Corel 5K, IAPR TC12, and ESP Game, demonstrate the effectiveness of the proposed approach.

70 citations

### Cites background from "Image annotation using metric learn..."

• ...It can conduct multi-label dictionary learning in input feature space and partial-identical label embedding in output label space, simultaneously....

[...]

• ...(Corresponding author: Xiao-Yuan Jing.)...

[...]

• ...In addition, MLDL specially designs the label consistency regularization term for multi-label dictionary learning to enhance the discriminability of learned dictionary....

[...]

##### References
More filters
Proceedings Article
05 Dec 2005
TL;DR: In this article, a Mahanalobis distance metric for k-NN classification is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin.
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification.

4,430 citations

Journal ArticleDOI
TL;DR: This paper shows how to learn a Mahalanobis distance metric for kNN classification from labeled examples in a globally integrated manner and finds that metrics trained in this way lead to significant improvements in kNN Classification.
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner.

3,736 citations

### "Image annotation using metric learn..." refers background or methods in this paper

• ...With this goal, we perform metric learning over 2PKNN by generalizing the LMNN [11] algorithm for multi-label prediction....

[...]

• ...In such a scenario, (i) since each base distance contributes differently, we can learn appropriate weights to combine them in the distance space [2, 3]; and (ii) since every feature (such as SIFT or colour histogram) itself is represented as a multidimensional vector, its individual elements can also be weighted in the feature space [11]....

[...]

• ...Our extension of LMNN conceptually differs from its previous extensions such as [21] in at least two significant ways: (i) we adapt LMNN in its choice of target/impostors to learn metrics for multi-label prediction problems, whereas [21] uses the same definition of target/impostors as in LMNN to address classification problem in multi-task setting, and (ii) in our formulation, the amount of push applied on an impostor varies depending on its conceptual similarity w.r.t. a given sample, which makes it suitable for multi-label prediction tasks....

[...]

• ...Our metric learning framework extends LMNN in two major ways: (i) LMNN is meant for single-label classification (or simply classification) problems, while we adapt it for images annotation which is a multi-label classification task; and (ii) LMNN learns a single Mahalanobis metric in the feature space, while we extend it to learn linear metrics for multi- Image Annotation Using Metric Learning in Semantic Neighbourhoods 3 ple features as well as distances together....

[...]

• ...For this purpose, we extend the classical LMNN [11] algorithm for multi-label prediction....

[...]

Proceedings ArticleDOI
25 Apr 2004
TL;DR: A new interactive system: a game that is fun and can be used to create valuable output that addresses the image-labeling problem and encourages people to do the work by taking advantage of their desire to be entertained.
Abstract: We introduce a new interactive system: a game that is fun and can be used to create valuable output. When people play the game they help determine the contents of images by providing meaningful labels for them. If the game is played as much as popular online games, we estimate that most images on the Web can be labeled in a few months. Having proper labels associated with each image on the Web would allow for more accurate image search, improve the accessibility of sites (by providing descriptions of images to visually impaired individuals), and help users block inappropriate images. Our system makes a significant contribution because of its valuable output and because of the way it addresses the image-labeling problem. Rather than using computer vision techniques, which don't work well enough, we encourage people to do the work by taking advantage of their desire to be entertained.

2,274 citations

### "Image annotation using metric learn..." refers background in this paper

• ...ESP Game contains images annotated using an on-line game, where two (mutually unknown) players are randomly given an image for which they have to predict same keyword(s) to score points [22]....

[...]

Journal ArticleDOI
TL;DR: A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.
Abstract: We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is $${\tilde{O}(d/(\lambda \epsilon))}$$, where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

1,891 citations

### "Image annotation using metric learn..." refers methods in this paper

• ...To overcome this issue, we solve it by alternatively using stochastic sub-gradient descent and projection steps (similar to Pegasos [12])....

[...]

• ...To address this, we implement metric learning by alternating between stochastic sub-gradient descent and projection steps (similar to Pegasos [12])....

[...]

Book ChapterDOI
28 May 2002
TL;DR: This work shows how to cluster words that individually are difficult to predict into clusters that can be predicted well, and cannot predict the distinction between train and locomotive using the current set of features, but can predict the underlying concept.
Abstract: We describe a model of object recognition as machine translation. In this model, recognition is a process of annotating image regions with words. Firstly, images are segmented into regions, which are classified into region types using a variety of features. A mapping between region types and keywords supplied with the images, is then learned, using a method based around EM. This process is analogous with learning a lexicon from an aligned bitext. For the implementation we describe, these words are nouns taken from a large vocabulary. On a large test set, the method can predict numerous words with high accuracy. Simple methods identify words that cannot be predicted well. We show how to cluster words that individually are difficult to predict into clusters that can be predicted well -- for example, we cannot predict the distinction between train and locomotive using the current set of features, but we can predict the underlying concept. The method is trained on a substantial collection of images. Extensive experimental results illustrate the strengths and weaknesses of the approach.

1,721 citations

### "Image annotation using metric learn..." refers background in this paper

• ...translation models [13, 14] and nearest-neighbour based relevance models [1, 8]....

[...]

• ...Corel 5K was first used in [14], and since then it has become a benchmark for comparing annotation performance....

[...]