TL;DR: This work proposes a new algorithm for tag completion, where the goal is to automatically fill in the missing tags as well as correct noisy tags for given images and represents the image-tag relation by a tag matrix, and search for the optimal tag matrix consistent with both the observed tags and the visual similarity.
Abstract: Many social image search engines are based on keyword/tag matching. This is because tag-based image retrieval (TBIR) is not only efficient but also effective. The performance of TBIR is highly dependent on the availability and quality of manual tags. Recent studies have shown that manual tags are often unreliable and inconsistent. In addition, since many users tend to choose general and ambiguous tags in order to minimize their efforts in choosing appropriate words, tags that are specific to the visual content of images tend to be missing or noisy, leading to a limited performance of TBIR. To address this challenge, we study the problem of tag completion, where the goal is to automatically fill in the missing tags as well as correct noisy tags for given images. We represent the image-tag relation by a tag matrix, and search for the optimal tag matrix consistent with both the observed tags and the visual similarity. We propose a new algorithm for solving this optimization problem. Extensive empirical studies show that the proposed algorithm is significantly more effective than the state-of-the-art algorithms. Our studies also verify that the proposed algorithm is computationally efficient and scales well to large databases.
With the remarkable growth in the popularity of social media websites, there have been a proliferation of digital images on the Internet, which have posed a great challenge for large-scale image search.
To overcome the limitations of CBIR, TBIR represents the visual content of images by manually assigned keywords/tags.
Similar to the classification based approaches for image annotation, to achieve good performance, these approaches require a large number of well annotated images, and therefore are not suitable for the tag completion problem.
The limitation of current automatic image annotation approaches motivates us to develop a new computational framework for tag completion.
2 RELATED WORK
Numerous algorithms have been proposed for automatic image annotation (see [18] and references therein).
The first group of approaches are based upon global image features [31], such as color moment, texture histogram, etc.
More recent work [34], [27] improves the performance of automatic image annotation by taking into account the spatial dependence among visual features.
Wang et al. [26] proposed a multi-label learning approach via maximum consistency.
Similar to the classification based approaches, these methods require clean and complete image tags, making them unsuitable for the tag completion problem.
3.1 A Framework for Tag Completion
Figure 1 illustrates the tag completion task.
Rn×m be the partially observed tag matrix derived from user annotations, where T̂i,j is set to one if tag j is assigned to image i and zero otherwise.
To narrow down the solution for the complete tag matrix T , the authors consider the following three criterions for reconstructing T .
To address this challenge, the authors propose to exploit this criterion by comparing image similarities based on visual content with image similarities based on the overlap in annotated tags.
There are, however, two problems with the formulation in (1).
3.2 Optimization
To solve the optimization problem in (2), the authors develop a subgradient descent based approach (Algorithm 1).
The subgradient descent approach is an iterative method.
At each iteration t, given the current solution.
At each iteration t, the authors compute the subgradients ∇TA(Tt,wt) and ∇wA(Tt,wt), and update the solutions for T and w according to the theory of composite function optimization [3].
The authors final question is how to decide the step size ηt.
3.3 Discussion
This however is not a serious issue from the viewpoint of learning theory [5].
To alleviate the problem of local optima, the authors run the algorithm 20 times and choose the run with the lowest objective function.
The convergence rate for the adopted subgradient descent method is O(1/ √ t), where t is the number of iterations.
The authors finally note that since the objective of this work is to complete the tag matrix for all the images, it belongs to the category of transductive learning.
A similar approach can be used for the proposed approach to make predictions for outof-samples.
3.4 Tag Based Image Retrieval
Given the complete tag matrix T obtained by solving the optimization problem in (2), the authors briefly describe how to utilize the matrix T for tag based image retrieval.
The authors first consider the simplest scenario when the query consists of a single-tag.
A straightforward approach is to compute the tag based similarity between the query and the images by Tq.
A shortcoming of this similarity measure is that it does not take into account the correlation between tags.
4 EXPERIMENTS
The authors evaluate the quality of the completed tag matrix on two tasks: automatic image annotation and tag based image retrieval.
The maximum number of annotated tags per image is 82.
The authors then cluster the projected low dimensional SIFT features into 100,000 visual words, and represent the visual content of images by the histogram of the visual words.
To make a fair comparison with other state-of-theart methods, the authors adopt average precision, and average recall [6] as the evaluation metrics.
4.1 Experiment (I): Automatic Image Annotation
The authors first evaluate the proposed algorithm for tag completion by automatic image annotation.
To run the proposed algorithm for automatic image annotation, the authors simply view test images as special cases of partially tagged images, i.e., no tag is observed for test images.
The authors will discuss the parameter setting in more details in the later part of this section.
The authors also observed that as the number of returned tags increases from five to ten, the precision usually declines while the recall usually improves.
4.2 Experiment (II): Tag based Image Retrieval
Unlike the experiments for image annotation where each dataset is divided into a training set and a testing set, for the experiment of tag-based image retrieval, the authors include all the images from the dataset except the queries as the gallery images for retrieval.
Similar to the previous experiments, the authors only compare the proposed algorithm to TagProp and TagRel because the other approaches were unable to handle the partially tagged images.
Below, the authors first present the results for queries with single-tag, and then the results for queries consisting of multiple tags.
4.2.1 Results for Single-tag Queries
Since every tag can be used as a query, the authors have in total 260 queries for the Corel5k dataset, 495 queries for Labelme dataset, and 1, 000 queries for the the Flickr and TinyImage datasets.
The authors adopt a simple rule determining the relevance: an image is relevant if its annotation contains the query.
Besides the TagProp and TagRel methods, the authors also introduce a reference method that returns a gallery image if its observed tags include the query word.
By comparing to the reference method, the authors will be able to determine the improvement made by the proposed matrix completion method.
Table 4 shows the MAP results for the four datasets.
4.2.2 Experimental results for queries with multiple tags
To generate queries with multiple tags, the authors randomly select 200 images from the Flickr dataset, and use the annotated tags of the randomly selected images as the queries.
For all the methods in comparison, the authors follow the method presented in Section 3.3 for calculating the tag-based similarity between the textual query and the completed tags of gallery images.
According to Table 5, the authors first observe a significant difference in MAP scores between CBIR and TBIR (i.e. the reference method, TagProp and TMC), which is consistent with the observations reported in the previous study [23].
Second, the authors observe that the proposed method TMC outperforms all the baseline methods significantly.
Figure 5 shows examples of queries and images returned by the proposed method and the baselines.
4.3 Convergence and Computational Efficiency
The authors evaluate the computational efficiency by the running time for image annotation.
Table 6 summarizes the running times of both the proposed method and the baseline methods.
Note that for the Flickr and TinyImg dataset, the authors only report the running time for three methods, because the other methods either have memory issue or take more than several days to finish.
Figure 3 shows how the objective function value is reduced over the iterations.
5 CONCLUSIONS
The authors have proposed a tag matrix completion method for image tagging and image retrieval.
The authors consider the image-tag relation as a tag matrix, and aim to optimize the tag matrix by minimizing the difference between tag based similarity and visual content based similarity.
The proposed method falls into the category of semi-supervised learning in that both tagged images and untagged images are exploited to find the optimal tag matrix.
Extensive experimental results on four open benchmark datasets show that the proposed method significantly outperforms several stateof-the-art methods for automatic image annotation.
TL;DR: A fully-automated approach for learning extensive models for a wide range of variations within any concept, which leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models.
Abstract: Recognition is graduating from labs to real-world applications. While it is encouraging to see its potential being tapped, it brings forth a fundamental challenge to the vision researcher: scalability. How can we learn a model for any concept that exhaustively covers all its appearance variations, while requiring minimal or no human supervision for compiling the vocabulary of visual variance, gathering the training images and annotations, and learning the models? In this paper, we introduce a fully-automated approach for learning extensive models for a wide range of variations (e.g. actions, interactions, attributes and beyond) within any concept. Our approach leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models. Our approach organizes the visual knowledge about a concept in a convenient and useful way, enabling a variety of applications across vision and NLP. Our online system has been queried by users to learn models for several interesting concepts including breakfast, Gandhi, beautiful, etc. To date, our system has models available for over 50, 000 variations within 150 concepts, and has annotated more than 10 million images with bounding boxes.
376 citations
Cites background from "Tag Completion for Image Retrieval"
...Such an exhaustive vocabulary helps in generating fine-grained descriptions of images [17, 29, 34, 40, 50]....
TL;DR: This work introduces a strategy to dynamically select face regions useful for robust HR estimation, inspired by recent advances on matrix completion theory, which significantly outperforms state-of-the-art HR estimation methods in naturalistic conditions.
Abstract: Recent studies in computer vision have shown that, while practically invisible to a human observer, skin color changes due to blood flow can be captured on face videos and, surprisingly, be used to estimate the heart rate (HR). While considerable progress has been made in the last few years, still many issues remain open. In particular, state of-the-art approaches are not robust enough to operate in natural conditions (e.g. in case of spontaneous movements, facial expressions, or illumination changes). Opposite to previous approaches that estimate the HR by processing all the skin pixels inside a fixed region of interest, we introduce a strategy to dynamically select face regions useful for robust HR estimation. Our approach, inspired by recent advances on matrix completion theory, allows us to predict the HR while simultaneously discover the best regions of the face to be used for estimation. Thorough experimental evaluation conducted on public benchmarks suggests that the proposed approach significantly outperforms state-of the-art HR estimation methods in naturalistic conditions.
280 citations
Cites background from "Tag Completion for Image Retrieval"
...Matrix completion has proved successful for many computer vision tasks, when data and labels are noisy or in the case of missing data, such as multi-label image classification [6], image retrieval and tagging [28, 9], manifold correspondence finding [16], head/body pose estimation [1] and emotion recognition from abstract paintings [2]....
TL;DR: A Deep Collaborative Embedding model is proposed to uncover a unified latent space for images and tags and integrates the weakly-supervised image-tag correlation, image correlation and tag correlation simultaneously and seamlessly to collaboratively explore the rich context information of social images.
Abstract: In this work, we investigate the problem of learning knowledge from the massive community-contributed images with rich weakly-supervised context information, which can benefit multiple image understanding tasks simultaneously, such as social image tag refinement and assignment, content-based image retrieval, tag-based image retrieval and tag expansion Towards this end, we propose a Deep Collaborative Embedding (DCE) model to uncover a unified latent space for images and tags The proposed method incorporates the end-to-end learning and collaborative factor analysis in one unified framework for the optimal compatibility of representation learning and latent space discovery A nonnegative and discrete refined tagging matrix is learned to guide the end-to-end learning To collaboratively explore the rich context information of social images, the proposed method integrates the weakly-supervised image-tag correlation, image correlation and tag correlation simultaneously and seamlessly The proposed model is also extended to embed new tags in the uncovered space To verify the effectiveness of the proposed method, extensive experiments are conducted on two widely-used social image benchmarks for multiple social image understanding tasks The encouraging performance of the proposed method over the state-of-the-art approaches demonstrates its superiority
269 citations
Cites background or methods from "Tag Completion for Image Retrieval"
...In [6], the relevance of images to tags is refined by constraining it to be consistent with the original one and the visual similarity....
[...]
...Some methods have recently been proposed to explore images and user-provided tags [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]....
[...]
...Following [6], [56], the performance is also evaluated by using average recall, which can measure the percentage of correct tags tagged by the tagging methods out of all ground-truth tags....
[...]
...The image-tag relation that is consistent with the original image-tag relation and the visual similarity is found for tag refinement in [6]....
[...]
...TBIR is to return images with a query tag based on the relevance between the tag and images [6], [37], [38], [39], [40]....
TL;DR: A novelWeakly supervised deep matrix factorization algorithm is proposed, which uncovers the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure.
Abstract: The number of images associated with weakly supervised user-provided tags has increased dramatically in recent years. User-provided tags are incomplete, subjective and noisy. In this paper, we focus on the problem of social image understanding, i.e., tag refinement, tag assignment, and image retrieval. Different from previous work, we propose a novel weakly supervised deep matrix factorization algorithm, which uncovers the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure. Due to the well-known semantic gap, the hidden representations of images are learned by a hierarchical model, which are progressively transformed from the visual feature space. It can naturally embed new images into the subspace using the learned deep architecture. The semantic and visual structures are jointly incorporated to learn a semantic subspace without overfitting the noisy, incomplete, or subjective tags. Besides, to remove the noisy or redundant visual features, a sparse model is imposed on the transformation matrix of the first layer in the deep architecture. Finally, a unified optimization problem with a well-defined objective function is developed to formulate the proposed problem and solved by a gradient descent procedure with curvilinear search. Extensive experiments on real-world social image databases are conducted on the tasks of image understanding: image tag refinement, assignment, and retrieval. Encouraging results are achieved with comparison with the state-of-the-art algorithms, which demonstrates the effectiveness of the proposed method.
197 citations
Cites background from "Tag Completion for Image Retrieval"
...Unfortunately, these tags are provided by amateur users and are imperfect, i.e., they are often incomplete or inaccurate in describing the visual content of images, which brings challenges to the tasks of image understanding such as tag-based image retrieval [4]....
TL;DR: In this paper, a comprehensive survey of content-based image retrieval focuses on what people tag about an image and how such information can be exploited to construct a tag relevance function. And a two-dimensional taxonomy is presented to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations.
Abstract: Where previous reviews on content-based image retrieval emphasize what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image A comprehensive treatise of three closely linked problems (ie, image tag assignment, refinement, and tag-based image retrieval) is presented While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, that is, estimating the relevance of a specific tag with respect to the visual content of a given image and its social context By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this article introduces a two-dimensional taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations For a head-to-head comparison with the state of the art, a new experimental protocol is presented, with training sets containing 10,000, 100,000, and 1 million images, and an evaluation on three test sets, contributed by various research groups Eleven representative works are implemented and evaluated Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
16,989 citations
"Tag Completion for Image Retrieval" refers background in this paper
...To address this challenge, we study the problem of tag completion, where the goal is to automatically fill in the missing tags as well as correct noisy tags for given images....
TL;DR: This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches and shows that it is simple, computationally efficient and intrinsically invariant.
Abstract: We present a novel method for generic visual categorization: the problem of identifying the object content of natural images while generalizing across variations inherent to the object class. This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches. We propose and compare two alternative implementations using different classifiers: Naive Bayes and SVM. The main advantages of the method are that it is simple, computationally efficient and intrinsically invariant. We present results for simultaneously classifying seven semantic visual categories. These results clearly demonstrate that the method is robust to background clutter and produces good categorization accuracy even without exploiting geometric information.
5,046 citations
"Tag Completion for Image Retrieval" refers background in this paper
...Various visual features, including both global features [33] (e.g., color, texture, and shape) and local features [16] (e.g., SIFT keypoints), have been studied for CBIR....
TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.
Abstract: We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web.
TL;DR: This paper examines (and improves upon) the local image descriptor used by SIFT, and demonstrates that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation.
Abstract: Stable local feature detection and representation is a fundamental component of many image registration and object recognition algorithms. Mikolajczyk and Schmid (June 2003) recently evaluated a variety of approaches and identified the SIFT [D. G. Lowe, 1999] algorithm as being the most resistant to common image deformations. This paper examines (and improves upon) the local image descriptor used by SIFT. Like SIFT, our descriptors encode the salient aspects of the image gradient in the feature point's neighborhood; however, instead of using SIFT's smoothed weighted histograms, we apply principal components analysis (PCA) to the normalized gradient patch. Our experiments demonstrate that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation. We also present results showing that using these descriptors in an image retrieval application results in increased accuracy and faster matching.
TL;DR: Experiments on three different real-world multi-label learning problems, i.e. Yeast gene functional analysis, natural scene classification and automatic web page categorization, show that ML-KNN achieves superior performance to some well-established multi- label learning algorithms.
Abstract: Multi-label learning originated from the investigation of text categorization problem, where each document may belong to several predefined topics simultaneously. In multi-label learning, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances through analyzing training instances with known label sets. In this paper, a multi-label lazy learning approach named ML-KNN is presented, which is derived from the traditional K-nearest neighbor (KNN) algorithm. In detail, for each unseen instance, its K nearest neighbors in the training set are firstly identified. After that, based on statistical information gained from the label sets of these neighboring instances, i.e. the number of neighboring instances belonging to each possible class, maximum a posteriori (MAP) principle is utilized to determine the label set for the unseen instance. Experiments on three different real-world multi-label learning problems, i.e. Yeast gene functional analysis, natural scene classification and automatic web page categorization, show that ML-KNN achieves superior performance to some well-established multi-label learning algorithms.
2,832 citations
"Tag Completion for Image Retrieval" refers background in this paper
...[40] proposed a lazy learning algorithm for multi-label prediction....
Q1. What are the contributions mentioned in the paper "Tag completion for image retrieval" ?
To address this challenge, the authors study the problem of tag completion where the goal is to automatically fill in the missing tags as well as correct noisy tags for given images. The authors represent the image-tag relation by a tag matrix, and search for the optimal tag matrix consistent with both the observed tags and the visual similarity. The authors propose a new algorithm for solving this optimization problem.
Q2. What have the authors stated for future works in "Tag completion for image retrieval" ?
In future work, the authors plan to exploit computationally more efficient approaches for tag completion based on the theory of compressed sensing and matrix completion.