scispace - formally typeset
Search or ask a question
Author

Xin-Jing Wang

Other affiliations: Tsinghua University
Bio: Xin-Jing Wang is an academic researcher from Microsoft. The author has contributed to research in topics: Image retrieval & Automatic image annotation. The author has an hindex of 20, co-authored 55 publications receiving 1815 citations. Previous affiliations of Xin-Jing Wang include Tsinghua University.


Papers
More filters
Proceedings Article•DOI•
Xin-Jing Wang1, Lei Zhang1, Feng Jing1, Wei-Ying Ma1•
17 Jun 2006
TL;DR: This paper presents AnnoSearch, a novel way to annotate images using search and data mining technologies, which enables annotating with unlimited vocabulary, which is impossible for all existing approaches.
Abstract: Although it has been studied for several years by computer vision and machine learning communities, image annotation is still far from practical. In this paper, we present AnnoSearch, a novel way to annotate images using search and data mining technologies. Leveraging the Web-scale images, we solve this problem in two-steps: 1) searching for semantically and visually similar images on the Web, 2) and mining annotations from them. Firstly, at least one accurate keyword is required to enable text-based search for a set of semantically similar images. Then content-based search is performed on this set to retrieve visually similar images. At last, annotations are mined from the descriptions (titles, URLs and surrounding texts) of these images. It worth highlighting that to ensure the efficiency, high dimensional visual features are mapped to hash codes which significantly speed up the content-based search process. Our proposed approach enables annotating with unlimited vocabulary, which is impossible for all existing approaches. Experimental results on real web images show the effectiveness and efficiency of the proposed algorithm.

334 citations

Journal Article•DOI•
TL;DR: This paper proposes a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results, and enables annotating with unlimited vocabulary and is highly scalable and robust to outliers.
Abstract: Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results. Some 2.4 million images with their surrounding text are collected from a few photo forums to support this approach. The entire process is formulated in a divide-and-conquer framework where a query keyword is provided along with the uncaptioned image to improve both the effectiveness and efficiency. This is helpful when the collected data set is not dense everywhere. In this sense, our approach contains three steps: 1) the search process to discover visually and semantically similar search results, 2) the mining process to identify salient terms from textual descriptions of the search results, and 3) the annotation rejection process to filter out noisy terms yielded by Step 2. To ensure real-time annotation, two key techniques are leveraged - one is to map the high-dimensional image visual features into hash codes, the other is to implement it as a distributed system, of which the search and mining processes are provided as Web services. As a typical result, the entire process finishes in less than 1 second. Since no training data set is required, our approach enables annotating with unlimited vocabulary and is highly scalable and robust to outliers. Experimental results on both real Web images and a benchmark image data set show the effectiveness and efficiency of the proposed algorithm. It is also worth noting that, although the entire approach is illustrated within the divide-and- conquer framework, a query keyword is not crucial to our current implementation. We provide experimental results to prove this.

213 citations

Proceedings Article•DOI•
Xin-Jing Wang1, Xudong Tu, Dan Feng, Lei Zhang1•
19 Jul 2009
TL;DR: This work proposes an analogical reasoning-based approach which measures the analogy between the new question-answer linkages and those of relevant knowledge which contains only positive links; the candidate answer which has the most analogous link is assumed to be the best answer.
Abstract: The method of finding high-quality answers has significant impact on user satisfaction in community question answering systems. However, due to the lexical gap between questions and answers as well as spam typically existing in user-generated content, filtering and ranking answers is very challenging. Previous solutions mainly focus on generating redundant features, or finding textual clues using machine learning techniques; none of them ever consider questions and their answers as relational data but instead model them as independent information. Moreover, they only consider the answers of the current question, and ignore any previous knowledge that would be helpful to bridge the lexical and semantic gap. We assume that answers are connected to their questions with various types of latent links, i.e. positive indicating high-quality answers, negative links indicating incorrect answers or user-generated spam, and propose an analogical reasoning-based approach which measures the analogy between the new question-answer linkages and those of relevant knowledge which contains only positive links; the candidate answer which has the most analogous link is assumed to be the best answer. We conducted experiments based on 29.8 million Yahoo!Answer question-answer threads and showed the effectiveness of our approach.

113 citations

Patent•
Xin-Jing Wang1, Lei Zhang1, Wei-Ying Ma1•
23 Jan 2009
TL;DR: In this article, a plurality of first questions and corresponding first answers are identified at a community question-answer (CQA) site as a pluralityof first questionanswer (q-a) pairs.
Abstract: In some implementations, a plurality of first questions and corresponding first answers are identified at a community question-answer (CQA) site as a plurality of first question-answer (q-a) pairs. A query thread comprised of a second question and a plurality of candidate second answers is selected for making a determination of answer quality. A set of the first questions that are similar to the second question are identified from the plurality of first questions. First linking features between the identified set of first questions and their corresponding first answers are used for determining an analogy with second linking features between the second question and candidate answers for ranking the candidate answers.

111 citations

Proceedings Article•DOI•
10 Oct 2004
TL;DR: An iterative similarity propagation approach to explore the inter-relationships between Web images and their textual annotations for image retrieval and shows that the proposed approach can significantly improve Web image retrieval performance.
Abstract: In this paper, we propose an iterative similarity propagation approach to explore the inter-relationships between Web images and their textual annotations for image retrieval. By considering Web images as one type of objects, their surrounding texts as another type, and constructing the links structure between them via webpage analysis, we can iteratively reinforce the similarities between images. The basic idea is that if two objects of the same type are both related to one object of another type, these two objects are similar; likewise, if two objects of the same type are related to two different, but similar objects of another type, then to some extent, these two objects are also similar. The goal of our method is to fully exploit the mutual reinforcement between images and their textual annotations. Our experiments based on 10,628 images crawled from the Web show that our proposed approach can significantly improve Web image retrieval performance.

102 citations


Cited by
More filters
Journal Article•DOI•
TL;DR: Almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation are surveyed, and the spawning of related subfields are discussed, to discuss the adaptation of existing image retrieval techniques to build systems that can be useful in the real world.
Abstract: We have witnessed great interest and a wealth of promise in content-based image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger association of weakly related fields. In this article, we survey almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation, and in the process discuss the spawning of related subfields. We also discuss significant challenges involved in the adaptation of existing image retrieval techniques to build systems that can be useful in the real world. In retrospect of what has been achieved so far, we also conjecture what the future may hold for image retrieval research.

3,433 citations

01 Jan 2006

3,012 citations

Proceedings Article•DOI•
08 Jul 2009
TL;DR: The benchmark results indicate that it is possible to learn effective models from sufficiently large image dataset to facilitate general image retrieval and four research issues on web image annotation and retrieval are identified.
Abstract: This paper introduces a web image dataset created by NUS's Lab for Media Search. The dataset includes: (1) 269,648 images and the associated tags from Flickr, with a total of 5,018 unique tags; (2) six types of low-level features extracted from these images, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments extracted over 5x5 fixed grid partitions, and 500-D bag of words based on SIFT descriptions; and (3) ground-truth for 81 concepts that can be used for evaluation. Based on this dataset, we highlight characteristics of Web image collections and identify four research issues on web image annotation and retrieval. We also provide the baseline results for web image annotation by learning from the tags using the traditional k-NN algorithm. The benchmark results indicate that it is possible to learn effective models from sufficiently large image dataset to facilitate general image retrieval.

2,648 citations

Proceedings Article•DOI•
21 Jul 2017
TL;DR: Li et al. as discussed by the authors proposed a recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way.
Abstract: Recognizing fine-grained categories (e.g., bird species) is difficult due to the challenges of discriminative region localization and fine-grained feature learning. Existing approaches predominantly solve these challenges independently, while neglecting the fact that region detection and fine-grained feature learning are mutually correlated and thus can reinforce each other. In this paper, we propose a novel recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way. The learning at each scale consists of a classification sub-network and an attention proposal sub-network (APN). The APN starts from full images, and iteratively generates region attention from coarse to fine by taking previous prediction as a reference, while the finer scale network takes as input an amplified attended region from previous scale in a recurrent way. The proposed RA-CNN is optimized by an intra-scale classification loss and an inter-scale ranking loss, to mutually learn accurate region attention and fine-grained representation. RA-CNN does not need bounding box/part annotations and can be trained end-to-end. We conduct comprehensive experiments and show that RA-CNN achieves the best performance in three fine-grained tasks, with relative accuracy gains of 3.3%, 3.7%, 3.8%, on CUB Birds, Stanford Dogs and Stanford Cars, respectively.

1,035 citations

Proceedings Article•DOI•
23 Oct 2006
TL;DR: The proposed spatiotemporal video attention framework has been applied on over 20 testing video sequences, and attended regions are detected to highlight interesting objects and motions present in the sequences with very high user satisfaction rate.
Abstract: Human vision system actively seeks interesting regions in images to reduce the search effort in tasks, such as object detection and recognition. Similarly, prominent actions in video sequences are more likely to attract our first sight than their surrounding neighbors. In this paper, we propose a spatiotemporal video attention detection technique for detecting the attended regions that correspond to both interesting objects and actions in video sequences. Both spatial and temporal saliency maps are constructed and further fused in a dynamic fashion to produce the overall spatiotemporal attention model. In the temporal attention model, motion contrast is computed based on the planar motions (homography) between images, which is estimated by applying RANSAC on point correspondences in the scene. To compensate the non-uniformity of spatial distribution of interest-points, spanning areas of motion segments are incorporated in the motion contrast computation. In the spatial attention model, a fast method for computing pixel-level saliency maps has been developed using color histograms of images. A hierarchical spatial attention representation is established to reveal the interesting points in images as well as the interesting regions. Finally, a dynamic fusion technique is applied to combine both the temporal and spatial saliency maps, where temporal attention is dominant over the spatial model when large motion contrast exists, and vice versa. The proposed spatiotemporal attention framework has been applied on over 20 testing video sequences, and attended regions are detected to highlight interesting objects and motions present in the sequences with very high user satisfaction rate.

983 citations