A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

doi:10.1007/S11263-013-0658-4

Open AccessJournal ArticleDOI

A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

Yunchao Gong, +3 more

- 01 Jan 2014 -

International Journal of Computer Vision

- Vol. 106, Iss: 2, pp 210-233

Chats0

TLDR

This paper starts with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporates a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts.

Abstract:

This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.

Citations

PDF

Open Access

More filters

Posted Content

Recent Advances in Convolutional Neural Networks

Jiuxiang Gu, +11 more

- 22 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper details the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation, and introduces various applications of convolutional neural networks in computer vision, speech and natural language processing.

...read moreread less

Proceedings ArticleDOI

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Bryan A. Plummer, +5 more

TL;DR: This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains linking mentions of the same entities in images, as well as 276k manually annotated bounding boxes corresponding to each entity, essential for continued progress in automatic image description and grounded language understanding.

...read moreread less

Proceedings ArticleDOI

CNN-RNN: A Unified Framework for Multi-label Image Classification

Jiang Wang, +5 more

TL;DR: In this article, a CNN-RNN framework is proposed to learn a joint image-label embedding to characterize the semantic label dependency as well as the image label relevance, and it can be trained end-to-end from scratch to integrate both information in a unified framework.

...read moreread less

Proceedings ArticleDOI

Learning from massive noisy labeled data for image classification

Tong Xiao, +4 more

TL;DR: A general framework to train CNNs with only a limited number of clean labels and millions of easily obtained noisy labels is introduced and the relationships between images, class labels and label noises are model with a probabilistic graphical model and further integrate it into an end-to-end deep learning system.

...read moreread less

Proceedings ArticleDOI

Adversarial Cross-Modal Retrieval

Bokun Wang, +4 more

TL;DR: Comprehensive experimental results show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

David G. Lowe

- 01 Nov 2004 -

International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Proceedings ArticleDOI

Histograms of oriented gradients for human detection

Navneet Dalal, +1 more

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Collapse

A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

Citations

Recent Advances in Convolutional Neural Networks

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

CNN-RNN: A Unified Framework for Multi-label Image Classification

Learning from massive noisy labeled data for image classification

Adversarial Cross-Modal Retrieval

References

ImageNet: A large-scale hierarchical image database

Distinctive Image Features from Scale-Invariant Keypoints

Histograms of oriented gradients for human detection

Latent dirichlet allocation

Latent Dirichlet Allocation

Related Papers (5)

Canonical Correlation Analysis: An Overview with Application to Learning Methods

NUS-WIDE: a real-world web image database from National University of Singapore

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Multimodal Deep Learning