Explain Images with Multimodal Recurrent Neural Networks.

Open AccessPosted Content

Explain Images with Multimodal Recurrent Neural Networks.

- 04 Oct 2014 -

arXiv: Computer Vision and Pattern Recog...

TLDR

The m-RNN model directly models the probability distribution of generating a word given previous words and the image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Abstract:

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Show and tell: A neural image caption generator

Oriol Vinyals, +3 more

TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.

...read moreread less

Proceedings ArticleDOI

Long-term recurrent convolutional networks for visual recognition and description

Jeff Donahue, +6 more

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Proceedings ArticleDOI

Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy, +1 more

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

...read moreread less

Posted Content

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Jeff Donahue, +6 more

- 17 Nov 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Journal ArticleDOI

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, +11 more

- 01 May 2017 -

International Journal of Computer Vision

TL;DR: The Visual Genome dataset as mentioned in this paper contains over 108k images where each image has an average of $35$35 objects, $26$26 attributes, and $21$21 pairwise relationships between objects.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Collapse

Explain Images with Multimodal Recurrent Neural Networks.

Citations

Show and tell: A neural image caption generator

Long-term recurrent convolutional networks for visual recognition and description

Deep visual-semantic alignments for generating image descriptions

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Microsoft COCO: Common Objects in Context

Distributed Representations of Words and Phrases and their Compositionality

Related Papers (5)

Show and tell: A neural image caption generator

Deep visual-semantic alignments for generating image descriptions

Bleu: a Method for Automatic Evaluation of Machine Translation

Long-term recurrent convolutional networks for visual recognition and description

Microsoft COCO: Common Objects in Context