Generation and Comprehension of Unambiguous Object Descriptions

doi:10.1109/CVPR.2016.9

Open AccessProceedings ArticleDOI

Generation and Comprehension of Unambiguous Object Descriptions

Junhua Mao, +5 more

- pp 11-20

Chats0

TLDR

The authors proposed a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.

Abstract:

We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate, our task allows for easy objective evaluation. We also present a new large-scale dataset for referring expressions, based on MSCOCO. We have released the dataset and a toolbox for visualization and evaluation, see https://github.com/ mjhucla/Google_Refexp_toolbox.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Multimodal Machine Learning: A Survey and Taxonomy

Tadas Baltrusaitis, +2 more

- 01 Feb 2019 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

...read moreread less

Proceedings ArticleDOI

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Bryan A. Plummer, +5 more

TL;DR: This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains linking mentions of the same entities in images, as well as 276k manually annotated bounding boxes corresponding to each entity, essential for continued progress in automatic image description and grounded language understanding.

...read moreread less

Proceedings ArticleDOI

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Peter Anderson, +8 more

TL;DR: The Room-to-Room (R2R) dataset as mentioned in this paper provides a large-scale reinforcement learning environment based on real imagery for visually-grounded natural language navigation in real buildings.

...read moreread less

Proceedings ArticleDOI

From Recognition to Cognition: Visual Commonsense Reasoning

Rowan Zellers, +3 more

TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

...read moreread less

Proceedings ArticleDOI

MAttNet: Modular Attention Network for Referring Expression Comprehension

Licheng Yu, +6 more

TL;DR: The authors decompose expressions into three modular components related to subject appearance, location, and relationship to other objects in an end-to-end framework, which allows to flexibly adapt to expressions containing different types of information.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, +3 more

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Collapse

Neural Computation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, +11 more

- 01 May 2017 -

International Journal of Computer Vision

Show and tell: A neural image caption generator

Oriol Vinyals, +3 more

Generation and Comprehension of Unambiguous Object Descriptions

Citations

Multimodal Machine Learning: A Survey and Taxonomy

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

From Recognition to Cognition: Visual Commonsense Reasoning

MAttNet: Modular Attention Network for Referring Expression Comprehension

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Microsoft COCO: Common Objects in Context

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Related Papers (5)

Microsoft COCO: Common Objects in Context

Deep Residual Learning for Image Recognition

Long short-term memory

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Show and tell: A neural image caption generator