Generation and Comprehension of Unambiguous Object Descriptions
Junhua Mao,Jonathan Huang,Alexander Toshev,Oana-Maria Camburu,Alan L. Yuille,Kevin Murphy +5 more
- pp 11-20
Reads0
Chats0
TLDR
The authors proposed a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.Abstract:
We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate, our task allows for easy objective evaluation. We also present a new large-scale dataset for referring expressions, based on MSCOCO. We have released the dataset and a toolbox for visualization and evaluation, see https://github.com/ mjhucla/Google_Refexp_toolbox.read more
Citations
More filters
Journal ArticleDOI
Multimodal Machine Learning: A Survey and Taxonomy
TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Proceedings ArticleDOI
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Bryan A. Plummer,Liwei Wang,Christopher M. Cervantes,Juan C. Caicedo,Julia Hockenmaier,Svetlana Lazebnik +5 more
TL;DR: This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains linking mentions of the same entities in images, as well as 276k manually annotated bounding boxes corresponding to each entity, essential for continued progress in automatic image description and grounded language understanding.
Proceedings ArticleDOI
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
Peter Anderson,Qi Wu,Damien Teney,Jake Bruce,Mark Johnson,Niko Sünderhauf,Ian Reid,Stephen Gould,Anton van den Hengel +8 more
TL;DR: The Room-to-Room (R2R) dataset as mentioned in this paper provides a large-scale reinforcement learning environment based on real imagery for visually-grounded natural language navigation in real buildings.
Proceedings ArticleDOI
From Recognition to Cognition: Visual Commonsense Reasoning
TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
Proceedings ArticleDOI
MAttNet: Modular Attention Network for Referring Expression Comprehension
TL;DR: The authors decompose expressions into three modular components related to subject appearance, location, and relationship to other objects in an end-to-end framework, which allows to flexibly adapt to expressions containing different types of information.
References
More filters
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Book ChapterDOI
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.