Open AccessProceedings Article
Image Description using Visual Dependency Representations
Desmond Elliott,Frank Keller +1 more
- pp 1292-1302
Reads0
Chats0
TLDR
In an image description task, two template-based description generation models that operate over visual dependency representations outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.Abstract:
Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image description task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.read more
Citations
More filters
Proceedings Article
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu,Jimmy Ba,Ryan Kiros,Kyunghyun Cho,Aaron Courville,Ruslan Salakhudinov,Ruslan Salakhudinov,Rich Zemel,Rich Zemel,Yoshua Bengio,Yoshua Bengio +10 more
TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Posted Content
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu,Jimmy Ba,Ryan Kiros,Kyunghyun Cho,Aaron Courville,Ruslan Salakhutdinov,Richard S. Zemel,Yoshua Bengio +7 more
TL;DR: This paper proposed an attention-based model that automatically learns to describe the content of images by focusing on salient objects while generating corresponding words in the output sequence, which achieved state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
Proceedings ArticleDOI
Show and tell: A neural image caption generator
TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.
Proceedings ArticleDOI
Deep visual-semantic alignments for generating image descriptions
Andrej Karpathy,Li Fei-Fei +1 more
TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Proceedings ArticleDOI
CIDEr: Consensus-based image description evaluation
TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.
References
More filters
Journal ArticleDOI
Object Detection with Discriminatively Trained Part-Based Models
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Journal ArticleDOI
LabelMe: A Database and Web-Based Tool for Image Annotation
TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.
The PASCAL visual object classes challenge 2006 (VOC2006) results
TL;DR: This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006).
Book ChapterDOI
Every picture tells a story: generating sentences from images
Ali Farhadi,Mohsen Hejrati,Mohammad Amin Sadeghi,Peter Young,Cyrus Rashtchian,Julia Hockenmaier,David Forsyth +6 more
TL;DR: A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence.
Proceedings Article
Im2Text: Describing Images Using 1 Million Captioned Photographs
TL;DR: A new objective performance measure for image captioning is introduced and methods incorporating many state of the art, but fairly noisy, estimates of image content are developed to produce even more pleasing results.