scispace - formally typeset
Open AccessProceedings Article

Image Description using Visual Dependency Representations

Desmond Elliott, +1 more
- pp 1292-1302
Reads0
Chats0
TLDR
In an image description task, two template-based description generation models that operate over visual dependency representations outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.
Abstract
Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image description task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Posted Content

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

TL;DR: This paper proposed an attention-based model that automatically learns to describe the content of images by focusing on salient objects while generating corresponding words in the output sequence, which achieved state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
Proceedings ArticleDOI

Show and tell: A neural image caption generator

TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.
Proceedings ArticleDOI

Deep visual-semantic alignments for generating image descriptions

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Proceedings ArticleDOI

CIDEr: Consensus-based image description evaluation

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.
References
More filters
Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Journal ArticleDOI

LabelMe: A Database and Web-Based Tool for Image Annotation

TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.

The PASCAL visual object classes challenge 2006 (VOC2006) results

TL;DR: This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006).
Book ChapterDOI

Every picture tells a story: generating sentences from images

TL;DR: A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence.
Proceedings Article

Im2Text: Describing Images Using 1 Million Captioned Photographs

TL;DR: A new objective performance measure for image captioning is introduced and methods incorporating many state of the art, but fairly noisy, estimates of image content are developed to produce even more pleasing results.
Related Papers (5)