Image Description using Visual Dependency Representations

Open AccessProceedings Article

Image Description using Visual Dependency Representations

Desmond Elliott, +1 more

- pp 1292-1302

Chats0

TLDR

In an image description task, two template-based description generation models that operate over visual dependency representations outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

Abstract:

Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image description task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

Citations

PDF

Open Access

More filters

Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, +10 more

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

Posted Content

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, +7 more

- 10 Feb 2015 -

arXiv: Learning

TL;DR: This paper proposed an attention-based model that automatically learns to describe the content of images by focusing on salient objects while generating corresponding words in the output sequence, which achieved state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

...read moreread less

Proceedings ArticleDOI

Show and tell: A neural image caption generator

Oriol Vinyals, +3 more

TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.

...read moreread less

Proceedings ArticleDOI

Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy, +1 more

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

...read moreread less

Proceedings ArticleDOI

CIDEr: Consensus-based image description evaluation

Ramakrishna Vedantam, +2 more

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

Pedro F. Felzenszwalb, +3 more

- 01 Sep 2010 -

IEEE Transactions on Pattern Analysis an...

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.

...read moreread less

Journal ArticleDOI

LabelMe: A Database and Web-Based Tool for Image Annotation

Bryan Russell, +3 more

- 01 May 2008 -

International Journal of Computer Vision

TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.

...read moreread less

The PASCAL visual object classes challenge 2006 (VOC2006) results

Mark Everingham, +3 more

TL;DR: This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006).

...read moreread less

Book ChapterDOI

Every picture tells a story: generating sentences from images

Ali Farhadi, +6 more

TL;DR: A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence.

...read moreread less

Proceedings Article

Im2Text: Describing Images Using 1 Million Captioned Photographs

Vicente Ordonez, +2 more

TL;DR: A new objective performance measure for image captioning is introduced and methods incorporating many state of the art, but fairly noisy, estimates of image content are developed to produce even more pleasing results.

...read moreread less

Image Description using Visual Dependency Representations

Citations

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show and tell: A neural image caption generator

Deep visual-semantic alignments for generating image descriptions

CIDEr: Consensus-based image description evaluation

References

Object Detection with Discriminatively Trained Part-Based Models

LabelMe: A Database and Web-Based Tool for Image Annotation

The PASCAL visual object classes challenge 2006 (VOC2006) results

Every picture tells a story: generating sentences from images

Im2Text: Describing Images Using 1 Million Captioned Photographs

Related Papers (5)

Show and tell: A neural image caption generator

Bleu: a Method for Automatic Evaluation of Machine Translation

Deep visual-semantic alignments for generating image descriptions

Microsoft COCO: Common Objects in Context

CIDEr: Consensus-based image description evaluation