scispace - formally typeset
Open AccessProceedings Article

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)

TLDR
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
Abstract
In [Hodosh et al., 2013], we establish a rankingbased framework for sentence-based image description and retrieval. We introduce a new dataset of images paired with multiple descriptive captions that was specifically designed for these tasks. We also present strong KCCA-based baseline systems for description and search, and perform an in-depth study of evaluation metrics for these two tasks. Our results indicate that automatic evaluation metrics for our ranking-based tasks are more accurate and robust than those proposed for generation-based image description.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

TL;DR: This paper proposed an attention-based model that automatically learns to describe the content of images by focusing on salient objects while generating corresponding words in the output sequence, which achieved state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
Proceedings ArticleDOI

Deep visual-semantic alignments for generating image descriptions

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Posted Content

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Proceedings ArticleDOI

VQA: Visual Question Answering

TL;DR: The task of free-form and open-ended Visual Question Answering (VQA) is proposed, given an image and a natural language question about the image, the task is to provide an accurate natural language answer.
Proceedings ArticleDOI

CIDEr: Consensus-based image description evaluation

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

A Coefficient of agreement for nominal Scales

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Related Papers (5)