scispace - formally typeset
Open AccessProceedings ArticleDOI

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Reads0
Chats0
TLDR
In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.
Abstract
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Answer Them All! Toward Universal Visual Question Answering Models

TL;DR: In this paper, the authors compare five state-of-the-art VQA algorithms across eight different datasets covering both natural image understanding and synthetic data sets and find that methods do not generalize across the two domains.
Book ChapterDOI

Learning to Generate Grounded Visual Captions Without Localization Supervision

TL;DR: This paper proposed a cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it, and then reconstruct the sentence from the localized image region(s) to match the ground-truth.
Posted Content

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

TL;DR: This paper argues that a simple attention mechanism can do the same or even better job without any bells and whistles of multi-modality encoder design, and finds this simple baseline model consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-V QA.
Proceedings ArticleDOI

Object-Difference Attention: A Simple Relational Attention for Visual Question Answering

TL;DR: An object-difference attention (ODA) is proposed which calculates the probability of attention by implementing difference operator between different image objects in an image under the guidance of questions in hand and Experimental results show those relational attentions have strengths on different types of questions.
Proceedings ArticleDOI

Cascade Reasoning Network for Text-based Visual Question Answering

TL;DR: A novel Cascade Reasoning Network (CRN) is proposed that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module that aims to explicitly model the connections and interactions between texts and visual concepts.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI

You Only Look Once: Unified, Real-Time Object Detection

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Related Papers (5)