Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

Open AccessPosted Content

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

- 25 Jul 2017 -

arXiv: Computer Vision and Pattern Recog...

TLDR

A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.

Abstract:

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

Citations

PDF

Open Access

More filters

Posted Content

Attention U-Net: Learning Where to Look for the Pancreas

Ozan Oktay, +11 more

- 11 Apr 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes is proposed to eliminate the necessity of using explicit external tissue/organ localisation modules of cascaded convolutional neural networks (CNNs).

...read moreread less

Proceedings ArticleDOI

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Yash Goyal, +4 more

TL;DR: The authors balance the VQA dataset by collecting complementary images such that every question in the balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the same question.

...read moreread less

Book ChapterDOI

UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen, +7 more

TL;DR: UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

...read moreread less

Posted Content

VisualBERT: A Simple and Performant Baseline for Vision and Language.

Liunian Harold Li, +4 more

- 09 Aug 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

...read moreread less

Proceedings Article

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, +3 more

TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Collapse

arXiv: Learning

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

Citations

Attention U-Net: Learning Where to Look for the Pancreas

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

UNITER: UNiversal Image-TExt Representation Learning

VisualBERT: A Simple and Performant Baseline for Vision and Language.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

References

Deep Residual Learning for Image Recognition

Long short-term memory

Deep Residual Learning for Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Glove: Global Vectors for Word Representation

Related Papers (5)

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Glove: Global Vectors for Word Representation

Adam: A Method for Stochastic Optimization

Bleu: a Method for Automatic Evaluation of Machine Translation

Deep Residual Learning for Image Recognition