Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

doi:10.1109/CVPR.2018.00636

Open AccessProceedings ArticleDOI

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson, +6 more

- pp 6077-6086

Chats0

TLDR

In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.

Abstract:

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

Citations

PDF

Open Access

More filters

Book ChapterDOI

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models.

Jize Cao, +5 more

TL;DR: VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training.

...read moreread less

Proceedings ArticleDOI

What Does BERT with Vision Look At

Liunian Harold Li, +4 more

TL;DR: It is demonstrated that certain attention heads of a visually grounded language model actively ground elements of language to image regions, performing the task known as entity grounding.

...read moreread less

Proceedings ArticleDOI

Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech

Aditya Deshpande, +4 more

TL;DR: In this paper, the authors use part-of-speech as summaries, since their summary should drive caption generation and achieve high accuracy for the diverse captions as evaluated by standard captioning metrics and user studies.

...read moreread less

Posted Content

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Ronghang Hu, +3 more

- 14 Nov 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.

...read moreread less

Proceedings ArticleDOI

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Daqing Liu, +4 more

TL;DR: A Context-Aware Visual Policy network (CAVP) for sequence-level image captioning that explicitly accounts for the previous visual attentions as the context, and then decides whether the context is helpful for the current word generation given the current visual attention.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Proceedings ArticleDOI

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, +3 more

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

Collapse

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Citations

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models.

What Does BERT with Vision Look At

Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

References

Deep Residual Learning for Image Recognition

Long short-term memory

ImageNet Large Scale Visual Recognition Challenge

Microsoft COCO: Common Objects in Context

You Only Look Once: Unified, Real-Time Object Detection

Related Papers (5)

Deep Residual Learning for Image Recognition

Microsoft COCO: Common Objects in Context

Attention is All you Need

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Faster R-CNN: towards real-time object detection with region proposal networks