Multi-level Attention Networks for Visual Question Answering

doi:10.1109/CVPR.2017.446

Proceedings ArticleDOI

Multi-level Attention Networks for Visual Question Answering

Dongfei Yu, +3 more

- pp 4187-4195

Chats0

TLDR

A multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention is proposed.

Abstract:

Inspired by the recent success of text-based question answering, visual question answering (VQA) is proposed to automatically answer natural language questions with the reference to a given image. Compared with text-based QA, VQA is more challenging because the reasoning process on visual domain needs both effective semantic embedding and fine-grained visual understanding. Existing approaches predominantly infer answers from the abstract low-level visual features, while neglecting the modeling of high-level image semantics and the rich spatial context of regions. To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention. First, we generate semantic concepts from high-level semantics in convolutional neural networks (CNN) and select those question-related concepts as semantic attention. Second, we encode region-based middle-level outputs from CNN into spatially-embedded representation by a bidirectional recurrent neural network, and further pinpoint the answer-related regions by multiple layer perceptron as visual attention. Third, we jointly optimize semantic attention, visual attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach outperforms the-state-of-arts on two challenging VQA datasets.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Attention on Attention for Image Captioning

Lun Huang, +3 more

TL;DR: AoANet as mentioned in this paper proposes an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries and achieves state-of-the-art performance.

...read moreread less

Proceedings ArticleDOI

Scene Graph Generation from Objects, Phrases and Region Captions

Yikang Li, +4 more

TL;DR: Zhang et al. as mentioned in this paper proposed a multi-level scene description network (MSDN) to solve the three vision tasks jointly in an end-to-end manner, where object, phrase, and caption regions are aligned with a dynamic graph based on their spatial and semantic connections.

...read moreread less

Proceedings ArticleDOI

TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays

Xiaosong Wang, +4 more

TL;DR: A novel Text-Image Embedding network (TieNet) is proposed for extracting the distinctive image and text representations of chest X-rays and multi-level attention models are integrated into an end-to-end trainable CNN-RNN architecture for highlighting the meaningful text words and image regions.

...read moreread less

Proceedings ArticleDOI

GeoMAN: Multi-level Attention Networks for Geo-sensory Time Series Prediction

Yuxuan Liang, +4 more

TL;DR: This paper predicts the readings of a geo-sensor over several future hours by using a multi-level attention-based recurrent neural network that considers multiple sensors' readings, meteorological data, and spatial data.

...read moreread less

Posted Content

Cross Attention Network for Few-shot Classification.

Ruibing Hou, +4 more

- 17 Oct 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel Cross Attention Network is introduced to deal with the problem of unseen classes and a transductive inference algorithm is proposed to alleviate the low-data problem, which iteratively utilizes the unlabeled query set to augment the support set, thereby making the class features more representative.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Journal ArticleDOI

Gradient-based learning applied to document recognition

Yann LeCun, +6 more

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Collapse

Multi-level Attention Networks for Visual Question Answering

Citations

Attention on Attention for Image Captioning

Scene Graph Generation from Objects, Phrases and Region Captions

TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays

GeoMAN: Multi-level Attention Networks for Geo-sensory Time Series Prediction

Cross Attention Network for Few-shot Classification.

References

Deep Residual Learning for Image Recognition

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

Gradient-based learning applied to document recognition

Microsoft COCO: Common Objects in Context

Related Papers (5)

Deep Residual Learning for Image Recognition

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Very Deep Convolutional Networks for Large-Scale Image Recognition

VQA: Visual Question Answering

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention