Visual question answering: A survey of methods and datasets

doi:10.1016/J.CVIU.2017.05.001

Open AccessJournal ArticleDOI

Visual question answering: A survey of methods and datasets

Qi Wu, +5 more

- 01 Oct 2017 -

Computer Vision and Image Understanding

- Vol. 163, pp 21-40

Chats0

TLDR

Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities as mentioned in this paper, which requires reasoning over visual elements of the image and general knowledge to infer the correct answer.

About:

This article is published in Computer Vision and Image Understanding.The article was published on 2017-10-01 and is currently open access. It has received 255 citations till now. The article focuses on the topics: Question answering & Natural language.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Detecting Visual Relationships with Deep Relational Networks

Bo Dai, +2 more

TL;DR: In this paper, the Deep Relational Network (DRN) is proposed to exploit the statistical dependencies between objects and their relationships, which achieves substantial improvement over state-of-the-art methods.

...read moreread less

Proceedings ArticleDOI

Graph-Structured Representations for Visual Question Answering

Damien Teney, +2 more

TL;DR: This paper proposes to build graphs over the scene objects and over the question words, and describes a deep neural network that exploits the structure in these representations, and achieves significant improvements over the state-of-the-art.

...read moreread less

Posted Content

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.

Christopher Clark, +5 more

- 24 May 2019 -

arXiv: Computation and Language

TL;DR: The authors study yes/no questions that are naturally occurring, meaning that they are generated in unprompted and unconstrained settings, and build a reading comprehension dataset, BoolQ, of such questions.

...read moreread less

Proceedings ArticleDOI

Deep Reinforcement Learning-Based Image Captioning with Embedding Reward

Zhou Ren, +4 more

TL;DR: A novel decision-making framework for image captioning that combines a policy network and a value network to collaboratively generate captions and outperforms state-of-the-art approaches across different evaluation metrics.

...read moreread less

Proceedings ArticleDOI

IQA: Visual Question Answering in Interactive Environments

Daniel Gordon, +5 more

TL;DR: In this paper, a Hierarchical Interactive Memory Network (HIMN) is proposed to operate at multiple levels of temporal abstraction, allowing the agent to interact with a dynamic visual environment.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Collapse

International Journal of Computer Vision

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson, +6 more

Visual question answering: A survey of methods and datasets

Citations

Detecting Visual Relationships with Deep Relational Networks

Graph-Structured Representations for Visual Question Answering

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.

Deep Reinforcement Learning-Based Image Captioning with Embedding Reward

IQA: Visual Question Answering in Interactive Environments

References

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Going deeper with convolutions

Glove: Global Vectors for Word Representation

Related Papers (5)

VQA: Visual Question Answering

Deep Residual Learning for Image Recognition

Stacked Attention Networks for Image Question Answering

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering