scispace - formally typeset
Open AccessJournal ArticleDOI

Visual question answering: A survey of methods and datasets

Reads0
Chats0
TLDR
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities as mentioned in this paper, which requires reasoning over visual elements of the image and general knowledge to infer the correct answer.
About
This article is published in Computer Vision and Image Understanding.The article was published on 2017-10-01 and is currently open access. It has received 255 citations till now. The article focuses on the topics: Question answering & Natural language.

read more

Citations
More filters
Proceedings ArticleDOI

Detecting Visual Relationships with Deep Relational Networks

TL;DR: In this paper, the Deep Relational Network (DRN) is proposed to exploit the statistical dependencies between objects and their relationships, which achieves substantial improvement over state-of-the-art methods.
Proceedings ArticleDOI

Graph-Structured Representations for Visual Question Answering

TL;DR: This paper proposes to build graphs over the scene objects and over the question words, and describes a deep neural network that exploits the structure in these representations, and achieves significant improvements over the state-of-the-art.
Posted Content

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.

TL;DR: The authors study yes/no questions that are naturally occurring, meaning that they are generated in unprompted and unconstrained settings, and build a reading comprehension dataset, BoolQ, of such questions.
Proceedings ArticleDOI

Deep Reinforcement Learning-Based Image Captioning with Embedding Reward

TL;DR: A novel decision-making framework for image captioning that combines a policy network and a value network to collaboratively generate captions and outperforms state-of-the-art approaches across different evaluation metrics.
Proceedings ArticleDOI

IQA: Visual Question Answering in Interactive Environments

TL;DR: In this paper, a Hierarchical Interactive Memory Network (HIMN) is proposed to operate at multiple levels of temporal abstraction, allowing the agent to interact with a dynamic visual environment.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Related Papers (5)