Visual question answering: A survey of methods and datasets
Reads0
Chats0
TLDR
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities as mentioned in this paper, which requires reasoning over visual elements of the image and general knowledge to infer the correct answer.About:
This article is published in Computer Vision and Image Understanding.The article was published on 2017-10-01 and is currently open access. It has received 255 citations till now. The article focuses on the topics: Question answering & Natural language.read more
Citations
More filters
Proceedings ArticleDOI
Detecting Visual Relationships with Deep Relational Networks
Bo Dai,Yuqi Zhang,Dahua Lin +2 more
TL;DR: In this paper, the Deep Relational Network (DRN) is proposed to exploit the statistical dependencies between objects and their relationships, which achieves substantial improvement over state-of-the-art methods.
Proceedings ArticleDOI
Graph-Structured Representations for Visual Question Answering
TL;DR: This paper proposes to build graphs over the scene objects and over the question words, and describes a deep neural network that exploits the structure in these representations, and achieves significant improvements over the state-of-the-art.
Posted Content
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.
Christopher Clark,Kenton Lee,Ming-Wei Chang,Tom Kwiatkowski,Michael Collins,Kristina Toutanova +5 more
TL;DR: The authors study yes/no questions that are naturally occurring, meaning that they are generated in unprompted and unconstrained settings, and build a reading comprehension dataset, BoolQ, of such questions.
Proceedings ArticleDOI
Deep Reinforcement Learning-Based Image Captioning with Embedding Reward
TL;DR: A novel decision-making framework for image captioning that combines a policy network and a value network to collaboratively generate captions and outperforms state-of-the-art approaches across different evaluation metrics.
Proceedings ArticleDOI
IQA: Visual Question Answering in Interactive Environments
TL;DR: In this paper, a Hierarchical Interactive Memory Network (HIMN) is proposed to operate at multiple levels of temporal abstraction, allowing the agent to interact with a dynamic visual environment.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI
Going deeper with convolutions
Christian Szegedy,Wei Liu,Yangqing Jia,Pierre Sermanet,Scott Reed,Dragomir Anguelov,Dumitru Erhan,Vincent Vanhoucke,Andrew Rabinovich +8 more
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings ArticleDOI
Glove: Global Vectors for Word Representation
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.