VQA: Visual Question Answering

doi:10.1007/S11263-016-0966-6

Open AccessJournal ArticleDOI

VQA: Visual Question Answering

Aishwarya Agrawal, +6 more

- 01 May 2017 -

International Journal of Computer Vision

- Vol. 123, Iss: 1, pp 4-31

Chats0

TLDR

This article proposed the task of free-form and open-ended Visual Question Answering (VQA), where given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

Abstract:

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing $$\sim $$~0.25 M images, $$\sim $$~0.76 M questions, and $$\sim $$~10 M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Tao Xu, +6 more

TL;DR: AttnGAN as mentioned in this paper proposes an attentional generative network to synthesize fine-grained details at different sub-regions of the image by paying attentions to the relevant words in the natural language description.

...read moreread less

Journal ArticleDOI

A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI

Erico Tjoa, +1 more

- 27 Oct 2021 -

IEEE Transactions on Neural Networks

TL;DR: A review on interpretabilities suggested by different research works and categorize them is provided, hoping that insight into interpretability will be born with more considerations for medical practices and initiatives to push forward data-based, mathematically grounded, and technically grounded medical education are encouraged.

...read moreread less

Posted Content

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Fartash Faghri, +3 more

- 18 Jul 2017 -

arXiv: Learning

TL;DR: A simple change to common loss functions used for multi-modal embeddings, inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, is introduced, which yields significant gains in retrieval performance.

...read moreread less

Proceedings ArticleDOI

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson, +1 more

TL;DR: GQA as discussed by the authors is a dataset for real-world visual reasoning and compositional question answering, which leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics.

...read moreread less

Proceedings Article

Hierarchical Question-Image Co-Attention for Visual Question Answering

Jiasen Lu, +3 more

TL;DR: This paper proposed a co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Journal ArticleDOI

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, +2 more

- 24 May 2017 -

Communications of The ACM

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Posted Content

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

- 16 Oct 2013 -

arXiv: Computation and Language

TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.

...read moreread less

Proceedings ArticleDOI

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia, +7 more

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Collapse

Neural Computation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

VQA: Visual Question Answering

Citations

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet classification with deep convolutional neural networks

Microsoft COCO: Common Objects in Context

Distributed Representations of Words and Phrases and their Compositionality

Caffe: Convolutional Architecture for Fast Feature Embedding

Related Papers (5)

Deep Residual Learning for Image Recognition

Glove: Global Vectors for Word Representation

Microsoft COCO: Common Objects in Context

Long short-term memory

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding