Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

Open AccessPosted Content

Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

- 21 Jun 2016 -

arXiv: Computer Vision and Pattern Recog...

TLDR

These approaches, based on LSTM-RNNs, VQA model uncertainty, and caption-question similarity, are able to outperform strong baselines on both relevance tasks and are shown to be more intelligent, reasonable, and human-like than previous approaches.

Abstract:

Visual Question Answering (VQA) is the task of answering natural-language questions about images. We introduce the novel problem of determining the relevance of questions to images in VQA. Current VQA models do not reason about whether a question is even related to the given image (e.g. What is the capital of Argentina?) or if it requires information from external resources to answer correctly. This can break the continuity of a dialogue in human-machine interaction. Our approaches for determining relevance are composed of two stages. Given an image and a question, (1) we first determine whether the question is visual or not, (2) if visual, we determine whether the question is relevant to the given image or not. Our approaches, based on LSTM-RNNs, VQA model uncertainty, and caption-question similarity, are able to outperform strong baselines on both relevance tasks. We also present human studies showing that VQA models augmented with such question relevance reasoning are perceived as more intelligent, reasonable, and human-like.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Yash Goyal, +4 more

TL;DR: The authors balance the VQA dataset by collecting complementary images such that every question in the balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the same question.

...read moreread less

Posted Content

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Justin Johnson, +5 more

- 20 Dec 2016 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a diagnostic dataset that tests a range of visual reasoning abilities and uses this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

...read moreread less

Journal Article

Visual Dialog

Abhishek Das, +8 more

- 01 May 2019 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The authors introduced the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content, given an image, a dialog history and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately.

...read moreread less

Proceedings ArticleDOI

A Corpus of Natural Language for Visual Reasoning.

Alane Suhr, +3 more

TL;DR: A method of crowdsourcing linguistically-diverse data, and an analysis of the data demonstrates a broad set of linguistic phenomena, requiring visual and set-theoretic reasoning.

...read moreread less

Posted Content

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Danna Gurari, +7 more

- 22 Feb 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The VizWiz dataset as discussed by the authors consists of over 31,000 visual questions originating from blind people who each took a picture using a mobile phone and recorded a spoken question about it, together with 10 crowdsourced answers per visual question.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Posted Content

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

- 16 Oct 2013 -

arXiv: Computation and Language

TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.

...read moreread less

Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, +10 more

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

Posted Content

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, +7 more

- 10 Feb 2015 -

arXiv: Learning

TL;DR: This paper proposed an attention-based model that automatically learns to describe the content of images by focusing on salient objects while generating corresponding words in the output sequence, which achieved state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

...read moreread less

Collapse

Related Papers (5)

VQA: Visual Question Answering

Aishwarya Agrawal, +6 more

- 03 May 2015 -

arXiv: Computation and Language

Revisiting Visual Question Answering Baselines

Allan Jabri, +2 more

- 27 Jun 2016 -

arXiv: Computer Vision and Pattern Recog...

Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

Citations

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Visual Dialog

A Corpus of Natural Language for Visual Reasoning.

VizWiz Grand Challenge: Answering Visual Questions from Blind People

References

Microsoft COCO: Common Objects in Context

Distributed Representations of Words and Phrases and their Compositionality

Distributed Representations of Words and Phrases and their Compositionality

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Related Papers (5)

VQA: Visual Question Answering

Revisiting Visual Question Answering Baselines

Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?

RUBi: Reducing Unimodal Biases for Visual Question Answering

Perception Matters: Detecting Perception Failures of VQA Models Using Metamorphic Testing