Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

doi:10.1109/CVPR.2017.670

Making the V in VQA Matter:

Elevating the Role of Image Understanding in Visual Question Answering

Yash Goyal

∗†

Tejas Khot

∗†

Douglas Summers-Stay

‡

Dhruv Batra

§

Devi Parikh

§

†

Virginia Tech

‡

Army Research Laboratory

§

Georgia Institute of Technology

†

{ygoyal, tjskhot}@vt.edu

‡

douglas.a.summers-stay.civ@mail.mil

§

{dbatra, parikh}@gatech.edu

Abstract

Problems at the intersection of vision and language

are of signiﬁcant importance both as challenging research

questions and for the rich set of applications they enable.

However, inherent structure in our world and bias in our

language tend to be a simpler signal for learning than vi-

sual modalities, resulting in models that ignore visual infor-

mation, leading to an inﬂated sense of their capability.

We propose to counter these language priors for the task

of Visual Question Answering (VQA) and make vision (the V

in VQA) matter! Speciﬁcally, we balance the popular VQA

dataset [

3] by collecting complementary images such that

every question in our balanced dataset is associated with

not just a single image, but rather a pair of similar images

that result in two different answers to the question. Our

dataset is by construction more balanced than the origi-

nal VQA dataset and has approximately twice the number

of image-question pairs. Our complete balanced dataset

is publicly available at http://visualqa.org/ as

part of the 2nd iteration of the Visual Question Answering

Dataset and Challenge (VQA v2.0).

We further benchmark a number of state-of-art VQA

models on our balanced dataset. All models perform sig-

niﬁcantly worse on our balanced dataset, suggesting that

these models have indeed learned to exploit language pri-

ors. This ﬁnding provides the ﬁrst concrete empirical evi-

dence for what seems to be a qualitative sense among prac-

titioners.

Finally, our data collection protocol for identifying com-

plementary images enables us to develop a novel inter-

pretable model, which in addition to providing an answer

to the given (image, question) pair, also provides a counter-

example based explanation. Speciﬁcally, it identiﬁes an im-

age that is similar to the original image, but it believes has

a different answer to the same question. This can help in

building trust for machines among their users.

∗

The ﬁrst two authors contributed equally.

Who is wearing glasses? Where is the child sitting?

Is the umbrella upside down?

How many children are in the bed?

woman

man

arms

fridge

no

yes

1

2

Figure 1: Examples from our balanced VQA dataset.

1. Introduction

Language and vision problems such as image caption-

ing [

8, 4, 7, 19, 40, 21, 28] and visual question answering

(VQA) [3, 26, 27, 10, 31] have gained popularity in recent

years as the computer vision research community is pro-

gressing beyond “bucketed” recognition and towards solv-

ing multi-modal problems.

The complex compositional structure of language makes

problems at the intersection of vision and language chal-

lenging. But recent works [6, 47, 49, 16, 18, 1] have pointed

out that language also provides a strong prior that can re-

sult in good superﬁcial performance, without the underlying

models truly understanding the visual content.

This phenomenon has been observed in image caption-

ing [

6] as well as visual question answering [47, 49, 16, 18,

1]. For instance, in the VQA [3] dataset, the most com-

mon sport answer “tennis” is the correct answer for 41%

of the questions starting with “What sport is”, and “2” is

the correct answer for 39% of the questions starting with

“How many”. Moreover, Zhang et al. [

47] points out a par-

ticular ‘visual priming bias’ in the VQA dataset – speciﬁ-

cally, subjects saw an image while asking questions about it.

Thus, people only ask the question “Is there a clock tower

in the picture?” on images actually containing clock tow-

ers. As one particularly perverse example – for questions

1

6904

in the VQA dataset starting with the n-gram “Do you see

a . . . ”, blindly answering “yes” without reading the rest of

the question or looking at the associated image results in a

VQA accuracy of 87%!

These language priors can give a false impression that

machines are making progress towards the goal of under-

standing images correctly when they are only exploiting

language priors to achieve high accuracy. This can hinder

progress in pushing state of art in the computer vision as-

pects of multi-modal AI [

39, 47].

In this work, we propose to counter these language bi-

ases and elevate the role of image understanding in VQA.

In order to accomplish this goal, we collect a balanced VQA

dataset with signiﬁcantly reduced language biases. Specif-

ically, we create a balanced VQA dataset in the following

way – given an (image, question, answer) triplet (I, Q, A)

from the VQA dataset, we ask a human subject to identify

an image I

′

that is similar to I but results in the answer to

the question Q to become A

′

(which is different from A).

Examples from our balanced dataset are shown in Fig.

1.

More random examples can be seen in Fig. 2 and on the

project website

1

.

Our hypothesis is that this balanced dataset will force

VQA models to focus on visual information. After all, when

a question has two different answers for two different im-

ages, the only way to know the right answer is by looking at

the image. Language-only models have simply no basis for

differentiating between the two cases – (Q, I) and (Q, I

′

),

and by construction must get one wrong. We believe that

this construction will also prevent language+vision models

from achieving high accuracy by exploiting language pri-

ors, enabling VQA evaluation protocols to more accurately

reﬂect progress in image understanding.

Our balanced VQA dataset is also particularly difﬁcult

because the picked complementary image I

′

is close to the

original image I in the semantic (fc7) space of VGGNet

[37] features. Therefore, VQA models will need to under-

stand the subtle differences between the two images to pre-

dict the answers to both the images correctly.

Note that simply ensuring that the answer distribution

P (A) is uniform across the dataset would not accomplish

the goal of alleviating language biases discussed above.

This is because language models exploit the correlation

between question n-grams and the answers, e.g. questions

starting with “Is there a clock” has the answer “yes” 98%

of the time, and questions starting with “Is the man stand-

ing” has the answer “no” 69% of the time. What we need

is not just higher entropy in P (A) across the dataset, but

higher entropy in P (A|Q) so that image I must play a role

in determining A. This motivates our balancing on a per-

question level.

Our complete balanced dataset contains approximately

1.1 Million (image, question) pairs – almost double the size

1

http://visualqa.org/

of the VQA [3] dataset – with approximately 13 Million

associated answers on the ∼200k images from COCO [23].

We believe this balanced VQA dataset is a better dataset to

benchmark VQA approaches.

Finally, our data collection protocol enables us to de-

velop a counter-example based explanation modality. We

propose a novel model that not only answers questions

about images, but also ‘explains’ its answer to an image-

question pair by providing “hard negatives” i.e., examples

of images that it believes are similar to the image at hand,

but it believes have different answers to the question. Such

an explanation modality will allow users of the VQA model

to establish greater trust in the model and identify its on-

coming failures.

Our main contributions are as follows: (1) We balance

the existing VQA dataset [3] by collecting complementary

images such that almost every question in our balanced

dataset is associated with not just a single image, but rather

a pair of similar images that result in two different answers

to the question. The result is a more balanced VQA dataset,

which is also approximately twice the size of the original

VQA dataset. (2) We evaluate state-of-art VQA models

(with publicly available code) on our balanced dataset, and

show that models trained on the existing ‘unbalanced’ VQA

dataset perform poorly on our new balanced dataset. This

ﬁnding conﬁrms our hypothesis that these models have been

exploiting language priors in the existing VQA dataset to

achieve higher accuracy. (3) Finally, our data collection

protocol for identifying complementary scenes enables us

to develop a novel interpretable model, which in addition to

answering questions about images, also provides a counter-

example based explanation – it retrieves images that it be-

lieves are similar to the original image but have different an-

swers to the question. Such explanations can help in build-

ing trust for machines among their users.

2. Related Work

Visual Question Answering. A number of recent works

have proposed visual question answering datasets [

3, 22, 26,

31, 10, 46, 38, 36] and models [9, 25, 2, 43, 24 , 27, 47,

45, 44, 41, 35, 20, 29, 15, 42, 33, 17]. Our work builds

on top of the VQA dataset from Antol et al. [3], which is

one of the most widely used VQA datasets. We reduce the

language biases present in this popular dataset, resulting in a

dataset that is more balanced and about twice the size of the

VQA dataset. We benchmark one ‘baseline’ VQA model

[24], one attention-based VQA model [25], and the winning

model from the VQA Challenge 2016 [9] on our balanced

VQA dataset, and compare them to a language-only model.

Data Balancing and Augmentation. At a high level,

our work may be viewed as constructing a more rigorous

evaluation protocol by collecting ‘hard negatives’. In that

spirit, it is similar to the work of Hodosh et al . [

14], who

created a binary forced-choice image captioning task, where

2

6905

Figure 2: Random examples from our proposed balanced VQA dataset. Each question has two similar images with different

answers to the question.

a machine must choose to caption an image with one of two

similar captions. To compare, Hodosh et al. [

14] imple-

mented hand-designed rules to create two similar captions

for images, while we create a novel annotation interface to

collect two similar images for questions in VQA.

Perhaps the most relevant to our work is that of Zhang et

al. [

47], who study this goal of balancing VQA in a fairly

restricted setting – binary (yes/no) questions on abstract

scenes made from clipart (part of the VQA abstract scenes

dataset [3]). Using clipart allows Zhang et al. to ask hu-

man annotators to “change the clipart scene such that the

answer to the question changes”. Unfortunately, such ﬁne-

grained editing of image content is simply not possible in

real images. The novelty of our work over Zhang et al.

is the proposed complementary image data collection inter-

face, application to real images, extension to all questions

(not just binary ones), benchmarking of state-of-art VQA

models on the balanced dataset, and ﬁnally the novel VQA

model with counter-example based explanations.

Models with explanation. A number of recent works

have proposed mechanisms for generating ‘explanations’

[

13, 34, 48, 11, 32] for the predictions made by deep

learning models, which are typically ‘black-box’ and non-

interpretable. [13] generates a natural language explanation

(sentence) for image categories. [34, 48, 11, 32] provide

‘visual explanations’ or spatial maps overlaid on images to

highlight the regions that the model focused on while mak-

ing its predictions. In this work, we introduce a third expla-

nation modality: counter-examples, instances the the model

believes are close to but not belonging to the category pre-

dicted by the model.

3. Dataset

We build on top of the VQA dataset introduced by An-

tol et al. [3]. VQA real images dataset contains just over

204K images from COCO [23], 614K free-form natural lan-

guage questions (3 questions per image), and over 6 million

free-form (but concise) answers (10 answers per question).

While this dataset has spurred signiﬁcant progress in VQA

domain, as discussed earlier, it has strong language biases.

Our key idea to counter this language bias is the follow-

ing – for every (image, question, answer) triplet (I, Q, A)

in the VQA dataset, our goal is to identify an image I

′

that

is similar to I, but results in the answer to the question Q

to become A

′

(which is different from A). We built an an-

notation interface (shown in Fig.

3) to collect such com-

plementary images on Amazon Mechanical Turk (AMT).

AMT workers are shown 24 nearest-neighbor images of I,

the question Q, and the answer A, and asked to pick an im-

age I

′

from the list of 24 images for which Q “makes sense”

and the answer to Q is not A.

To capture “question makes sense”, we explained to the

workers (and conducted qualiﬁcation tests to make sure that

they understood) that any premise assumed in the question

must hold true for the image they select. For instance,

the question “What is the woman doing?” assumes that a

3

6906

woman is present and can be seen in the image. It does

not make sense to ask this question on an image without a

woman visible in it.

We compute the 24 nearest neighbors by ﬁrst repre-

senting each image with the activations from the penulti-

mate (‘fc7’) layer of a deep Convolutional Neural Network

(CNN) – in particular VGGNet [

37] – and then using ℓ

2

-

distances to compute neighbors.

After the complementary images are collected, we con-

duct a second round of data annotation to collect answers on

these new images. Speciﬁcally, we show the picked image

I

′

with the question Q to 10 new AMT workers, and collect

10 ground truth answers (similar to [

3]). The most common

answer among the 10 is the new answer A

′

.

This two-stage data collection process ﬁnally results in

pairs of complementary images I and I

′

that are semanti-

cally similar, but have different answers A and A

′

respec-

tively to the same question Q . Since I and I

′

are seman-

tically similar, a VQA model will have to understand the

subtle differences between I and I

′

to provide the right an-

swer to both images. Example complementary images are

shown in Fig.

1, Fig. 2, and on the project website.

Note that sometimes it may not be possible to pick one

of the 24 neighbors as a complementary image. This is be-

cause either (1) the question does not make sense for any

of the 24 images (e.g. the question is ‘what is the woman

doing?’ and none of the neighboring images contain a

woman), or (2) the question is applicable to some neighbor-

ing images, but the answer to the question is still A (same

as the original image I). In such cases, our data collection

interface allowed AMT workers to select “not possible”.

We analyzed the data annotated with “not possible” se-

lection by AMT workers and found that this typically hap-

pens when (1) the object being talked about in the question

is too small in the original image and thus the nearest neigh-

bor images, while globally similar, do not necessarily con-

tain the object resulting in the question not making sense,

or (2) when the concept in the question is rare (e.g., when

workers are asked to pick an image such that the answer to

the question “What color is the banana?” is NOT “yellow”).

In total, such “not possible” selections make up 22% of

all the questions in the VQA dataset. We believe that a

more sophisticated interface that allowed workers to scroll

through many more than 24 neighboring images could pos-

sibly reduce this fraction. But, (1) it will likely still not be

0 (there may be no image in COCO where the answer to “is

the woman ﬂying?” is NOT “no”), and (2) the task would

be signiﬁcantly more cumbersome for workers, making the

data collection signiﬁcantly more expensive.

We collected complementary images and the corre-

sponding new answers for all of train, val and test splits of

the VQA dataset. AMT workers picked “not possible” for

approximately 135K total questions. In total, we collected

approximately 195K complementary images for train, 93K

Figure 3: A snapshot of our Amazon Mechanical Turk

(AMT) interface to collect complementary images.

complementary images for val, and 191K complementary

images for test set. In addition, we augment the test set

with ∼18K additional (question, image) pairs to provide

additional means to detect anomalous trends on the test

data. Hence, our complete balanced dataset contains more

than 443K train, 214K val and 453K test (question, image)

pairs. Our complete balanced dataset is publicly available

for download.

We use the publicly released VQA evaluation script in

our experiments. The evaluation metric uses 10 ground-

truth answers for each question to compute VQA accura-

cies. As described above, we collected 10 answers for ev-

ery complementary image and its corresponding question

to be consistent with the VQA dataset [

3]. Note that while

unlikely, it is possible that the majority vote of the 10 new

answers may not match the intended answer of the person

picking the image either due to inter-human disagreement,

or if the worker selecting the complementary image simply

made a mistake. We ﬁnd this to be the case – i.e., A to be

the same as A

′

– for about 9% of our questions.

While comparing the distribution of answers per

question-type in our new balanced VQA dataset with the

original (unbalanced) VQA dataset [

3], we notice several

interesting trends. First, binary questions (e.g. “is the”, “is

this”, “is there”, “are”, “does”) have a signiﬁcantly more

balanced distribution over “yes” and “no” answers in our

balanced dataset compared to unbalanced VQA dataset.

“baseball” is now slightly more popular than “tennis” under

“what sport”, and more importantly, overall “baseball” and

“tennis” dominate less in the answer distribution. Several

other sports like “frisbee”, “skiing”, “soccer”, “skateboard-

ing”, “snowboard” and “surﬁng” are more visible in the an-

swer distribution in the balanced dataset, suggesting that it

contains heavier tails. Similar trends are found for colors,

animals, numbers, etc. A visualization can be found here

2

.

Quantitatively, we ﬁnd that the entropy of answer distribu-

tions averaged across various question types (weighted by

frequency of question types) increases by 56% after balanc-

ing, conﬁrming the heavier tails in the answer distribution.

2

https://arxiv.org/abs/1612.00837

4

6907

As the statistics show, while our balanced dataset is not

perfectly balanced, it is signiﬁcantly more balanced than the

original VQA dataset. The resultant impact of this balanc-

ing on performance of state-of-the-art VQA models is dis-

cussed in the next section.

4. Benchmarking Existing VQA Models

Our ﬁrst approach to training a VQA model that empha-

sizes the visual information over language-priors-alone is

to re-train the existing state-of-art VQA models (with code

publicly available [

24, 25, 9]) on our new balanced VQA

dataset. Our hypothesis is that simply training a model to

answer questions correctly on our balanced dataset will al-

ready encourage the model to focus more on the visual sig-

nal, since the language signal alone has been impoverished.

We experiment with the following models:

Deeper LSTM Question + norm Image (d-LSTM+n-

I) [

24]: This was the VQA model introduced in [3] together

with the dataset. It uses a CNN embedding of the image, a

Long-Short Term Memory (LSTM) embedding of the ques-

tion, combines these two embeddings via a point-wise mul-

tiplication, followed by a multi-layer perceptron classiﬁer

to predict a probability distribution over 1000 most frequent

answers in the training dataset.

Hierarchical Co-attention (HieCoAtt) [

25]: This is a

recent attention-based VQA model that ‘co-attends’ to both

the image and the question to predict an answer. Speciﬁ-

cally, it models the question (and consequently the image

via the co-attention mechanism) in a hierarchical fashion:

at the word-level, phrase-level and entire question-level.

These levels are combined recursively to produce a distri-

bution over the 1000 most frequent answers.

Multimodal Compact Bilinear Pooling (MCB) [

9]:

This is the winning entry on the real images track of the

VQA Challenge 2016. This model uses a multimodal com-

pact bilinear pooling mechanism to attend over image fea-

tures and combine the attended image features with lan-

guage features. These combined features are then passed

through a fully-connected layer to predict a probability dis-

tribution over the 3000 most frequent answers. It should be

noted that MCB uses image features from a more power-

ful CNN architecture ResNet [12] while the previous two

models use image features from VGGNet [37].

Baselines: To put the accuracies of these models in per-

spective, we compare to the following baselines: Prior:

Predicting the most common answer in the training set, for

all test questions. The most common answer is “yes” in both

the unbalanced and balanced sets. Language-only: This

language-only baseline has a similar architecture as Deeper

LSTM + Norm I [

24] except that it only accepts the question

as input and does not utilize any visual information. Com-

paring VQA models to language-only ablations quantiﬁes

to what extent VQA models have succeeded in leveraging

the image to answer the questions.

The results are shown in Table 1. For fair comparison of

accuracies with original (unbalanced) dataset, we create a

balanced train set which is of similar size as original dataset

(referred to as B

half

in table). For benchmarking, we also

report results using the full balanced train set.

Approach UU UB B

half

B BB

Prior 27.38 24.04 24.04 24.04

Language-only 48.21 41.40 41.47 43.01

d-LSTM+n-I [

24] 54.40 47.56 49.23 51.62

HieCoAtt [

25] 57.09 50.31 51.88 54.57

MCB [

9] 60.36 54.22 56.08 59.14

Table 1: Performance of VQA models when trained/tested

on unbalanced/balanced VQA datasets. UB stands for

training on Unbalanced train and testing on Balanced val

datasets. UU, B

half

B and BB are deﬁned analogously.

We see that the current state-of-art VQA models trained

on the original (unbalanced) VQA dataset perform signiﬁ-

cantly worse when evaluated on our balanced dataset, com-

pared to evaluating on the original unbalanced VQA dataset

(i.e., comparing UU to UB in the table). This ﬁnding con-

ﬁrms our hypothesis that existing models have learned se-

vere language biases present in the dataset, resulting in a re-

duced ability to answer questions correctly when the same

question has different answers on different images. When

these models are trained on our balanced dataset, their per-

formance improves (compare UB to B

half

B in the table).

Further, when models are trained on complete balanced

dataset (∼twice the size of original dataset), the accuracy

improves by 2-3% (compare B

half

B to BB). This increase

in accuracy suggests that current VQA models are data

starved, and would beneﬁt from even larger VQA datasets.

As the absolute numbers in the table suggest, there is

signiﬁcant room for improvement in building visual under-

standing models that can extract detailed information from

images and leverage this information to answer free-form

natural language questions about images accurately. As ex-

pected from the construction of this balanced dataset, the

question-only approach performs signiﬁcantly worse on the

balanced dataset compared to the unbalanced dataset, again

conﬁrming the language-bias in the original VQA dataset,

and its successful alleviation (though not elimination) in our

proposed balanced dataset.

Note that in addition to the lack of language bias, visual

reasoning is also challenging on the balanced dataset since

there are pairs of images very similar to each other in im-

age representations learned by CNNs, but with different an-

swers to the same question. To be successful, VQA models

need to understand the subtle differences in these images.

The paired construction of our dataset allows us to an-

alyze the performance of VQA models in unique ways.

Given the prediction of a VQA model, we can count the

number of questions where both complementary images

5

6908

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Citations

Improving Visual Question Answering by Referring to Generated Paragraph Captions

Uncovering Hidden Challenges in Query-Based Video Moment Retrieval

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Bilinear Graph Networks for Visual Question Answering

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

References

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Glove: Global Vectors for Word Representation

Microsoft COCO: Common Objects in Context

Related Papers (5)

VQA: Visual Question Answering

Deep Residual Learning for Image Recognition

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Microsoft COCO: Common Objects in Context

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding