scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

TL;DR: The authors balance the VQA dataset by collecting complementary images such that every question in the balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the same question.
Abstract: Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset (Antol et al., ICCV 2015) by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the Visual Question Answering Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.

Content maybe subject to copyright    Report

Making the V in VQA Matter:
Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
Douglas Summers-Stay
Dhruv Batra
§
Devi Parikh
§
Virginia Tech
Army Research Laboratory
§
Georgia Institute of Technology
{ygoyal, tjskhot}@vt.edu
douglas.a.summers-stay.civ@mail.mil
§
{dbatra, parikh}@gatech.edu
Abstract
Problems at the intersection of vision and language
are of significant importance both as challenging research
questions and for the rich set of applications they enable.
However, inherent structure in our world and bias in our
language tend to be a simpler signal for learning than vi-
sual modalities, resulting in models that ignore visual infor-
mation, leading to an inflated sense of their capability.
We propose to counter these language priors for the task
of Visual Question Answering (VQA) and make vision (the V
in VQA) matter! Specifically, we balance the popular VQA
dataset [
3] by collecting complementary images such that
every question in our balanced dataset is associated with
not just a single image, but rather a pair of similar images
that result in two different answers to the question. Our
dataset is by construction more balanced than the origi-
nal VQA dataset and has approximately twice the number
of image-question pairs. Our complete balanced dataset
is publicly available at http://visualqa.org/ as
part of the 2nd iteration of the Visual Question Answering
Dataset and Challenge (VQA v2.0).
We further benchmark a number of state-of-art VQA
models on our balanced dataset. All models perform sig-
nificantly worse on our balanced dataset, suggesting that
these models have indeed learned to exploit language pri-
ors. This finding provides the first concrete empirical evi-
dence for what seems to be a qualitative sense among prac-
titioners.
Finally, our data collection protocol for identifying com-
plementary images enables us to develop a novel inter-
pretable model, which in addition to providing an answer
to the given (image, question) pair, also provides a counter-
example based explanation. Specifically, it identifies an im-
age that is similar to the original image, but it believes has
a different answer to the same question. This can help in
building trust for machines among their users.
The first two authors contributed equally.
Who is wearing glasses? Where is the child sitting?
Is the umbrella upside down?
How many children are in the bed?
woman
man
arms
fridge
no
yes
1
2
Figure 1: Examples from our balanced VQA dataset.
1. Introduction
Language and vision problems such as image caption-
ing [
8, 4, 7, 19, 40, 21, 28] and visual question answering
(VQA) [3, 26, 27, 10, 31] have gained popularity in recent
years as the computer vision research community is pro-
gressing beyond “bucketed” recognition and towards solv-
ing multi-modal problems.
The complex compositional structure of language makes
problems at the intersection of vision and language chal-
lenging. But recent works [6, 47, 49, 16, 18, 1] have pointed
out that language also provides a strong prior that can re-
sult in good superficial performance, without the underlying
models truly understanding the visual content.
This phenomenon has been observed in image caption-
ing [
6] as well as visual question answering [47, 49, 16, 18,
1]. For instance, in the VQA [3] dataset, the most com-
mon sport answer “tennis” is the correct answer for 41%
of the questions starting with “What sport is”, and “2” is
the correct answer for 39% of the questions starting with
“How many”. Moreover, Zhang et al. [
47] points out a par-
ticular ‘visual priming bias’ in the VQA dataset specifi-
cally, subjects saw an image while asking questions about it.
Thus, people only ask the question “Is there a clock tower
in the picture?” on images actually containing clock tow-
ers. As one particularly perverse example for questions
1
6904

in the VQA dataset starting with the n-gram “Do you see
a . . . ”, blindly answering “yes” without reading the rest of
the question or looking at the associated image results in a
VQA accuracy of 87%!
These language priors can give a false impression that
machines are making progress towards the goal of under-
standing images correctly when they are only exploiting
language priors to achieve high accuracy. This can hinder
progress in pushing state of art in the computer vision as-
pects of multi-modal AI [
39, 47].
In this work, we propose to counter these language bi-
ases and elevate the role of image understanding in VQA.
In order to accomplish this goal, we collect a balanced VQA
dataset with significantly reduced language biases. Specif-
ically, we create a balanced VQA dataset in the following
way given an (image, question, answer) triplet (I, Q, A)
from the VQA dataset, we ask a human subject to identify
an image I
that is similar to I but results in the answer to
the question Q to become A
(which is different from A).
Examples from our balanced dataset are shown in Fig.
1.
More random examples can be seen in Fig. 2 and on the
project website
1
.
Our hypothesis is that this balanced dataset will force
VQA models to focus on visual information. After all, when
a question has two different answers for two different im-
ages, the only way to know the right answer is by looking at
the image. Language-only models have simply no basis for
differentiating between the two cases (Q, I) and (Q, I
),
and by construction must get one wrong. We believe that
this construction will also prevent language+vision models
from achieving high accuracy by exploiting language pri-
ors, enabling VQA evaluation protocols to more accurately
reflect progress in image understanding.
Our balanced VQA dataset is also particularly difficult
because the picked complementary image I
is close to the
original image I in the semantic (fc7) space of VGGNet
[37] features. Therefore, VQA models will need to under-
stand the subtle differences between the two images to pre-
dict the answers to both the images correctly.
Note that simply ensuring that the answer distribution
P (A) is uniform across the dataset would not accomplish
the goal of alleviating language biases discussed above.
This is because language models exploit the correlation
between question n-grams and the answers, e.g. questions
starting with “Is there a clock” has the answer “yes” 98%
of the time, and questions starting with “Is the man stand-
ing” has the answer “no” 69% of the time. What we need
is not just higher entropy in P (A) across the dataset, but
higher entropy in P (A|Q) so that image I must play a role
in determining A. This motivates our balancing on a per-
question level.
Our complete balanced dataset contains approximately
1.1 Million (image, question) pairs almost double the size
1
http://visualqa.org/
of the VQA [3] dataset with approximately 13 Million
associated answers on the 200k images from COCO [23].
We believe this balanced VQA dataset is a better dataset to
benchmark VQA approaches.
Finally, our data collection protocol enables us to de-
velop a counter-example based explanation modality. We
propose a novel model that not only answers questions
about images, but also ‘explains’ its answer to an image-
question pair by providing “hard negatives” i.e., examples
of images that it believes are similar to the image at hand,
but it believes have different answers to the question. Such
an explanation modality will allow users of the VQA model
to establish greater trust in the model and identify its on-
coming failures.
Our main contributions are as follows: (1) We balance
the existing VQA dataset [3] by collecting complementary
images such that almost every question in our balanced
dataset is associated with not just a single image, but rather
a pair of similar images that result in two different answers
to the question. The result is a more balanced VQA dataset,
which is also approximately twice the size of the original
VQA dataset. (2) We evaluate state-of-art VQA models
(with publicly available code) on our balanced dataset, and
show that models trained on the existing ‘unbalanced’ VQA
dataset perform poorly on our new balanced dataset. This
finding confirms our hypothesis that these models have been
exploiting language priors in the existing VQA dataset to
achieve higher accuracy. (3) Finally, our data collection
protocol for identifying complementary scenes enables us
to develop a novel interpretable model, which in addition to
answering questions about images, also provides a counter-
example based explanation it retrieves images that it be-
lieves are similar to the original image but have different an-
swers to the question. Such explanations can help in build-
ing trust for machines among their users.
2. Related Work
Visual Question Answering. A number of recent works
have proposed visual question answering datasets [
3, 22, 26,
31, 10, 46, 38, 36] and models [9, 25, 2, 43, 24 , 27, 47,
45, 44, 41, 35, 20, 29, 15, 42, 33, 17]. Our work builds
on top of the VQA dataset from Antol et al. [3], which is
one of the most widely used VQA datasets. We reduce the
language biases present in this popular dataset, resulting in a
dataset that is more balanced and about twice the size of the
VQA dataset. We benchmark one ‘baseline’ VQA model
[24], one attention-based VQA model [25], and the winning
model from the VQA Challenge 2016 [9] on our balanced
VQA dataset, and compare them to a language-only model.
Data Balancing and Augmentation. At a high level,
our work may be viewed as constructing a more rigorous
evaluation protocol by collecting ‘hard negatives’. In that
spirit, it is similar to the work of Hodosh et al . [
14], who
created a binary forced-choice image captioning task, where
2
6905

Figure 2: Random examples from our proposed balanced VQA dataset. Each question has two similar images with different
answers to the question.
a machine must choose to caption an image with one of two
similar captions. To compare, Hodosh et al. [
14] imple-
mented hand-designed rules to create two similar captions
for images, while we create a novel annotation interface to
collect two similar images for questions in VQA.
Perhaps the most relevant to our work is that of Zhang et
al. [
47], who study this goal of balancing VQA in a fairly
restricted setting binary (yes/no) questions on abstract
scenes made from clipart (part of the VQA abstract scenes
dataset [3]). Using clipart allows Zhang et al. to ask hu-
man annotators to “change the clipart scene such that the
answer to the question changes”. Unfortunately, such fine-
grained editing of image content is simply not possible in
real images. The novelty of our work over Zhang et al.
is the proposed complementary image data collection inter-
face, application to real images, extension to all questions
(not just binary ones), benchmarking of state-of-art VQA
models on the balanced dataset, and finally the novel VQA
model with counter-example based explanations.
Models with explanation. A number of recent works
have proposed mechanisms for generating ‘explanations’
[
13, 34, 48, 11, 32] for the predictions made by deep
learning models, which are typically ‘black-box’ and non-
interpretable. [13] generates a natural language explanation
(sentence) for image categories. [34, 48, 11, 32] provide
‘visual explanations’ or spatial maps overlaid on images to
highlight the regions that the model focused on while mak-
ing its predictions. In this work, we introduce a third expla-
nation modality: counter-examples, instances the the model
believes are close to but not belonging to the category pre-
dicted by the model.
3. Dataset
We build on top of the VQA dataset introduced by An-
tol et al. [3]. VQA real images dataset contains just over
204K images from COCO [23], 614K free-form natural lan-
guage questions (3 questions per image), and over 6 million
free-form (but concise) answers (10 answers per question).
While this dataset has spurred significant progress in VQA
domain, as discussed earlier, it has strong language biases.
Our key idea to counter this language bias is the follow-
ing for every (image, question, answer) triplet (I, Q, A)
in the VQA dataset, our goal is to identify an image I
that
is similar to I, but results in the answer to the question Q
to become A
(which is different from A). We built an an-
notation interface (shown in Fig.
3) to collect such com-
plementary images on Amazon Mechanical Turk (AMT).
AMT workers are shown 24 nearest-neighbor images of I,
the question Q, and the answer A, and asked to pick an im-
age I
from the list of 24 images for which Q “makes sense”
and the answer to Q is not A.
To capture “question makes sense”, we explained to the
workers (and conducted qualification tests to make sure that
they understood) that any premise assumed in the question
must hold true for the image they select. For instance,
the question “What is the woman doing?” assumes that a
3
6906

woman is present and can be seen in the image. It does
not make sense to ask this question on an image without a
woman visible in it.
We compute the 24 nearest neighbors by first repre-
senting each image with the activations from the penulti-
mate (‘fc7’) layer of a deep Convolutional Neural Network
(CNN) in particular VGGNet [
37] and then using
2
-
distances to compute neighbors.
After the complementary images are collected, we con-
duct a second round of data annotation to collect answers on
these new images. Specifically, we show the picked image
I
with the question Q to 10 new AMT workers, and collect
10 ground truth answers (similar to [
3]). The most common
answer among the 10 is the new answer A
.
This two-stage data collection process finally results in
pairs of complementary images I and I
that are semanti-
cally similar, but have different answers A and A
respec-
tively to the same question Q . Since I and I
are seman-
tically similar, a VQA model will have to understand the
subtle differences between I and I
to provide the right an-
swer to both images. Example complementary images are
shown in Fig.
1, Fig. 2, and on the project website.
Note that sometimes it may not be possible to pick one
of the 24 neighbors as a complementary image. This is be-
cause either (1) the question does not make sense for any
of the 24 images (e.g. the question is ‘what is the woman
doing?’ and none of the neighboring images contain a
woman), or (2) the question is applicable to some neighbor-
ing images, but the answer to the question is still A (same
as the original image I). In such cases, our data collection
interface allowed AMT workers to select “not possible”.
We analyzed the data annotated with “not possible” se-
lection by AMT workers and found that this typically hap-
pens when (1) the object being talked about in the question
is too small in the original image and thus the nearest neigh-
bor images, while globally similar, do not necessarily con-
tain the object resulting in the question not making sense,
or (2) when the concept in the question is rare (e.g., when
workers are asked to pick an image such that the answer to
the question “What color is the banana?” is NOT “yellow”).
In total, such “not possible” selections make up 22% of
all the questions in the VQA dataset. We believe that a
more sophisticated interface that allowed workers to scroll
through many more than 24 neighboring images could pos-
sibly reduce this fraction. But, (1) it will likely still not be
0 (there may be no image in COCO where the answer to “is
the woman flying?” is NOT “no”), and (2) the task would
be significantly more cumbersome for workers, making the
data collection significantly more expensive.
We collected complementary images and the corre-
sponding new answers for all of train, val and test splits of
the VQA dataset. AMT workers picked “not possible” for
approximately 135K total questions. In total, we collected
approximately 195K complementary images for train, 93K
Figure 3: A snapshot of our Amazon Mechanical Turk
(AMT) interface to collect complementary images.
complementary images for val, and 191K complementary
images for test set. In addition, we augment the test set
with 18K additional (question, image) pairs to provide
additional means to detect anomalous trends on the test
data. Hence, our complete balanced dataset contains more
than 443K train, 214K val and 453K test (question, image)
pairs. Our complete balanced dataset is publicly available
for download.
We use the publicly released VQA evaluation script in
our experiments. The evaluation metric uses 10 ground-
truth answers for each question to compute VQA accura-
cies. As described above, we collected 10 answers for ev-
ery complementary image and its corresponding question
to be consistent with the VQA dataset [
3]. Note that while
unlikely, it is possible that the majority vote of the 10 new
answers may not match the intended answer of the person
picking the image either due to inter-human disagreement,
or if the worker selecting the complementary image simply
made a mistake. We find this to be the case i.e., A to be
the same as A
for about 9% of our questions.
While comparing the distribution of answers per
question-type in our new balanced VQA dataset with the
original (unbalanced) VQA dataset [
3], we notice several
interesting trends. First, binary questions (e.g. “is the”, “is
this”, “is there”, “are”, “does”) have a significantly more
balanced distribution over “yes” and “no” answers in our
balanced dataset compared to unbalanced VQA dataset.
“baseball” is now slightly more popular than “tennis” under
“what sport”, and more importantly, overall “baseball” and
“tennis” dominate less in the answer distribution. Several
other sports like “frisbee”, “skiing”, “soccer”, “skateboard-
ing”, “snowboard” and “surfing” are more visible in the an-
swer distribution in the balanced dataset, suggesting that it
contains heavier tails. Similar trends are found for colors,
animals, numbers, etc. A visualization can be found here
2
.
Quantitatively, we find that the entropy of answer distribu-
tions averaged across various question types (weighted by
frequency of question types) increases by 56% after balanc-
ing, confirming the heavier tails in the answer distribution.
2
https://arxiv.org/abs/1612.00837
4
6907

As the statistics show, while our balanced dataset is not
perfectly balanced, it is significantly more balanced than the
original VQA dataset. The resultant impact of this balanc-
ing on performance of state-of-the-art VQA models is dis-
cussed in the next section.
4. Benchmarking Existing VQA Models
Our first approach to training a VQA model that empha-
sizes the visual information over language-priors-alone is
to re-train the existing state-of-art VQA models (with code
publicly available [
24, 25, 9]) on our new balanced VQA
dataset. Our hypothesis is that simply training a model to
answer questions correctly on our balanced dataset will al-
ready encourage the model to focus more on the visual sig-
nal, since the language signal alone has been impoverished.
We experiment with the following models:
Deeper LSTM Question + norm Image (d-LSTM+n-
I) [
24]: This was the VQA model introduced in [3] together
with the dataset. It uses a CNN embedding of the image, a
Long-Short Term Memory (LSTM) embedding of the ques-
tion, combines these two embeddings via a point-wise mul-
tiplication, followed by a multi-layer perceptron classifier
to predict a probability distribution over 1000 most frequent
answers in the training dataset.
Hierarchical Co-attention (HieCoAtt) [
25]: This is a
recent attention-based VQA model that ‘co-attends’ to both
the image and the question to predict an answer. Specifi-
cally, it models the question (and consequently the image
via the co-attention mechanism) in a hierarchical fashion:
at the word-level, phrase-level and entire question-level.
These levels are combined recursively to produce a distri-
bution over the 1000 most frequent answers.
Multimodal Compact Bilinear Pooling (MCB) [
9]:
This is the winning entry on the real images track of the
VQA Challenge 2016. This model uses a multimodal com-
pact bilinear pooling mechanism to attend over image fea-
tures and combine the attended image features with lan-
guage features. These combined features are then passed
through a fully-connected layer to predict a probability dis-
tribution over the 3000 most frequent answers. It should be
noted that MCB uses image features from a more power-
ful CNN architecture ResNet [12] while the previous two
models use image features from VGGNet [37].
Baselines: To put the accuracies of these models in per-
spective, we compare to the following baselines: Prior:
Predicting the most common answer in the training set, for
all test questions. The most common answer is “yes” in both
the unbalanced and balanced sets. Language-only: This
language-only baseline has a similar architecture as Deeper
LSTM + Norm I [
24] except that it only accepts the question
as input and does not utilize any visual information. Com-
paring VQA models to language-only ablations quantifies
to what extent VQA models have succeeded in leveraging
the image to answer the questions.
The results are shown in Table 1. For fair comparison of
accuracies with original (unbalanced) dataset, we create a
balanced train set which is of similar size as original dataset
(referred to as B
half
in table). For benchmarking, we also
report results using the full balanced train set.
Approach UU UB B
half
B BB
Prior 27.38 24.04 24.04 24.04
Language-only 48.21 41.40 41.47 43.01
d-LSTM+n-I [
24] 54.40 47.56 49.23 51.62
HieCoAtt [
25] 57.09 50.31 51.88 54.57
MCB [
9] 60.36 54.22 56.08 59.14
Table 1: Performance of VQA models when trained/tested
on unbalanced/balanced VQA datasets. UB stands for
training on Unbalanced train and testing on Balanced val
datasets. UU, B
half
B and BB are defined analogously.
We see that the current state-of-art VQA models trained
on the original (unbalanced) VQA dataset perform signifi-
cantly worse when evaluated on our balanced dataset, com-
pared to evaluating on the original unbalanced VQA dataset
(i.e., comparing UU to UB in the table). This finding con-
firms our hypothesis that existing models have learned se-
vere language biases present in the dataset, resulting in a re-
duced ability to answer questions correctly when the same
question has different answers on different images. When
these models are trained on our balanced dataset, their per-
formance improves (compare UB to B
half
B in the table).
Further, when models are trained on complete balanced
dataset (twice the size of original dataset), the accuracy
improves by 2-3% (compare B
half
B to BB). This increase
in accuracy suggests that current VQA models are data
starved, and would benefit from even larger VQA datasets.
As the absolute numbers in the table suggest, there is
significant room for improvement in building visual under-
standing models that can extract detailed information from
images and leverage this information to answer free-form
natural language questions about images accurately. As ex-
pected from the construction of this balanced dataset, the
question-only approach performs significantly worse on the
balanced dataset compared to the unbalanced dataset, again
confirming the language-bias in the original VQA dataset,
and its successful alleviation (though not elimination) in our
proposed balanced dataset.
Note that in addition to the lack of language bias, visual
reasoning is also challenging on the balanced dataset since
there are pairs of images very similar to each other in im-
age representations learned by CNNs, but with different an-
swers to the same question. To be successful, VQA models
need to understand the subtle differences in these images.
The paired construction of our dataset allows us to an-
alyze the performance of VQA models in unique ways.
Given the prediction of a VQA model, we can count the
number of questions where both complementary images
5
6908

Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

2,904 citations

Posted Content
TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

2,248 citations

Proceedings ArticleDOI
20 Aug 2019
TL;DR: The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
Abstract: Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pre-trained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pre-trained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pre-training strategies significantly contribute to our strong results. Code and pre-trained models publicly available at: https://github.com/airsplay/lxmert

1,729 citations

Posted Content
TL;DR: Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Abstract: We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

1,155 citations

Journal ArticleDOI
TL;DR: A set of recommendations for model interpretation and benchmarking is developed, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications.
Abstract: Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today’s machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this Perspective we seek to distil how many of deep learning’s failures can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in comparative psychology, education and linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike. Based on these observations, we develop a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications. Deep learning has resulted in impressive achievements, but under what circumstances does it fail, and why? The authors propose that its failures are a consequence of shortcut learning, a common characteristic across biological and artificial systems in which strategies that appear to have solved a problem fail unexpectedly under different circumstances.

924 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Proceedings ArticleDOI
01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Abstract: Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

30,558 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations