scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

01 Jul 2017-pp 1988-1997
TL;DR: In this paper, the authors present a diagnostic dataset that tests a range of visual reasoning abilities and provides insights into their abilities and limitations, and use this dataset to analyze a variety of modern visual reasoning systems.
Abstract: When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover short-comings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

Content maybe subject to copyright    Report

CLEVR: A Diagnostic Dataset for
Compositional Language and Elementary Visual Reasoning
Justin Johnson
1,2
Li Fei-Fei
1
Bharath Hariharan
2
C. Lawrence Zitnick
2
Laurens van der Maaten
2
Ross Girshick
2
1
Stanford University
2
Facebook AI Research
Abstract
When building artificial intelligence systems that can
reason and answer questions about visual data, we need
diagnostic tests to analyze our progress and discover short-
comings. Existing benchmarks for visual question answer-
ing can help, but have strong biases that models can exploit
to correctly answer questions without reasoning. They also
conflate multiple sources of error, making it hard to pinpoint
model weaknesses. We present a diagnostic dataset that
tests a range of visual reasoning abilities. It contains mini-
mal biases and has detailed annotations describing the kind
of reasoning each question requires. We use this dataset to
analyze a variety of modern visual reasoning systems, pro-
viding novel insights into their abilities and limitations.
1. Introduction
A long-standing goal of artificial intelligence research
is to develop systems that can reason and answer ques-
tions about visual information. Recently, several datasets
have been introduced to study this problem [
4, 10, 21, 26,
32, 46, 49]. Each of these Visual Question Answering
(VQA) datasets contains challenging natural language ques-
tions about images. Correctly answering these questions
requires perceptual abilities such as recognizing objects,
attributes, and spatial relationships as well as higher-level
skills such as counting, performing logical inference, mak-
ing comparisons, or leveraging commonsense world knowl-
edge [
31]. Numerous methods have attacked these prob-
lems [
2, 3, 9, 24, 44], but many show only marginal im-
provements over strong baselines [
4, 16, 48]. Unfortunately,
our ability to understand the limitations of these methods is
impeded by the inherent complexity of the VQA task. Are
methods hampered by failures in recognition, poor reason-
ing, lack of commonsense knowledge, or something else?
The difficulty of understanding a system’s competences
Work done during an internship at FAIR.
Q: Are there an equal number of large things and metal spheres?
Q:
What size is the cylinder that is left of the brown metal thing that
is left of
the big sphere? Q: There is a sphere with the same size as the
metal cube; is it made of the same material as the small red sphere?
Q:
How many objects are either small cylinders or metal things?
Figure 1. A sample image and questions from CLEVR. Questions
test aspects of visual reasoning such as
attribute identification,
counting, comparison, multiple attention, and logical operations.
is exemplified by Clever Hans, a 1900s era horse who ap-
peared to be able to answer arithmetic questions. Care-
ful observation revealed that Hans was correctly “answer-
ing” questions by reacting to cues read off his human ob-
servers [
30]. Statistical learning systems, like those used
for VQA, may develop similar “cheating” approaches to
superficially “solve” tasks without learning the underlying
reasoning processes [
35, 36]. For instance, a statistical
learner may correctly answer the question “What covers the
ground?” not because it understands the scene but because
biased datasets often ask questions about the ground when
it is snow-covered [
1, 47]. How can we determine whether
a system is capable of sophisticated reasoning and not just
exploiting biases of the world, similar to Clever Hans?
In this paper we propose a diagnostic dataset for study-
ing the ability of VQA systems to perform visual reasoning.
We refer to this dataset as the Compositional Language and
Elementary Visual Reasoning diagnostics dataset (CLEVR;
pronounced as clever in homage to Hans). CLEVR contains
100k rendered images and about one million automatically-
generated questions, of which 853k are unique. It has chal-
1
2901

lenging images and questions that test visual reasoning abil-
ities such as counting, comparing, logical reasoning, and
storing information in memory, as illustrated in Figure
1.
We designed CLEVR with the explicit goal of enabling
detailed analysis of visual reasoning. Our images depict
simple 3D shapes; this simplifies recognition and allows us
to focus on reasoning skills. We ensure that the information
in each image is complete and exclusive so that external in-
formation sources, such as commonsense knowledge, can-
not increase the chance of correctly answering questions.
We control question-conditional bias via rejection sampling
within families of related questions, and avoid degenerate
questions that are seemingly complex but contain simple
shortcuts to the correct answer. Finally, we use structured
ground-truth representations for both images and questions:
images are annotated with ground-truth object positions and
attributes, and questions are represented as functional pro-
grams that can be executed to answer the question (see Sec-
tion
3). These representations facilitate in-depth analyses
not possible with traditional VQA datasets.
These design choices also mean that while images in
CLEVR may be visually simple, its questions are complex
and require a range of reasoning skills. For instance, fac-
torized representations may be required to generalize to un-
seen combinations of objects and attributes. Tasks such as
counting or comparing may require short-term memory [
15]
or attending to specific objects [
24, 44]. Answering ques-
tions that combine multiple subtasks in diverse ways may
require compositional systems [
2, 3].
We use CLEVR to analyze a suite of VQA models and
discover weaknesses that are not widely known. For exam-
ple, we find that current state-of-the-art VQA models strug-
gle on tasks requiring short-term memory, such as compar-
ing the attributes of objects, or compositional reasoning,
such as recognizing novel attribute combinations. These
observations point to novel avenues for further research.
Finally, we stress that accuracy on CLEVR is not an end
goal in itself: a hand-crafted system with explicit knowl-
edge of the CLEVR universe might work well, but will not
generalize to real-world settings. Therefore CLEVR should
be used in conjunction with other VQA datasets in order to
study the reasoning abilities of general VQA systems.
The CLEVR dataset is publicly available at
http://
cs.stanford.edu/people/jcjohns/clevr/
.
2. Related Work
In recent years, a range of benchmarks for visual under-
standing have been proposed, including datasets for image
captioning [7, 8, 23, 45], referring to objects [19], rela-
tional graph prediction [
21], and visual Turing tests [12, 27].
CLEVR, our diagnostic dataset, is most closely related to
benchmarks for visual question answering [
4, 10, 21, 26,
32, 37, 46, 49], as it involves answering natural-language
questions about images. The two main differences between
CLEVR and other VQA datasets are that: (1) CLEVR con-
trols biases found in prior VQA datasets that can be used
by learning systems to answer questions correctly without
visual reasoning and (2) CLEVR’s synthetic nature and de-
tailed annotations facilitate in-depth analyses of reasoning
abilities that are impossible with existing datasets.
Prior work has attempted to mitigate biases in VQA
datasets in simple cases such as yes/no questions [
12, 47],
but it is difficult to apply such bias-reduction approaches
to more complex questions without a high-quality se-
mantic representation of both questions and answers. In
CLEVR, this semantic representation is provided by the
functional program underlying each image-question pair,
and biases are largely eliminated via sampling. Winograd
schemas [
22] are another approach for controlling bias in
question answering: these questions are carefully designed
to be ambiguous based on syntax alone and require com-
monsense knowledge. Unfortunately this approach does
not scale gracefully: the first phase of the 2016 Winograd
Schema Challenge consists of just 60 hand-designed ques-
tions. CLEVR is also related to the bAbI question answer-
ing tasks [
38] in that it aims to diagnose a set of clearly
defined competences of a system, but CLEVR focuses on
visual reasoning whereas bAbI is purely textual.
We are also not the first to consider synthetic data for
studying (visual) reasoning. SHRDLU performed sim-
ple, interactive visual reasoning with the goal of moving
specific objects in the visual scene [
40]; this study was
one of the first to demonstrate the brittleness of manu-
ally programmed semantic understanding. The pioneer-
ing DAQUAR dataset [
28] contains both synthetic and
human-written questions, but they only generate 420 syn-
thetic questions using eight text templates. VQA [
4] con-
tains 150,000 natural-language questions about abstract
scenes [
50], but these questions do not control for question-
conditional bias and are not equipped with functional pro-
gram representations. CLEVR is similar in spirit to the
SHAPES dataset [
3], but is more complex and varied both
in terms of visual content and question variety and complex-
ity: SHAPES contains 15,616 total questions with just 244
unique questions while CLEVR contains nearly a million
questions of which 853,554 are unique.
3. The CLEVR Diagnostic Dataset
CLEVR provides a dataset that requires complex reason-
ing to solve and that can be used to conduct rich diagnos-
tics to better understand the visual reasoning capabilities
of VQA systems. This requires complete control over the
dataset, which we achieve by using synthetic images and
automatically generated questions. The images have asso-
ciated ground-truth object locations and attributes, and the
questions have an associated machine-readable form. These
2902

objects
objects
objects
yes/no
number
objects
objects
objects
object
value
objects
number
number
CLEVR6function6catalog
Relate
Equal
Less6/6More
yes/no
Equal
yes/no
value
value
Exist
Count
And
Or
Filter6<attr>
object
objects
Unique
In#front#vs.#behind
Sizes,#colors,#shapes,#and#materials
Left#vs.#right
Large#gray
metal
sphere
Large#red
metal#cube
Small#blue
metal#cylinder
Small#green
metal#sphere
Large#brown
rubber
sphere
Large#purple
rubber
cylinder
Small#cyan
rubber
cube
Small#yellow
rubber
sphere
Behind
In#front
Left Right
Filter6
color
Filter6
shape
Unique Relate
Filter6
shape
Unique
Query
color
yellow sphere
value
right cube
What%color%is%the%cube%to%the%right%of%the%yellow%sphere?
object
Query6<attr>
value
Filter6
color
Unique Relate
green
left
Filter6
size
Unique Relate
small
in6front
And
Count
How%many%cylinders%are%in%front%of%the%small%
thing%and%on%the%left%side%of%the%green%object?
Sample6chain-structured6question:
Sample6tree-structured6question:
object
Same6<attr>
objects
Filter6
shape
cylinder
Figure 2. A field guide to the CLEVR universe. Left: Shapes, attributes, and spatial relationships. Center: Examples of questions and
their associated functional programs. Right: Catalog of basic functions used to build questions. See Section 3 for details.
ground-truth structures allow us to analyze models based
on, for example: question type, question topology (chain
vs. tree), question length, and various forms of relationships
between objects. Figure
2 gives a brief overview of the main
components of CLEVR, which we describe in detail below.
Objects and relationships. The CLEVR universe con-
tains three object shapes (cube, sphere, and cylinder) that
come in two absolute sizes (small and large), two materi-
als (shiny “metal” and matte “rubber”), and eight colors.
Objects are spatially related via four relationships: “left”,
“right”, “behind”, and “in front”. The semantics of these
prepositions are complex and depend not only on relative
object positions but also on camera viewpoint and context.
We found that generating questions that invoke spatial rela-
tionships with semantic accord was difficult. Instead we
rely on a simple and unambiguous definition: projecting
the camera viewpoint vector onto the ground plane defines
the “behind” vector, and one object is behind another if
its ground-plane position is further along the “behind” vec-
tor. The other relationships are similarly defined. Figure
2
(left) illustrates the objects, attributes, and spatial relation-
ships in CLEVR. The CLEVR universe also includes one
non-spatial relationship type that we refer to as the same-
attribute relation. Two objects are in this relationship if
they have equal attribute values for a specified attribute.
Scene representation. Scenes are represented as collec-
tions of objects annotated with shape, size, color, material,
and position on the ground-plane. A scene can also be rep-
resented by a scene graph [
17, 21], where nodes are objects
annotated with attributes and edges connect spatially related
objects. A scene graph contains all ground-truth informa-
tion for an image and could be used to replace the vision
component of a VQA system with perfect sight.
Image generation. CLEVR images are generated by ran-
domly sampling a scene graph and rendering it using
Blender [
6]. Every scene contains between three and ten
objects with random shapes, sizes, materials, colors, and
positions. When placing objects we ensure that no objects
intersect, that all objects are at least partially visible, and
that there are small horizontal and vertical margins between
the image-plane centers of each pair of objects; this helps
reduce ambiguity in spatial relationships. In each image the
positions of the lights and camera are randomly jittered.
Question representation. Each question in CLEVR is as-
sociated with a functional program that can be executed on
an image’s scene graph, yielding the answer to the question.
Functional programs are built from simple basic functions
that correspond to elementary operations of visual reason-
ing such as querying object attributes, counting sets of ob-
jects, or comparing values. As shown in Figure
2, complex
questions can be represented by compositions of these sim-
ple building blocks. Full details about each basic function
can be found in the supplementary material.
As we will see in Section
4, representing questions as
functional programs enables rich analysis that would be
impossible with natural-language questions. A question’s
functional program tells us exactly which reasoning abili-
ties are required to solve it, allowing us to compare perfor-
mance on questions requiring different types of reasoning.
2903

We categorize questions by question type, defined by the
outermost function in the question’s program; for example
the questions in Figure
2 have types query-color and exist.
Figure
3 shows the number of questions of each type.
Question families. We must overcome several key chal-
lenges to generate a VQA dataset using functional pro-
grams. Functional building blocks can be used to construct
an infinite number of possible functional programs, and we
must decide which program structures to consider. We also
need a method for converting functional programs to natu-
ral language in a way that minimizes question-conditional
bias. We solve these problems using question families.
A question family contains a template for constructing
functional programs and several text templates providing
multiple ways of expressing these programs in natural lan-
guage. For example, the question “How many red things
are there?” can be formed by instantiating the text tem-
plate “How many <C> <M> things are there?”, bind-
ing the parameters <C> and <M> (with types “color” and
“material”) to the values red and nil. The functional pro-
gram count(filter
color(red, scene())) for this question can
be formed by instantiating the associated program template
count(filter
color(<C>, filter material(<M>, scene())))
with the same values, using the convention that functions
taking a nil input are removed after instantiation.
CLEVR contains a total of 90 question families, each
with a single program template and an average of four text
templates. Text templates were generated by manually writ-
ing one or two templates per family and then crowdsourcing
question rewrites. To further increase language diversity we
use a set of synonyms for each shape, color, and material.
With up to 19 parameters per template, a small number of
families can generate a huge number of unique questions;
Figure
3 shows that of the nearly one million questions in
CLEVR, more than 853k are unique. CLEVR can easily be
extended by adding new question families.
Question generation. Generating a question for an im-
age is conceptually simple: we choose a question family,
select values for each of its template parameters, execute
the resulting program on the image’s scene graph to find the
answer, and use one of the text templates from the question
family to generate the final natural-language question.
However, many combinations of values give rise to ques-
tions which are either ill-posed or degenerate. The question
“What color is the cube to the right of the sphere?” would
be ill-posed if there were many cubes right of the sphere, or
degenerate if there were only one cube in the scene since the
reference to the sphere would then be unnecessary. Avoid-
ing such ill-posed and degenerate questions is critical to en-
sure the correctness and complexity of our questions.
A na
¨
ıve solution is to randomly sample combinations of
values and reject those which lead to ill-posed or degenerate
Unique Overlap
Split
Images Questions questions with train
Total 100,000 999,968 853,554 -
Train 70,000 699,989 608,607 -
Val
15,000 149,991 140,448 17,338
Test
15,000 149,988 140,352 17,335
   








































Figure 3. Top: Statistics for CLEVR; the majority of questions
are unique and few questions from the val and test sets appear
in the training set. Bottom left: Comparison of question lengths
for different VQA datasets; CLEVR questions are generally much
longer. Bottom right: Distribution of question types in CLEVR.
questions. However, the number of possible configurations
for a question family is exponential in its number of param-
eters, and most of them are undesirable. This makes brute-
force search intractable for our complex question families.
Instead, we employ a depth-first search to find valid val-
ues for instantiating question families. At each step of
the search, we use ground-truth scene information to prune
large swaths of the search space which are guaranteed to
produce undesirable questions; for example we need not en-
tertain questions of the form “What color is the <S> to the
<R> of the sphere” for scenes that do not contain spheres.
Finally, we use rejection sampling to produce an approx-
imately uniform answer distribution for each question fam-
ily; this helps minimize question-conditional bias since all
questions from the same family share linguistic structure.
4. VQA Systems on CLEVR
4.1. Models
VQA models typically represent images with features
from pretrained CNNs and use word embeddings or re-
current networks to represent questions and/or answers.
Models may train recurrent networks for answer genera-
tion [
10, 28, 41], multiclass classifiers over common an-
swers [
4, 24, 25, 32, 48, 49], or binary classifiers on image-
question-answer triples [
9, 16, 33]. Many methods incor-
porate attention over the image [
9, 33, 44, 49, 43] or ques-
tion [
24]. Some methods incorporate memory [42] or dy-
namic network architectures [
2, 3].
Experimenting with all methods is logistically challeng-
ing, so we reproduced a representative subset of meth-
ods: baselines that do not look at the image (Q-type mode,
LSTM), a simple baseline (CNN+BoW) that performs near
state-of-the-art [
16, 48], and more sophisticated methods
2904

















































































































Figure 4. Accuracy per question type of the six VQA methods on the CLEVR dataset (higher is better). Figure best viewed in color.
using recurrent networks (CNN+LSTM), sophisticated fea-
ture pooling (CNN+LSTM+MCB), and spatial attention
(CNN+LSTM+SA).
1
These are described in detail below.
Q-type mode: Similar to the “per Q-type prior” method
in [
4], this baseline predicts the most frequent training-set
answer for each question’s type.
LSTM: Similar to “LSTM Q” in [
4], the question is pro-
cessed with learned word embeddings followed by a word-
level LSTM [
15]. The final LSTM hidden state is passed to
a multi-layer perceptron (MLP) that predicts a distribution
over answers. This method uses no image information so it
can only model question-conditional bias.
CNN+BoW: Following [
48], the question is encoded by
averaging word vectors for each word in the question and
the image is encoded using features from a convolutional
network (CNN). The question and image features are con-
catenated and passed to a MLP which predicts a distribution
over answers. We use word vectors trained on the Google-
News corpus [
29]; these are not fine-tuned during training.
CNN+LSTM: Images and questions are encoded using
CNN features and final LSTM hidden states, respectively.
These features are concatenated and passed to an MLP that
predicts an answer distribution.
CNN+LSTM+MCB: Images and questions are encoded
as above, but instead of concatenation, their features are
pooled using compact multimodal pooling (MCB) [
9, 11].
CNN+LSTM+SA: Again, the question and image are
encoded using a CNN and LSTM, respectively. Follow-
ing [
44], these representations are combined using one or
more rounds of soft spatial attention and the final answer
distribution is predicted with an MLP.
Human: We used Mechanical Turk to collect human re-
sponses for 5500 random questions from the test set, taking
a majority vote among three workers for each question.
Implementation details. Our CNNs are ResNet-101
models pretrained on ImageNet [14] that are not finetuned;
images are resized to 224 × 224 prior to feature extraction.
1
We performed initial experiments with dynamic module networks [
2]
but its parsing heuristics did not generalize to the complex questions in
CLEVR so it did not work out-of-the-box; see supplementary material.
CNN+LSTM+SA extracts features from the last layer of the
conv4 stage, giving 14 × 14 × 1024-dimensional features.
All other methods extract features from the final average
pooling layer, giving 2048-dimensional features. LSTMs
use one or two layers with 512 or 1024 units per layer.
MLPs use ReLU functions and dropout [
34]; they have one
or two hidden layers with between 1024 and 8192 units per
layer. All models are trained using Adam [
20].
Experimental protocol. CLEVR is split into train, vali-
dation, and test sets (see Figure
3). We tuned hyperparam-
eters (learning rate, dropout, word vector size, number and
size of LSTM and MLP layers) independently per model
based on the validation error. All experiments were de-
signed on the validation set; after finalizing the design we
ran each model once on the test set. All experimental find-
ings generalized from the validation set to the test set.
4.2. Analysis by Question Type
We can use the program representation of questions to
analyze model performance on different forms of reason-
ing. We first evaluate performance on each question type,
defined as the outermost function in the program. Figure
4
shows results and detailed findings are discussed below.
Querying attributes: Query questions ask about an at-
tribute of a particular object (e.g. “What color is the thing
right of the red sphere?”). The CLEVR world has two
sizes, eight colors, two materials, and three shapes. On
questions asking about these different attributes, Q-type
mode and LSTM obtain accuracies close to 50%, 12.5%,
50%, and 33.3% respectively, showing that the dataset
has minimal question-conditional bias for these questions.
CNN+LSTM+SA substantially outperforms all other mod-
els on these questions; its attention mechanism may help it
focus on the target object and identify its attributes.
Comparing attributes: Attribute comparison questions
ask whether two objects have the same value for some at-
tribute (e.g. “Is the cube the same size as the sphere?”). The
only valid answers are “yes” and “no”. Q-Type mode and
LSTM achieve accuracies close to 50%, confirming there is
no dataset bias for these questions. Unlike attribute-query
2905

Citations
More filters
Posted Content
TL;DR: This work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.
Abstract: Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human performance; text-based question answering using the bAbI suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.

943 citations

Posted Content
TL;DR: A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.
Abstract: We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{this https URL}.

822 citations

Posted Content
TL;DR: Feature-wise linear modulation (FiLM) as mentioned in this paper is a general-purpose conditioning method for neural networks, which can influence neural network computation via a simple, feature-wise affine transformation based on conditioning information.
Abstract: We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.

610 citations

Book ChapterDOI
08 Sep 2018
TL;DR: A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics.
Abstract: We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image. We also propose an attentional Graph Convolutional Network (aGCN) that effectively captures contextual information between objects and relations. Finally, we introduce a new evaluation metric that is more holistic and realistic than existing metrics. We report state-of-the-art performance on scene graph generation as evaluated using both existing and our proposed metrics.

583 citations

Posted Content
TL;DR: WILDS is presented, a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, and is hoped to encourage the development of general-purpose methods that are anchored to real-world distribution shifts and that work well across different applications and problem settings.
Abstract: Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity, these real-world distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts which naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training results in substantially lower out-of-distribution than in-distribution performance, and that this gap remains even with models trained by existing methods for handling distribution shifts. This underscores the need for new training methods that produce models which are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at this https URL.

579 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations