scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

TL;DR: This paper proposed a unified framework for the tasks of referring expression comprehension and generation, which consists of three modules: speaker, listener, and reinforcer, and achieved state-of-the-art results on three referring expression datasets.
Abstract: Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose a unified framework for the tasks of referring expression comprehension and generation. Our model is composed of three modules: speaker, listener, and reinforcer. The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions. The listener-speaker modules are trained jointly in an end-to-end learning framework, allowing the modules to be aware of one another during learning while also benefiting from the discriminative reinforcer’s feedback. We demonstrate that this unified framework and training achieves state-of-the-art results for both comprehension and generation on three referring expression datasets.

Content maybe subject to copyright    Report

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg
Department of Computer Science
University of North Carolina at Chapel Hill
{licheng, airsplay, mbansal, tlberg}@cs.unc.edu
Abstract
Referring expressions are natural language construc-
tions used to identify particular objects within a scene. In
this paper, we propose a unified framework for the tasks of
referring expression comprehension and generation. Our
model is composed of three modules: speaker, listener, and
reinforcer. The speaker generates referring expressions, the
listener comprehends referring expressions, and the rein-
forcer introduces a reward function to guide sampling of
more discriminative expressions. The listener-speaker mod-
ules are trained jointly in an end-to-end learning frame-
work, allowing the modules to be aware of one another
during learning while also benefiting from the discrimina-
tive reinforcer’s feedback. We demonstrate that this unified
framework and training achieves state-of-the-art results for
both comprehension and generation on three referring ex-
pression datasets.
1
1. Introduction
People often use referring expressions in their everyday
discourse to unambiguously identify or indicate particular
objects within their physical environment. For example, one
might point out a person in the crowd by referring to them
as “the man in the blue shirt” or you might ask someone
to “pass me the red pen on the table. In both of these ex-
amples, we have a pragmatic interaction between two peo-
ple (or between a person and an intelligent agent such as
a robot). First, we have a speaker who must generate an
expression given a target object and its surrounding world
context. Second, we have a listener who must interpret and
comprehend the expression and map it to an object in the
environment. Therefore, in this paper we propose an end-
to-end trained listener-speaker framework that models these
behaviors jointly.
In addition to the listener and speaker, we also intro-
duce a new reinforcer module that learns a discriminative
reward model to help generate less ambiguous expressions
(expressions that apply to the target object but not to other
1
Project and demo page:
https://vision.cs.unc.edu/refer.
objects in the image). This goal corresponds to the Gricean
Maxim [
8] of manner, where one tries to be as clear, brief,
and orderly as possible while avoiding obscurity and am-
biguity. Avoiding ambiguity is important because the gen-
erated expression should be easily and uniquely mapped to
the target object. For example, if there were two pens on
the table one “long and red” and the other “short and red”,
asking for the “red pen” would be ambiguous while asking
for the “long pen” would be better. The reinforcer module is
incorporated using reinforcement learning, inspired by be-
havioral psychology that says that agents operating in an en-
vironment should take actions that maximize their expected
cumulative reward. In our case, the reward takes the form
of a discriminative classifier trained to reward the speaker
for generating less ambiguous expressions.
Within the realm of referring expressions, there are two
tasks that can be computationally modeled, mimicking the
listener and speaker roles. Referring Expression Genera-
tion (speaker) requires an algorithm to generate a referring
expression for a given target object in a visual scene, as in
Fig.
2. Referring Expression Comprehension (listener) re-
quires an algorithm to localize the object/region in an image
described by a given referring expression, as in Fig.
3.
The Referring Expression Generation (REG) task has
been studied since the 1970s [
29]. Many of the early
works in this space focused on relatively limited datasets,
using synthesized images of objects in artificial scenes
or limited sets of real-world objects in simplified envi-
ronments [
20, 7, 15]. Recently, the research focus has
shifted to more complex natural image datasets and has ex-
panded to include the Referring Expression Comprehension
task [
13, 19, 31] as well as to real-world interactions with
robotics [
4, 3]. One reason this has become feasible is that
several large-scale REG datasets [
13, 31, 19] have been col-
lected where deep learning models can be applied.
Recent neural approaches to the referring expression
generation and comprehension tasks can be roughly split
into two types. The first type uses a CNN-LSTM encoder-
decoder generative model [
25] to generate (decode) sen-
tences given the encoded target object. With careful de-
1
7282

sign of the visual representation of target object, this model
can generate unambiguous expressions [
19, 31]. Here, the
CNN-LSTM models P (r|o), where r is the referring ex-
pression and o is the target object, which can be easily con-
verted to P (o|r) via Bayes’ rule and used to address the
comprehension task [
10, 19, 31, 21] by selecting the o with
the largest posterior probability. The second type of ap-
proach uses a joint-embedding model that projects both a vi-
sual representation of the target object and a semantic repre-
sentation of the expression into a common space and learns
a distance metric. Generation and comprehension can be
performed by embedding a target object (or expression) into
the embedding space and retrieving the closest expression
(or object) in this space. This type of approach typically
achieves better comprehension performance than the CNN-
LSTM model as in [
23, 26], but previously was only ap-
plied to the referring expression comprehension task. Re-
cent work [1] has also used both an encoder-decoder model
(speaker) and an embedding model (listener) for referring
expression generation in abstract images, where the offline
listener reranks the speaker’s output.
In this paper, we propose a unified model that jointly
learns both the CNN-LSTM speaker and embedding-based
listener models, for both the generation and comprehension
tasks. Additionally, we add a discriminative reward-based
reinforcer to guide the sampling of more discriminative ex-
pressions and further improve our final system. Instead of
working independently, we let the speaker, listener, and re-
inforcer interact with each other, resulting in improved per-
formance on both generation and comprehension tasks. Re-
sults evaluated on three standard, large-scale datasets verify
that our proposed listener-speaker-reinforcer model signifi-
cantly outperforms the state-of-the-art on both the compre-
hension task (Tables
1 and 2) and the generation task (eval-
uated using human judgements in Table
4, and automatic
metrics in Table
3).
2. Related work
Recent years have witnessed a rise in multimodal re-
search related to vision and language. Given the individ-
ual success in each area and the need for models with more
advanced cognition capabilities, several tasks have emerged
as evaluation applications, including image captioning, vi-
sual question answering, and referring expression genera-
tion/comprehension.
Image Captioning: The aim of image captioning is to gen-
erate a sentence describing the general content of an im-
age. Most recent approaches use deep learning to address
this problem. Perhaps the most common architecture is a
CNN-LSTM model [
25], which generates a sentence con-
ditioned on visual information from the image. One pa-
per related to our work is gLSTM [
11] which uses CCA
semantics to guide the caption generation. A further step
beyond image captioning is to locate the regions being de-
scribed in captions [
27, 22, 26]. The Visual Genome [16]
collected captions for dense regions in an image that have
been used for dense-captioning tasks [
12]. There also has
been a movement toward more focused tasks, such as vi-
sual question answering [2, 30], and referring expression
generation and comprehension which involve specific re-
gions/objects within an image (discussed below).
Referring Expression Datasets: REG has been studied
for many years [29, 15, 20] in linguistics and natural lan-
guage processing, but mainly focused on small or artificial
datasets. In 2014, Kazemzadeh et al [13] introduced the first
large-scale dataset RefCLEF using 20,000 real-world natu-
ral images [
9]. This dataset was collected in a two-player
game, where the first player writes a referring expression
given an indicated target object. The second player is shown
only the image and expression and has to click on the cor-
rect object described by the expression. If the click lies
within the target object region, both sides get points and
their roles switch. Using the same game interace, the au-
thors further collected RefCOCO and RefCOCO+ datasets
on MSCOCO images [
31]. The RefCOCO and RefCOCO+
datasets each contain 50,000 referred objects with 3 refer-
ring expressions on average. The main difference between
RefCOCO and RefCOCO+ is that in RefCOCO+, players
were forbidden from using absolute location words, e.g. left
dog, therefore focusing the referring expression to purely
appearance-based descriptions. In addition, Mao et al [
19]
also collected a referring expression dataset - RefCOCOg,
using MSCOCO images, but in a non-interactive frame-
work. These expressions are more similar to the MSCOCO
captions in that they are longer and more complex as their
was no time constraint in the non-interactive data collection
setting. This dataset has 96,654 objects with 1.3 expressions
per object on average.
Referring Expression Comprehension and Generation:
Referring expressions are associated with two tasks, com-
prehension and generation. The comprehension task re-
quires a system to select the region being described by
a given referring expression. To address this problem,
[
19, 31, 21, 10] model P (r|o) and looks for the object
o maximizing the probability. People also try modeling
P (o, r) directly using embedding model [23, 26], which
learns to minimize the distance between paired object and
sentence in the embedding space. The generation task asks
a system to compose an expression for a specified object
within an image. While some previous work used rule-
based approaches to generate expressions with fixed gram-
mar pattern [
20, 5, 13], recent work has followed the CNN-
LSTM structure to generate expressions [19, 31].
7283

Man in the middle
wearing yellow
MLP
MLP
Concat LSTM
Embedding
Loss
Generation
loss
Reward
Loss
LSTM
Speaker
Listener
Sampling
Reinforcer
L2-Normalization
L2-Normalization
Figure 1: Framework: The Speaker is a CNN-LSTM model, which generates a referring expression for the target object. The
Listener is a joint-embedding model learned to minimize the distance between paired object and expression representations.
In addition, a Reinforcer module helps improve the speaker by sampling more discriminative expressions for training.
3. Model
Our model is composed of three modules: speaker
(Sec
3.1), listener (Sec 3.2), and reinforcer (Sec 3.3). Dur-
ing training, the speaker and listener are trained jointly so
that they can benefit from each other and from the rein-
forcer. As the reward function for the reinforcer is not dif-
ferentiable, it is incorporated using reinforcement learning
policy gradient algorithm.
3.1. Speaker
For our speaker module, we follow the previous state-of-
the-art [
19, 31], and use a CNN-LSTM framework. Here, a
pre-trained CNN model is used to define a visual represen-
tation for the target object and other visual context. Then,
a Long-short term memory (LSTM) is used to generate the
most likely expression given the visual representation.
Because of the improved quantitative performance
over [
19], we use the visual comparison model of [31]
as our speaker (to encode the target object). Here, the vi-
sual representation includes the target object, context, lo-
cation/size features, and two visual comparison features.
Specifically, the target object representation o
i
is modeled
as the fc7 features from a pre-trained VGG network [
24].
Global context, g
i
, is modeled as features extracted from
the VGG-fc7 layer for the entire image. The location/size
representation, l
i
, for the target object is modeled as a 5-
dimensional vector, encoding the x and y locations of the
top left and bottom right corners of the target object bound-
ing box, as well as the bounding box size with respect to the
image, i.e., l
i
= [
x
tl
W
,
y
tl
H
,
x
br
W
,
y
br
H
,
w·h
W ·H
].
As referring expressions often relate an object to other
objects of the same type within the image (“the red ball” vs
“the blue ball” or “the larger elephant”), comparisons tend
to be quite important for differentiation. The comparison
features are composed of two parts: a) appearance similarity
δv
i
=
1
n
P
j6=i
o
i
o
j
ko
i
o
j
k
, where n is the number of objects
chosen for comparisons, b) location and size similarity δl
i
,
concatenating the 5-d difference on each compared object
δl
ij
= [
[x
tl
]
ij
w
i
,
[y
tl
]
ij
h
i
,
[x
br
]
ij
w
i
,
[y
br
]
ij
h
i
,
w
j
h
j
w
i
h
i
].
The final visual representation for the target object
is then a concatenation of the above features followed
by a fully-connected layer fusing them together, r
i
=
W
m
[o
i
, g
i
, l
i
, δv
i
, δl
i
] + b
m
. This joint feature is then fed
into the LSTM for referring expression generation. During
training we minimize the negative log-likelihood:
L
s
1
(θ) =
X
i
log P (r
i
|o
i
; θ)
=
X
i
X
t
log P (r
t
i
|r
t1
i
, . . . , r
1
i
, o
i
; θ)
(1)
Note that the speaker can be modeled using any form of
CNN-LSTM structure.
In [
19], Mao proposed to add a Maximum Mutual Infor-
mation (MMI) constraint encouraging the generated expres-
sion to describe the target object better than the other objects
within the image (i.e., a ranking loss on objects). We gen-
eralize this idea to incorporate two triplet hinge losses com-
posed of a positive match and two negative matches. Given
a positive match (r
i
, o
i
), we sample the contrastive pair
(r
j
, o
i
) where r
j
is the expression describing some other
object and pair (r
i
, o
k
) where o
k
is some other object in the
same image, then we optimize the following max-margin
7284

loss:
L
s
2
(θ) =
X
i
[λ
s
1
max(0, M + log P (r
i
|o
k
) log P (r
i
|o
i
))
+λ
s
2
max(0, M + log P (r
j
|o
i
) log P (r
i
|o
i
))]
(2)
The first term is from [
19], while the second term encour-
ages that the target object to be better described by the true
expression compared to expressions describing other ob-
jects in the image (i.e., a ranking loss on expressions).
3.2. Listener
We use a joint-embedding model to mimick the listener’s
behaviour. The purpose of this embedding model is to
encode the visual information from the target object and
semantic information from the referring expression into a
joint space that embeds vectors that are visually or semanti-
cally related closer together in the space. Here for the refer-
ring expression comprehension task, given a referring ex-
pression representation, the listener embeds it into the joint
space, then selects the closest object in the embedding space
for the predicted target object.
As illustrated in Fig.
1, for our listener joint-embedding
model (outlined by a red dashline), we use an LSTM to en-
code the input referring expression and the same visual rep-
resentation as the speaker to encode the target object (thus
connecting the speaker to the listener). We then add two
MLPs (multi-layer perceptions) and two L2 normalization
layers following each view, the object and the expression.
Each MLP is composed of two fully connected layers with
ReLU nonlinearities between them, serving to transform the
object view and the expression view into a common embed-
ding space. The inner-product of the two normalized rep-
resentations is computed as their similarity score S(r, o) in
the space. As a listener, we force the similarity on target
object and referring expression pairs by applying a hinge
loss over triplets, which consist of a positive match and two
negative matches:
L
l
(θ) =
X
i
[λ
l
1
max(0, M + S(r
i
, o
k
) S(r
i
, o
i
))
+λ
l
2
max(0, M + S(r
j
, o
i
) S(r
i
, o
i
))]
(3)
where the negative matches are randomly chosen from the
other objects and expressions in the same image.
Note that the listener model is not limited to this par-
ticular triplet-based model. For example, [
23] computes a
similarity score between every object for given referring ex-
pression, and minimizes the cross entropy of the SoftMax
knowing the target object, which could also be applied here.
3.3. Reinforcer
Besides using the ground-truth pairs of target object and
referring expression for training the speaker, we also use
reinforcement learning to guide the speaker toward gener-
ating less ambiguous expressions (expressions that apply to
the target object but not to other objects). This reinforcer
module is composed of a discriminative reward function
and performs a non-differentiable policy gradient update to
the speaker.
Specifically, given the softmax output of the speaker’s
LSTM, we sample words according to the categorical dis-
tribution at each time step, resulting in a complete expres-
sion after sampling the <END> token. This sampling op-
eration is non-differentiable as we do not know whether an
expression is ambiguous or not until we feed it into a re-
ward function. Therefore, we use policy gradient reinforce-
ment learning to update the speaker’s parameters. Here, the
goal is to maximize the reward expectation F (w
1:T
) under
the distribution of p(w
1:T
; θ) parameterized by the speaker,
i.e., J = E
p(w
1:T
)
[F ]. According to the policy gradient
algorithm [
28], we have
θ
J = E
p(w
1:T
)
[F (w
1:T
)
θ
log p(w
1:T
; θ)], (4)
Where log p(w
t
) is defined by the softmax output. We then
use this gradient to update our speaker model during train-
ing.
The only thing left is to choose a reward function that
encourages the speaker to sample less ambiguous expres-
sions. As illustrated in Fig.
1 (outlined in dashed orange),
the reinforcer module learns a reward function using paired
objects and expressions. We again use the same visual rep-
resentation for the target object and use another LSTM to
encode the expression representation. Rather than using two
MLPs to encode each view as in the listener, here we con-
catenate the two views and feed them together into a MLP
to learn a 1-d Logistic Regression score between 0 and 1.
Trained with cross-entropy loss, the reward function com-
putes a match score between an input object and expression.
We use this score as the reward signal in Eqn.
4 for sampled
expression and target object pairs. After training, the reward
function is fixed to assist our joint speaker-listener system.
3.4. Joint Model
In this subsection, we describe some specifics of how
our three modules (speaker, listener, reinforcer) are inte-
grated into a joint framework (shown in Fig.
1). For the
listener, we notice that the visual vector in the embedding
space is learned to capture the neighbourhood vectors of re-
ferring expressions, thus making it aware of the listener’s
knowledge. Therefore, we take this MLP embedded vec-
tor as an additional input for the speaker, which encodes
the listener based information. In Fig.
1, we use concatena-
tion to jointly encode the standard visual representation of
target object and this listener-aware representation and then
feed them into speaker. Besides concatenation, the element-
wise product or compact bilinear pooling can also be ap-
plied [
6]. During training, we sample the same triplets for
7285

both the speaker and listener, and make the word embedding
of the speaker and listener shared to reduce the number of
parameters. For the reinforcer module, we do sentence sam-
pling using the speaker’s LSTM as shown in the top right of
Fig.
1. Within each mini-batch, the sampled expressions for
the target objects are fed into the reward function to obtain
reward values.
The overall loss function is formulated as a multi-task
learning problem:
θ = arg min L
s
1
(θ) + L
s
2
(θ) + L
l
(θ) λ
r
J(θ), (5)
where λ
r
is the weight on reward loss. The weights on the
loss of speaker and listener are already included in Eqn.
2
and Eqn. 3. We list all hyper-parameters settings in Sec. 4.1.
3.5. Comprehension and Generation
For the comprehension task, at test time, we can use ei-
ther the speaker or listener to select the target object given
an input expression. Using the listener, we would em-
bed the input expression into the learned embedding space
and select the closest object as the predicted target. Using
the speaker, we would generate expressions for each object
within the image and then select the object whose generated
expression best matches the input expression. Therefore,
we utilize both modules by ensembling the speaker and lis-
tener predictions together to pick the most probable object
given a referring expression.
ˆo = arg max
o
P (r | o)S(o, r)
λ
(6)
Surprisingly, using the speaker alone (setting λ to 0)
already achieves state-of-art results due to our joint train-
ing. Adding the listener further improves performance more
than 4% over previous state-of-art results.
For the generation task, we first let the speaker gener-
ate multiple expressions per object via beam search. We
then use the listener to rerank these expressions and select
the least ambiguous expression, which is similar to [
1]. To
fully utilize the listener’s power in generation, we propose
to consider cross comprehension as well as the diversity of
expressions by minimizing the potential:
E(r) =
X
i
θ
i
(r
i
) +
X
i,j
θ
i,j
(r
i
, r
j
)
θ
i
(r
i
) = log P (r
i
|o
i
) λ
1
log S(r
i
, o
i
)
+ λ
2
max
j6=i
log S(r
i
, o
j
)
θ
i,j
(r
i
, r
j
) = λ
3
I(r
i
= r
j
)
(7)
The first term and second term in the unary potential
measure how well the target object and generated expres-
sion match using the speaker and listener modules respec-
tively (also used in [
1]). The third term in the unary po-
tential measures the likelihood of the generated sentence
of describing other objects in the same image. The pair-
wise potential penalizes the same sentence being generated
for different objects (encouraging diversity in generation).
In this way, the expressions for every object in an image
are jointly generated. Compared with the previous model
that attempted to tie language generation of referring ex-
pressions together [
31], the constraints in Eqn. 7 are more
explicit and overall this works better to reduce ambiguity in
the generated expressions.
4. Experiments
4.1. Optimization
We optimize our model using Adam [
14] with an initial
learning rate of 0.0004, halved every 2,000 iterations, with
a batch size of 32. The word embedding size and hidden
state size of the LSTM are set to 512. To avoid overfitting,
we apply dropout with a ratio of 0.2 after each linear trans-
formation in the MLP layers. We also regularize the word-
embedding and output layers of the speaker’s LSTM using
dropout with ratio of 0.5. For the constrastive pairs, we set
λ
l
1
= 1 and λ
l
2
= 1 in listener (Eqn.
3), and set λ
s
1
= 1 and
λ
s
2
= 0.1 in speaker (Eqn.
2). The weight on reward loss is
set as λ
r
= 1 .
4.2. Datasets
We perform experiments on three referring expression
datasets: RefCOCO, RefCOCO+ and RefCOCOg (de-
scribed in Sec
2). All three datasets are collected on
MSCOCO images [
17], but with several differences: 1) Re-
fCOCO and RefCOCO+ were collected using an interactive
game interface while RefCOCOg was collected in a non-
interactive setting and contains longer expressions, 2) Ref-
COCOg contains on average 1.63 objects of the same type
per images, while RefCOCO and RefCOCO have 3.9 on av-
erage, 3) RefCOCO+ disallowed absolute location words in
referring expressions. Overall, RefCOCO has 142,210 ex-
pressions for 50,000 objects in 19,994 images, RefCOCO+
has 141,565 expressions for 49,856 objects in 19,992 im-
ages, and RefCOCOg has 104,560 expressions for 54,822
objects in 26,711 images.
Additionally, each dataset is provided with dataset splits
for evaluation. RefCOCO and RefCOCO+ provide person
vs. object splits for evaluation. Images containing multi-
ple people are in “TestA while images containing multiple
objects of other categories are in “TestB”. For RefCOCOg,
the authors divide their dataset by randomly partitioning ob-
jects into training and testing splits. Thus the same image
may appear in both splits. As only training and validation
splits have been released for this dataset, we use the hyper-
paramters cross-validated on RefCOCO to train models on
RefCOCOg.
7286

Citations
More filters
Proceedings ArticleDOI
15 Jun 2019
TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
Abstract: Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

687 citations

Proceedings ArticleDOI
01 Jun 2018
TL;DR: The authors decompose expressions into three modular components related to subject appearance, location, and relationship to other objects in an end-to-end framework, which allows to flexibly adapt to expressions containing different types of information.
Abstract: In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo1 and code2 are provided.

626 citations

Proceedings ArticleDOI
01 Jan 2018
TL;DR: This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
Abstract: Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.

446 citations

Journal ArticleDOI
TL;DR: In this article, a two-branch neural network is proposed to learn the similarity between image-sentence matching and visual grounding, which achieves high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional imagesentence retrieval on Flickr30k and MSCOCO datasets.
Abstract: Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network , learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network , fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets.

391 citations

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper proposes a new way to combine image and text through residual connection, that outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset the authors create based on CLEVR.
Abstract: In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar, but are modified in small ways, such as being taken at nighttime instead of during the day. o tackle this task, we embed the query (reference image plus modification text) and the target (images). The encoding function of the image text query learns a representation, such that the similarity with the target image representation is high iff it is a ``positive match''. We propose a new way to combine image and text through residual connection, that is designed for this retrieval task. We show this outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset we create based on CLEVR. We also show that our approach can be used to perform image classification with compositionally novel labels, and we outperform previous methods on MIT-States on this task.

220 citations

References
More filters
Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations

Book ChapterDOI
08 Oct 2016
TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Abstract: We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https://github.com/weiliu89/caffe/tree/ssd.

19,543 citations

Book ChapterDOI

13,767 citations