A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg
Department of Computer Science
University of North Carolina at Chapel Hill
{licheng, airsplay, mbansal, tlberg}@cs.unc.edu
Abstract
Referring expressions are natural language construc-
tions used to identify particular objects within a scene. In
this paper, we propose a unified framework for the tasks of
referring expression comprehension and generation. Our
model is composed of three modules: speaker, listener, and
reinforcer. The speaker generates referring expressions, the
listener comprehends referring expressions, and the rein-
forcer introduces a reward function to guide sampling of
more discriminative expressions. The listener-speaker mod-
ules are trained jointly in an end-to-end learning frame-
work, allowing the modules to be aware of one another
during learning while also benefiting from the discrimina-
tive reinforcer’s feedback. We demonstrate that this unified
framework and training achieves state-of-the-art results for
both comprehension and generation on three referring ex-
pression datasets.
1
1. Introduction
People often use referring expressions in their everyday
discourse to unambiguously identify or indicate particular
objects within their physical environment. For example, one
might point out a person in the crowd by referring to them
as “the man in the blue shirt” or you might ask someone
to “pass me the red pen on the table.” In both of these ex-
amples, we have a pragmatic interaction between two peo-
ple (or between a person and an intelligent agent such as
a robot). First, we have a speaker who must generate an
expression given a target object and its surrounding world
context. Second, we have a listener who must interpret and
comprehend the expression and map it to an object in the
environment. Therefore, in this paper we propose an end-
to-end trained listener-speaker framework that models these
behaviors jointly.
In addition to the listener and speaker, we also intro-
duce a new reinforcer module that learns a discriminative
reward model to help generate less ambiguous expressions
(expressions that apply to the target object but not to other
1
Project and demo page:
https://vision.cs.unc.edu/refer.
objects in the image). This goal corresponds to the Gricean
Maxim [
8] of manner, where one tries to be as clear, brief,
and orderly as possible while avoiding obscurity and am-
biguity. Avoiding ambiguity is important because the gen-
erated expression should be easily and uniquely mapped to
the target object. For example, if there were two pens on
the table one “long and red” and the other “short and red”,
asking for the “red pen” would be ambiguous while asking
for the “long pen” would be better. The reinforcer module is
incorporated using reinforcement learning, inspired by be-
havioral psychology that says that agents operating in an en-
vironment should take actions that maximize their expected
cumulative reward. In our case, the reward takes the form
of a discriminative classifier trained to reward the speaker
for generating less ambiguous expressions.
Within the realm of referring expressions, there are two
tasks that can be computationally modeled, mimicking the
listener and speaker roles. Referring Expression Genera-
tion (speaker) requires an algorithm to generate a referring
expression for a given target object in a visual scene, as in
Fig.
2. Referring Expression Comprehension (listener) re-
quires an algorithm to localize the object/region in an image
described by a given referring expression, as in Fig.
3.
The Referring Expression Generation (REG) task has
been studied since the 1970s [
29]. Many of the early
works in this space focused on relatively limited datasets,
using synthesized images of objects in artificial scenes
or limited sets of real-world objects in simplified envi-
ronments [
20, 7, 15]. Recently, the research focus has
shifted to more complex natural image datasets and has ex-
panded to include the Referring Expression Comprehension
task [
13, 19, 31] as well as to real-world interactions with
robotics [
4, 3]. One reason this has become feasible is that
several large-scale REG datasets [
13, 31, 19] have been col-
lected where deep learning models can be applied.
Recent neural approaches to the referring expression
generation and comprehension tasks can be roughly split
into two types. The first type uses a CNN-LSTM encoder-
decoder generative model [
25] to generate (decode) sen-
tences given the encoded target object. With careful de-
1
7282
sign of the visual representation of target object, this model
can generate unambiguous expressions [
19, 31]. Here, the
CNN-LSTM models P (r|o), where r is the referring ex-
pression and o is the target object, which can be easily con-
verted to P (o|r) via Bayes’ rule and used to address the
comprehension task [
10, 19, 31, 21] by selecting the o with
the largest posterior probability. The second type of ap-
proach uses a joint-embedding model that projects both a vi-
sual representation of the target object and a semantic repre-
sentation of the expression into a common space and learns
a distance metric. Generation and comprehension can be
performed by embedding a target object (or expression) into
the embedding space and retrieving the closest expression
(or object) in this space. This type of approach typically
achieves better comprehension performance than the CNN-
LSTM model as in [
23, 26], but previously was only ap-
plied to the referring expression comprehension task. Re-
cent work [1] has also used both an encoder-decoder model
(speaker) and an embedding model (listener) for referring
expression generation in abstract images, where the offline
listener reranks the speaker’s output.
In this paper, we propose a unified model that jointly
learns both the CNN-LSTM speaker and embedding-based
listener models, for both the generation and comprehension
tasks. Additionally, we add a discriminative reward-based
reinforcer to guide the sampling of more discriminative ex-
pressions and further improve our final system. Instead of
working independently, we let the speaker, listener, and re-
inforcer interact with each other, resulting in improved per-
formance on both generation and comprehension tasks. Re-
sults evaluated on three standard, large-scale datasets verify
that our proposed listener-speaker-reinforcer model signifi-
cantly outperforms the state-of-the-art on both the compre-
hension task (Tables
1 and 2) and the generation task (eval-
uated using human judgements in Table
4, and automatic
metrics in Table
3).
2. Related work
Recent years have witnessed a rise in multimodal re-
search related to vision and language. Given the individ-
ual success in each area and the need for models with more
advanced cognition capabilities, several tasks have emerged
as evaluation applications, including image captioning, vi-
sual question answering, and referring expression genera-
tion/comprehension.
Image Captioning: The aim of image captioning is to gen-
erate a sentence describing the general content of an im-
age. Most recent approaches use deep learning to address
this problem. Perhaps the most common architecture is a
CNN-LSTM model [
25], which generates a sentence con-
ditioned on visual information from the image. One pa-
per related to our work is gLSTM [
11] which uses CCA
semantics to guide the caption generation. A further step
beyond image captioning is to locate the regions being de-
scribed in captions [
27, 22, 26]. The Visual Genome [16]
collected captions for dense regions in an image that have
been used for dense-captioning tasks [
12]. There also has
been a movement toward more focused tasks, such as vi-
sual question answering [2, 30], and referring expression
generation and comprehension which involve specific re-
gions/objects within an image (discussed below).
Referring Expression Datasets: REG has been studied
for many years [29, 15, 20] in linguistics and natural lan-
guage processing, but mainly focused on small or artificial
datasets. In 2014, Kazemzadeh et al [13] introduced the first
large-scale dataset RefCLEF using 20,000 real-world natu-
ral images [
9]. This dataset was collected in a two-player
game, where the first player writes a referring expression
given an indicated target object. The second player is shown
only the image and expression and has to click on the cor-
rect object described by the expression. If the click lies
within the target object region, both sides get points and
their roles switch. Using the same game interace, the au-
thors further collected RefCOCO and RefCOCO+ datasets
on MSCOCO images [
31]. The RefCOCO and RefCOCO+
datasets each contain 50,000 referred objects with 3 refer-
ring expressions on average. The main difference between
RefCOCO and RefCOCO+ is that in RefCOCO+, players
were forbidden from using absolute location words, e.g. left
dog, therefore focusing the referring expression to purely
appearance-based descriptions. In addition, Mao et al [
19]
also collected a referring expression dataset - RefCOCOg,
using MSCOCO images, but in a non-interactive frame-
work. These expressions are more similar to the MSCOCO
captions in that they are longer and more complex as their
was no time constraint in the non-interactive data collection
setting. This dataset has 96,654 objects with 1.3 expressions
per object on average.
Referring Expression Comprehension and Generation:
Referring expressions are associated with two tasks, com-
prehension and generation. The comprehension task re-
quires a system to select the region being described by
a given referring expression. To address this problem,
[
19, 31, 21, 10] model P (r|o) and looks for the object
o maximizing the probability. People also try modeling
P (o, r) directly using embedding model [23, 26], which
learns to minimize the distance between paired object and
sentence in the embedding space. The generation task asks
a system to compose an expression for a specified object
within an image. While some previous work used rule-
based approaches to generate expressions with fixed gram-
mar pattern [
20, 5, 13], recent work has followed the CNN-
LSTM structure to generate expressions [19, 31].
7283
Man in the middle
wearing yellow
MLP
MLP
Concat LSTM
Embedding
Loss
Generation
loss
Reward
Loss
LSTM
Speaker
Listener
Sampling
Reinforcer
L2-Normalization
L2-Normalization
Figure 1: Framework: The Speaker is a CNN-LSTM model, which generates a referring expression for the target object. The
Listener is a joint-embedding model learned to minimize the distance between paired object and expression representations.
In addition, a Reinforcer module helps improve the speaker by sampling more discriminative expressions for training.
3. Model
Our model is composed of three modules: speaker
(Sec
3.1), listener (Sec 3.2), and reinforcer (Sec 3.3). Dur-
ing training, the speaker and listener are trained jointly so
that they can benefit from each other and from the rein-
forcer. As the reward function for the reinforcer is not dif-
ferentiable, it is incorporated using reinforcement learning
policy gradient algorithm.
3.1. Speaker
For our speaker module, we follow the previous state-of-
the-art [
19, 31], and use a CNN-LSTM framework. Here, a
pre-trained CNN model is used to define a visual represen-
tation for the target object and other visual context. Then,
a Long-short term memory (LSTM) is used to generate the
most likely expression given the visual representation.
Because of the improved quantitative performance
over [
19], we use the visual comparison model of [31]
as our speaker (to encode the target object). Here, the vi-
sual representation includes the target object, context, lo-
cation/size features, and two visual comparison features.
Specifically, the target object representation o
i
is modeled
as the fc7 features from a pre-trained VGG network [
24].
Global context, g
i
, is modeled as features extracted from
the VGG-fc7 layer for the entire image. The location/size
representation, l
i
, for the target object is modeled as a 5-
dimensional vector, encoding the x and y locations of the
top left and bottom right corners of the target object bound-
ing box, as well as the bounding box size with respect to the
image, i.e., l
i
= [
x
tl
W
,
y
tl
H
,
x
br
W
,
y
br
H
,
w·h
W ·H
].
As referring expressions often relate an object to other
objects of the same type within the image (“the red ball” vs
“the blue ball” or “the larger elephant”), comparisons tend
to be quite important for differentiation. The comparison
features are composed of two parts: a) appearance similarity
δv
i
=
1
n
P
j6=i
o
i
−o
j
ko
i
−o
j
k
, where n is the number of objects
chosen for comparisons, b) location and size similarity δl
i
,
concatenating the 5-d difference on each compared object
δl
ij
= [
[△x
tl
]
ij
w
i
,
[△y
tl
]
ij
h
i
,
[△x
br
]
ij
w
i
,
[△y
br
]
ij
h
i
,
w
j
h
j
w
i
h
i
].
The final visual representation for the target object
is then a concatenation of the above features followed
by a fully-connected layer fusing them together, r
i
=
W
m
[o
i
, g
i
, l
i
, δv
i
, δl
i
] + b
m
. This joint feature is then fed
into the LSTM for referring expression generation. During
training we minimize the negative log-likelihood:
L
s
1
(θ) = −
X
i
log P (r
i
|o
i
; θ)
= −
X
i
X
t
log P (r
t
i
|r
t−1
i
, . . . , r
1
i
, o
i
; θ)
(1)
Note that the speaker can be modeled using any form of
CNN-LSTM structure.
In [
19], Mao proposed to add a Maximum Mutual Infor-
mation (MMI) constraint encouraging the generated expres-
sion to describe the target object better than the other objects
within the image (i.e., a ranking loss on objects). We gen-
eralize this idea to incorporate two triplet hinge losses com-
posed of a positive match and two negative matches. Given
a positive match (r
i
, o
i
), we sample the contrastive pair
(r
j
, o
i
) where r
j
is the expression describing some other
object and pair (r
i
, o
k
) where o
k
is some other object in the
same image, then we optimize the following max-margin
7284
loss:
L
s
2
(θ) =
X
i
[λ
s
1
max(0, M + log P (r
i
|o
k
) − log P (r
i
|o
i
))
+λ
s
2
max(0, M + log P (r
j
|o
i
) − log P (r
i
|o
i
))]
(2)
The first term is from [
19], while the second term encour-
ages that the target object to be better described by the true
expression compared to expressions describing other ob-
jects in the image (i.e., a ranking loss on expressions).
3.2. Listener
We use a joint-embedding model to mimick the listener’s
behaviour. The purpose of this embedding model is to
encode the visual information from the target object and
semantic information from the referring expression into a
joint space that embeds vectors that are visually or semanti-
cally related closer together in the space. Here for the refer-
ring expression comprehension task, given a referring ex-
pression representation, the listener embeds it into the joint
space, then selects the closest object in the embedding space
for the predicted target object.
As illustrated in Fig.
1, for our listener joint-embedding
model (outlined by a red dashline), we use an LSTM to en-
code the input referring expression and the same visual rep-
resentation as the speaker to encode the target object (thus
connecting the speaker to the listener). We then add two
MLPs (multi-layer perceptions) and two L2 normalization
layers following each view, the object and the expression.
Each MLP is composed of two fully connected layers with
ReLU nonlinearities between them, serving to transform the
object view and the expression view into a common embed-
ding space. The inner-product of the two normalized rep-
resentations is computed as their similarity score S(r, o) in
the space. As a listener, we force the similarity on target
object and referring expression pairs by applying a hinge
loss over triplets, which consist of a positive match and two
negative matches:
L
l
(θ) =
X
i
[λ
l
1
max(0, M + S(r
i
, o
k
) − S(r
i
, o
i
))
+λ
l
2
max(0, M + S(r
j
, o
i
) − S(r
i
, o
i
))]
(3)
where the negative matches are randomly chosen from the
other objects and expressions in the same image.
Note that the listener model is not limited to this par-
ticular triplet-based model. For example, [
23] computes a
similarity score between every object for given referring ex-
pression, and minimizes the cross entropy of the SoftMax
knowing the target object, which could also be applied here.
3.3. Reinforcer
Besides using the ground-truth pairs of target object and
referring expression for training the speaker, we also use
reinforcement learning to guide the speaker toward gener-
ating less ambiguous expressions (expressions that apply to
the target object but not to other objects). This reinforcer
module is composed of a discriminative reward function
and performs a non-differentiable policy gradient update to
the speaker.
Specifically, given the softmax output of the speaker’s
LSTM, we sample words according to the categorical dis-
tribution at each time step, resulting in a complete expres-
sion after sampling the <END> token. This sampling op-
eration is non-differentiable as we do not know whether an
expression is ambiguous or not until we feed it into a re-
ward function. Therefore, we use policy gradient reinforce-
ment learning to update the speaker’s parameters. Here, the
goal is to maximize the reward expectation F (w
1:T
) under
the distribution of p(w
1:T
; θ) parameterized by the speaker,
i.e., J = E
p(w
1:T
)
[F ]. According to the policy gradient
algorithm [
28], we have
∇
θ
J = E
p(w
1:T
)
[F (w
1:T
)∇
θ
log p(w
1:T
; θ)], (4)
Where log p(w
t
) is defined by the softmax output. We then
use this gradient to update our speaker model during train-
ing.
The only thing left is to choose a reward function that
encourages the speaker to sample less ambiguous expres-
sions. As illustrated in Fig.
1 (outlined in dashed orange),
the reinforcer module learns a reward function using paired
objects and expressions. We again use the same visual rep-
resentation for the target object and use another LSTM to
encode the expression representation. Rather than using two
MLPs to encode each view as in the listener, here we con-
catenate the two views and feed them together into a MLP
to learn a 1-d Logistic Regression score between 0 and 1.
Trained with cross-entropy loss, the reward function com-
putes a match score between an input object and expression.
We use this score as the reward signal in Eqn.
4 for sampled
expression and target object pairs. After training, the reward
function is fixed to assist our joint speaker-listener system.
3.4. Joint Model
In this subsection, we describe some specifics of how
our three modules (speaker, listener, reinforcer) are inte-
grated into a joint framework (shown in Fig.
1). For the
listener, we notice that the visual vector in the embedding
space is learned to capture the neighbourhood vectors of re-
ferring expressions, thus making it aware of the listener’s
knowledge. Therefore, we take this MLP embedded vec-
tor as an additional input for the speaker, which encodes
the listener based information. In Fig.
1, we use concatena-
tion to jointly encode the standard visual representation of
target object and this listener-aware representation and then
feed them into speaker. Besides concatenation, the element-
wise product or compact bilinear pooling can also be ap-
plied [
6]. During training, we sample the same triplets for
7285
both the speaker and listener, and make the word embedding
of the speaker and listener shared to reduce the number of
parameters. For the reinforcer module, we do sentence sam-
pling using the speaker’s LSTM as shown in the top right of
Fig.
1. Within each mini-batch, the sampled expressions for
the target objects are fed into the reward function to obtain
reward values.
The overall loss function is formulated as a multi-task
learning problem:
θ = arg min L
s
1
(θ) + L
s
2
(θ) + L
l
(θ) − λ
r
J(θ), (5)
where λ
r
is the weight on reward loss. The weights on the
loss of speaker and listener are already included in Eqn.
2
and Eqn. 3. We list all hyper-parameters settings in Sec. 4.1.
3.5. Comprehension and Generation
For the comprehension task, at test time, we can use ei-
ther the speaker or listener to select the target object given
an input expression. Using the listener, we would em-
bed the input expression into the learned embedding space
and select the closest object as the predicted target. Using
the speaker, we would generate expressions for each object
within the image and then select the object whose generated
expression best matches the input expression. Therefore,
we utilize both modules by ensembling the speaker and lis-
tener predictions together to pick the most probable object
given a referring expression.
ˆo = arg max
o
P (r | o)S(o, r)
λ
(6)
Surprisingly, using the speaker alone (setting λ to 0)
already achieves state-of-art results due to our joint train-
ing. Adding the listener further improves performance more
than 4% over previous state-of-art results.
For the generation task, we first let the speaker gener-
ate multiple expressions per object via beam search. We
then use the listener to rerank these expressions and select
the least ambiguous expression, which is similar to [
1]. To
fully utilize the listener’s power in generation, we propose
to consider cross comprehension as well as the diversity of
expressions by minimizing the potential:
E(r) =
X
i
θ
i
(r
i
) +
X
i,j
θ
i,j
(r
i
, r
j
)
θ
i
(r
i
) = − log P (r
i
|o
i
) − λ
1
log S(r
i
, o
i
)
+ λ
2
max
j6=i
log S(r
i
, o
j
)
θ
i,j
(r
i
, r
j
) = λ
3
I(r
i
= r
j
)
(7)
The first term and second term in the unary potential
measure how well the target object and generated expres-
sion match using the speaker and listener modules respec-
tively (also used in [
1]). The third term in the unary po-
tential measures the likelihood of the generated sentence
of describing other objects in the same image. The pair-
wise potential penalizes the same sentence being generated
for different objects (encouraging diversity in generation).
In this way, the expressions for every object in an image
are jointly generated. Compared with the previous model
that attempted to tie language generation of referring ex-
pressions together [
31], the constraints in Eqn. 7 are more
explicit and overall this works better to reduce ambiguity in
the generated expressions.
4. Experiments
4.1. Optimization
We optimize our model using Adam [
14] with an initial
learning rate of 0.0004, halved every 2,000 iterations, with
a batch size of 32. The word embedding size and hidden
state size of the LSTM are set to 512. To avoid overfitting,
we apply dropout with a ratio of 0.2 after each linear trans-
formation in the MLP layers. We also regularize the word-
embedding and output layers of the speaker’s LSTM using
dropout with ratio of 0.5. For the constrastive pairs, we set
λ
l
1
= 1 and λ
l
2
= 1 in listener (Eqn.
3), and set λ
s
1
= 1 and
λ
s
2
= 0.1 in speaker (Eqn.
2). The weight on reward loss is
set as λ
r
= 1 .
4.2. Datasets
We perform experiments on three referring expression
datasets: RefCOCO, RefCOCO+ and RefCOCOg (de-
scribed in Sec
2). All three datasets are collected on
MSCOCO images [
17], but with several differences: 1) Re-
fCOCO and RefCOCO+ were collected using an interactive
game interface while RefCOCOg was collected in a non-
interactive setting and contains longer expressions, 2) Ref-
COCOg contains on average 1.63 objects of the same type
per images, while RefCOCO and RefCOCO have 3.9 on av-
erage, 3) RefCOCO+ disallowed absolute location words in
referring expressions. Overall, RefCOCO has 142,210 ex-
pressions for 50,000 objects in 19,994 images, RefCOCO+
has 141,565 expressions for 49,856 objects in 19,992 im-
ages, and RefCOCOg has 104,560 expressions for 54,822
objects in 26,711 images.
Additionally, each dataset is provided with dataset splits
for evaluation. RefCOCO and RefCOCO+ provide person
vs. object splits for evaluation. Images containing multi-
ple people are in “TestA” while images containing multiple
objects of other categories are in “TestB”. For RefCOCOg,
the authors divide their dataset by randomly partitioning ob-
jects into training and testing splits. Thus the same image
may appear in both splits. As only training and validation
splits have been released for this dataset, we use the hyper-
paramters cross-validated on RefCOCO to train models on
RefCOCOg.
7286