A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

doi:10.1109/CVPR.2017.375

Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

Department of Computer Science

University of North Carolina at Chapel Hill

{licheng, airsplay, mbansal, tlberg}@cs.unc.edu

Abstract

Referring expressions are natural language construc-

tions used to identify particular objects within a scene. In

this paper, we propose a uniﬁed framework for the tasks of

referring expression comprehension and generation. Our

model is composed of three modules: speaker, listener, and

reinforcer. The speaker generates referring expressions, the

listener comprehends referring expressions, and the rein-

forcer introduces a reward function to guide sampling of

more discriminative expressions. The listener-speaker mod-

ules are trained jointly in an end-to-end learning frame-

work, allowing the modules to be aware of one another

during learning while also beneﬁting from the discrimina-

tive reinforcer’s feedback. We demonstrate that this uniﬁed

framework and training achieves state-of-the-art results for

both comprehension and generation on three referring ex-

pression datasets.

1

1. Introduction

People often use referring expressions in their everyday

discourse to unambiguously identify or indicate particular

objects within their physical environment. For example, one

might point out a person in the crowd by referring to them

as “the man in the blue shirt” or you might ask someone

to “pass me the red pen on the table.” In both of these ex-

amples, we have a pragmatic interaction between two peo-

ple (or between a person and an intelligent agent such as

a robot). First, we have a speaker who must generate an

expression given a target object and its surrounding world

context. Second, we have a listener who must interpret and

comprehend the expression and map it to an object in the

environment. Therefore, in this paper we propose an end-

to-end trained listener-speaker framework that models these

behaviors jointly.

In addition to the listener and speaker, we also intro-

duce a new reinforcer module that learns a discriminative

reward model to help generate less ambiguous expressions

(expressions that apply to the target object but not to other

1

Project and demo page:

https://vision.cs.unc.edu/refer.

objects in the image). This goal corresponds to the Gricean

Maxim [

8] of manner, where one tries to be as clear, brief,

and orderly as possible while avoiding obscurity and am-

biguity. Avoiding ambiguity is important because the gen-

erated expression should be easily and uniquely mapped to

the target object. For example, if there were two pens on

the table one “long and red” and the other “short and red”,

asking for the “red pen” would be ambiguous while asking

for the “long pen” would be better. The reinforcer module is

incorporated using reinforcement learning, inspired by be-

havioral psychology that says that agents operating in an en-

vironment should take actions that maximize their expected

cumulative reward. In our case, the reward takes the form

of a discriminative classiﬁer trained to reward the speaker

for generating less ambiguous expressions.

Within the realm of referring expressions, there are two

tasks that can be computationally modeled, mimicking the

listener and speaker roles. Referring Expression Genera-

tion (speaker) requires an algorithm to generate a referring

expression for a given target object in a visual scene, as in

Fig.

2. Referring Expression Comprehension (listener) re-

quires an algorithm to localize the object/region in an image

described by a given referring expression, as in Fig.

3.

The Referring Expression Generation (REG) task has

been studied since the 1970s [

29]. Many of the early

works in this space focused on relatively limited datasets,

using synthesized images of objects in artiﬁcial scenes

or limited sets of real-world objects in simpliﬁed envi-

ronments [

20, 7, 15]. Recently, the research focus has

shifted to more complex natural image datasets and has ex-

panded to include the Referring Expression Comprehension

task [

13, 19, 31] as well as to real-world interactions with

robotics [

4, 3]. One reason this has become feasible is that

several large-scale REG datasets [

13, 31, 19] have been col-

lected where deep learning models can be applied.

Recent neural approaches to the referring expression

generation and comprehension tasks can be roughly split

into two types. The ﬁrst type uses a CNN-LSTM encoder-

decoder generative model [

25] to generate (decode) sen-

tences given the encoded target object. With careful de-

1

7282

sign of the visual representation of target object, this model

can generate unambiguous expressions [

19, 31]. Here, the

CNN-LSTM models P (r|o), where r is the referring ex-

pression and o is the target object, which can be easily con-

verted to P (o|r) via Bayes’ rule and used to address the

comprehension task [

10, 19, 31, 21] by selecting the o with

the largest posterior probability. The second type of ap-

proach uses a joint-embedding model that projects both a vi-

sual representation of the target object and a semantic repre-

sentation of the expression into a common space and learns

a distance metric. Generation and comprehension can be

performed by embedding a target object (or expression) into

the embedding space and retrieving the closest expression

(or object) in this space. This type of approach typically

achieves better comprehension performance than the CNN-

LSTM model as in [

23, 26], but previously was only ap-

plied to the referring expression comprehension task. Re-

cent work [1] has also used both an encoder-decoder model

(speaker) and an embedding model (listener) for referring

expression generation in abstract images, where the ofﬂine

listener reranks the speaker’s output.

In this paper, we propose a uniﬁed model that jointly

learns both the CNN-LSTM speaker and embedding-based

listener models, for both the generation and comprehension

tasks. Additionally, we add a discriminative reward-based

reinforcer to guide the sampling of more discriminative ex-

pressions and further improve our ﬁnal system. Instead of

working independently, we let the speaker, listener, and re-

inforcer interact with each other, resulting in improved per-

formance on both generation and comprehension tasks. Re-

sults evaluated on three standard, large-scale datasets verify

that our proposed listener-speaker-reinforcer model signiﬁ-

cantly outperforms the state-of-the-art on both the compre-

hension task (Tables

1 and 2) and the generation task (eval-

uated using human judgements in Table

4, and automatic

metrics in Table

3).

2. Related work

Recent years have witnessed a rise in multimodal re-

search related to vision and language. Given the individ-

ual success in each area and the need for models with more

advanced cognition capabilities, several tasks have emerged

as evaluation applications, including image captioning, vi-

sual question answering, and referring expression genera-

tion/comprehension.

Image Captioning: The aim of image captioning is to gen-

erate a sentence describing the general content of an im-

age. Most recent approaches use deep learning to address

this problem. Perhaps the most common architecture is a

CNN-LSTM model [

25], which generates a sentence con-

ditioned on visual information from the image. One pa-

per related to our work is gLSTM [

11] which uses CCA

semantics to guide the caption generation. A further step

beyond image captioning is to locate the regions being de-

scribed in captions [

27, 22, 26]. The Visual Genome [16]

collected captions for dense regions in an image that have

been used for dense-captioning tasks [

12]. There also has

been a movement toward more focused tasks, such as vi-

sual question answering [2, 30], and referring expression

generation and comprehension which involve speciﬁc re-

gions/objects within an image (discussed below).

Referring Expression Datasets: REG has been studied

for many years [29, 15, 20] in linguistics and natural lan-

guage processing, but mainly focused on small or artiﬁcial

datasets. In 2014, Kazemzadeh et al [13] introduced the ﬁrst

large-scale dataset RefCLEF using 20,000 real-world natu-

ral images [

9]. This dataset was collected in a two-player

game, where the ﬁrst player writes a referring expression

given an indicated target object. The second player is shown

only the image and expression and has to click on the cor-

rect object described by the expression. If the click lies

within the target object region, both sides get points and

their roles switch. Using the same game interace, the au-

thors further collected RefCOCO and RefCOCO+ datasets

on MSCOCO images [

31]. The RefCOCO and RefCOCO+

datasets each contain 50,000 referred objects with 3 refer-

ring expressions on average. The main difference between

RefCOCO and RefCOCO+ is that in RefCOCO+, players

were forbidden from using absolute location words, e.g. left

dog, therefore focusing the referring expression to purely

appearance-based descriptions. In addition, Mao et al [

19]

also collected a referring expression dataset - RefCOCOg,

using MSCOCO images, but in a non-interactive frame-

work. These expressions are more similar to the MSCOCO

captions in that they are longer and more complex as their

was no time constraint in the non-interactive data collection

setting. This dataset has 96,654 objects with 1.3 expressions

per object on average.

Referring Expression Comprehension and Generation:

Referring expressions are associated with two tasks, com-

prehension and generation. The comprehension task re-

quires a system to select the region being described by

a given referring expression. To address this problem,

[

19, 31, 21, 10] model P (r|o) and looks for the object

o maximizing the probability. People also try modeling

P (o, r) directly using embedding model [23, 26], which

learns to minimize the distance between paired object and

sentence in the embedding space. The generation task asks

a system to compose an expression for a speciﬁed object

within an image. While some previous work used rule-

based approaches to generate expressions with ﬁxed gram-

mar pattern [

20, 5, 13], recent work has followed the CNN-

LSTM structure to generate expressions [19, 31].

7283

Man in the middle

wearing yellow

MLP

Concat LSTM

Embedding

Loss

Generation

loss

Reward

Loss

LSTM

Speaker

Listener

Sampling

Reinforcer

L2-Normalization

Figure 1: Framework: The Speaker is a CNN-LSTM model, which generates a referring expression for the target object. The

Listener is a joint-embedding model learned to minimize the distance between paired object and expression representations.

In addition, a Reinforcer module helps improve the speaker by sampling more discriminative expressions for training.

3. Model

Our model is composed of three modules: speaker

(Sec

3.1), listener (Sec 3.2), and reinforcer (Sec 3.3). Dur-

ing training, the speaker and listener are trained jointly so

that they can beneﬁt from each other and from the rein-

forcer. As the reward function for the reinforcer is not dif-

ferentiable, it is incorporated using reinforcement learning

policy gradient algorithm.

3.1. Speaker

For our speaker module, we follow the previous state-of-

the-art [

19, 31], and use a CNN-LSTM framework. Here, a

pre-trained CNN model is used to deﬁne a visual represen-

tation for the target object and other visual context. Then,

a Long-short term memory (LSTM) is used to generate the

most likely expression given the visual representation.

Because of the improved quantitative performance

over [

19], we use the visual comparison model of [31]

as our speaker (to encode the target object). Here, the vi-

sual representation includes the target object, context, lo-

cation/size features, and two visual comparison features.

Speciﬁcally, the target object representation o

i

is modeled

as the fc7 features from a pre-trained VGG network [

24].

Global context, g

i

, is modeled as features extracted from

the VGG-fc7 layer for the entire image. The location/size

representation, l

i

, for the target object is modeled as a 5-

dimensional vector, encoding the x and y locations of the

top left and bottom right corners of the target object bound-

ing box, as well as the bounding box size with respect to the

image, i.e., l

i

= [

x

tl

W

,

y

tl

H

,

x

br

W

,

y

br

H

,

w·h

W ·H

].

As referring expressions often relate an object to other

objects of the same type within the image (“the red ball” vs

“the blue ball” or “the larger elephant”), comparisons tend

to be quite important for differentiation. The comparison

features are composed of two parts: a) appearance similarity

δv

i

=

1

n

P

j6=i

o

i

−o

j

ko

i

−o

j

k

, where n is the number of objects

chosen for comparisons, b) location and size similarity δl

i

,

concatenating the 5-d difference on each compared object

δl

ij

= [

[△x

tl

]

ij

w

i

,

[△y

tl

]

ij

h

i

,

[△x

br

]

ij

w

i

,

[△y

br

]

ij

h

i

,

w

j

h

j

w

i

h

i

].

The ﬁnal visual representation for the target object

is then a concatenation of the above features followed

by a fully-connected layer fusing them together, r

i

=

W

m

[o

i

, g

i

, l

i

, δv

i

, δl

i

] + b

m

. This joint feature is then fed

into the LSTM for referring expression generation. During

training we minimize the negative log-likelihood:

L

s

1

(θ) = −

X

i

log P (r

i

|o

i

; θ)

= −

X

i

X

t

log P (r

t

i

|r

t−1

i

, . . . , r

1

i

, o

i

; θ)

(1)

Note that the speaker can be modeled using any form of

CNN-LSTM structure.

In [

19], Mao proposed to add a Maximum Mutual Infor-

mation (MMI) constraint encouraging the generated expres-

sion to describe the target object better than the other objects

within the image (i.e., a ranking loss on objects). We gen-

eralize this idea to incorporate two triplet hinge losses com-

posed of a positive match and two negative matches. Given

a positive match (r

i

, o

i

), we sample the contrastive pair

(r

j

, o

i

) where r

j

is the expression describing some other

object and pair (r

i

, o

k

) where o

k

is some other object in the

same image, then we optimize the following max-margin

7284

loss:

L

s

2

(θ) =

X

i

[λ

s

1

max(0, M + log P (r

i

|o

k

) − log P (r

i

|o

i

))

+λ

s

2

max(0, M + log P (r

j

|o

i

) − log P (r

i

|o

i

))]

(2)

The ﬁrst term is from [

19], while the second term encour-

ages that the target object to be better described by the true

expression compared to expressions describing other ob-

jects in the image (i.e., a ranking loss on expressions).

3.2. Listener

We use a joint-embedding model to mimick the listener’s

behaviour. The purpose of this embedding model is to

encode the visual information from the target object and

semantic information from the referring expression into a

joint space that embeds vectors that are visually or semanti-

cally related closer together in the space. Here for the refer-

ring expression comprehension task, given a referring ex-

pression representation, the listener embeds it into the joint

space, then selects the closest object in the embedding space

for the predicted target object.

As illustrated in Fig.

1, for our listener joint-embedding

model (outlined by a red dashline), we use an LSTM to en-

code the input referring expression and the same visual rep-

resentation as the speaker to encode the target object (thus

connecting the speaker to the listener). We then add two

MLPs (multi-layer perceptions) and two L2 normalization

layers following each view, the object and the expression.

Each MLP is composed of two fully connected layers with

ReLU nonlinearities between them, serving to transform the

object view and the expression view into a common embed-

ding space. The inner-product of the two normalized rep-

resentations is computed as their similarity score S(r, o) in

the space. As a listener, we force the similarity on target

object and referring expression pairs by applying a hinge

loss over triplets, which consist of a positive match and two

negative matches:

L

l

(θ) =

X

i

[λ

l

1

max(0, M + S(r

i

, o

k

) − S(r

i

, o

i

))

+λ

l

2

max(0, M + S(r

j

, o

i

) − S(r

i

, o

i

))]

(3)

where the negative matches are randomly chosen from the

other objects and expressions in the same image.

Note that the listener model is not limited to this par-

ticular triplet-based model. For example, [

23] computes a

similarity score between every object for given referring ex-

pression, and minimizes the cross entropy of the SoftMax

knowing the target object, which could also be applied here.

3.3. Reinforcer

Besides using the ground-truth pairs of target object and

referring expression for training the speaker, we also use

reinforcement learning to guide the speaker toward gener-

ating less ambiguous expressions (expressions that apply to

the target object but not to other objects). This reinforcer

module is composed of a discriminative reward function

and performs a non-differentiable policy gradient update to

the speaker.

Speciﬁcally, given the softmax output of the speaker’s

LSTM, we sample words according to the categorical dis-

tribution at each time step, resulting in a complete expres-

sion after sampling the <END> token. This sampling op-

eration is non-differentiable as we do not know whether an

expression is ambiguous or not until we feed it into a re-

ward function. Therefore, we use policy gradient reinforce-

ment learning to update the speaker’s parameters. Here, the

goal is to maximize the reward expectation F (w

1:T

) under

the distribution of p(w

1:T

; θ) parameterized by the speaker,

i.e., J = E

p(w

1:T

)

[F ]. According to the policy gradient

algorithm [

28], we have

∇

θ

J = E

p(w

1:T

)

[F (w

1:T

)∇

θ

log p(w

1:T

; θ)], (4)

Where log p(w

t

) is deﬁned by the softmax output. We then

use this gradient to update our speaker model during train-

ing.

The only thing left is to choose a reward function that

encourages the speaker to sample less ambiguous expres-

sions. As illustrated in Fig.

1 (outlined in dashed orange),

the reinforcer module learns a reward function using paired

objects and expressions. We again use the same visual rep-

resentation for the target object and use another LSTM to

encode the expression representation. Rather than using two

MLPs to encode each view as in the listener, here we con-

catenate the two views and feed them together into a MLP

to learn a 1-d Logistic Regression score between 0 and 1.

Trained with cross-entropy loss, the reward function com-

putes a match score between an input object and expression.

We use this score as the reward signal in Eqn.

4 for sampled

expression and target object pairs. After training, the reward

function is ﬁxed to assist our joint speaker-listener system.

3.4. Joint Model

In this subsection, we describe some speciﬁcs of how

our three modules (speaker, listener, reinforcer) are inte-

grated into a joint framework (shown in Fig.

1). For the

listener, we notice that the visual vector in the embedding

space is learned to capture the neighbourhood vectors of re-

ferring expressions, thus making it aware of the listener’s

knowledge. Therefore, we take this MLP embedded vec-

tor as an additional input for the speaker, which encodes

the listener based information. In Fig.

1, we use concatena-

tion to jointly encode the standard visual representation of

target object and this listener-aware representation and then

feed them into speaker. Besides concatenation, the element-

wise product or compact bilinear pooling can also be ap-

plied [

6]. During training, we sample the same triplets for

7285

both the speaker and listener, and make the word embedding

of the speaker and listener shared to reduce the number of

parameters. For the reinforcer module, we do sentence sam-

pling using the speaker’s LSTM as shown in the top right of

Fig.

1. Within each mini-batch, the sampled expressions for

the target objects are fed into the reward function to obtain

reward values.

The overall loss function is formulated as a multi-task

learning problem:

θ = arg min L

s

1

(θ) + L

s

2

(θ) + L

l

(θ) − λ

r

J(θ), (5)

where λ

r

is the weight on reward loss. The weights on the

loss of speaker and listener are already included in Eqn.

2

and Eqn. 3. We list all hyper-parameters settings in Sec. 4.1.

3.5. Comprehension and Generation

For the comprehension task, at test time, we can use ei-

ther the speaker or listener to select the target object given

an input expression. Using the listener, we would em-

bed the input expression into the learned embedding space

and select the closest object as the predicted target. Using

the speaker, we would generate expressions for each object

within the image and then select the object whose generated

expression best matches the input expression. Therefore,

we utilize both modules by ensembling the speaker and lis-

tener predictions together to pick the most probable object

given a referring expression.

ˆo = arg max

o

P (r | o)S(o, r)

λ

(6)

Surprisingly, using the speaker alone (setting λ to 0)

already achieves state-of-art results due to our joint train-

ing. Adding the listener further improves performance more

than 4% over previous state-of-art results.

For the generation task, we ﬁrst let the speaker gener-

ate multiple expressions per object via beam search. We

then use the listener to rerank these expressions and select

the least ambiguous expression, which is similar to [

1]. To

fully utilize the listener’s power in generation, we propose

to consider cross comprehension as well as the diversity of

expressions by minimizing the potential:

E(r) =

X

i

θ

i

(r

i

) +

X

i,j

θ

i,j

(r

i

, r

j

)

θ

i

(r

i

) = − log P (r

i

|o

i

) − λ

1

log S(r

i

, o

i

)

+ λ

2

max

j6=i

log S(r

i

, o

j

)

θ

i,j

(r

i

, r

j

) = λ

3

I(r

i

= r

j

)

(7)

The ﬁrst term and second term in the unary potential

measure how well the target object and generated expres-

sion match using the speaker and listener modules respec-

tively (also used in [

1]). The third term in the unary po-

tential measures the likelihood of the generated sentence

of describing other objects in the same image. The pair-

wise potential penalizes the same sentence being generated

for different objects (encouraging diversity in generation).

In this way, the expressions for every object in an image

are jointly generated. Compared with the previous model

that attempted to tie language generation of referring ex-

pressions together [

31], the constraints in Eqn. 7 are more

explicit and overall this works better to reduce ambiguity in

the generated expressions.

4. Experiments

4.1. Optimization

We optimize our model using Adam [

14] with an initial

learning rate of 0.0004, halved every 2,000 iterations, with

a batch size of 32. The word embedding size and hidden

state size of the LSTM are set to 512. To avoid overﬁtting,

we apply dropout with a ratio of 0.2 after each linear trans-

formation in the MLP layers. We also regularize the word-

embedding and output layers of the speaker’s LSTM using

dropout with ratio of 0.5. For the constrastive pairs, we set

λ

l

1

= 1 and λ

l

2

= 1 in listener (Eqn.

3), and set λ

s

1

= 1 and

λ

s

2

= 0.1 in speaker (Eqn.

2). The weight on reward loss is

set as λ

r

= 1 .

4.2. Datasets

We perform experiments on three referring expression

datasets: RefCOCO, RefCOCO+ and RefCOCOg (de-

scribed in Sec

2). All three datasets are collected on

MSCOCO images [

17], but with several differences: 1) Re-

fCOCO and RefCOCO+ were collected using an interactive

game interface while RefCOCOg was collected in a non-

interactive setting and contains longer expressions, 2) Ref-

COCOg contains on average 1.63 objects of the same type

per images, while RefCOCO and RefCOCO have 3.9 on av-

erage, 3) RefCOCO+ disallowed absolute location words in

referring expressions. Overall, RefCOCO has 142,210 ex-

pressions for 50,000 objects in 19,994 images, RefCOCO+

has 141,565 expressions for 49,856 objects in 19,992 im-

ages, and RefCOCOg has 104,560 expressions for 54,822

objects in 26,711 images.

Additionally, each dataset is provided with dataset splits

for evaluation. RefCOCO and RefCOCO+ provide person

vs. object splits for evaluation. Images containing multi-

ple people are in “TestA” while images containing multiple

objects of other categories are in “TestB”. For RefCOCOg,

the authors divide their dataset by randomly partitioning ob-

jects into training and testing splits. Thus the same image

may appear in both splits. As only training and validation

splits have been released for this dataset, we use the hyper-

paramters cross-validated on RefCOCO to train models on

RefCOCOg.

7286

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

Citations

References

Related Papers (5)