Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

doi:10.1109/CVPR.2017.345

Knowing When to Look: Adaptive Attention via

A Visual Sentinel for Image Captioning

Jiasen Lu

2∗†

, Caiming Xiong

1†

, Devi Parikh

3

, Richard Socher

1

Salesforce Research,

2

Virginia Tech,

3

Georgia Institute of Technology

jiasenlu@vt.edu, parikh@gatech.edu, {cxiong, rsocher}@salesforce.com

Abstract

Attention-based neural encoder-decoder frameworks

have been widely adopted for image captioning. Most meth-

ods force visual attention to be active for every generated

word. However, the decoder likely requires little to no visual

information from the image to predict non-visual words

such as “the” and “of”. Other words that may seem visual

can often be predicted reliably just from the language model

e.g., “sign” after “behind a red stop” or “phone” following

“talking on a cell”. In this paper, we propose a novel ad-

aptive attention model with a visual sentinel. At each time

step, our model decides whether to attend to the image (and

if so, to which regions) or to the visual sentinel. The model

decides whether to attend to the image and where, in order

to extract meaningful information for sequential word gen-

eration. We test our method on the COCO image captioning

2015 challenge dataset and Flickr30K. Our approach sets

the new state-of-the-art by a signiﬁcant margin.

1. Introduction

Automatically generating captions for images has

emerged as a prominent interdisciplinary research problem

in both academia and industry. [

8, 11, 18, 23, 27, 30]. It

can aid visually impaired users, and make it easy for users

to organize and navigate through large amounts of typically

unstructured visual data. In order to generate high quality

captions, the model needs to incorporate ﬁne-grained visual

clues from the image. Recently, visual attention-based

neural encoder-decoder models [30, 11, 32] have been ex-

plored, where the attention mechanism typically produces

a spatial map highlighting image regions relevant to each

generated word.

Most attention models for image captioning and visual

question answering attend to the image at every time step,

irrespective of which word is going to be emitted next

∗

The major part of this work was done while J. Lu was an intern at

Salesforce Research.

†

Equal contribution

0.3

0.5

0.7

0.9

Adaptive Attention Model

Spatial Attention

Sentinel Gate

RNN

… …

…

… … ……

Visual grounding

probability

CNN

Figure 1: Our model learns an adaptive attention model

that automatically determines when to look (sentinel gate)

and where to look (spatial attention) for word generation,

which are explained in section

2.2, 2.3 & 5.4.

[31, 29, 17]. However, not all words in the caption have cor-

responding visual signals. Consider the example in Fig.

1

that shows an image and its generated caption “A white

bird perched on top of a red stop sign”. The words “a”

and “of” do not have corresponding canonical visual sig-

nals. Moreover, language correlations make the visual sig-

nal unnecessary when generating words like “on” and “top”

following “perched”, and “sign” following “a red stop”. In

fact, gradients from non-visual words could mislead and di-

minish the overall effectiveness of the visual signal in guid-

ing the caption generation process.

In this paper, we introduce an adaptive attention encoder-

decoder framework which can automatically decide when to

rely on visual signals and when to just rely on the language

model. Of course, when relying on visual signals, the model

also decides where – which image region – it should attend

to. We ﬁrst propose a novel spatial attention model for ex-

tracting spatial image features. Then as our proposed adapt-

ive attention mechanism, we introduce a new Long Short

Term Memory (LSTM) extension, which produces an ad-

ditional “visual sentinel” vector instead of a single hidden

state. The “visual sentinel”, an additional latent representa-

tion of the decoder’s memory, provides a fallback option to

the decoder. We further design a new sentinel gate, which

1

375

decides how much new information the decoder wants to get

from the image as opposed to relying on the visual sentinel

when generating the next word. For example, as illustrated

in Fig.

1, our model learns to attend to the image more when

generating words “white”, “bird”, “red” and “stop”, and

relies more on the visual sentinel when generating words

“top”, “of” and “sign”.

Overall, the main contributions of this paper are:

• We introduce an adaptive encoder-decoder framework

that automatically decides when to look at the image

and when to rely on the language model to generate

the next word.

• We ﬁrst propose a new spatial attention model, and

then build on it to design our novel adaptive attention

model with “visual sentinel”.

• Our model signiﬁcantly outperforms other state-of-

the-art methods on COCO and Flickr30k.

• We perform an extensive analysis of our adaptive at-

tention model, including visual grounding probabil-

ities of words and weakly supervised localization of

generated attention maps.

2. Method

We ﬁrst describe the generic neural encoder-decoder

framework for image captioning in Sec.

2.1, then introduce

our proposed attention-based image captioning models in

Sec.

2.2 & 2.3.

2.1. Encoder-Decoder for Image Captioning

We start by brieﬂy describing the encoder-decoder image

captioning framework [27, 30]. Given an image and the

corresponding caption, the encoder-decoder model directly

maximizes the following objective:

θ

∗

= arg max

θ

X

(I,y)

log p(y|I; θ) (1)

where θ are the parameters of the model, I is the image,

and y = {y

1

, . . . , y

t

} is the corresponding caption. Us-

ing the chain rule, the log likelihood of the joint probability

distribution can be decomposed into ordered conditionals:

log p(y) =

T

X

t=1

log p(y

t

|y

1

, . . . , y

t−1

, I) (2)

where we drop the dependency on model parameters for

convenience.

In the encoder-decoder framework, with recurrent neural

network (RNN), each conditional probability is modeled as:

log p(y

t

|y

1

, . . . , y

t−1

, I) = f (h

t

, c

t

) (3)

where f is a nonlinear function that outputs the probabil-

ity of y

t

. c

t

is the visual context vector at time t extracted

from image I. h

t

is the hidden state of the RNN at time t.

In this paper, we adopt Long-Short Term Memory (LSTM)

instead of a vanilla RNN. The former have demonstrated

state-of-the-art performance on a variety of sequence mod-

eling tasks. h

t

is modeled as:

h

t

= LSTM(x

t

, h

t−1

, m

t−1

) (4)

where x

t

is the input vector. m

t−1

is the memory cell vec-

tor at time t − 1.

Commonly, context vector, c

t

is an important factor

in the neural encoder-decoder framework, which provides

visual evidence for caption generation [

18, 27, 30, 34].

These different ways of modeling the context vector fall

into two categories: vanilla encoder-decoder and attention-

based encoder-decoder frameworks:

• First, in the vanilla framework, c

t

is only dependent on

the encoder, a Convolutional Neural Network (CNN).

The input image I is fed into the CNN, which extracts

the last fully connected layer as a global image feature

[

18, 27]. Across generated words, the context vector

c

t

keeps constant, and does not depend on the hidden

state of the decoder.

• Second, in the attention-based framework, c

t

is de-

pendent on both encoder and decoder. At time t, based

on the hidden state, the decoder would attend to the

speciﬁc regions of the image and compute c

t

using the

spatial image features from a convolution layer of a

CNN. In [

30, 34], they show that attention models can

signiﬁcantly improve the performance of image cap-

tioning.

To compute the context vector c

t

, we ﬁrst propose our

spatial attention model in Sec.

2.2, then extend the model to

an adaptive attention model in Sec.

2.3.

2.2. Spatial Attention Model

First, we propose a spatial attention model for computing

the context vector c

t

which is deﬁned as:

c

t

= g(V , h

t

) (5)

where g is the attention function, V = [v

1

, . . . , v

k

] , v

i

∈

R

d

is the spatial image features, each of which is a d dimen-

sional representation corresponding to a part of the image.

h

t

is the hidden state of RNN at time t.

Given the spatial image feature V ∈ R

d×k

and hidden

state h

t

∈ R

d

of the LSTM, we feed them through a single

layer neural network followed by a softmax function to gen-

erate the attention distribution over the k regions of the im-

age:

z

t

= w

T

h

tanh(W

v

V + (W

g

h

t

)

✶

T

) (6)

α

t

= softmax(z

t

) (7)

where

✶

∈ R

k

is a vector with all elements set to 1.

W

v

, W

g

∈ R

k×d

and w

h

∈ R

k

are parameters to be

376

Atten

LSTM

MLP

h

t −1

h

t

h

t

c

t

V

y

t

x

t

LSTM

Atten

h

t −1

h

t

h

t

x

t

V

MLP

c

t

y

t

(a)

(b)

Figure 2: A illustration of soft attention model from [30] (a)

and our proposed spatial attention model (b).

learnt. α ∈ R

k

is the attention weight over features in

V . Based on the attention distribution, the context vector

c

t

can be obtained by:

c

t

=

k

X

i=1

α

ti

v

ti

(8)

where c

t

and h

t

are combined to predict next word y

t+1

as

in Equation

3.

Different from [

30], shown in Fig. 2, we use the current

hidden state h

t

to analyze where to look (i.e., generating the

context vector c

t

), then combine both sources of informa-

tion to predict the next word. Our motivation stems from the

superior performance of residual network [

10]. The gener-

ated context vector c

t

could be considered as the residual

visual information of current hidden state h

t

, which dimin-

ishes the uncertainty or complements the informativeness of

the current hidden state for next word prediction. We also

empirically ﬁnd our spatial attention model performs better,

as illustrated in Table

1.

2.3. Adaptive Attention Model

While spatial attention based decoders have proven to be

effective for image captioning, they cannot determine when

to rely on visual signal and when to rely on the language

model. In this section, motivated from Merity et al. [

19],

we introduce a new concept – “visual sentinel”, which is

a latent representation of what the decoder already knows.

With the “visual sentinel”, we extend our spatial attention

model, and propose an adaptive model that is able to de-

termine whether it needs to attend the image to predict next

word.

What is visual sentinel? The decoder’s memory stores

both long and short term visual and linguistic information.

Our model learns to extract a new component from this that

the model can fall back on when it chooses to not attend to

the image. This new component is called the visual sentinel.

And the gate that decides whether to attend to the image or

to the visual sentinel is the sentinel gate. When the decoder

RNN is an LSTM, we consider those information preserved

LSTM

h

t −1

h

t

h

t

x

t

V

MLP

y

t

s

t

Atten

v

1

…

v

2

v

L

a

t1

a

t 2

a

tL

β

t

+

V

s

t

ˆ

c

t

ˆ

c

t

h

t

Figure 3: An illustration of the proposed model generating

the t-th target word y

t

given the image.

in its memory cell. Therefore, we extend the LSTM to ob-

tain the “visual sentinel” vector s

t

by:

g

t

= σ (W

x

t

+ W

h

t−1

) (9)

s

t

= g

t

⊙ tanh (m

t

) (10)

where W

x

and W

h

are weight parameters to be learned, x

t

is the input to the LSTM at time step t, and g

t

is the gate

applied on the memory cell m

t

. ⊙ represents the element-

wise product and σ is the logistic sigmoid activation.

Based on the visual sentinel, we propose an adaptive at-

tention model to compute the context vector. In our pro-

posed architecture (see Fig.

3), our new adaptive context

vector is deﬁned as

ˆ

c

t

, which is modeled as a mixture of

the spatially attended image features (i.e. context vector of

spatial attention model) and the visual sentinel vector. This

trades off how much new information the network is con-

sidering from the image with what it already knows in the

decoder memory (i.e., the visual sentinel ). The mixture

model is deﬁned as follows:

ˆ

c

t

= β

t

s

t

+ (1 − β

t

)c

t

(11)

where β

t

is the new sentinel gate at time t. In our mixture

model, β

t

produces a scalar in the range [0, 1]. A value of

1 implies that only the visual sentinel information is used

and 0 means only spatial image information is used when

generating the next word.

To compute the new sentinel gate β

t

, we modiﬁed the

spatial attention component. In particular, we add an addi-

tional element to z, the vector containing attention scores

as deﬁned in Equation

6. This element indicates how much

“attention” the network is placing on the sentinel (as op-

posed to the image features). The addition of this extra ele-

ment is summarized by converting Equation

7 to:

ˆ

α

t

= softmax([z

t

; w

T

h

tanh(W

s

t

+ (W

g

h

t

))]) (12)

where [·; ·] indicates concatenation. W

s

and W

g

are weight

parameters. Notably, W

g

is the same weight parameter as

in Equation

6.

ˆ

α

t

∈ R

k+1

is the attention distribution over

377

both the spatial image feature as well as the visual sentinel

vector. We interpret the last element of this vector to be the

gate value: β

t

= α

t

[k + 1].

The probability over a vocabulary of possible words at

time t can be calculated as:

p

t

= softmax (W

p

(

ˆ

c

t

+ h

t

)) (13)

where W

p

is the weight parameters to be learnt.

This formulation encourages the model to adaptively at-

tend to the image vs. the visual sentinel when generating the

next word. The sentinel vector is updated at each time step.

With this adaptive attention model, we call our framework

the adaptive encoder-decoder image captioning framework.

3. Implementation Details

In this section, we describe the implementation details of

our model and how we train our network.

Encoder-CNN. The encoder uses a CNN to get the

representation of images. Speciﬁcally, the spatial feature

outputs of the last convolutional layer of ResNet [

10] are

used, which have a dimension of 2048 × 7 × 7. We use

A = {a

1

, . . . , a

k

}, a

i

∈ R

2048

to represent the spatial

CNN features at each of the k grid locations. Following

[10], the global image feature can be obtained by:

a

g

=

1

k

X

i=1

a

i

(14)

where a

g

is the global image feature. For modeling con-

venience, we use a single layer perceptron with rectiﬁer ac-

tivation function to transform the image feature vector into

new vectors with dimension d:

v

i

= ReLU(W

a

i

) (15)

v

g

= ReLU(W

b

a

g

) (16)

where W

a

and W

g

are the weight parameters. The trans-

formed spatial image feature form V = [v

1

, . . . , v

k

].

Decoder-RNN. We concatenate the word embedding

vector w

t

and global image feature vector v

g

to get the in-

put vector x

t

= [w

t

; v

g

]. We use a single layer neural net-

work to transform the visual sentinel vector s

t

and LSTM

output vector h

t

into new vectors that have the dimension

d.

Training details. In our experiments, we use a single

layer LSTM with hidden size of 512. We use the Adam

optimizer with base learning rate of 5e-4 for the language

model and 1e-5 for the CNN. The momentum and weight-

decay are 0.8 and 0.999 respectively. We ﬁnetune the CNN

network after 20 epochs. We set the batch size to be 80 and

train for up to 50 epochs with early stopping if the validation

CIDEr [

26] score had not improved over the last 6 epochs.

Our model can be trained within 30 hours on a single Titan

X GPU. We use beam size of 3 when sampling the caption

for both COCO and Flickr30k datasets.

4. Related Work

Image captioning has many important applications ran-

ging from helping visually impaired users to human-robot

interaction. As a result, many different models have been

developed for image captioning. In general, those meth-

ods can be divided into two categories: template-based

[

9, 13, 14, 20] and neural-based [12, 18, 6, 3, 27, 7, 11,

30, 8, 34, 32, 33].

Template-based approaches generate caption tem-

plates whose slots are ﬁlled in based on outputs of object de-

tection, attribute classiﬁcation, and scene recognition. Far-

hadi et al. [

9] infer a triplet of scene elements which is con-

verted to text using templates. Kulkarni et al. [

13] adopt a

Conditional Random Field (CRF) to jointly reason across

objects, attributes, and prepositions before ﬁlling the slots.

[

14, 20] use more powerful language templates such as a

syntactically well-formed tree, and add descriptive inform-

ation from the output of attribute detection.

Neural-based approaches are inspired by the success of

sequence-to-sequence encoder-decoder frameworks in ma-

chine translation [

4, 24, 2] with the view that image caption-

ing is analogous to translating images to text. Kiros et al.

[

12] proposed a feed forward neural network with a mul-

timodal log-bilinear model to predict the next word given

the image and previous word. Other methods then replaced

the feed forward neural network with a recurrent neural net-

work [

18, 3]. Vinyals et al. [27] use an LSTM instead of a

vanilla RNN as the decoder. However, all these approaches

represent the image with the last fully connected layer of

a CNN. Karpathy et al. [

11] adopt the result of object de-

tection from R-CNN and output of a bidirectional RNN to

learn a joint embedding space for caption ranking and gen-

eration.

Recently, attention mechanisms have been introduced to

encoder-decoder neural frameworks in image captioning.

Xu et al. [

30] incorporate an attention mechanism to learn a

latent alignment from scratch when generating correspond-

ing words. [

28, 34] utilize high-level concepts or attributes

and inject them into a neural-based approach as semantic

attention to enhance image captioning. Yang et al. [

32]

extend current attention encoder-decoder frameworks using

a review network, which captures the global properties in

a compact vector representation and are usable by the at-

tention mechanism in the decoder. Yao et al. [

33] present

variants of architectures for augmenting high-level attrib-

utes from images to complement image representation for

sentence generation.

To the best of our knowledge, ours is the ﬁrst work to

reason about when a model should attend to an image when

378

Flickr30k MS-COCO

Method B-1 B-2 B-3 B-4 METEOR CIDEr B-1 B-2 B-3 B-4 METEOR CIDEr

DeepVS [11] 0.573 0.369 0.240 0.157 0.153 0.247 0.625 0.450 0.321 0.230 0.195 0.660

Hard-Attention [

30] 0.669 0.439 0.296 0.199 0.185 - 0.718 0.504 0.357 0.250 0.230 -

ATT-FCN

†

[

34] 0.647 0.460 0.324 0.230 0.189 - 0.709 0.537 0.402 0.304 0.243 -

ERD [

32] - - - - - - - - - 0.298 0.240 0.895

MSM

†

[

33] - - - - - - 0.730 0.565 0.429 0.325 0.251 0.986

Ours-Spatial 0.644 0.462 0.327 0.231 0.202 0.493 0.734 0.566 0.418 0.304 0.257 1.029

Ours-Adaptive 0.677 0.494 0.354 0.251 0.204 0.531 0.742 0.580 0.439 0.332 0.266 1.085

Table 1: Performance on Flickr30k and COCO test splits. † indicates ensemble models. B-n is BLEU score that uses up to

n-grams. Higher is better in all columns. For future comparisons, our ROUGE-L/SPICE Flickr30k scores are 0.467/0.145

and the COCO scores are 0.549/0.194.

B-1 B-2 B-3 B-4 METEOR ROUGE-L CIDEr

Method c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40

Google NIC [27] 0.713 0.895 0.542 0.802 0.407 0.694 0.309 0.587 0.254 0.346 0.530 0.682 0.943 0.946

MS Captivator [

8] 0.715 0.907 0.543 0.819 0.407 0.710 0.308 0.601 0.248 0.339 0.526 0.680 0.931 0.937

m-RNN [

18] 0.716 0.890 0.545 0.798 0.404 0.687 0.299 0.575 0.242 0.325 0.521 0.666 0.917 0.935

LRCN [

7] 0.718 0.895 0.548 0.804 0.409 0.695 0.306 0.585 0.247 0.335 0.528 0.678 0.921 0.934

Hard-Attention [

30] 0.705 0.881 0.528 0.779 0.383 0.658 0.277 0.537 0.241 0.322 0.516 0.654 0.865 0.893

ATT-FCN [

34] 0.731 0.900 0.565 0.815 0.424 0.709 0.316 0.599 0.250 0.335 0.535 0.682 0.943 0.958

ERD [

32] 0.720 0.900 0.550 0.812 0.414 0.705 0.313 0.597 0.256 0.347 0.533 0.686 0.965 0.969

MSM [

33] 0.739 0.919 0.575 0.842 0.436 0.740 0.330 0.632 0.256 0.350 0.542 0.700 0.984 1.003

Ours-Adaptive 0.748 0.920 0.584 0.845 0.444 0.744 0.336 0.637 0.264 0.359 0.550 0.705 1.042 1.059

Table 2: Leaderboard of the published state-of-the-art image captioning models on the online COCO testing server. Our

submission is a ensemble of 5 models trained with different initialization.

generating a sequence of words.

5. Results

5.1. Experiment Settings

We experiment with two datasets: Flickr30k [

35] and

COCO [

16].

Flickr30k contains 31,783 images collected from Flickr.

Most of these images depict humans performing various

activities. Each image is paired with 5 crowd-sourced cap-

tions. We use the publicly available splits

1

containing 1,000

images for validation and test each.

COCO is the largest image captioning dataset, contain-

ing 82,783, 40,504 and 40,775 images for training, valida-

tion and test respectively. This dataset is more challenging,

since most images contain multiple objects in the context of

complex scenes. Each image has 5 human annotated cap-

tions. For ofﬂine evaluation, we use the same data split as

in [

32, 33, 34] containing 5000 images for validation and

test each. For online evaluation on the COCO evaluation

server, we reserve 2000 images from validation for devel-

opment and the rest for training.

Pre-processing. We truncate captions longer than 18

words for COCO and 22 for Flickr30k. We then build a

1

https://github.com/karpathy/neuraltalk

vocabulary of words that occur at least 5 and 3 times in the

training set, resulting in 9567 and 7649 words for COCO

and Flickr30k respectively.

Compared Approaches: For ofﬂine evaluation on

Flickr30k and COCO, we ﬁrst compare our full model

(Ours-Adaptive) with an ablated version (Ours-Spatial),

which only performs the spatial attention. The goal of this

comparison is to verify that our improvements are not the

result of orthogonal contributions (e.g. better CNN features

or better optimization). We further compare our method

with DeepVS [

11], Hard-Attention [30] and recently pro-

posed ATT [

34], ERD [32] and best performed method

(LSTM-A

5

) of MSM [

33]. For online evaluation, we com-

pare our method with Google NIC [

27], MS Captivator

[

8], m-RNN [18], LRCN [7], Hard-Attention [30], ATT-

FCN [

34], ERD [32] and MSM [33].

5.2. Quantitative Analysis

We report results using the COCO captioning evaluation

tool [

16], which reports the following metrics: BLEU [21],

Meteor [

5], Rouge-L [15] and CIDEr [26]. We also report

results using the new metric SPICE [

1], which was found to

better correlate with human judgments.

Table 1 shows results on the Flickr30k and COCO data-

sets. Comparing the full model w.r.t ablated versions

without visual sentinel veriﬁes the effectiveness of the pro-

379

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

Citations

Cites methods from "Knowing When to Look: Adaptive Atte..."

References

"Knowing When to Look: Adaptive Atte..." refers methods in this paper

"Knowing When to Look: Adaptive Atte..." refers background in this paper

Related Papers (5)