scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

TL;DR: This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.
Abstract: Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell. In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

Content maybe subject to copyright    Report

Knowing When to Look: Adaptive Attention via
A Visual Sentinel for Image Captioning
Jiasen Lu
2
, Caiming Xiong
1
, Devi Parikh
3
, Richard Socher
1
1
Salesforce Research,
2
Virginia Tech,
3
Georgia Institute of Technology
jiasenlu@vt.edu, parikh@gatech.edu, {cxiong, rsocher}@salesforce.com
Abstract
Attention-based neural encoder-decoder frameworks
have been widely adopted for image captioning. Most meth-
ods force visual attention to be active for every generated
word. However, the decoder likely requires little to no visual
information from the image to predict non-visual words
such as “the” and “of”. Other words that may seem visual
can often be predicted reliably just from the language model
e.g., “sign” after “behind a red stop” or “phone” following
“talking on a cell”. In this paper, we propose a novel ad-
aptive attention model with a visual sentinel. At each time
step, our model decides whether to attend to the image (and
if so, to which regions) or to the visual sentinel. The model
decides whether to attend to the image and where, in order
to extract meaningful information for sequential word gen-
eration. We test our method on the COCO image captioning
2015 challenge dataset and Flickr30K. Our approach sets
the new state-of-the-art by a significant margin.
1. Introduction
Automatically generating captions for images has
emerged as a prominent interdisciplinary research problem
in both academia and industry. [
8, 11, 18, 23, 27, 30]. It
can aid visually impaired users, and make it easy for users
to organize and navigate through large amounts of typically
unstructured visual data. In order to generate high quality
captions, the model needs to incorporate fine-grained visual
clues from the image. Recently, visual attention-based
neural encoder-decoder models [30, 11, 32] have been ex-
plored, where the attention mechanism typically produces
a spatial map highlighting image regions relevant to each
generated word.
Most attention models for image captioning and visual
question answering attend to the image at every time step,
irrespective of which word is going to be emitted next
The major part of this work was done while J. Lu was an intern at
Salesforce Research.
Equal contribution
0.3
0.5
0.7
0.9
Adaptive Attention Model
Spatial Attention
Sentinel Gate
RNN
Visual grounding
probability
CNN
CNN
Figure 1: Our model learns an adaptive attention model
that automatically determines when to look (sentinel gate)
and where to look (spatial attention) for word generation,
which are explained in section
2.2, 2.3 & 5.4.
[31, 29, 17]. However, not all words in the caption have cor-
responding visual signals. Consider the example in Fig.
1
that shows an image and its generated caption A white
bird perched on top of a red stop sign”. The words “a”
and “of do not have corresponding canonical visual sig-
nals. Moreover, language correlations make the visual sig-
nal unnecessary when generating words like “on” and “top”
following “perched”, and “sign” following “a red stop”. In
fact, gradients from non-visual words could mislead and di-
minish the overall effectiveness of the visual signal in guid-
ing the caption generation process.
In this paper, we introduce an adaptive attention encoder-
decoder framework which can automatically decide when to
rely on visual signals and when to just rely on the language
model. Of course, when relying on visual signals, the model
also decides where which image region it should attend
to. We first propose a novel spatial attention model for ex-
tracting spatial image features. Then as our proposed adapt-
ive attention mechanism, we introduce a new Long Short
Term Memory (LSTM) extension, which produces an ad-
ditional visual sentinel vector instead of a single hidden
state. The “visual sentinel”, an additional latent representa-
tion of the decoder’s memory, provides a fallback option to
the decoder. We further design a new sentinel gate, which
1
375

decides how much new information the decoder wants to get
from the image as opposed to relying on the visual sentinel
when generating the next word. For example, as illustrated
in Fig.
1, our model learns to attend to the image more when
generating words “white”, “bird”, “red” and “stop”, and
relies more on the visual sentinel when generating words
“top”, “of and “sign”.
Overall, the main contributions of this paper are:
We introduce an adaptive encoder-decoder framework
that automatically decides when to look at the image
and when to rely on the language model to generate
the next word.
We first propose a new spatial attention model, and
then build on it to design our novel adaptive attention
model with “visual sentinel”.
Our model significantly outperforms other state-of-
the-art methods on COCO and Flickr30k.
We perform an extensive analysis of our adaptive at-
tention model, including visual grounding probabil-
ities of words and weakly supervised localization of
generated attention maps.
2. Method
We first describe the generic neural encoder-decoder
framework for image captioning in Sec.
2.1, then introduce
our proposed attention-based image captioning models in
Sec.
2.2 & 2.3.
2.1. Encoder-Decoder for Image Captioning
We start by briefly describing the encoder-decoder image
captioning framework [27, 30]. Given an image and the
corresponding caption, the encoder-decoder model directly
maximizes the following objective:
θ
= arg max
θ
X
(I,y)
log p(y|I; θ) (1)
where θ are the parameters of the model, I is the image,
and y = {y
1
, . . . , y
t
} is the corresponding caption. Us-
ing the chain rule, the log likelihood of the joint probability
distribution can be decomposed into ordered conditionals:
log p(y) =
T
X
t=1
log p(y
t
|y
1
, . . . , y
t1
, I) (2)
where we drop the dependency on model parameters for
convenience.
In the encoder-decoder framework, with recurrent neural
network (RNN), each conditional probability is modeled as:
log p(y
t
|y
1
, . . . , y
t1
, I) = f (h
t
, c
t
) (3)
where f is a nonlinear function that outputs the probabil-
ity of y
t
. c
t
is the visual context vector at time t extracted
from image I. h
t
is the hidden state of the RNN at time t.
In this paper, we adopt Long-Short Term Memory (LSTM)
instead of a vanilla RNN. The former have demonstrated
state-of-the-art performance on a variety of sequence mod-
eling tasks. h
t
is modeled as:
h
t
= LSTM(x
t
, h
t1
, m
t1
) (4)
where x
t
is the input vector. m
t1
is the memory cell vec-
tor at time t 1.
Commonly, context vector, c
t
is an important factor
in the neural encoder-decoder framework, which provides
visual evidence for caption generation [
18, 27, 30, 34].
These different ways of modeling the context vector fall
into two categories: vanilla encoder-decoder and attention-
based encoder-decoder frameworks:
First, in the vanilla framework, c
t
is only dependent on
the encoder, a Convolutional Neural Network (CNN).
The input image I is fed into the CNN, which extracts
the last fully connected layer as a global image feature
[
18, 27]. Across generated words, the context vector
c
t
keeps constant, and does not depend on the hidden
state of the decoder.
Second, in the attention-based framework, c
t
is de-
pendent on both encoder and decoder. At time t, based
on the hidden state, the decoder would attend to the
specific regions of the image and compute c
t
using the
spatial image features from a convolution layer of a
CNN. In [
30, 34], they show that attention models can
significantly improve the performance of image cap-
tioning.
To compute the context vector c
t
, we first propose our
spatial attention model in Sec.
2.2, then extend the model to
an adaptive attention model in Sec.
2.3.
2.2. Spatial Attention Model
First, we propose a spatial attention model for computing
the context vector c
t
which is defined as:
c
t
= g(V , h
t
) (5)
where g is the attention function, V = [v
1
, . . . , v
k
] , v
i
R
d
is the spatial image features, each of which is a d dimen-
sional representation corresponding to a part of the image.
h
t
is the hidden state of RNN at time t.
Given the spatial image feature V R
d×k
and hidden
state h
t
R
d
of the LSTM, we feed them through a single
layer neural network followed by a softmax function to gen-
erate the attention distribution over the k regions of the im-
age:
z
t
= w
T
h
tanh(W
v
V + (W
g
h
t
)
T
) (6)
α
t
= softmax(z
t
) (7)
where
R
k
is a vector with all elements set to 1.
W
v
, W
g
R
k×d
and w
h
R
k
are parameters to be
376

Atten
LSTM
MLP
h
t 1
h
t
h
t
c
t
V
y
t
x
t
LSTM
Atten
h
t 1
h
t
h
t
x
t
V
MLP
y
t
(a)
(b)
Figure 2: A illustration of soft attention model from [30] (a)
and our proposed spatial attention model (b).
learnt. α R
k
is the attention weight over features in
V . Based on the attention distribution, the context vector
c
t
can be obtained by:
c
t
=
k
X
i=1
α
ti
v
ti
(8)
where c
t
and h
t
are combined to predict next word y
t+1
as
in Equation
3.
Different from [
30], shown in Fig. 2, we use the current
hidden state h
t
to analyze where to look (i.e., generating the
context vector c
t
), then combine both sources of informa-
tion to predict the next word. Our motivation stems from the
superior performance of residual network [
10]. The gener-
ated context vector c
t
could be considered as the residual
visual information of current hidden state h
t
, which dimin-
ishes the uncertainty or complements the informativeness of
the current hidden state for next word prediction. We also
empirically find our spatial attention model performs better,
as illustrated in Table
1.
2.3. Adaptive Attention Model
While spatial attention based decoders have proven to be
effective for image captioning, they cannot determine when
to rely on visual signal and when to rely on the language
model. In this section, motivated from Merity et al. [
19],
we introduce a new concept “visual sentinel”, which is
a latent representation of what the decoder already knows.
With the “visual sentinel”, we extend our spatial attention
model, and propose an adaptive model that is able to de-
termine whether it needs to attend the image to predict next
word.
What is visual sentinel? The decoder’s memory stores
both long and short term visual and linguistic information.
Our model learns to extract a new component from this that
the model can fall back on when it chooses to not attend to
the image. This new component is called the visual sentinel.
And the gate that decides whether to attend to the image or
to the visual sentinel is the sentinel gate. When the decoder
RNN is an LSTM, we consider those information preserved
LSTM
h
t 1
h
t
h
t
x
t
V
MLP
y
t
s
t
Atten
v
1
v
2
a
t1
a
t 2
a
tL
β
t
+
V
s
t
ˆ
c
t
ˆ
c
t
h
t
Figure 3: An illustration of the proposed model generating
the t-th target word y
t
given the image.
in its memory cell. Therefore, we extend the LSTM to ob-
tain the “visual sentinel” vector s
t
by:
g
t
= σ (W
x
x
t
+ W
h
h
t1
) (9)
s
t
= g
t
tanh (m
t
) (10)
where W
x
and W
h
are weight parameters to be learned, x
t
is the input to the LSTM at time step t, and g
t
is the gate
applied on the memory cell m
t
. represents the element-
wise product and σ is the logistic sigmoid activation.
Based on the visual sentinel, we propose an adaptive at-
tention model to compute the context vector. In our pro-
posed architecture (see Fig.
3), our new adaptive context
vector is defined as
ˆ
c
t
, which is modeled as a mixture of
the spatially attended image features (i.e. context vector of
spatial attention model) and the visual sentinel vector. This
trades off how much new information the network is con-
sidering from the image with what it already knows in the
decoder memory (i.e., the visual sentinel ). The mixture
model is defined as follows:
ˆ
c
t
= β
t
s
t
+ (1 β
t
)c
t
(11)
where β
t
is the new sentinel gate at time t. In our mixture
model, β
t
produces a scalar in the range [0, 1]. A value of
1 implies that only the visual sentinel information is used
and 0 means only spatial image information is used when
generating the next word.
To compute the new sentinel gate β
t
, we modified the
spatial attention component. In particular, we add an addi-
tional element to z, the vector containing attention scores
as defined in Equation
6. This element indicates how much
“attention” the network is placing on the sentinel (as op-
posed to the image features). The addition of this extra ele-
ment is summarized by converting Equation
7 to:
ˆ
α
t
= softmax([z
t
; w
T
h
tanh(W
s
s
t
+ (W
g
h
t
))]) (12)
where [·; ·] indicates concatenation. W
s
and W
g
are weight
parameters. Notably, W
g
is the same weight parameter as
in Equation
6.
ˆ
α
t
R
k+1
is the attention distribution over
377

both the spatial image feature as well as the visual sentinel
vector. We interpret the last element of this vector to be the
gate value: β
t
= α
t
[k + 1].
The probability over a vocabulary of possible words at
time t can be calculated as:
p
t
= softmax (W
p
(
ˆ
c
t
+ h
t
)) (13)
where W
p
is the weight parameters to be learnt.
This formulation encourages the model to adaptively at-
tend to the image vs. the visual sentinel when generating the
next word. The sentinel vector is updated at each time step.
With this adaptive attention model, we call our framework
the adaptive encoder-decoder image captioning framework.
3. Implementation Details
In this section, we describe the implementation details of
our model and how we train our network.
Encoder-CNN. The encoder uses a CNN to get the
representation of images. Specifically, the spatial feature
outputs of the last convolutional layer of ResNet [
10] are
used, which have a dimension of 2048 × 7 × 7. We use
A = {a
1
, . . . , a
k
}, a
i
R
2048
to represent the spatial
CNN features at each of the k grid locations. Following
[10], the global image feature can be obtained by:
a
g
=
1
k
k
X
i=1
a
i
(14)
where a
g
is the global image feature. For modeling con-
venience, we use a single layer perceptron with rectifier ac-
tivation function to transform the image feature vector into
new vectors with dimension d:
v
i
= ReLU(W
a
a
i
) (15)
v
g
= ReLU(W
b
a
g
) (16)
where W
a
and W
g
are the weight parameters. The trans-
formed spatial image feature form V = [v
1
, . . . , v
k
].
Decoder-RNN. We concatenate the word embedding
vector w
t
and global image feature vector v
g
to get the in-
put vector x
t
= [w
t
; v
g
]. We use a single layer neural net-
work to transform the visual sentinel vector s
t
and LSTM
output vector h
t
into new vectors that have the dimension
d.
Training details. In our experiments, we use a single
layer LSTM with hidden size of 512. We use the Adam
optimizer with base learning rate of 5e-4 for the language
model and 1e-5 for the CNN. The momentum and weight-
decay are 0.8 and 0.999 respectively. We finetune the CNN
network after 20 epochs. We set the batch size to be 80 and
train for up to 50 epochs with early stopping if the validation
CIDEr [
26] score had not improved over the last 6 epochs.
Our model can be trained within 30 hours on a single Titan
X GPU. We use beam size of 3 when sampling the caption
for both COCO and Flickr30k datasets.
4. Related Work
Image captioning has many important applications ran-
ging from helping visually impaired users to human-robot
interaction. As a result, many different models have been
developed for image captioning. In general, those meth-
ods can be divided into two categories: template-based
[
9, 13, 14, 20] and neural-based [12, 18, 6, 3, 27, 7, 11,
30, 8, 34, 32, 33].
Template-based approaches generate caption tem-
plates whose slots are filled in based on outputs of object de-
tection, attribute classification, and scene recognition. Far-
hadi et al. [
9] infer a triplet of scene elements which is con-
verted to text using templates. Kulkarni et al. [
13] adopt a
Conditional Random Field (CRF) to jointly reason across
objects, attributes, and prepositions before filling the slots.
[
14, 20] use more powerful language templates such as a
syntactically well-formed tree, and add descriptive inform-
ation from the output of attribute detection.
Neural-based approaches are inspired by the success of
sequence-to-sequence encoder-decoder frameworks in ma-
chine translation [
4, 24, 2] with the view that image caption-
ing is analogous to translating images to text. Kiros et al.
[
12] proposed a feed forward neural network with a mul-
timodal log-bilinear model to predict the next word given
the image and previous word. Other methods then replaced
the feed forward neural network with a recurrent neural net-
work [
18, 3]. Vinyals et al. [27] use an LSTM instead of a
vanilla RNN as the decoder. However, all these approaches
represent the image with the last fully connected layer of
a CNN. Karpathy et al. [
11] adopt the result of object de-
tection from R-CNN and output of a bidirectional RNN to
learn a joint embedding space for caption ranking and gen-
eration.
Recently, attention mechanisms have been introduced to
encoder-decoder neural frameworks in image captioning.
Xu et al. [
30] incorporate an attention mechanism to learn a
latent alignment from scratch when generating correspond-
ing words. [
28, 34] utilize high-level concepts or attributes
and inject them into a neural-based approach as semantic
attention to enhance image captioning. Yang et al. [
32]
extend current attention encoder-decoder frameworks using
a review network, which captures the global properties in
a compact vector representation and are usable by the at-
tention mechanism in the decoder. Yao et al. [
33] present
variants of architectures for augmenting high-level attrib-
utes from images to complement image representation for
sentence generation.
To the best of our knowledge, ours is the first work to
reason about when a model should attend to an image when
378

Flickr30k MS-COCO
Method B-1 B-2 B-3 B-4 METEOR CIDEr B-1 B-2 B-3 B-4 METEOR CIDEr
DeepVS [11] 0.573 0.369 0.240 0.157 0.153 0.247 0.625 0.450 0.321 0.230 0.195 0.660
Hard-Attention [
30] 0.669 0.439 0.296 0.199 0.185 - 0.718 0.504 0.357 0.250 0.230 -
ATT-FCN
[
34] 0.647 0.460 0.324 0.230 0.189 - 0.709 0.537 0.402 0.304 0.243 -
ERD [
32] - - - - - - - - - 0.298 0.240 0.895
MSM
[
33] - - - - - - 0.730 0.565 0.429 0.325 0.251 0.986
Ours-Spatial 0.644 0.462 0.327 0.231 0.202 0.493 0.734 0.566 0.418 0.304 0.257 1.029
Ours-Adaptive 0.677 0.494 0.354 0.251 0.204 0.531 0.742 0.580 0.439 0.332 0.266 1.085
Table 1: Performance on Flickr30k and COCO test splits. indicates ensemble models. B-n is BLEU score that uses up to
n-grams. Higher is better in all columns. For future comparisons, our ROUGE-L/SPICE Flickr30k scores are 0.467/0.145
and the COCO scores are 0.549/0.194.
B-1 B-2 B-3 B-4 METEOR ROUGE-L CIDEr
Method c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
Google NIC [27] 0.713 0.895 0.542 0.802 0.407 0.694 0.309 0.587 0.254 0.346 0.530 0.682 0.943 0.946
MS Captivator [
8] 0.715 0.907 0.543 0.819 0.407 0.710 0.308 0.601 0.248 0.339 0.526 0.680 0.931 0.937
m-RNN [
18] 0.716 0.890 0.545 0.798 0.404 0.687 0.299 0.575 0.242 0.325 0.521 0.666 0.917 0.935
LRCN [
7] 0.718 0.895 0.548 0.804 0.409 0.695 0.306 0.585 0.247 0.335 0.528 0.678 0.921 0.934
Hard-Attention [
30] 0.705 0.881 0.528 0.779 0.383 0.658 0.277 0.537 0.241 0.322 0.516 0.654 0.865 0.893
ATT-FCN [
34] 0.731 0.900 0.565 0.815 0.424 0.709 0.316 0.599 0.250 0.335 0.535 0.682 0.943 0.958
ERD [
32] 0.720 0.900 0.550 0.812 0.414 0.705 0.313 0.597 0.256 0.347 0.533 0.686 0.965 0.969
MSM [
33] 0.739 0.919 0.575 0.842 0.436 0.740 0.330 0.632 0.256 0.350 0.542 0.700 0.984 1.003
Ours-Adaptive 0.748 0.920 0.584 0.845 0.444 0.744 0.336 0.637 0.264 0.359 0.550 0.705 1.042 1.059
Table 2: Leaderboard of the published state-of-the-art image captioning models on the online COCO testing server. Our
submission is a ensemble of 5 models trained with different initialization.
generating a sequence of words.
5. Results
5.1. Experiment Settings
We experiment with two datasets: Flickr30k [
35] and
COCO [
16].
Flickr30k contains 31,783 images collected from Flickr.
Most of these images depict humans performing various
activities. Each image is paired with 5 crowd-sourced cap-
tions. We use the publicly available splits
1
containing 1,000
images for validation and test each.
COCO is the largest image captioning dataset, contain-
ing 82,783, 40,504 and 40,775 images for training, valida-
tion and test respectively. This dataset is more challenging,
since most images contain multiple objects in the context of
complex scenes. Each image has 5 human annotated cap-
tions. For offline evaluation, we use the same data split as
in [
32, 33, 34] containing 5000 images for validation and
test each. For online evaluation on the COCO evaluation
server, we reserve 2000 images from validation for devel-
opment and the rest for training.
Pre-processing. We truncate captions longer than 18
words for COCO and 22 for Flickr30k. We then build a
1
https://github.com/karpathy/neuraltalk
vocabulary of words that occur at least 5 and 3 times in the
training set, resulting in 9567 and 7649 words for COCO
and Flickr30k respectively.
Compared Approaches: For offline evaluation on
Flickr30k and COCO, we first compare our full model
(Ours-Adaptive) with an ablated version (Ours-Spatial),
which only performs the spatial attention. The goal of this
comparison is to verify that our improvements are not the
result of orthogonal contributions (e.g. better CNN features
or better optimization). We further compare our method
with DeepVS [
11], Hard-Attention [30] and recently pro-
posed ATT [
34], ERD [32] and best performed method
(LSTM-A
5
) of MSM [
33]. For online evaluation, we com-
pare our method with Google NIC [
27], MS Captivator
[
8], m-RNN [18], LRCN [7], Hard-Attention [30], ATT-
FCN [
34], ERD [32] and MSM [33].
5.2. Quantitative Analysis
We report results using the COCO captioning evaluation
tool [
16], which reports the following metrics: BLEU [21],
Meteor [
5], Rouge-L [15] and CIDEr [26]. We also report
results using the new metric SPICE [
1], which was found to
better correlate with human judgments.
Table 1 shows results on the Flickr30k and COCO data-
sets. Comparing the full model w.r.t ablated versions
without visual sentinel verifies the effectiveness of the pro-
379

Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

2,904 citations

Book ChapterDOI
08 Sep 2018
TL;DR: Zhang et al. as discussed by the authors proposed GCN-LSTM with attention mechanism to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework.
Abstract: It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image Nevertheless, there has not been evidence in support of the idea on image description generation In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections The representations of each region proposed on objects are then refined by leveraging graph structure through GCN With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches More remarkably, GCN-LSTM increases CIDEr-D performance from 1201% to 1287% on COCO testing set

775 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features.
Abstract: Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.

660 citations


Cites methods from "Knowing When to Look: Adaptive Atte..."

  • ...On the image encoding side, instead, single-layer attention mechanisms have been adopted to incorporate spatial knowledge, initially from a grid of CNN features [43, 24, 48], and then using image regions extracted with an object detector [4, 27, 25]....

    [...]

Journal ArticleDOI
TL;DR: An overview of the state-of-the-art attention models proposed in recent years is given and a unified model that is suitable for most attention structures is defined.

620 citations

Journal ArticleDOI
TL;DR: A comprehensive review of deep learning-based image captioning techniques can be found in this article, where the authors discuss the foundation of the techniques to analyze their performances, strengths, and limitations.
Abstract: Generating a description of an image is called image captioning. Image captioning requires recognizing the important objects, their attributes, and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep-learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey article, we aim to present a comprehensive review of existing deep-learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths, and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep-learning-based automatic image captioning.

564 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations


"Knowing When to Look: Adaptive Atte..." refers methods in this paper

  • ...We experiment with two datasets: Flickr30k [35] and COCO [16]....

    [...]

  • ...We report results using the COCO captioning evaluation tool [16], which reports the following metrics: BLEU [21], Meteor [5], Rouge-L [15] and CIDEr [26]....

    [...]

Proceedings ArticleDOI
06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

21,126 citations

Proceedings Article
01 Jan 2015
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

20,027 citations


"Knowing When to Look: Adaptive Atte..." refers background in this paper

  • ...Neural-based approaches are inspired by the success of sequence-to-sequence encoder-decoder frameworks in machine translation [4, 24, 2] with the view that image captioning is analogous to translating images to text....

    [...]

Proceedings ArticleDOI
01 Jan 2014
TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
Abstract: In this paper, we propose a novel neural network model called RNN Encoder‐ Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder‐Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

19,998 citations