Under review as a conference paper at ICLR 2017
BOOSTING IMAGE CAPTIONING WITH ATTRIBUTES
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, Tao Mei
Microsoft Research Asia
{tiyao, v-yipan, v-yehl, v-zhqiu, tmei}@microsoft.com
ABSTRACT
Automatically describing an image with a natural language has been an emerg-
ing challenge in both fields of computer vision and natural language processing.
In this paper, we present Long Short-Term Memory with Attributes (LSTM-A)
- a novel architecture that integrates attributes into the successful Convolutional
Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image cap-
tioning framework, by training them in an end-to-end manner. To incorporate
attributes, we construct variants of architectures by feeding image representations
and attributes into RNNs in different ways to explore the mutual but also fuzzy re-
lationship between them. Extensive experiments are conducted on COCO image
captioning dataset and our framework achieves superior results when compared
to state-of-the-art deep models. Most remarkably, we obtain METEOR/CIDEr-D
of 25.2%/98.6% on testing data of widely used and publicly available splits in
(Karpathy & Fei-Fei, 2015) when extracting image representations by GoogleNet
and achieve to date top-1 performance on COCO captioning Leaderboard.
1 INTRODUCTION
Accelerated by tremendous increase in Internet bandwidth and proliferation of sensor-rich mobile
devices, image data has been generated, published and spread explosively, becoming an indispens-
able part of today’s big data. This has encouraged the development of advanced techniques for a
broad range of image understanding applications. A fundamental issue that underlies the success
of these technological advances is the recognition (Szegedy et al., 2015; Simonyan & Zisserman,
2015; He et al., 2016). Recently, researchers have strived to automatically describe the content of
an image with a complete and natural sentence, which has a great potential impact for instance on
robotic vision or helping visually impaired people. Nevertheless, this problem is very challenging,
as description generation model should capture not only the objects or scenes presented in the image,
but also be capable of expressing how the objects/scenes relate to each other in a nature sentence.
The main inspiration of recent attempts on this problem (Donahue et al., 2015; Vinyals et al., 2015;
Xu et al., 2015; You et al., 2016) are from the advances by using RNNs in machine translation
(Sutskever et al., 2014), which is to translate a text from one language (e.g., English) to another
(e.g., Chinese). The basic idea is to perform a sequence to sequence learning for translation, where
an encoder RNN reads the input sequential sentence, one word at a time till the end of the sentence
and then a decoder RNN is exploited to generate the sentence in target language, one word at each
time step. Following this philosophy, it is natural to employ a CNN instead of the encoder RNN for
image captioning, which is regarded as an image encoder to produce image representations.
While encouraging performances are reported, these CNN plus RNN image captioning methods
translate directly from image representations to language, without explicitly taking more high-level
semantic information from images into account. Furthermore, attributes are properties observed in
images with rich semantic cues and have been proved to be effective in visual recognition (Parikh &
Grauman, 2011). A valid question is how to incorporate high-level image attributes into CNN plus
RNN image captioning architecture as complementary knowledge in addition to image representa-
tions. We investigate particularly in this paper the architectures by exploiting the mutual relationship
between image representations and attributes for enhancing image description generation. Specifi-
cally, to better demonstrate the impact of simultaneously utilizing the two kinds of representations,
we devise variants of architectures by feeding them into RNN in different placements and moments,
1
Under review as a conference paper at ICLR 2017
e.g., leveraging only attributes, inserting image representations first and then attributes or vice versa,
and inputting image representations/attributes once or at each time step.
The main contribution of this work is the proposal of attribute augmented architectures by integrating
the attributes into CNN plus RNN image captioning framework, which is a problem not yet fully
understood in the literature. By leveraging more knowledge for building richer representations and
description models, our work takes a further step forward to enhance image captioning and could
have a direct impact of indicating a new direction of vision and language research. More importantly,
the utilization of attributes also has a great potential to be an elegant solution of generating open-
vocabulary sentences, making image captioning system really practical.
2 RELATED WORK
The research on image captioning has proceeded along three different dimensions: template-based
methods (Kulkarni et al., 2013; Yang et al., 2011; Mitchell et al., 2012), search-based approaches
(Farhadi et al., 2010; Ordonez et al., 2011; Devlin et al., 2015), and language-based models (Don-
ahue et al., 2015; Kiros et al., 2014; Mao et al., 2014; Vinyals et al., 2015; Xu et al., 2015; Wu et al.,
2016; You et al., 2016).
The first direction, template-based methods, predefine the template for sentence generation which
follows some specific rules of language grammar and split sentence into several parts (e.g., subject,
verb, and object). With such sentence fragments, many works align each part with image content
and then generate the sentence for the image. Obviously, most of them highly depend on the tem-
plates of sentence and always generate sentence with syntactical structure. For example, Kulkarni
et al. employ Conditional Random Field (CRF) model to predict labeling based on the detected
objects, attributes, and prepositions, and then generate sentence with a template by filling in slots
with the most likely labeling (Kulkarni et al., 2013). Similar in spirit, Yang et al. utilize Hidden
Markov Model (HMM) to select the best objects, scenes, verbs, and prepositions with the highest
log-likelihood ratio for template-based sentence generation in (Yang et al., 2011). Furthermore, the
traditional simple template is extended to syntactic trees in (Mitchell et al., 2012) which also starts
from detecting attributes from image as description anchors and then connecting ordered objects
with a syntactically well-formed tree, followed by adding necessary descriptive information.
Search-based approaches “generate” sentence for an image by selecting the most semantically sim-
ilar sentences from sentence pool or directly copying sentences from other visually similar images.
This direction indeed can achieve human-level descriptions as all sentences are from existing human-
generated sentences. The need to collect human-generated sentences, however, makes the sentence
pool hard to be scaled up. Moreover, the approaches in this dimension cannot generate novel de-
scriptions. For instance, in (Farhadi et al., 2010), an intermediate meaning space based on the triplet
of object, action, and scene is proposed to measure the similarity between image and sentence,
where the top sentences are regarded as the generated sentences for the target image. Ordonez et al.
(Ordonez et al., 2011) search images in a large captioned photo collection by using the combination
of object, stuff, people, and scene information and transfer the associated sentences to the query
image. Recently, a simple k-nearest neighbor retrieval model is utilized in (Devlin et al., 2015) and
the best or consensus caption is selected from the returned candidate captions, which even performs
as well as several state-of-the-art language-based models.
Different from template-based and search-based models, language-based models aim to learn the
probability distribution in the common space of visual content and textual sentence to generate
novel sentences with more flexible syntactical structures. In this direction, recent works explore such
probability distribution mainly using neural networks for image captioning. For instance, in (Vinyals
et al., 2015), Vinyals et al. propose an end-to-end neural networks architecture by utilizing LSTM
to generate sentence for an image, which is further incorporated with attention mechanism in (Xu
et al., 2015) to automatically focus on salient objects when generating corresponding words. More
recently, in (Wu et al., 2016), high-level concepts/attributes are shown to obtain clear improvements
on image captioning when injected into existing state-of-the-art RNN-based model and such visual
attributes are further utilized as semantic attention in (You et al., 2016) to enhance image captioning.
In short, our work in this paper belongs to the language-based models. Different from most of
the aforementioned language-based models which mainly focus on sentence generation by solely
2
Under review as a conference paper at ICLR 2017
depending on image representations (Donahue et al., 2015; Kiros et al., 2014; Mao et al., 2014;
Vinyals et al., 2015; Xu et al., 2015) or high-level attributes (Wu et al., 2016), our work contributes
by studying not only jointly exploiting image representations and attributes for image captioning,
but also how the architecture can be better devised by exploring mutual relationship in between. It
is also worth noting that (You et al., 2016) also additionally involve attributes for image captioning.
Ours is fundamentally different in the way that (You et al., 2016) is as a result of utilizing attributes
to model semantic attention to the locally previous words, as opposed to holistically employing
attributes as a kind of complementary representations in this work.
3 BOOSTING IMAGE CAPTIONING WITH ATTRIBUTES
In this paper, we devise our CNN plus RNN architectures to generate descriptions for images under
the umbrella of additionally incorporating the detected high-level attributes. Specifically, we be-
gin this section by presenting the problem formulation and followed by five variants of our image
captioning frameworks with attributes.
3.1 PROBLEM FORMULATION
Suppose we have an image I to be described by a textual sentence S, where S = {w
1
, w
2
, ..., w
N
s
}
consisting of N
s
words. Let I ∈ R
D
v
and w
t
∈ R
D
s
denote the D
v
-dimensional image repre-
sentations of the image I and the D
s
-dimensional textual features of the t-th word in sentence S,
respectively. Furthermore, we have feature vector A ∈ R
D
a
to represent the probability distribution
over the high-level attributes for image I. Specifically, we train the attribute detectors by using the
weakly-supervised approach of Multiple Instance Learning (MIL) in (Fang et al., 2015) on image
captioning benchmarks. For an attribute w
a
, one image I is regarded as a positive bag of regions (in-
stances) if w
a
exists in image I’s ground-truth sentences, and negative bag otherwise. By inputting
all the bags into a noisy-OR MIL model, the probability of the bag b
I
which contains attribute w
a
is
measured on the probabilities of all the regions in the bag as
Pr
w
a
I
= 1 −
Y
r
i
∈b
I
(1 − p
w
a
i
), (1)
where p
w
a
i
is the probability of the attribute w
a
predicted by region r
i
and can be calculated through
a sigmoid layer after the last convolutional layer in the fully convolutional network. In particular, the
dimension of convolutional activations from the last convolutional layer is x×x×h and h represents
the representation dimension of each region, resulting in x × x response map which preserves the
spatial dependency of the image. Then, a cross entropy loss is calculated based on the probabilities
of all the attributes at the top of the whole architecture to optimize MIL model. With the learnt MIL
model on image captioning dataset, we treat the final image-level response probabilities of all the
attributes as A.
Inspired by the recent successes of probabilistic sequence models leveraged in statistical machine
translation (Bahdanau et al., 2015; Sutskever et al., 2014), we aim to formulate our image captioning
models in an end-to-end fashion based on RNNs which encode the given image and/or its detected
attributes into a fixed dimensional vector and then decode it to the target output sentence. Hence,
the sentence generation problem we explore here can be formulated by minimizing the following
energy loss function as
E(I, A, S) = − log Pr (S|I, A), (2)
which is the negative log probability of the correct textual sentence given the image representations
and detected attributes.
Since the model produces one word in the sentence at each time step, it is natural to apply chain rule
to model the joint probability over the sequential words. Thus, the log probability of the sentence is
given by the sum of the log probabilities over the word and can be expressed as
log Pr (S|I, A) =
N
s
X
t=1
log Pr ( w
t
| I, A, w
0
, . . . , w
t−1
). (3)
By minimizing this loss, the contextual relationship among the words in the sentence can be guar-
anteed given the image and its detected attributes.
3
Under review as a conference paper at ICLR 2017
AttributesImage
LSTM LSTM
Attributes Image
LSTM LSTM
Attributes
LSTM
w0
LSTM
w1
w1
LSTM
w2
wN
s
-1
LSTM
wN
s
...
Attributes
LSTM
Image
w0
LSTM
w1
w1
LSTM
w2
wN
s
-1
LSTM
wN
s
...
LSTM
Attributes
Image
w0
LSTM
w1
w1
LSTM
w2
wN
s
-1
LSTM
wN
s
...
Figure 1: Five variants of our LSTM-A framework (better viewed in color).
We formulate this task as a variable-length sequence to sequence problem and model the parametric
distribution Pr (w
t
| I, A, w
0
, . . . , w
t−1
) in Eq.(3) with Long Short-Term Memory (LSTM), which
is a widely used type of RNN. The vector formulas for a LSTM layer forward pass are summarized
as below. For time step t, x
t
and h
t
are the input and output vector respectively, T are input weights
matrices, R are recurrent weight matrices and b are bias vectors. Sigmoid σ and hyperbolic tangent
φ are element-wise non-linear activation functions. The dot product of two vectors is denoted with
. Given inputs x
t
, h
t−1
and c
t−1
, the LSTM unit updates for time step t are:
g
t
= φ(T
g
x
t
+ R
g
h
t−1
+ b
g
), i
t
= σ(T
i
x
t
+ R
i
h
t−1
+ b
i
),
f
t
= σ(T
f
x
t
+ R
f
h
t−1
+ b
f
), c
t
= g
t
i
t
+ c
t−1
f
t
,
o
t
= σ(T
o
x
t
+ R
o
h
t−1
+ b
o
), h
t
= φ(c
t
) o
t
,
where g
t
, i
t
, f
t
, c
t
, o
t
, and h
t
are cell input, input gate, forget gate, cell state, output gate, and cell
output of the LSTM, respectively.
3.2 LONG SHORT-TERM MEMORY WITH ATTRIBUTES
Unlike the existing image captioning models in (Donahue et al., 2015; Vinyals et al., 2015) which
solely encode image representations for sentence generation, our proposed Long Short-Term Mem-
ory with Attributes (LSTM-A) model additionally integrates the detected high-level attributes into
LSTM. We devise five variants of LSTM-A for involvement of two design purposes. The first pur-
pose is about where to feed attributes into LSTM and three architectures, i.e., LSTM-A
1
(leveraging
only attributes), LSTM-A
2
(inserting image representations first) and LSTM-A
3
(feeding attributes
first), are derived from this view. The second is about when to input attributes or image represen-
tations into LSTM and we design LSTM-A
4
(inputting image representations at each time step)
and LSTM-A
5
(inputting attributes at each time step) for this purpose. An overview of LSTM-A
architectures is depicted in Figure 1.
3.2.1 LSTM-A
1
(LEVERAGING ONLY ATTRIBUTES)
Given the detected attributes, one natural way is to directly inject the attributes as representations at
the initial time to inform the LSTM about the high-level attributes. This kind of architecture with
only attributes input is named as LSTM-A
1
. It is also worth noting that the attributes-based model
in (Wu et al., 2016) is similar to LSTM-A
1
and can be regarded as one special case of our LSTM-A.
Given the attribute representations A and the corresponding sentence W ≡ [w
0
, w
1
, ..., w
N
s
], the
LSTM updating procedure in LSTM-A
1
is as
x
−1
= T
a
A,
x
t
= T
s
w
t
, t ∈ {0, . . . , N
s
− 1} and h
t
= f
x
t
, t ∈ {0, . . . , N
s
− 1} ,
where D
e
is the dimensionality of LSTM input, T
a
∈ R
D
e
×D
a
and T
s
∈ R
D
e
×D
s
is the transfor-
mation matrix for attribute representations and textual features of word, respectively, and f is the
4
Under review as a conference paper at ICLR 2017
updating function within LSTM unit. Please note that for the input sentence W ≡ [w
0
, . . . , w
N
s
],
we take w
0
as the start sign word to inform the beginning of sentence and w
N
s
as the end sign
word which indicates the end of sentence. Both of the special sign words are included in our vo-
cabulary. Most specifically, at the initial time step, the attribute representations are transformed
as the input to LSTM, and then in the next steps, word embedding x
t
will be input into the L-
STM along with the previous step’s hidden state h
t−1
. In each time step (except the initial step),
we use the LSTM cell output h
t
to predict the next word. Here a softmax layer is applied after
the LSTM layer to produce a probability distribution over all the D
s
words in the vocabulary as
Pr
t+1
(w
t+1
) =
exp
T
(
w
t+1
)
h
h
t
P
w∈W
exp
n
T
(w)
h
h
t
o
, where W is the word vocabulary space and T
(w)
h
is the parame-
ter matrix in softmax layer.
3.2.2 LSTM-A
2
(INSERTING IMAGE REPRESENTATIONS FIRST)
To further leverage both image representations and high-level attributes in the encoding stage of our
LSTM-A, we design the second architecture LSTM-A
2
by treating both of them as atoms in the input
sequence to LSTM. Specifically, at the initial step, the image representations I are firstly transformed
into LSTM to inform the LSTM about the image content, followed by the attribute representations
A which are encoded into LSTM at the next time step to inform the high-level attributes. Then,
LSTM decodes each output word based on previous word x
t
and previous step’s hidden state h
t−1
,
which is similar to the decoding stage in LSTM-A
1
. The LSTM updating procedure in LSTM-A
2
is
designed as
x
−2
= T
v
I and x
−1
= T
a
A,
x
t
= T
s
w
t
, t ∈ {0, . . . , N
s
− 1} and h
t
= f
x
t
, t ∈ {0, . . . , N
s
− 1} ,
where T
v
∈ R
D
e
×D
v
is the transformation matrix for image representations.
3.2.3 LSTM-A
3
(FEEDING ATTRIBUTES FIRST)
The third design LSTM-A
3
is similar to LSTM-A
2
as both designs utilize image representations
and high-level attributes to form the input sequence to LSTM in the encoding stage, except that the
orders of encoding are different. In LSTM-A
3
, the attribute representations are firstly encoded into
LSTM and then the image representations are transformed into LSTM at the second time step. The
whole LSTM updating procedure in LSTM-A
3
is as
x
−2
= T
a
A and x
−1
= T
v
I,
x
t
= T
s
w
t
, t ∈ {0, . . . , N
s
− 1} and h
t
= f
x
t
, t ∈ {0, . . . , N
s
− 1} .
3.2.4 LSTM-A
4
(INPUTTING IMAGE REPRESENTATIONS AT EACH TIME STEP)
Different from the former three designed architectures which mainly inject high-level attributes
and image representations at the encoding stage of LSTM, we next modify the decoding stage in
our LSTM-A by additionally incorporating image representations or high-level attributes. More
specifically, in LSTM-A
4
, the attribute representations are injected once at the initial step to inform
the LSTM about the high-level attributes, and then image representations are fed at each time step as
an extra input to LSTM to emphasize the image content frequently among memory cells in LSTM.
Hence, the LSTM updating procedure in LSTM-A
4
is:
x
−1
= T
a
A,
x
t
= T
s
w
t
+ T
v
I, t ∈ {0, . . . , N
s
− 1} and h
t
= f
x
t
, t ∈ {0, . . . , N
s
− 1} .
3.2.5 LSTM-A
5
(INPUTTING ATTRIBUTES AT EACH TIME STEP)
The last design LSTM-A
5
is similar to LSTM-A
4
except that it firstly encodes image representations
and then feeds attribute representations as an additional input to LSTM at each step in decoding
stage to emphasize the high-level attributes frequently. Accordingly, the LSTM updating procedure
in LSTM-A
5
is as
x
−1
= T
v
I,
x
t
= T
s
w
t
+ T
a
A, t ∈ {0, . . . , N
s
− 1} and h
t
= f
x
t
, t ∈ {0, . . . , N
s
− 1} .
5