Boosting Image Captioning with Attributes

doi:10.1109/ICCV.2017.524

Under review as a conference paper at ICLR 2017

BOOSTING IMAGE CAPTIONING WITH ATTRIBUTES

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, Tao Mei

Microsoft Research Asia

{tiyao, v-yipan, v-yehl, v-zhqiu, tmei}@microsoft.com

ABSTRACT

Automatically describing an image with a natural language has been an emerg-

ing challenge in both ﬁelds of computer vision and natural language processing.

In this paper, we present Long Short-Term Memory with Attributes (LSTM-A)

- a novel architecture that integrates attributes into the successful Convolutional

Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image cap-

tioning framework, by training them in an end-to-end manner. To incorporate

attributes, we construct variants of architectures by feeding image representations

and attributes into RNNs in different ways to explore the mutual but also fuzzy re-

lationship between them. Extensive experiments are conducted on COCO image

captioning dataset and our framework achieves superior results when compared

to state-of-the-art deep models. Most remarkably, we obtain METEOR/CIDEr-D

of 25.2%/98.6% on testing data of widely used and publicly available splits in

(Karpathy & Fei-Fei, 2015) when extracting image representations by GoogleNet

and achieve to date top-1 performance on COCO captioning Leaderboard.

1 INTRODUCTION

Accelerated by tremendous increase in Internet bandwidth and proliferation of sensor-rich mobile

devices, image data has been generated, published and spread explosively, becoming an indispens-

able part of today’s big data. This has encouraged the development of advanced techniques for a

broad range of image understanding applications. A fundamental issue that underlies the success

of these technological advances is the recognition (Szegedy et al., 2015; Simonyan & Zisserman,

2015; He et al., 2016). Recently, researchers have strived to automatically describe the content of

an image with a complete and natural sentence, which has a great potential impact for instance on

robotic vision or helping visually impaired people. Nevertheless, this problem is very challenging,

as description generation model should capture not only the objects or scenes presented in the image,

but also be capable of expressing how the objects/scenes relate to each other in a nature sentence.

The main inspiration of recent attempts on this problem (Donahue et al., 2015; Vinyals et al., 2015;

Xu et al., 2015; You et al., 2016) are from the advances by using RNNs in machine translation

(Sutskever et al., 2014), which is to translate a text from one language (e.g., English) to another

(e.g., Chinese). The basic idea is to perform a sequence to sequence learning for translation, where

an encoder RNN reads the input sequential sentence, one word at a time till the end of the sentence

and then a decoder RNN is exploited to generate the sentence in target language, one word at each

time step. Following this philosophy, it is natural to employ a CNN instead of the encoder RNN for

image captioning, which is regarded as an image encoder to produce image representations.

While encouraging performances are reported, these CNN plus RNN image captioning methods

translate directly from image representations to language, without explicitly taking more high-level

semantic information from images into account. Furthermore, attributes are properties observed in

images with rich semantic cues and have been proved to be effective in visual recognition (Parikh &

Grauman, 2011). A valid question is how to incorporate high-level image attributes into CNN plus

RNN image captioning architecture as complementary knowledge in addition to image representa-

tions. We investigate particularly in this paper the architectures by exploiting the mutual relationship

between image representations and attributes for enhancing image description generation. Speciﬁ-

cally, to better demonstrate the impact of simultaneously utilizing the two kinds of representations,

we devise variants of architectures by feeding them into RNN in different placements and moments,

1

Under review as a conference paper at ICLR 2017

e.g., leveraging only attributes, inserting image representations ﬁrst and then attributes or vice versa,

and inputting image representations/attributes once or at each time step.

The main contribution of this work is the proposal of attribute augmented architectures by integrating

the attributes into CNN plus RNN image captioning framework, which is a problem not yet fully

understood in the literature. By leveraging more knowledge for building richer representations and

description models, our work takes a further step forward to enhance image captioning and could

have a direct impact of indicating a new direction of vision and language research. More importantly,

the utilization of attributes also has a great potential to be an elegant solution of generating open-

vocabulary sentences, making image captioning system really practical.

2 RELATED WORK

The research on image captioning has proceeded along three different dimensions: template-based

methods (Kulkarni et al., 2013; Yang et al., 2011; Mitchell et al., 2012), search-based approaches

(Farhadi et al., 2010; Ordonez et al., 2011; Devlin et al., 2015), and language-based models (Don-

ahue et al., 2015; Kiros et al., 2014; Mao et al., 2014; Vinyals et al., 2015; Xu et al., 2015; Wu et al.,

2016; You et al., 2016).

The ﬁrst direction, template-based methods, predeﬁne the template for sentence generation which

follows some speciﬁc rules of language grammar and split sentence into several parts (e.g., subject,

verb, and object). With such sentence fragments, many works align each part with image content

and then generate the sentence for the image. Obviously, most of them highly depend on the tem-

plates of sentence and always generate sentence with syntactical structure. For example, Kulkarni

et al. employ Conditional Random Field (CRF) model to predict labeling based on the detected

objects, attributes, and prepositions, and then generate sentence with a template by ﬁlling in slots

with the most likely labeling (Kulkarni et al., 2013). Similar in spirit, Yang et al. utilize Hidden

Markov Model (HMM) to select the best objects, scenes, verbs, and prepositions with the highest

log-likelihood ratio for template-based sentence generation in (Yang et al., 2011). Furthermore, the

traditional simple template is extended to syntactic trees in (Mitchell et al., 2012) which also starts

from detecting attributes from image as description anchors and then connecting ordered objects

with a syntactically well-formed tree, followed by adding necessary descriptive information.

Search-based approaches “generate” sentence for an image by selecting the most semantically sim-

ilar sentences from sentence pool or directly copying sentences from other visually similar images.

This direction indeed can achieve human-level descriptions as all sentences are from existing human-

generated sentences. The need to collect human-generated sentences, however, makes the sentence

pool hard to be scaled up. Moreover, the approaches in this dimension cannot generate novel de-

scriptions. For instance, in (Farhadi et al., 2010), an intermediate meaning space based on the triplet

of object, action, and scene is proposed to measure the similarity between image and sentence,

where the top sentences are regarded as the generated sentences for the target image. Ordonez et al.

(Ordonez et al., 2011) search images in a large captioned photo collection by using the combination

of object, stuff, people, and scene information and transfer the associated sentences to the query

image. Recently, a simple k-nearest neighbor retrieval model is utilized in (Devlin et al., 2015) and

the best or consensus caption is selected from the returned candidate captions, which even performs

as well as several state-of-the-art language-based models.

Different from template-based and search-based models, language-based models aim to learn the

probability distribution in the common space of visual content and textual sentence to generate

novel sentences with more ﬂexible syntactical structures. In this direction, recent works explore such

probability distribution mainly using neural networks for image captioning. For instance, in (Vinyals

et al., 2015), Vinyals et al. propose an end-to-end neural networks architecture by utilizing LSTM

to generate sentence for an image, which is further incorporated with attention mechanism in (Xu

et al., 2015) to automatically focus on salient objects when generating corresponding words. More

recently, in (Wu et al., 2016), high-level concepts/attributes are shown to obtain clear improvements

on image captioning when injected into existing state-of-the-art RNN-based model and such visual

attributes are further utilized as semantic attention in (You et al., 2016) to enhance image captioning.

In short, our work in this paper belongs to the language-based models. Different from most of

the aforementioned language-based models which mainly focus on sentence generation by solely

2

Under review as a conference paper at ICLR 2017

depending on image representations (Donahue et al., 2015; Kiros et al., 2014; Mao et al., 2014;

Vinyals et al., 2015; Xu et al., 2015) or high-level attributes (Wu et al., 2016), our work contributes

by studying not only jointly exploiting image representations and attributes for image captioning,

but also how the architecture can be better devised by exploring mutual relationship in between. It

is also worth noting that (You et al., 2016) also additionally involve attributes for image captioning.

Ours is fundamentally different in the way that (You et al., 2016) is as a result of utilizing attributes

to model semantic attention to the locally previous words, as opposed to holistically employing

attributes as a kind of complementary representations in this work.

3 BOOSTING IMAGE CAPTIONING WITH ATTRIBUTES

In this paper, we devise our CNN plus RNN architectures to generate descriptions for images under

the umbrella of additionally incorporating the detected high-level attributes. Speciﬁcally, we be-

gin this section by presenting the problem formulation and followed by ﬁve variants of our image

captioning frameworks with attributes.

3.1 PROBLEM FORMULATION

Suppose we have an image I to be described by a textual sentence S, where S = {w

1

, w

2

, ..., w

N

s

}

consisting of N

s

words. Let I ∈ R

D

v

and w

t

∈ R

D

s

denote the D

v

-dimensional image repre-

sentations of the image I and the D

s

-dimensional textual features of the t-th word in sentence S,

respectively. Furthermore, we have feature vector A ∈ R

D

a

to represent the probability distribution

over the high-level attributes for image I. Speciﬁcally, we train the attribute detectors by using the

weakly-supervised approach of Multiple Instance Learning (MIL) in (Fang et al., 2015) on image

captioning benchmarks. For an attribute w

a

, one image I is regarded as a positive bag of regions (in-

stances) if w

a

exists in image I’s ground-truth sentences, and negative bag otherwise. By inputting

all the bags into a noisy-OR MIL model, the probability of the bag b

I

which contains attribute w

a

is

measured on the probabilities of all the regions in the bag as

Pr

w

a

I

= 1 −

Y

r

i

∈b

I

(1 − p

w

a

i

), (1)

where p

w

a

i

is the probability of the attribute w

a

predicted by region r

i

and can be calculated through

a sigmoid layer after the last convolutional layer in the fully convolutional network. In particular, the

dimension of convolutional activations from the last convolutional layer is x×x×h and h represents

the representation dimension of each region, resulting in x × x response map which preserves the

spatial dependency of the image. Then, a cross entropy loss is calculated based on the probabilities

of all the attributes at the top of the whole architecture to optimize MIL model. With the learnt MIL

model on image captioning dataset, we treat the ﬁnal image-level response probabilities of all the

attributes as A.

Inspired by the recent successes of probabilistic sequence models leveraged in statistical machine

translation (Bahdanau et al., 2015; Sutskever et al., 2014), we aim to formulate our image captioning

models in an end-to-end fashion based on RNNs which encode the given image and/or its detected

attributes into a ﬁxed dimensional vector and then decode it to the target output sentence. Hence,

the sentence generation problem we explore here can be formulated by minimizing the following

energy loss function as

E(I, A, S) = − log Pr (S|I, A), (2)

which is the negative log probability of the correct textual sentence given the image representations

and detected attributes.

Since the model produces one word in the sentence at each time step, it is natural to apply chain rule

to model the joint probability over the sequential words. Thus, the log probability of the sentence is

given by the sum of the log probabilities over the word and can be expressed as

log Pr (S|I, A) =

N

s

X

t=1

log Pr ( w

t

| I, A, w

0

, . . . , w

t−1

). (3)

By minimizing this loss, the contextual relationship among the words in the sentence can be guar-

anteed given the image and its detected attributes.

3

Under review as a conference paper at ICLR 2017

AttributesImage

LSTM LSTM

Attributes Image

LSTM LSTM

Attributes

LSTM

w0

LSTM

w1

LSTM

w2

wN

s

-1

LSTM

wN

s

...

Attributes

LSTM

Image

w0

LSTM

w1

LSTM

w2

wN

s

-1

LSTM

wN

s

...

LSTM

Attributes

Image

w0

LSTM

w1

LSTM

w2

wN

s

-1

LSTM

wN

s

...

Figure 1: Five variants of our LSTM-A framework (better viewed in color).

We formulate this task as a variable-length sequence to sequence problem and model the parametric

distribution Pr (w

t

| I, A, w

0

, . . . , w

t−1

) in Eq.(3) with Long Short-Term Memory (LSTM), which

is a widely used type of RNN. The vector formulas for a LSTM layer forward pass are summarized

as below. For time step t, x

t

and h

t

are the input and output vector respectively, T are input weights

matrices, R are recurrent weight matrices and b are bias vectors. Sigmoid σ and hyperbolic tangent

φ are element-wise non-linear activation functions. The dot product of two vectors is denoted with

. Given inputs x

t

, h

t−1

and c

t−1

, the LSTM unit updates for time step t are:

g

t

= φ(T

g

x

t

+ R

g

h

t−1

+ b

g

), i

t

= σ(T

i

x

t

+ R

i

h

t−1

+ b

i

),

f

t

= σ(T

f

x

t

+ R

f

h

t−1

+ b

f

), c

t

= g

t

 i

t

+ c

t−1

 f

t

,

o

t

= σ(T

o

x

t

+ R

o

h

t−1

+ b

o

), h

t

= φ(c

t

)  o

t

,

where g

t

, i

t

, f

t

, c

t

, o

t

, and h

t

are cell input, input gate, forget gate, cell state, output gate, and cell

output of the LSTM, respectively.

3.2 LONG SHORT-TERM MEMORY WITH ATTRIBUTES

Unlike the existing image captioning models in (Donahue et al., 2015; Vinyals et al., 2015) which

solely encode image representations for sentence generation, our proposed Long Short-Term Mem-

ory with Attributes (LSTM-A) model additionally integrates the detected high-level attributes into

LSTM. We devise ﬁve variants of LSTM-A for involvement of two design purposes. The ﬁrst pur-

pose is about where to feed attributes into LSTM and three architectures, i.e., LSTM-A

1

(leveraging

only attributes), LSTM-A

2

(inserting image representations ﬁrst) and LSTM-A

3

(feeding attributes

ﬁrst), are derived from this view. The second is about when to input attributes or image represen-

tations into LSTM and we design LSTM-A

4

(inputting image representations at each time step)

and LSTM-A

5

(inputting attributes at each time step) for this purpose. An overview of LSTM-A

architectures is depicted in Figure 1.

3.2.1 LSTM-A

1

(LEVERAGING ONLY ATTRIBUTES)

Given the detected attributes, one natural way is to directly inject the attributes as representations at

the initial time to inform the LSTM about the high-level attributes. This kind of architecture with

only attributes input is named as LSTM-A

1

. It is also worth noting that the attributes-based model

in (Wu et al., 2016) is similar to LSTM-A

1

and can be regarded as one special case of our LSTM-A.

Given the attribute representations A and the corresponding sentence W ≡ [w

0

, w

1

, ..., w

N

s

], the

LSTM updating procedure in LSTM-A

1

is as

x

−1

= T

a

A,

x

t

= T

s

w

t

, t ∈ {0, . . . , N

s

− 1} and h

t

= f



x

t



, t ∈ {0, . . . , N

s

− 1} ,

where D

e

is the dimensionality of LSTM input, T

a

∈ R

D

e

×D

a

and T

s

∈ R

D

e

×D

s

is the transfor-

mation matrix for attribute representations and textual features of word, respectively, and f is the

4

Under review as a conference paper at ICLR 2017

updating function within LSTM unit. Please note that for the input sentence W ≡ [w

0

, . . . , w

N

s

],

we take w

0

as the start sign word to inform the beginning of sentence and w

N

s

as the end sign

word which indicates the end of sentence. Both of the special sign words are included in our vo-

cabulary. Most speciﬁcally, at the initial time step, the attribute representations are transformed

as the input to LSTM, and then in the next steps, word embedding x

t

will be input into the L-

STM along with the previous step’s hidden state h

t−1

. In each time step (except the initial step),

we use the LSTM cell output h

t

to predict the next word. Here a softmax layer is applied after

the LSTM layer to produce a probability distribution over all the D

s

words in the vocabulary as

Pr

t+1

(w

t+1

) =

exp



T

(

w

t+1

)

h

t



P

w∈W

exp

n

T

(w)

h

t

o

, where W is the word vocabulary space and T

(w)

h

is the parame-

ter matrix in softmax layer.

3.2.2 LSTM-A

2

(INSERTING IMAGE REPRESENTATIONS FIRST)

To further leverage both image representations and high-level attributes in the encoding stage of our

LSTM-A, we design the second architecture LSTM-A

2

by treating both of them as atoms in the input

sequence to LSTM. Speciﬁcally, at the initial step, the image representations I are ﬁrstly transformed

into LSTM to inform the LSTM about the image content, followed by the attribute representations

A which are encoded into LSTM at the next time step to inform the high-level attributes. Then,

LSTM decodes each output word based on previous word x

t

and previous step’s hidden state h

t−1

,

which is similar to the decoding stage in LSTM-A

1

. The LSTM updating procedure in LSTM-A

2

is

designed as

x

−2

= T

v

I and x

−1

= T

a

A,

x

t

= T

s

w

t

, t ∈ {0, . . . , N

s

− 1} and h

t

= f



x

t



, t ∈ {0, . . . , N

s

− 1} ,

where T

v

∈ R

D

e

×D

v

is the transformation matrix for image representations.

3.2.3 LSTM-A

3

(FEEDING ATTRIBUTES FIRST)

The third design LSTM-A

3

is similar to LSTM-A

2

as both designs utilize image representations

and high-level attributes to form the input sequence to LSTM in the encoding stage, except that the

orders of encoding are different. In LSTM-A

3

, the attribute representations are ﬁrstly encoded into

LSTM and then the image representations are transformed into LSTM at the second time step. The

whole LSTM updating procedure in LSTM-A

3

is as

x

−2

= T

a

A and x

−1

= T

v

I,

x

t

= T

s

w

t

, t ∈ {0, . . . , N

s

− 1} and h

t

= f



x

t



, t ∈ {0, . . . , N

s

− 1} .

3.2.4 LSTM-A

4

(INPUTTING IMAGE REPRESENTATIONS AT EACH TIME STEP)

Different from the former three designed architectures which mainly inject high-level attributes

and image representations at the encoding stage of LSTM, we next modify the decoding stage in

our LSTM-A by additionally incorporating image representations or high-level attributes. More

speciﬁcally, in LSTM-A

4

, the attribute representations are injected once at the initial step to inform

the LSTM about the high-level attributes, and then image representations are fed at each time step as

an extra input to LSTM to emphasize the image content frequently among memory cells in LSTM.

Hence, the LSTM updating procedure in LSTM-A

4

is:

x

−1

= T

a

A,

x

t

= T

s

w

t

+ T

v

I, t ∈ {0, . . . , N

s

− 1} and h

t

= f



x

t



, t ∈ {0, . . . , N

s

− 1} .

3.2.5 LSTM-A

5

(INPUTTING ATTRIBUTES AT EACH TIME STEP)

The last design LSTM-A

5

is similar to LSTM-A

4

except that it ﬁrstly encodes image representations

and then feeds attribute representations as an additional input to LSTM at each step in decoding

stage to emphasize the high-level attributes frequently. Accordingly, the LSTM updating procedure

in LSTM-A

5

is as

x

−1

= T

v

I,

x

t

= T

s

w

t

+ T

a

A, t ∈ {0, . . . , N

s

− 1} and h

t

= f



x

t



, t ∈ {0, . . . , N

s

− 1} .

5

Boosting Image Captioning with Attributes

Citations

References

Related Papers (5)