A Hierarchical Approach for Generating Descriptive Image Paragraphs

doi:10.1109/CVPR.2017.356

Jonathan Krause Justin Johnson Ranjay Krishna Li Fei-Fei

Stanford University

{jkrause,jcjohns,ranjaykrishna,feifeili}@cs.stanford.edu

Abstract

Recent progress on image captioning has made it possible

to generate novel sentences describing images in natural

language, but compressing an image into a single sentence

can describe visual content in only coarse detail. While one

new captioning approach, dense captioning, can potentially

describe images in ﬁner levels of detail by captioning many

regions within an image, it in turn is unable to produce a

coherent story for an image. In this paper we overcome these

limitations by generating entire paragraphs for describing

images, which can tell detailed, uniﬁed stories. We develop

a model that decomposes both images and paragraphs into

their constituent parts, detecting semantic regions in images

and using a hierarchical recurrent neural network to reason

about language. Linguistic analysis conﬁrms the complexity

of the paragraph generation task, and thorough experiments

on a new dataset of image and paragraph pairs demonstrate

the effectiveness of our approach.

1. Introduction

Vision is the primary sensory modality for human percep-

tion, and language is our most powerful tool for communi-

cating with the world. Building systems that can simultane-

ously understand visual stimuli and describe them in natural

language is therefore a core problem in both computer vi-

sion and artiﬁcial intelligence as a whole. With the advent

of large datasets pairing images with natural language de-

scriptions [

20

,

34

,

10

,

16

] it has recently become possible to

generate novel sentences describing images [

4

,

6

,

12

,

22

,

30

].

While the success of these methods is encouraging, they all

share one key limitation: detail. By only describing images

with a single high-level sentence, there is a fundamental

upper-bound on the quantity and quality of information ap-

proaches can produce.

One recent alternative to sentence-level captioning is the

task of dense captioning [

11

], which overcomes this limita-

tion by detecting many regions of interest in an image and

describing each with a short phrase. By extending the task

of object detection to include natural language description,

1) A girl is eating donuts with a boy in a restaurant

2) A boy and girl sitting at a table with doughnuts.

3) Two kids sitting a coffee shop eating some frosted donuts

4) Two children sitting at a table eating donuts.

5) Two children eat doughnuts at a restaurant table.

Sentences

Paragraph

Two children are sitting at a table in a restaurant. The

children are one little girl and one little boy. The little girl is

eating a pink frosted donut with white icing lines on top of it.

The girl has blonde hair and is wearing a green jacket with a

black long sleeve shirt underneath. The little boy is wearing a

black zip up jacket and is holding his finger to his lip but is

not eating. A metal napkin dispenser is in between them at

the table. The wall next to them is white brick. Two adults are

on the other side of the short white brick wall. The room has

white circular lights on the ceiling and a large window in the

front of the restaurant. It is daylight outside.

Figure 1. Paragraphs are longer, more informative, and more

linguistically complex than sentence-level captions. Here we show

an image with its sentence-level captions from MS COCO [

20

]

(top) and the paragraph used in this work (bottom).

dense captioning describes images in considerably more de-

tail than standard image captioning. However, this comes at

a cost: descriptions generated for dense captioning are not

coherent, i.e. they do not form a cohesive whole describing

the entire image.

In this paper we address the shortcomings of both tra-

ditional image captioning and the recently-proposed dense

317

image captioning by introducing the task of generating para-

graphs that richly describe images (Fig. 1). Paragraph gen-

eration combines the strengths of these tasks but does not

suffer from their weaknesses – like traditional captioning,

paragraphs give a coherent natural language description for

images, but like dense captioning, they can do so in ﬁne-

grained detail.

Generating paragraphs for images is challenging, requir-

ing both ﬁne-grained image understanding and long-term

language reasoning. To overcome these challenges, we pro-

pose a model that decomposes images and paragraphs into

their constituent parts: We break images into semantically

meaningful pieces by detecting objects and other regions of

interest, and we reason about language with a hierarchical

recurrent neural network, decomposing paragraphs into their

corresponding sentences. In addition, we also demonstrate

for the ﬁrst time the ability to transfer visual and linguistic

knowledge from large-scale region captioning [

16

], which

we show has the ability to improve paragraph generation.

To validate our method, we collected a dataset of image

and paragraph pairs, which complements the whole-image

and region-level annotations of MS COCO [

20

] and Visual

Genome [

16

]. To validate the complexity of the paragraph

generation task, we performed a linguistic analysis of our

collected paragraphs, comparing them to sentence-level im-

age captioning. We compare our approach with numerous

baselines, showcasing the beneﬁts of hierarchical modeling

for generating descriptive paragraphs.

The rest of this paper is organized as follows: Sec. 2

overviews related work in image captioning and hierarchical

RNNs, Sec. 3 introduces the paragraph generation task, de-

scribes our newly-collected dataset, and performs a simple

linguistic analysis on it, Sec. 4 details our model for para-

graph generation, Sec.

5 contains experiments, and Sec. 6

concludes with discussion.

2. Related Work

Image Captioning

Building connections between visual

and textual data has been a longstanding goal in computer

vision. One line of work treats the problem as a ranking task,

using images to retrieve relevant captions from a database

and vice-versa [

8

,

10

,

13

]. Due to the compositional nature

of language, it is unlikely that any database will contain

all possible image captions; therefore another line of work

focuses on generating captions directly. Early work uses

handwritten templates to generate language [

17

] while more

recent methods train recurrent neural network language mod-

els conditioned on image features [

4

,

6

,

12

,

22

,

30

,

33

] and

sample from them to generate text. Similar methods have

also been applied to generate captions for videos [

6

,

32

,

35

].

A handful of approaches to image captioning reason not

only about whole images but also image regions. Xu et

al. [

31

] generate captions using a recurrent network with

attention, so that the model produces a distribution over im-

age regions for each word. In contrast to their work, which

uses a coarse grid as image regions, we use semantically

meaningful regions of interest. Karpathy and Fei-Fei [

12

]

use a ranking loss to align image regions with sentence frag-

ments but do not do generation with the model. Johnson et

al. [

11

] introdue the task of dense captioning, which detects

and describes regions of interest, but these descriptions are

independent and do not form a coherent whole.

There has also been some pioneering work on video cap-

tioning with multiple sentences [

27

]. While videos are a

natural candidate for multi-sentence description generation,

image captioning cannot leverage strong temporal dependen-

cies, adding extra challenge.

Hierarchical Recurrent Networks

In order to generate

a paragraph description, a model must reason about long-

term linguistic structures spanning multiple sentences. Due

to vanishing gradients, recurrent neural networks trained

with stochastic gradient descent often struggle to learn long-

term dependencies. Alternative recurrent architectures such

as long-short term memory (LSTM) [

9

] help alleviate this

problem through a gating mechanism that improves gradient

ﬂow. Another solution is a hierarchical recurrent network,

where the architecture is designed such that different parts

of the model operate on different time scales.

Early work applied hierarchical recurrent networks to

simple algorithmic problems [

7

]. The Clockwork RNN [

15

]

uses a related technique for audio signal generation, spoken

word classiﬁcation, and handwriting recognition; a similar

hierarchical architecture was also used in [

2

] for speech

recognition. In these approaches, each recurrent unit is up-

dated on a ﬁxed schedule: some units are updated on every

timestep, while other units might be updated every other

or every fourth timestep. This type of hierarchy helps re-

duce the vanishing gradient problem, but the hierarchy of the

model does not directly reﬂect the hierarchy of the output

sequence.

More related to our work are hierarchical architectures

that directly mirror the hierarchy of language. Li et al. [

18

]

introduce a hierarchical autoencoder, and Lin et al. [

19

]

use different recurrent units to model sentences and words.

Most similar to our work is Yu et al. [

35

], who generate

multi-sentence descriptions for cooking videos using a dif-

ferent hierarchical model. Due to the less constrained non-

temporal setting in our work, our method has to learn in

a much more generic fashion and has been made simpler

as a result, relying more on learning the interplay between

sentences. Additionally, our method reasons about semantic

regions in images, which both enables the transfer of infor-

mation from these regions and leads to more interpretability

in generation.

318

Sentences

COCO [

20]

Paragraphs

Ours

Description Length 11.30 67.50

Sentence Length 11.30 11.91

Diversity 19.01 70.49

Nouns 33.45% 25.81%

Adjectives 27.23% 27.64%

Verbs 10.72% 15.21%

Pronouns 1.23% 2.45%

Table 1. Statistics of paragraph descriptions, compared with

sentence-level captions used in prior work. Description and

sentence lengths are represented by the number of tokens

present, diversity is the inverse of the average CIDEr score

between sentences of the same image, and part of speech

distributions are aggregated from Penn Treebank [

23

] part of

speech tags.

3. Paragraphs are Different

To what extent does describing images with paragraphs

differ from sentence-level captioning? To answer this ques-

tion, we collected a novel dataset of paragraph annota-

tions, comparised of 19,551 MS COCO [

20

] and Visual

Genome [

16

] images, where each image has been annotated

with a paragraph description. Annotations were collected

on Amazon Mechanical Turk, using U.S. workers with at

least 5,000 accepted HITs and an acceptance rate of 98% or

greater

1

, and were additionally subject to automatic and man-

ual spot checks on quality. Fig. 1 demonstrates an example,

comparing our collected paragraph with the ﬁve correspond-

ing sentence-level captions from MS COCO. Though it is

clear that the paragraph is longer and more descriptive than

any one sentence, we note further that a single paragraph can

be more detailed than all ﬁve sentence captions, even when

combined. This occurs because of redundancy in sentence-

level captions – while each caption might use slightly differ-

ent words to describe the image, since all sentence captions

have the goal of describing the image as a whole, they are

fundamentally limited in terms of both diversity and their

total information.

We quantify these observations along with various other

statistics of language in Tab. 1. For example, we ﬁnd that

each paragraph is roughly six times as long as the average

sentence caption, and the individual sentences in each para-

graph are of comparable length as sentence-level captions.

To examine the issue of sentence diversity, we compute the

average CIDEr [

29

] similarity between COCO sentences for

each image and between the individual sentences in each

collected paragraph, deﬁning the ﬁnal diversity score as 100

minus the average CIDEr similarity. Viewed through this

metric, the difference in diversity is striking – sentences

1

Available at

http://cs.stanford.edu/people/

ranjaykrishna/im2p/index.html

within paragraphs are substantially more diverse than sen-

tence captions, with a diversity score of 70.49 compared to

only 19.01. This quantiﬁable evidence demonstrates that sen-

tences in paragraphs provide signiﬁcantly more information

about images.

Diving deeper, we performed a simple linguistic analysis

on COCO sentences and our collected paragraphs, com-

prised of annotating each word with a part of speech tag

from Penn Treebank via Stanford CoreNLP [

21

] and aggre-

gating parts of speech into higher-level linguistic categories.

A few common parts of speech are given in Tab. 1. As a

proportion, paragraphs have somewhat more verbs and pro-

nouns, a comparable frequency of adjectives, and somewhat

fewer nouns. Given the nature of paragraphs, this makes

sense – longer descriptions go beyond the presence of a few

salient objects and include information about their properties

and relationships. We also note but do not quantify that para-

graphs exhibit higher frequencies of more complex linguistic

phenomena, e.g. coreference occurring in Fig. 1, wherein

sentences refer to either “two children”, “one little girl and

one little boy”, “the girl”, or “the boy.” We belive that these

types of long-range phenomena are a fundamental property

of descriptive paragraphs with human-like language and can-

not be adequately explored with sentence-level captions.

4. Method

Overview

Our model takes an image as input, generating

a natural-language paragraph describing it, and is designed

to take advantage of the compositional structure of both

images and paragraphs. Fig.

2 provides an overview. We

ﬁrst decompose the input image by detecting objects and

other regions of interest, then aggregate features across these

regions to produce a pooled representation richly expressing

the image semantics. This feature vector is taken as input

by a hierarchical recurrent neural network composed of two

levels: a sentence RNN and a word RNN. The sentence RNN

receives the image features, decides how many sentences to

generate in the resulting paragraph, and produces an input

topic vector for each sentence. Given this topic vector, the

word RNN generates the words of a single sentence. We

also show how to transfer knowledge from a dense image

captioning [11] task to our model for paragraph generation.

4.1. Region Detector

The region detector receives an input image of size

3×H ×W

, detects regions of interest, and produces a feature

vector of dimension

D = 4096

for each region. Our region

detector follows [

26

,

11

]; we provide a summary here for

completeness: The image is resized so that its longest edge

is 720 pixels, and is then passed through a convolutional

network initialized from the 16-layer VGG network [

28

].

The resulting feature map is processed by a region proposal

network [

26

], which regresses from a set of anchors to pro-

319

Image:

3 x H x W

Regions with

features: M x D

Pooled

vector:

1 x P

Sentence

RNN

Sentence topic

vectors: S x P

A baseball player

is swinging a bat.

Word

RNN

He is wearing a

red helmet and

a white shirt.

The catcher’s

mitt is behind

the batter.

Word

RNN

Generated

sentences

Word

RNN

Region

Detector

projection,

pooling

CNN RPN

Hierarchical Recurrent Network

p

i

Figure 2. Overview of our model. Given an image (left), a region detector (comprising a convolutional network and a region proposal

network) detects regions of interest and produces features for each. Region features are projected to

R

P

, pooled to give a compact image

representation, and passed to a hierarchical recurrent neural network language model comprising a sentence RNN and a word RNN. The

sentence RNN determines the number of sentences to generate based on the halting distribution

p

i

and also generates sentence topic vectors,

which are consumed by each word RNN to generate sentences.

pose regions of interest. These regions are projected onto

the convolutional feature map, and the corresponding region

of the feature map is reshaped to a ﬁxed size using bilinear

interpolation and processed by two fully-connected layers to

give a vector of dimension D for each region.

Given a dataset of images and ground-truth regions of

interest, the region detector can be trained in an end-to-end

fashion as in [

26

] for object detection and [

11

] for dense cap-

tioning. Since paragraph descriptions do not have annotated

groundings to regions of interest, we use a region detector

trained for dense image captioning on the Visual Genome

dataset [

16

], using the publicly available implementation of

[11]. This produces M = 50 detected regions.

One alternative worth noting is to use a region detector

trained strictly for object detection, rather than dense caption-

ing. Although such an approach would capture many salient

objects in an image, its paragraphs would suffer: an ideal

paragraph describes not only objects, but also scenery and

relationships, which are better captured by dense captioning

task that captures all noteworthy elements of a scene.

4.2. Region Pooling

The region detector produces a set of vectors

v

1

, . . . , v

M

∈ R

D

, each describing a different region in

the input image. We wish to aggregate these vectors into

a single pooled vector

v

p

∈ R

P

that compactly describes

the content of the image. To this end, we learn a projec-

tion matrix

W

pool

∈ R

P ×D

and bias

b

pool

∈ R

P

; the

pooled vector

v

p

is computed by projecting each region

vector using

W

pool

and taking an elementwise maximum,

so that

v

p

= max

M

i=1

(W

pool

v

i

+ b

pool

)

. While alternative

approaches for representing collections of regions, such as

spatial attention [

31

], may also be possible, we view these as

complementary to the model proposed in this paper; further-

more we note recent work [

25

] which has proven max pool-

ing sufﬁcient for representing any continuous set function,

giving motivation that max pooling does not, in principle,

sacriﬁce expressive power.

4.3. Hierarchical Recurrent Network

The pooled region vector

v

p

∈ R

P

is given as input

to a hierarchical neural language model composed of two

modules: a sentence RNN and a word RNN. The sentence

RNN is responsible for deciding the number of sentences

S

that should be in the generated paragraph and for producing

a

P

-dimensional topic vector for each of these sentences.

Given a topic vector for a sentence, the word RNN generates

the words of that sentence. We adopt the standard LSTM

architecture [9] for both the word RNN and sentence RNN.

As an alternative to this hierarchical approach, one could

instead use a non-hierarchical language model to directly

generate the words of a paragraph, treating the end-of-

sentence token as another word in the vocabulary. Our hier-

archical model is advantageous because it reduces the length

of time over which the recurrent networks must reason. Our

paragraphs contain an average of 67.5 words (Tab. 1), so

a non-hierarchical approach must reason over dozens of

time steps, which is extremely difﬁcult for language mod-

els. However, since our paragraphs contain an average of

5.7 sentences, each with an average of 11.9 words, both

the paragraph and sentence RNNs need only reason over

much shorter time-scales, making learning an appropriate

representation much more tractable.

Sentence RNN

The sentence RNN is a single-layer LSTM

with hidden size

H = 512

and initial hidden and cell states

set to zero. At each time step, the sentence RNN receives

the pooled region vector

v

p

as input, and in turn produces

a sequence of hidden states

h

1

, . . . , h

S

∈ R

H

, one for each

sentence in the paragraph. Each hidden state

h

i

is used in

two ways: First, a linear projection from

h

i

and a logis-

tic classiﬁer produce a distribution

p

i

over the two states

{CONTINUE = 0, STOP = 1}

which determine whether

the

i

th sentence is the last sentence in the paragraph. Second,

the hidden state

h

i

is fed through a two-layer fully-connected

network to produce the topic vector

t

i

∈ R

P

for the

i

th sen-

tence of the paragraph, which is the input to the word RNN.

320

Word RNN

The word RNN is a two-layer LSTM with

hidden size

H = 512

, which, given a topic vector

t

i

∈

R

P

from the sentence RNN, is responsible for generating

the words of a sentence. We follow the input formulation

of [

30

]: the ﬁrst and second inputs to the RNN are the topic

vector and a special

START

token, and subsequent inputs are

learned embedding vectors for the words of the sentence. At

each timestep the hidden state of the last LSTM layer is used

to predict a distribution over the words in the vocabulary,

and a special

END

token signals the end of a sentence. After

each Word RNN has generated the words of their respective

sentences, these sentences are ﬁnally concatenated to form

the generated paragraph.

4.4. Training and Sampling

Training data consists of pairs

(x, y)

, with

x

an image

and

y

a ground-truth paragraph description for that image,

where

y

has

S

sentences, the

i

th sentence has

N

i

words, and

y

ij

is the

j

th word of the

i

th sentence. After computing

the pooled region vector

v

p

for the image, we unroll the

sentence RNN for

S

timesteps, giving a distribution

p

i

over

the

{CONTINUE, STOP}

states for each sentence. We feed

the sentence topic vectors to

S

copies of the word RNN,

unrolling the

i

th copy for

N

i

timesteps, producing distri-

butions

p

ij

over each word of each sentence. Our training

loss

ℓ(x, y)

for the example

(x, y)

is a weighted sum of two

cross-entropy terms: a sentence loss

ℓ

sent

on the stopping

distribution

p

i

, and a word loss

ℓ

word

on the word distribu-

tion p

ij

:

ℓ(x, y) =λ

sent

S

X

i=1

ℓ

sent

(p

i

, I [i = S]) (1)

+λ

word

S

X

i=1

N

i

X

j=1

ℓ

word

(p

ij

, y

ij

) (2)

To generate a paragraph for an image, we run the sentence

RNN forward until the stopping probability

p

i

(STOP)

ex-

ceeds a threshold

T

STOP

or after

S

MAX

sentences, whichever

comes ﬁrst. We then sample sentences from the word

RNN, choosing the most likely word at each timestep and

stopping after choosing the

STOP

token or after

N

MAX

words. We set the parameters

T

STOP

= 0.5

,

S

MAX

= 6

, and

N

MAX

= 50 based on validation set performance.

4.5. Transfer Learning

Transfer learning has become pervasive in computer vi-

sion. For tasks such as object detection [

26

] and image cap-

tioning [

6

,

12

,

30

,

31

], it has become standard practice not

only to process images with convolutional neural networks,

but also to initialize the weights of these networks from

weights that had been tuned for image classiﬁcation, such

as the 16-layer VGG network [

28

]. Initializing from a pre-

trained convolutional network allows a form of knowledge

transfer from large classiﬁcation datasets, and is particularly

effective on datasets of limited size. Might transfer learning

also be useful for paragraph generation?

We propose to utilize transfer learning in two ways. First,

we initialize our region detection network from a model

trained for dense image captioning [

11

]; although our model

is end-to-end differentiable, we keep this sub-network ﬁxed

during training both for efﬁciency and also to prevent over-

ﬁtting. Second, we initialize the word embedding vectors,

recurrent network weights, and output linear projection of

the word RNN from a language model that had been trained

on region-level captions [

11

], ﬁne-tuning these parameters

during training to be better suited for the task of paragraph

generation. Parameters for tokens not present in the region

model are initialized from the parameters for the

UNK

to-

ken. This initialization strategy allows our model to utilize

linguistic knowledge learned on large-scale region caption

datasets [

16

] to produce better paragraph descriptions, and

we validate the efﬁcacy of this strategy in our experiments.

5. Experiments

In this section we describe our paragraph generation ex-

periments on the collected data described in Sec. 3, which

we divide into 14,575 training, 2,487 validation, and 2,489

testing images.

5.1. Baselines

Sentence-Concat:

To demonstrate the difference between

sentence-level and paragraph captions, this baseline samples

and concatenates ﬁve sentence captions from a model [

12

]

trained on MS COCO captions [

20

]. The ﬁrst sentence uses

beam search (beam size

= 2

) and the rest are sampled. The

motivation for this is as follows: the image captioning model

ﬁrst produces the sentence that best describes the image as

a whole, and subsequent sentences use sampling in order to

generate a diverse range of sentences, since the alternative

is to repeat the same sentence from beam search. We have

validated that this approach works better than using either

only beam search or only sampling, as the intent is to make

the strongest possible comparison at a task-level to standard

image captioning. We also note that, while Sentence-Concat

is trained on MS COCO, all images in our dataset are also in

MS COCO, and our descriptions were also written by users

on Amazon Mechanical Turk.

Image-Flat:

This model uses a ﬂat representation for both

images and language, and is equivalent to the standard image

captioning model NeuralTalk [

12

]. It takes the whole image

as input, and decodes into a paragraph token by token. We

use the publically available implementation of [

12

], which

uses the 16-layer VGG network [

28

] to extract CNN features

and projects them as input into an LSTM [

9

], training the

whole model jointly end-to-end.

321

A Hierarchical Approach for Generating Descriptive Image Paragraphs

Citations

References

"A Hierarchical Approach for Generat..." refers methods in this paper

"A Hierarchical Approach for Generat..." refers background or methods in this paper

"A Hierarchical Approach for Generat..." refers background or methods in this paper

Related Papers (5)