scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Hierarchical Approach for Generating Descriptive Image Paragraphs

01 Jul 2017-pp 3337-3345
TL;DR: A model that decomposes both images and paragraphs into their constituent parts is developed, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language.
Abstract: Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail. While one new captioning approach, dense captioning, can potentially describe images in finer levels of detail by captioning many regions within an image, it in turn is unable to produce a coherent story for an image. In this paper we overcome these limitations by generating entire paragraphs for describing images, which can tell detailed, unified stories. We develop a model that decomposes both images and paragraphs into their constituent parts, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language. Linguistic analysis confirms the complexity of the paragraph generation task, and thorough experiments on a new dataset of image and paragraph pairs demonstrate the effectiveness of our approach.

Content maybe subject to copyright    Report

A Hierarchical Approach for Generating Descriptive Image Paragraphs
Jonathan Krause Justin Johnson Ranjay Krishna Li Fei-Fei
Stanford University
{jkrause,jcjohns,ranjaykrishna,feifeili}@cs.stanford.edu
Abstract
Recent progress on image captioning has made it possible
to generate novel sentences describing images in natural
language, but compressing an image into a single sentence
can describe visual content in only coarse detail. While one
new captioning approach, dense captioning, can potentially
describe images in finer levels of detail by captioning many
regions within an image, it in turn is unable to produce a
coherent story for an image. In this paper we overcome these
limitations by generating entire paragraphs for describing
images, which can tell detailed, unified stories. We develop
a model that decomposes both images and paragraphs into
their constituent parts, detecting semantic regions in images
and using a hierarchical recurrent neural network to reason
about language. Linguistic analysis confirms the complexity
of the paragraph generation task, and thorough experiments
on a new dataset of image and paragraph pairs demonstrate
the effectiveness of our approach.
1. Introduction
Vision is the primary sensory modality for human percep-
tion, and language is our most powerful tool for communi-
cating with the world. Building systems that can simultane-
ously understand visual stimuli and describe them in natural
language is therefore a core problem in both computer vi-
sion and artificial intelligence as a whole. With the advent
of large datasets pairing images with natural language de-
scriptions [
20
,
34
,
10
,
16
] it has recently become possible to
generate novel sentences describing images [
4
,
6
,
12
,
22
,
30
].
While the success of these methods is encouraging, they all
share one key limitation: detail. By only describing images
with a single high-level sentence, there is a fundamental
upper-bound on the quantity and quality of information ap-
proaches can produce.
One recent alternative to sentence-level captioning is the
task of dense captioning [
11
], which overcomes this limita-
tion by detecting many regions of interest in an image and
describing each with a short phrase. By extending the task
of object detection to include natural language description,
1) A girl is eating donuts with a boy in a restaurant
2) A boy and girl sitting at a table with doughnuts.
3) Two kids sitting a coffee shop eating some frosted donuts
4) Two children sitting at a table eating donuts.
5) Two children eat doughnuts at a restaurant table.
Sentences
Paragraph
Two children are sitting at a table in a restaurant. The
children are one little girl and one little boy. The little girl is
eating a pink frosted donut with white icing lines on top of it.
The girl has blonde hair and is wearing a green jacket with a
black long sleeve shirt underneath. The little boy is wearing a
black zip up jacket and is holding his finger to his lip but is
not eating. A metal napkin dispenser is in between them at
the table. The wall next to them is white brick. Two adults are
on the other side of the short white brick wall. The room has
white circular lights on the ceiling and a large window in the
front of the restaurant. It is daylight outside.
Figure 1. Paragraphs are longer, more informative, and more
linguistically complex than sentence-level captions. Here we show
an image with its sentence-level captions from MS COCO [
20
]
(top) and the paragraph used in this work (bottom).
dense captioning describes images in considerably more de-
tail than standard image captioning. However, this comes at
a cost: descriptions generated for dense captioning are not
coherent, i.e. they do not form a cohesive whole describing
the entire image.
In this paper we address the shortcomings of both tra-
ditional image captioning and the recently-proposed dense
317

image captioning by introducing the task of generating para-
graphs that richly describe images (Fig. 1). Paragraph gen-
eration combines the strengths of these tasks but does not
suffer from their weaknesses like traditional captioning,
paragraphs give a coherent natural language description for
images, but like dense captioning, they can do so in fine-
grained detail.
Generating paragraphs for images is challenging, requir-
ing both fine-grained image understanding and long-term
language reasoning. To overcome these challenges, we pro-
pose a model that decomposes images and paragraphs into
their constituent parts: We break images into semantically
meaningful pieces by detecting objects and other regions of
interest, and we reason about language with a hierarchical
recurrent neural network, decomposing paragraphs into their
corresponding sentences. In addition, we also demonstrate
for the first time the ability to transfer visual and linguistic
knowledge from large-scale region captioning [
16
], which
we show has the ability to improve paragraph generation.
To validate our method, we collected a dataset of image
and paragraph pairs, which complements the whole-image
and region-level annotations of MS COCO [
20
] and Visual
Genome [
16
]. To validate the complexity of the paragraph
generation task, we performed a linguistic analysis of our
collected paragraphs, comparing them to sentence-level im-
age captioning. We compare our approach with numerous
baselines, showcasing the benefits of hierarchical modeling
for generating descriptive paragraphs.
The rest of this paper is organized as follows: Sec. 2
overviews related work in image captioning and hierarchical
RNNs, Sec. 3 introduces the paragraph generation task, de-
scribes our newly-collected dataset, and performs a simple
linguistic analysis on it, Sec. 4 details our model for para-
graph generation, Sec.
5 contains experiments, and Sec. 6
concludes with discussion.
2. Related Work
Image Captioning
Building connections between visual
and textual data has been a longstanding goal in computer
vision. One line of work treats the problem as a ranking task,
using images to retrieve relevant captions from a database
and vice-versa [
8
,
10
,
13
]. Due to the compositional nature
of language, it is unlikely that any database will contain
all possible image captions; therefore another line of work
focuses on generating captions directly. Early work uses
handwritten templates to generate language [
17
] while more
recent methods train recurrent neural network language mod-
els conditioned on image features [
4
,
6
,
12
,
22
,
30
,
33
] and
sample from them to generate text. Similar methods have
also been applied to generate captions for videos [
6
,
32
,
35
].
A handful of approaches to image captioning reason not
only about whole images but also image regions. Xu et
al. [
31
] generate captions using a recurrent network with
attention, so that the model produces a distribution over im-
age regions for each word. In contrast to their work, which
uses a coarse grid as image regions, we use semantically
meaningful regions of interest. Karpathy and Fei-Fei [
12
]
use a ranking loss to align image regions with sentence frag-
ments but do not do generation with the model. Johnson et
al. [
11
] introdue the task of dense captioning, which detects
and describes regions of interest, but these descriptions are
independent and do not form a coherent whole.
There has also been some pioneering work on video cap-
tioning with multiple sentences [
27
]. While videos are a
natural candidate for multi-sentence description generation,
image captioning cannot leverage strong temporal dependen-
cies, adding extra challenge.
Hierarchical Recurrent Networks
In order to generate
a paragraph description, a model must reason about long-
term linguistic structures spanning multiple sentences. Due
to vanishing gradients, recurrent neural networks trained
with stochastic gradient descent often struggle to learn long-
term dependencies. Alternative recurrent architectures such
as long-short term memory (LSTM) [
9
] help alleviate this
problem through a gating mechanism that improves gradient
flow. Another solution is a hierarchical recurrent network,
where the architecture is designed such that different parts
of the model operate on different time scales.
Early work applied hierarchical recurrent networks to
simple algorithmic problems [
7
]. The Clockwork RNN [
15
]
uses a related technique for audio signal generation, spoken
word classification, and handwriting recognition; a similar
hierarchical architecture was also used in [
2
] for speech
recognition. In these approaches, each recurrent unit is up-
dated on a fixed schedule: some units are updated on every
timestep, while other units might be updated every other
or every fourth timestep. This type of hierarchy helps re-
duce the vanishing gradient problem, but the hierarchy of the
model does not directly reflect the hierarchy of the output
sequence.
More related to our work are hierarchical architectures
that directly mirror the hierarchy of language. Li et al. [
18
]
introduce a hierarchical autoencoder, and Lin et al. [
19
]
use different recurrent units to model sentences and words.
Most similar to our work is Yu et al. [
35
], who generate
multi-sentence descriptions for cooking videos using a dif-
ferent hierarchical model. Due to the less constrained non-
temporal setting in our work, our method has to learn in
a much more generic fashion and has been made simpler
as a result, relying more on learning the interplay between
sentences. Additionally, our method reasons about semantic
regions in images, which both enables the transfer of infor-
mation from these regions and leads to more interpretability
in generation.
318

Sentences
COCO [
20]
Paragraphs
Ours
Description Length 11.30 67.50
Sentence Length 11.30 11.91
Diversity 19.01 70.49
Nouns 33.45% 25.81%
Adjectives 27.23% 27.64%
Verbs 10.72% 15.21%
Pronouns 1.23% 2.45%
Table 1. Statistics of paragraph descriptions, compared with
sentence-level captions used in prior work. Description and
sentence lengths are represented by the number of tokens
present, diversity is the inverse of the average CIDEr score
between sentences of the same image, and part of speech
distributions are aggregated from Penn Treebank [
23
] part of
speech tags.
3. Paragraphs are Different
To what extent does describing images with paragraphs
differ from sentence-level captioning? To answer this ques-
tion, we collected a novel dataset of paragraph annota-
tions, comparised of 19,551 MS COCO [
20
] and Visual
Genome [
16
] images, where each image has been annotated
with a paragraph description. Annotations were collected
on Amazon Mechanical Turk, using U.S. workers with at
least 5,000 accepted HITs and an acceptance rate of 98% or
greater
1
, and were additionally subject to automatic and man-
ual spot checks on quality. Fig. 1 demonstrates an example,
comparing our collected paragraph with the five correspond-
ing sentence-level captions from MS COCO. Though it is
clear that the paragraph is longer and more descriptive than
any one sentence, we note further that a single paragraph can
be more detailed than all five sentence captions, even when
combined. This occurs because of redundancy in sentence-
level captions while each caption might use slightly differ-
ent words to describe the image, since all sentence captions
have the goal of describing the image as a whole, they are
fundamentally limited in terms of both diversity and their
total information.
We quantify these observations along with various other
statistics of language in Tab. 1. For example, we find that
each paragraph is roughly six times as long as the average
sentence caption, and the individual sentences in each para-
graph are of comparable length as sentence-level captions.
To examine the issue of sentence diversity, we compute the
average CIDEr [
29
] similarity between COCO sentences for
each image and between the individual sentences in each
collected paragraph, defining the final diversity score as 100
minus the average CIDEr similarity. Viewed through this
metric, the difference in diversity is striking sentences
1
Available at
http://cs.stanford.edu/people/
ranjaykrishna/im2p/index.html
within paragraphs are substantially more diverse than sen-
tence captions, with a diversity score of 70.49 compared to
only 19.01. This quantifiable evidence demonstrates that sen-
tences in paragraphs provide significantly more information
about images.
Diving deeper, we performed a simple linguistic analysis
on COCO sentences and our collected paragraphs, com-
prised of annotating each word with a part of speech tag
from Penn Treebank via Stanford CoreNLP [
21
] and aggre-
gating parts of speech into higher-level linguistic categories.
A few common parts of speech are given in Tab. 1. As a
proportion, paragraphs have somewhat more verbs and pro-
nouns, a comparable frequency of adjectives, and somewhat
fewer nouns. Given the nature of paragraphs, this makes
sense longer descriptions go beyond the presence of a few
salient objects and include information about their properties
and relationships. We also note but do not quantify that para-
graphs exhibit higher frequencies of more complex linguistic
phenomena, e.g. coreference occurring in Fig. 1, wherein
sentences refer to either “two children”, “one little girl and
one little boy”, “the girl”, or “the boy. We belive that these
types of long-range phenomena are a fundamental property
of descriptive paragraphs with human-like language and can-
not be adequately explored with sentence-level captions.
4. Method
Overview
Our model takes an image as input, generating
a natural-language paragraph describing it, and is designed
to take advantage of the compositional structure of both
images and paragraphs. Fig.
2 provides an overview. We
first decompose the input image by detecting objects and
other regions of interest, then aggregate features across these
regions to produce a pooled representation richly expressing
the image semantics. This feature vector is taken as input
by a hierarchical recurrent neural network composed of two
levels: a sentence RNN and a word RNN. The sentence RNN
receives the image features, decides how many sentences to
generate in the resulting paragraph, and produces an input
topic vector for each sentence. Given this topic vector, the
word RNN generates the words of a single sentence. We
also show how to transfer knowledge from a dense image
captioning [11] task to our model for paragraph generation.
4.1. Region Detector
The region detector receives an input image of size
3×H ×W
, detects regions of interest, and produces a feature
vector of dimension
D = 4096
for each region. Our region
detector follows [
26
,
11
]; we provide a summary here for
completeness: The image is resized so that its longest edge
is 720 pixels, and is then passed through a convolutional
network initialized from the 16-layer VGG network [
28
].
The resulting feature map is processed by a region proposal
network [
26
], which regresses from a set of anchors to pro-
319

Image:
3 x H x W
Regions with
features: M x D
Pooled
vector:
1 x P
Sentence
RNN
Sentence topic
vectors: S x P
A baseball player
is swinging a bat.
Word
RNN
He is wearing a
red helmet and
a white shirt.
The catcher’s
mitt is behind
the batter.
Word
RNN
Generated
sentences
Word
RNN
Region
Detector
projection,
pooling
CNN RPN
Hierarchical Recurrent Network
p
i
Figure 2. Overview of our model. Given an image (left), a region detector (comprising a convolutional network and a region proposal
network) detects regions of interest and produces features for each. Region features are projected to
R
P
, pooled to give a compact image
representation, and passed to a hierarchical recurrent neural network language model comprising a sentence RNN and a word RNN. The
sentence RNN determines the number of sentences to generate based on the halting distribution
p
i
and also generates sentence topic vectors,
which are consumed by each word RNN to generate sentences.
pose regions of interest. These regions are projected onto
the convolutional feature map, and the corresponding region
of the feature map is reshaped to a fixed size using bilinear
interpolation and processed by two fully-connected layers to
give a vector of dimension D for each region.
Given a dataset of images and ground-truth regions of
interest, the region detector can be trained in an end-to-end
fashion as in [
26
] for object detection and [
11
] for dense cap-
tioning. Since paragraph descriptions do not have annotated
groundings to regions of interest, we use a region detector
trained for dense image captioning on the Visual Genome
dataset [
16
], using the publicly available implementation of
[11]. This produces M = 50 detected regions.
One alternative worth noting is to use a region detector
trained strictly for object detection, rather than dense caption-
ing. Although such an approach would capture many salient
objects in an image, its paragraphs would suffer: an ideal
paragraph describes not only objects, but also scenery and
relationships, which are better captured by dense captioning
task that captures all noteworthy elements of a scene.
4.2. Region Pooling
The region detector produces a set of vectors
v
1
, . . . , v
M
R
D
, each describing a different region in
the input image. We wish to aggregate these vectors into
a single pooled vector
v
p
R
P
that compactly describes
the content of the image. To this end, we learn a projec-
tion matrix
W
pool
R
P ×D
and bias
b
pool
R
P
; the
pooled vector
v
p
is computed by projecting each region
vector using
W
pool
and taking an elementwise maximum,
so that
v
p
= max
M
i=1
(W
pool
v
i
+ b
pool
)
. While alternative
approaches for representing collections of regions, such as
spatial attention [
31
], may also be possible, we view these as
complementary to the model proposed in this paper; further-
more we note recent work [
25
] which has proven max pool-
ing sufficient for representing any continuous set function,
giving motivation that max pooling does not, in principle,
sacrifice expressive power.
4.3. Hierarchical Recurrent Network
The pooled region vector
v
p
R
P
is given as input
to a hierarchical neural language model composed of two
modules: a sentence RNN and a word RNN. The sentence
RNN is responsible for deciding the number of sentences
S
that should be in the generated paragraph and for producing
a
P
-dimensional topic vector for each of these sentences.
Given a topic vector for a sentence, the word RNN generates
the words of that sentence. We adopt the standard LSTM
architecture [9] for both the word RNN and sentence RNN.
As an alternative to this hierarchical approach, one could
instead use a non-hierarchical language model to directly
generate the words of a paragraph, treating the end-of-
sentence token as another word in the vocabulary. Our hier-
archical model is advantageous because it reduces the length
of time over which the recurrent networks must reason. Our
paragraphs contain an average of 67.5 words (Tab. 1), so
a non-hierarchical approach must reason over dozens of
time steps, which is extremely difficult for language mod-
els. However, since our paragraphs contain an average of
5.7 sentences, each with an average of 11.9 words, both
the paragraph and sentence RNNs need only reason over
much shorter time-scales, making learning an appropriate
representation much more tractable.
Sentence RNN
The sentence RNN is a single-layer LSTM
with hidden size
H = 512
and initial hidden and cell states
set to zero. At each time step, the sentence RNN receives
the pooled region vector
v
p
as input, and in turn produces
a sequence of hidden states
h
1
, . . . , h
S
R
H
, one for each
sentence in the paragraph. Each hidden state
h
i
is used in
two ways: First, a linear projection from
h
i
and a logis-
tic classifier produce a distribution
p
i
over the two states
{CONTINUE = 0, STOP = 1}
which determine whether
the
i
th sentence is the last sentence in the paragraph. Second,
the hidden state
h
i
is fed through a two-layer fully-connected
network to produce the topic vector
t
i
R
P
for the
i
th sen-
tence of the paragraph, which is the input to the word RNN.
320

Word RNN
The word RNN is a two-layer LSTM with
hidden size
H = 512
, which, given a topic vector
t
i
R
P
from the sentence RNN, is responsible for generating
the words of a sentence. We follow the input formulation
of [
30
]: the first and second inputs to the RNN are the topic
vector and a special
START
token, and subsequent inputs are
learned embedding vectors for the words of the sentence. At
each timestep the hidden state of the last LSTM layer is used
to predict a distribution over the words in the vocabulary,
and a special
END
token signals the end of a sentence. After
each Word RNN has generated the words of their respective
sentences, these sentences are finally concatenated to form
the generated paragraph.
4.4. Training and Sampling
Training data consists of pairs
(x, y)
, with
x
an image
and
y
a ground-truth paragraph description for that image,
where
y
has
S
sentences, the
i
th sentence has
N
i
words, and
y
ij
is the
j
th word of the
i
th sentence. After computing
the pooled region vector
v
p
for the image, we unroll the
sentence RNN for
S
timesteps, giving a distribution
p
i
over
the
{CONTINUE, STOP}
states for each sentence. We feed
the sentence topic vectors to
S
copies of the word RNN,
unrolling the
i
th copy for
N
i
timesteps, producing distri-
butions
p
ij
over each word of each sentence. Our training
loss
(x, y)
for the example
(x, y)
is a weighted sum of two
cross-entropy terms: a sentence loss
sent
on the stopping
distribution
p
i
, and a word loss
word
on the word distribu-
tion p
ij
:
(x, y) =λ
sent
S
X
i=1
sent
(p
i
, I [i = S]) (1)
+λ
word
S
X
i=1
N
i
X
j=1
word
(p
ij
, y
ij
) (2)
To generate a paragraph for an image, we run the sentence
RNN forward until the stopping probability
p
i
(STOP)
ex-
ceeds a threshold
T
STOP
or after
S
MAX
sentences, whichever
comes first. We then sample sentences from the word
RNN, choosing the most likely word at each timestep and
stopping after choosing the
STOP
token or after
N
MAX
words. We set the parameters
T
STOP
= 0.5
,
S
MAX
= 6
, and
N
MAX
= 50 based on validation set performance.
4.5. Transfer Learning
Transfer learning has become pervasive in computer vi-
sion. For tasks such as object detection [
26
] and image cap-
tioning [
6
,
12
,
30
,
31
], it has become standard practice not
only to process images with convolutional neural networks,
but also to initialize the weights of these networks from
weights that had been tuned for image classification, such
as the 16-layer VGG network [
28
]. Initializing from a pre-
trained convolutional network allows a form of knowledge
transfer from large classification datasets, and is particularly
effective on datasets of limited size. Might transfer learning
also be useful for paragraph generation?
We propose to utilize transfer learning in two ways. First,
we initialize our region detection network from a model
trained for dense image captioning [
11
]; although our model
is end-to-end differentiable, we keep this sub-network fixed
during training both for efficiency and also to prevent over-
fitting. Second, we initialize the word embedding vectors,
recurrent network weights, and output linear projection of
the word RNN from a language model that had been trained
on region-level captions [
11
], fine-tuning these parameters
during training to be better suited for the task of paragraph
generation. Parameters for tokens not present in the region
model are initialized from the parameters for the
UNK
to-
ken. This initialization strategy allows our model to utilize
linguistic knowledge learned on large-scale region caption
datasets [
16
] to produce better paragraph descriptions, and
we validate the efficacy of this strategy in our experiments.
5. Experiments
In this section we describe our paragraph generation ex-
periments on the collected data described in Sec. 3, which
we divide into 14,575 training, 2,487 validation, and 2,489
testing images.
5.1. Baselines
Sentence-Concat:
To demonstrate the difference between
sentence-level and paragraph captions, this baseline samples
and concatenates five sentence captions from a model [
12
]
trained on MS COCO captions [
20
]. The first sentence uses
beam search (beam size
= 2
) and the rest are sampled. The
motivation for this is as follows: the image captioning model
first produces the sentence that best describes the image as
a whole, and subsequent sentences use sampling in order to
generate a diverse range of sentences, since the alternative
is to repeat the same sentence from beam search. We have
validated that this approach works better than using either
only beam search or only sampling, as the intent is to make
the strongest possible comparison at a task-level to standard
image captioning. We also note that, while Sentence-Concat
is trained on MS COCO, all images in our dataset are also in
MS COCO, and our descriptions were also written by users
on Amazon Mechanical Turk.
Image-Flat:
This model uses a flat representation for both
images and language, and is equivalent to the standard image
captioning model NeuralTalk [
12
]. It takes the whole image
as input, and decodes into a paragraph token by token. We
use the publically available implementation of [
12
], which
uses the 16-layer VGG network [
28
] to extract CNN features
and projects them as input into an LSTM [
9
], training the
whole model jointly end-to-end.
321

Citations
More filters
Proceedings ArticleDOI
01 Oct 2017
TL;DR: In this article, the authors introduce a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, which is called ActivityNet Captions.
Abstract: Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.

551 citations

Proceedings ArticleDOI
01 Oct 2017
TL;DR: This article proposed a new framework based on conditional generative adversarial networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content.
Abstract: Despite the substantial progress in recent years, the image captioning techniques are still far from being perfect. Sentences produced by existing methods, e.g. those based on RNNs, are often overly rigid and lacking in variability. This issue is related to a learning principle widely used in practice, that is, to maximize the likelihood of training samples. This principle encourages high resemblance to the “ground-truth” captions, while suppressing other reasonable descriptions. Conventional evaluation metrics, e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we explore an alternative approach, with the aim to improve the naturalness and diversity – two essential properties of human expression. Specifically, we propose a new framework based on Conditional Generative Adversarial Networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content. It is noteworthy that training a sequence generator is nontrivial. We overcome the difficulty by Policy Gradient, a strategy stemming from Reinforcement Learning, which allows the generator to receive early feedback along the way. We tested our method on two large datasets, where it performed competitively against real people in our user study and outperformed other methods on various tasks.

415 citations

Posted Content
TL;DR: In this article, the authors propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, which can capture both short and long events that span minutes.
Abstract: Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with it's unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.

301 citations

Journal ArticleDOI
TL;DR: A comprehensive review of state-of-the-art deep learning approaches that have been used in the context of histopathological image analysis can be found in this paper, where a survey of over 130 papers is presented.

260 citations

Proceedings ArticleDOI
01 May 2018
TL;DR: The authors proposed a unified learning framework that collectively addresses all the above issues by composing a committee of discriminators that can guide a base RNN generator towards more globally coherent generations, and human evaluation demonstrates that text generated by their model is preferred over that of baselines by a large margin, significantly enhancing the overall coherence, style, and information of the generations.
Abstract: Despite their local fluency, long-form text generated from RNNs is often generic, repetitive, and even self-contradictory. We propose a unified learning framework that collectively addresses all the above issues by composing a committee of discriminators that can guide a base RNN generator towards more globally coherent generations. More concretely, discriminators each specialize in a different principle of communication, such as Grice’s maxims, and are collectively combined with the base RNN generator through a composite decoding objective. Human evaluation demonstrates that text generated by our model is preferred over that of baselines by a large margin, significantly enhancing the overall coherence, style, and information of the generations.

214 citations

References
More filters
Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations


"A Hierarchical Approach for Generat..." refers methods in this paper

  • ...Training is done via stochastic gradient descent with Adam [13] learning rate updates, implemented in Torch....

    [...]

  • ...Training is done via stochastic gradient descent with Adam [14], implemented in Torch....

    [...]

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"A Hierarchical Approach for Generat..." refers background or methods in this paper

  • ...All baseline neural language models use two layers of LSTM [9] units with 512 dimensions....

    [...]

  • ...Alternative recurrent architectures such as long-short term memory (LSTM) [8] help alleviate this problem through a gating mechanism that improves gradient flow....

    [...]

  • ...At each timestep the hidden state of the last LSTM layer is used to predict a distribution over the words in the vocabulary, and a special END token signals the end of a sentence....

    [...]

  • ...We adopt the standard LSTM architecture [9] for both the word RNN and sentence RNN....

    [...]

  • ...Second, the hidden state hi is fed through a two-layer fullyconnected network to produce the topic vector ti ∈ RP for the ith sentence of the paragraph, which is the direct input to the word RNN. Word RNN The word RNN is a two-layer LSTM with hidden size H = 512, which, given a topic vector ti ∈ RP from the sentence RNN, is responsible for generating the words of a sentence....

    [...]

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations


"A Hierarchical Approach for Generat..." refers background or methods in this paper

  • ...Our hierarchical method also had a much wider vocabulary compared to the Template approach, though Sentence-Concat, trained on hundreds of thousands of MS COCO [20] captions, is a bit larger....

    [...]

  • ...Sentence-Concat: To demonstrate the difference between sentence-level and paragraph captions, this baseline samples and concatenates five sentence captions from a model [12] trained on MS COCO captions [20]....

    [...]

  • ...With the advent of large datasets pairing images with natural language descriptions [20, 34, 10, 16] it has recently become possible to generate novel sentences describing images [4, 6, 12, 22, 30]....

    [...]

  • ...To validate our method, we collected a dataset of image and paragraph pairs, which complements the whole-image and region-level annotations of MS COCO [20] and Visual Genome [16]....

    [...]

  • ...To what extent does describing images with paragraphs differ from sentence-level captioning? To answer this question, we collected a novel dataset of paragraph annotations, comparised of 19,551 MS COCO [20] and Visual Genome [16] images, where each image has been annotated with a paragraph description....

    [...]

Proceedings ArticleDOI
06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

21,126 citations