scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Boosting Image Captioning with Attributes

01 Oct 2017-pp 4904-4912
TL;DR: Li et al. as discussed by the authors proposed a Long Short-Term Memory with Attributes (LSTM-A) architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus RNNs (RNNs) image captioning framework, by training them in an end-to-end manner.
Abstract: Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. Particularly, the learning of attributes is strengthened by integrating inter-attribute correlations into Multiple Instance Learning (MIL). To incorporate attributes into captioning, we construct variants of architectures by feeding image representations and attributes into RNNs in different ways to explore the mutual but also fuzzy relationship between them. Extensive experiments are conducted on COCO image captioning dataset and our framework shows clear improvements when compared to state-of-the-art deep models. More remarkably, we obtain METEOR/CIDEr-D of 25.5%/100.2% on testing data of widely used and publicly available splits in [10] when extracting image representations by GoogleNet and achieve superior performance on COCO captioning Leaderboard.

Content maybe subject to copyright    Report

Under review as a conference paper at ICLR 2017
BOOSTING IMAGE CAPTIONING WITH ATTRIBUTES
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, Tao Mei
Microsoft Research Asia
{tiyao, v-yipan, v-yehl, v-zhqiu, tmei}@microsoft.com
ABSTRACT
Automatically describing an image with a natural language has been an emerg-
ing challenge in both fields of computer vision and natural language processing.
In this paper, we present Long Short-Term Memory with Attributes (LSTM-A)
- a novel architecture that integrates attributes into the successful Convolutional
Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image cap-
tioning framework, by training them in an end-to-end manner. To incorporate
attributes, we construct variants of architectures by feeding image representations
and attributes into RNNs in different ways to explore the mutual but also fuzzy re-
lationship between them. Extensive experiments are conducted on COCO image
captioning dataset and our framework achieves superior results when compared
to state-of-the-art deep models. Most remarkably, we obtain METEOR/CIDEr-D
of 25.2%/98.6% on testing data of widely used and publicly available splits in
(Karpathy & Fei-Fei, 2015) when extracting image representations by GoogleNet
and achieve to date top-1 performance on COCO captioning Leaderboard.
1 INTRODUCTION
Accelerated by tremendous increase in Internet bandwidth and proliferation of sensor-rich mobile
devices, image data has been generated, published and spread explosively, becoming an indispens-
able part of today’s big data. This has encouraged the development of advanced techniques for a
broad range of image understanding applications. A fundamental issue that underlies the success
of these technological advances is the recognition (Szegedy et al., 2015; Simonyan & Zisserman,
2015; He et al., 2016). Recently, researchers have strived to automatically describe the content of
an image with a complete and natural sentence, which has a great potential impact for instance on
robotic vision or helping visually impaired people. Nevertheless, this problem is very challenging,
as description generation model should capture not only the objects or scenes presented in the image,
but also be capable of expressing how the objects/scenes relate to each other in a nature sentence.
The main inspiration of recent attempts on this problem (Donahue et al., 2015; Vinyals et al., 2015;
Xu et al., 2015; You et al., 2016) are from the advances by using RNNs in machine translation
(Sutskever et al., 2014), which is to translate a text from one language (e.g., English) to another
(e.g., Chinese). The basic idea is to perform a sequence to sequence learning for translation, where
an encoder RNN reads the input sequential sentence, one word at a time till the end of the sentence
and then a decoder RNN is exploited to generate the sentence in target language, one word at each
time step. Following this philosophy, it is natural to employ a CNN instead of the encoder RNN for
image captioning, which is regarded as an image encoder to produce image representations.
While encouraging performances are reported, these CNN plus RNN image captioning methods
translate directly from image representations to language, without explicitly taking more high-level
semantic information from images into account. Furthermore, attributes are properties observed in
images with rich semantic cues and have been proved to be effective in visual recognition (Parikh &
Grauman, 2011). A valid question is how to incorporate high-level image attributes into CNN plus
RNN image captioning architecture as complementary knowledge in addition to image representa-
tions. We investigate particularly in this paper the architectures by exploiting the mutual relationship
between image representations and attributes for enhancing image description generation. Specifi-
cally, to better demonstrate the impact of simultaneously utilizing the two kinds of representations,
we devise variants of architectures by feeding them into RNN in different placements and moments,
1

Under review as a conference paper at ICLR 2017
e.g., leveraging only attributes, inserting image representations first and then attributes or vice versa,
and inputting image representations/attributes once or at each time step.
The main contribution of this work is the proposal of attribute augmented architectures by integrating
the attributes into CNN plus RNN image captioning framework, which is a problem not yet fully
understood in the literature. By leveraging more knowledge for building richer representations and
description models, our work takes a further step forward to enhance image captioning and could
have a direct impact of indicating a new direction of vision and language research. More importantly,
the utilization of attributes also has a great potential to be an elegant solution of generating open-
vocabulary sentences, making image captioning system really practical.
2 RELATED WORK
The research on image captioning has proceeded along three different dimensions: template-based
methods (Kulkarni et al., 2013; Yang et al., 2011; Mitchell et al., 2012), search-based approaches
(Farhadi et al., 2010; Ordonez et al., 2011; Devlin et al., 2015), and language-based models (Don-
ahue et al., 2015; Kiros et al., 2014; Mao et al., 2014; Vinyals et al., 2015; Xu et al., 2015; Wu et al.,
2016; You et al., 2016).
The first direction, template-based methods, predefine the template for sentence generation which
follows some specific rules of language grammar and split sentence into several parts (e.g., subject,
verb, and object). With such sentence fragments, many works align each part with image content
and then generate the sentence for the image. Obviously, most of them highly depend on the tem-
plates of sentence and always generate sentence with syntactical structure. For example, Kulkarni
et al. employ Conditional Random Field (CRF) model to predict labeling based on the detected
objects, attributes, and prepositions, and then generate sentence with a template by filling in slots
with the most likely labeling (Kulkarni et al., 2013). Similar in spirit, Yang et al. utilize Hidden
Markov Model (HMM) to select the best objects, scenes, verbs, and prepositions with the highest
log-likelihood ratio for template-based sentence generation in (Yang et al., 2011). Furthermore, the
traditional simple template is extended to syntactic trees in (Mitchell et al., 2012) which also starts
from detecting attributes from image as description anchors and then connecting ordered objects
with a syntactically well-formed tree, followed by adding necessary descriptive information.
Search-based approaches “generate” sentence for an image by selecting the most semantically sim-
ilar sentences from sentence pool or directly copying sentences from other visually similar images.
This direction indeed can achieve human-level descriptions as all sentences are from existing human-
generated sentences. The need to collect human-generated sentences, however, makes the sentence
pool hard to be scaled up. Moreover, the approaches in this dimension cannot generate novel de-
scriptions. For instance, in (Farhadi et al., 2010), an intermediate meaning space based on the triplet
of object, action, and scene is proposed to measure the similarity between image and sentence,
where the top sentences are regarded as the generated sentences for the target image. Ordonez et al.
(Ordonez et al., 2011) search images in a large captioned photo collection by using the combination
of object, stuff, people, and scene information and transfer the associated sentences to the query
image. Recently, a simple k-nearest neighbor retrieval model is utilized in (Devlin et al., 2015) and
the best or consensus caption is selected from the returned candidate captions, which even performs
as well as several state-of-the-art language-based models.
Different from template-based and search-based models, language-based models aim to learn the
probability distribution in the common space of visual content and textual sentence to generate
novel sentences with more flexible syntactical structures. In this direction, recent works explore such
probability distribution mainly using neural networks for image captioning. For instance, in (Vinyals
et al., 2015), Vinyals et al. propose an end-to-end neural networks architecture by utilizing LSTM
to generate sentence for an image, which is further incorporated with attention mechanism in (Xu
et al., 2015) to automatically focus on salient objects when generating corresponding words. More
recently, in (Wu et al., 2016), high-level concepts/attributes are shown to obtain clear improvements
on image captioning when injected into existing state-of-the-art RNN-based model and such visual
attributes are further utilized as semantic attention in (You et al., 2016) to enhance image captioning.
In short, our work in this paper belongs to the language-based models. Different from most of
the aforementioned language-based models which mainly focus on sentence generation by solely
2

Under review as a conference paper at ICLR 2017
depending on image representations (Donahue et al., 2015; Kiros et al., 2014; Mao et al., 2014;
Vinyals et al., 2015; Xu et al., 2015) or high-level attributes (Wu et al., 2016), our work contributes
by studying not only jointly exploiting image representations and attributes for image captioning,
but also how the architecture can be better devised by exploring mutual relationship in between. It
is also worth noting that (You et al., 2016) also additionally involve attributes for image captioning.
Ours is fundamentally different in the way that (You et al., 2016) is as a result of utilizing attributes
to model semantic attention to the locally previous words, as opposed to holistically employing
attributes as a kind of complementary representations in this work.
3 BOOSTING IMAGE CAPTIONING WITH ATTRIBUTES
In this paper, we devise our CNN plus RNN architectures to generate descriptions for images under
the umbrella of additionally incorporating the detected high-level attributes. Specifically, we be-
gin this section by presenting the problem formulation and followed by ve variants of our image
captioning frameworks with attributes.
3.1 PROBLEM FORMULATION
Suppose we have an image I to be described by a textual sentence S, where S = {w
1
, w
2
, ..., w
N
s
}
consisting of N
s
words. Let I R
D
v
and w
t
R
D
s
denote the D
v
-dimensional image repre-
sentations of the image I and the D
s
-dimensional textual features of the t-th word in sentence S,
respectively. Furthermore, we have feature vector A R
D
a
to represent the probability distribution
over the high-level attributes for image I. Specifically, we train the attribute detectors by using the
weakly-supervised approach of Multiple Instance Learning (MIL) in (Fang et al., 2015) on image
captioning benchmarks. For an attribute w
a
, one image I is regarded as a positive bag of regions (in-
stances) if w
a
exists in image Is ground-truth sentences, and negative bag otherwise. By inputting
all the bags into a noisy-OR MIL model, the probability of the bag b
I
which contains attribute w
a
is
measured on the probabilities of all the regions in the bag as
Pr
w
a
I
= 1
Y
r
i
b
I
(1 p
w
a
i
), (1)
where p
w
a
i
is the probability of the attribute w
a
predicted by region r
i
and can be calculated through
a sigmoid layer after the last convolutional layer in the fully convolutional network. In particular, the
dimension of convolutional activations from the last convolutional layer is x×x×h and h represents
the representation dimension of each region, resulting in x × x response map which preserves the
spatial dependency of the image. Then, a cross entropy loss is calculated based on the probabilities
of all the attributes at the top of the whole architecture to optimize MIL model. With the learnt MIL
model on image captioning dataset, we treat the final image-level response probabilities of all the
attributes as A.
Inspired by the recent successes of probabilistic sequence models leveraged in statistical machine
translation (Bahdanau et al., 2015; Sutskever et al., 2014), we aim to formulate our image captioning
models in an end-to-end fashion based on RNNs which encode the given image and/or its detected
attributes into a fixed dimensional vector and then decode it to the target output sentence. Hence,
the sentence generation problem we explore here can be formulated by minimizing the following
energy loss function as
E(I, A, S) = log Pr (S|I, A), (2)
which is the negative log probability of the correct textual sentence given the image representations
and detected attributes.
Since the model produces one word in the sentence at each time step, it is natural to apply chain rule
to model the joint probability over the sequential words. Thus, the log probability of the sentence is
given by the sum of the log probabilities over the word and can be expressed as
log Pr (S|I, A) =
N
s
X
t=1
log Pr ( w
t
| I, A, w
0
, . . . , w
t1
). (3)
By minimizing this loss, the contextual relationship among the words in the sentence can be guar-
anteed given the image and its detected attributes.
3

Under review as a conference paper at ICLR 2017
AttributesImage
LSTM LSTM
Attributes Image
LSTM LSTM
Attributes
LSTM
w0
LSTM
w1
w1
LSTM
w2
wN
s
-1
LSTM
wN
s
...
Attributes
LSTM
Image
w0
LSTM
w1
w1
LSTM
w2
wN
s
-1
LSTM
wN
s
...
LSTM
Attributes
Image
w0
LSTM
w1
w1
LSTM
w2
wN
s
-1
LSTM
wN
s
...
Figure 1: Five variants of our LSTM-A framework (better viewed in color).
We formulate this task as a variable-length sequence to sequence problem and model the parametric
distribution Pr (w
t
| I, A, w
0
, . . . , w
t1
) in Eq.(3) with Long Short-Term Memory (LSTM), which
is a widely used type of RNN. The vector formulas for a LSTM layer forward pass are summarized
as below. For time step t, x
t
and h
t
are the input and output vector respectively, T are input weights
matrices, R are recurrent weight matrices and b are bias vectors. Sigmoid σ and hyperbolic tangent
φ are element-wise non-linear activation functions. The dot product of two vectors is denoted with
. Given inputs x
t
, h
t1
and c
t1
, the LSTM unit updates for time step t are:
g
t
= φ(T
g
x
t
+ R
g
h
t1
+ b
g
), i
t
= σ(T
i
x
t
+ R
i
h
t1
+ b
i
),
f
t
= σ(T
f
x
t
+ R
f
h
t1
+ b
f
), c
t
= g
t
i
t
+ c
t1
f
t
,
o
t
= σ(T
o
x
t
+ R
o
h
t1
+ b
o
), h
t
= φ(c
t
) o
t
,
where g
t
, i
t
, f
t
, c
t
, o
t
, and h
t
are cell input, input gate, forget gate, cell state, output gate, and cell
output of the LSTM, respectively.
3.2 LONG SHORT-TERM MEMORY WITH ATTRIBUTES
Unlike the existing image captioning models in (Donahue et al., 2015; Vinyals et al., 2015) which
solely encode image representations for sentence generation, our proposed Long Short-Term Mem-
ory with Attributes (LSTM-A) model additionally integrates the detected high-level attributes into
LSTM. We devise ve variants of LSTM-A for involvement of two design purposes. The first pur-
pose is about where to feed attributes into LSTM and three architectures, i.e., LSTM-A
1
(leveraging
only attributes), LSTM-A
2
(inserting image representations first) and LSTM-A
3
(feeding attributes
first), are derived from this view. The second is about when to input attributes or image represen-
tations into LSTM and we design LSTM-A
4
(inputting image representations at each time step)
and LSTM-A
5
(inputting attributes at each time step) for this purpose. An overview of LSTM-A
architectures is depicted in Figure 1.
3.2.1 LSTM-A
1
(LEVERAGING ONLY ATTRIBUTES)
Given the detected attributes, one natural way is to directly inject the attributes as representations at
the initial time to inform the LSTM about the high-level attributes. This kind of architecture with
only attributes input is named as LSTM-A
1
. It is also worth noting that the attributes-based model
in (Wu et al., 2016) is similar to LSTM-A
1
and can be regarded as one special case of our LSTM-A.
Given the attribute representations A and the corresponding sentence W [w
0
, w
1
, ..., w
N
s
], the
LSTM updating procedure in LSTM-A
1
is as
x
1
= T
a
A,
x
t
= T
s
w
t
, t {0, . . . , N
s
1} and h
t
= f
x
t
, t {0, . . . , N
s
1} ,
where D
e
is the dimensionality of LSTM input, T
a
R
D
e
×D
a
and T
s
R
D
e
×D
s
is the transfor-
mation matrix for attribute representations and textual features of word, respectively, and f is the
4

Under review as a conference paper at ICLR 2017
updating function within LSTM unit. Please note that for the input sentence W [w
0
, . . . , w
N
s
],
we take w
0
as the start sign word to inform the beginning of sentence and w
N
s
as the end sign
word which indicates the end of sentence. Both of the special sign words are included in our vo-
cabulary. Most specifically, at the initial time step, the attribute representations are transformed
as the input to LSTM, and then in the next steps, word embedding x
t
will be input into the L-
STM along with the previous step’s hidden state h
t1
. In each time step (except the initial step),
we use the LSTM cell output h
t
to predict the next word. Here a softmax layer is applied after
the LSTM layer to produce a probability distribution over all the D
s
words in the vocabulary as
Pr
t+1
(w
t+1
) =
exp
T
(
w
t+1
)
h
h
t
P
w∈W
exp
n
T
(w)
h
h
t
o
, where W is the word vocabulary space and T
(w)
h
is the parame-
ter matrix in softmax layer.
3.2.2 LSTM-A
2
(INSERTING IMAGE REPRESENTATIONS FIRST)
To further leverage both image representations and high-level attributes in the encoding stage of our
LSTM-A, we design the second architecture LSTM-A
2
by treating both of them as atoms in the input
sequence to LSTM. Specifically, at the initial step, the image representations I are firstly transformed
into LSTM to inform the LSTM about the image content, followed by the attribute representations
A which are encoded into LSTM at the next time step to inform the high-level attributes. Then,
LSTM decodes each output word based on previous word x
t
and previous step’s hidden state h
t1
,
which is similar to the decoding stage in LSTM-A
1
. The LSTM updating procedure in LSTM-A
2
is
designed as
x
2
= T
v
I and x
1
= T
a
A,
x
t
= T
s
w
t
, t {0, . . . , N
s
1} and h
t
= f
x
t
, t {0, . . . , N
s
1} ,
where T
v
R
D
e
×D
v
is the transformation matrix for image representations.
3.2.3 LSTM-A
3
(FEEDING ATTRIBUTES FIRST)
The third design LSTM-A
3
is similar to LSTM-A
2
as both designs utilize image representations
and high-level attributes to form the input sequence to LSTM in the encoding stage, except that the
orders of encoding are different. In LSTM-A
3
, the attribute representations are firstly encoded into
LSTM and then the image representations are transformed into LSTM at the second time step. The
whole LSTM updating procedure in LSTM-A
3
is as
x
2
= T
a
A and x
1
= T
v
I,
x
t
= T
s
w
t
, t {0, . . . , N
s
1} and h
t
= f
x
t
, t {0, . . . , N
s
1} .
3.2.4 LSTM-A
4
(INPUTTING IMAGE REPRESENTATIONS AT EACH TIME STEP)
Different from the former three designed architectures which mainly inject high-level attributes
and image representations at the encoding stage of LSTM, we next modify the decoding stage in
our LSTM-A by additionally incorporating image representations or high-level attributes. More
specifically, in LSTM-A
4
, the attribute representations are injected once at the initial step to inform
the LSTM about the high-level attributes, and then image representations are fed at each time step as
an extra input to LSTM to emphasize the image content frequently among memory cells in LSTM.
Hence, the LSTM updating procedure in LSTM-A
4
is:
x
1
= T
a
A,
x
t
= T
s
w
t
+ T
v
I, t {0, . . . , N
s
1} and h
t
= f
x
t
, t {0, . . . , N
s
1} .
3.2.5 LSTM-A
5
(INPUTTING ATTRIBUTES AT EACH TIME STEP)
The last design LSTM-A
5
is similar to LSTM-A
4
except that it firstly encodes image representations
and then feeds attribute representations as an additional input to LSTM at each step in decoding
stage to emphasize the high-level attributes frequently. Accordingly, the LSTM updating procedure
in LSTM-A
5
is as
x
1
= T
v
I,
x
t
= T
s
w
t
+ T
a
A, t {0, . . . , N
s
1} and h
t
= f
x
t
, t {0, . . . , N
s
1} .
5

Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

2,904 citations

Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.
Abstract: Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell. In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

1,093 citations

Book ChapterDOI
08 Sep 2018
TL;DR: Zhang et al. as discussed by the authors proposed GCN-LSTM with attention mechanism to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework.
Abstract: It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image Nevertheless, there has not been evidence in support of the idea on image description generation In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections The representations of each region proposed on objects are then refined by leveraging graph structure through GCN With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches More remarkably, GCN-LSTM increases CIDEr-D performance from 1201% to 1287% on COCO testing set

775 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: AoANet as mentioned in this paper proposes an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries and achieves state-of-the-art performance.
Abstract: Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an information vector and an attention gate using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the attended information, the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-of-the-art performance of 129.8 CIDEr-D score on MS COCO Karpathy offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.

641 citations

Proceedings ArticleDOI
01 Jun 2018
TL;DR: The authors decompose expressions into three modular components related to subject appearance, location, and relationship to other objects in an end-to-end framework, which allows to flexibly adapt to expressions containing different types of information.
Abstract: In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo1 and code2 are provided.

626 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations