scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

01 Jul 2017-pp 3261-3269
TL;DR: In this paper, a high-level concept word detector is proposed that can be integrated with any video-to-language models to generate a list of concept words as useful semantic priors for language generation models.
Abstract: We propose a high-level concept word detector that can be integrated with any video-to-language models. It takes a video as input and generates a list of concept words as useful semantic priors for language generation models. The proposed word detector has two important properties. First, it does not require any external knowledge sources for training. Second, the proposed word detector is trainable in an end-to-end manner jointly with any video-to-language models. To effectively exploit the detected words, we also develop a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model. In order to demonstrate that the proposed approach indeed improves the performance of multiple video-to-language tasks, we participate in all the four tasks of LSMDC 2016 [18]. Our approach has won three of them, including fill-in-the-blank, multiple-choice test, and movie retrieval.

Content maybe subject to copyright    Report

End-to-end Concept Word Detection
for Video Captioning, Retrieval, and Question Answering
Youngjae Yu Hyungjin Ko Jongwook Choi Gunhee Kim
Seoul National University, Seoul, Korea
{yj.yu, hj.ko}@vision.snu.ac.kr, {wookayin, gunhee}@snu.ac.kr
http://vision.snu.ac.kr/project/lsmdc-2016
Abstract
We propose a high-level concept word detector that can
be integrated with any video-to-language models. It takes
a video as input and generates a list of concept words as
useful semantic priors for language generation models. The
proposed word detector has two important properties. First,
it does not require any external knowledge sources for train-
ing. Second, the proposed word detector is trainable in
an end-to-end manner jointly with any video-to-language
models. To effectively exploit the detected words, we also
develop a semantic attention mechanism that selectively fo-
cuses on the detected concept words and fuse them with the
word encoding and decoding in the language model. In or-
der to demonstrate that the proposed approach indeed im-
proves the performance of multiple video-to-language tasks,
we participate in all the four tasks of LSMDC 2016 [
18].
Our approach has won three of them, including ll-in-the-
blank, multiple-choice test, and movie retrieval.
1. Introduction
Video-to-language tasks, including video captioning [
6,
8, 17, 27, 32, 35] and video question answering (QA) [23],
are recent emerging challenges in computer vision research.
This set of problems is interesting as one of frontiers in ar-
ticial intelligence; beyond that, it can also potentiate mul-
tiple practical applications, such as retrieving video content
by users’ free-form queries or helping visually impaired
people understand the visual content. Recently, a number
of large-scale datasets have been introduced as a common
ground for researchers to promote the progress of video-to-
language research (e.g. [
4, 16, 18, 23]).
The objective of this work is to propose a concept word
detector, as shown in Fig.1, which takes a training set of
videos and associated sentences as input, and generates a
list of high-level concept words per video as useful seman-
tic priors for a variety of video-to-language tasks, includ-
ing video captioning, retrieval, and question answering. We
Tracing concepts by attention
down
pull car
get
street
housefront
K concept words
outside
Description Fill-in-the-blank Multi-choice Retrieval
He slows down in
front of one _____
with a garage ...
Q: A vehicle pulling up
A:
house
A car pulls up onto
the driveway …
1.The vehicle …
2.Someone eyes …
3.A man is …
4.There is …
5.A boy walks …
A:
Q:
Input movie clip
drive
road
Concept word detector
Trace
Tracing LSTMs
Figure 1. The intuition of the proposed concept word detector.
Given a video clip, a set of tracing LSTMs extract multiple concept
words that consistently appear across frame regions. We then em-
ploy semantic attention to combine the detected concepts with text
encoding/decoding for several video-to-language tasks of LSMDC
2016, such as captioning, retrieval, and question answering.
design our word detector to have the following two charac-
teristics, to be easily integrated with any video-to-language
models. First, it does not require any external knowledge
sources for training. Instead, our detector learns the cor-
relation between words in the captions and video regions
from the whole training data. To this end, we use a contin-
uous soft attention mechanism that traces consistent visual
information across frames and associates them with concept
words from captions. Second, the word detector is trainable
in an end-to-end manner jointly with any video-to-language
models. The loss function for learning the word detector
can be plugged as an auxiliary term into the model’s overall
cost function; as a result, we can reduce efforts to separately
3165

collect training examples and learn both models.
We also develop language model components to to ef-
fectively exploit the detected words. Inspired by semantic
attention in image captioning research [
34], we develop an
attention mechanism that selectively focuses on the detected
concept words and fuse them with word encoding and de-
coding in the language model. That is, the detected concept
words are combined with input words to better represent the
hidden states of encoders, and with output words to gener-
ate more accurate word prediction.
In order to demonstrate that the proposed word detector
and attention mechanism indeed improve the performance
of multiple video-to-language tasks, we participate in four
tasks of LSMDC 2016 (Large Scale Movie Description
Challenge) [
18], which is one of the most active and suc-
cessful benchmarks that advance the progress of video-to-
language research. The challenges include movie descrip-
tion and multiple-choice test as video captioning, ll-in-the-
blank as video question answering, and movie retrieval as
video retrieval. Following the public evaluation protocol of
LSMDC 2016, our approach achieves the best accuracies
in the three tasks (ll-in-the-blank, multiple-choice test, and
movie retrieval), and comparable performance in the other
task (movie description).
1.1. Related Work
Our work can be uniquely positioned in the context of
two recent research directions in image/video captioning.
Image/Video Captioning with Word Detection. Image
and video captioning has been actively studied in recent vi-
sion and language research, including [
5, 6, 8, 17, 19, 27,
28], to name a few. Among them, there have been several
attempts to detect a set of concept words or attributes from
visual input to boost up the captioning performance. In im-
age captioning research, Fang et al. [
7] exploit a multiple in-
stance learning (MIL) approach to train visual detectors that
identify a set of words with bounding boxed regions of the
image. Based on the detected words, they retrieve and re-
rank the best caption sentence for the image. Wu et al. [
29]
use a CNN to learn a mapping between an image and se-
mantic attributes. They then exploit the mapping as an input
to the captioning decoder. They also extend the framework
to explicitly leverage external knowledge base such as DB-
pedia for question answering tasks. Venugopalan et al. [
26]
generate description with novel words beyond the ones in
the training set, by leveraging external sources, including
object recognition datasets like ImageNet and external text
corpus like Wikipedia. You et al. [
34] also exploit weak
labels and tags on Internet images to train additional para-
metric visual classiers for image captioning.
In the video domain, it is more ambiguous to learn the re-
lation between descriptive words and visual patterns. There
have been only few work in video captioning. Rohrbach
et al. [
17] propose a two-step approach for video caption-
ing on the LSMDC dataset. They rst extract verbs, ob-
jects, and places from movie description, and separately
train SVM-based classiers for each group. They then learn
the LSTM decoder that generates text description based on
the responses of these visual classiers.
While almost all previous captioning methods exploit ex-
ternal classiers for concept or attribute detection, the nov-
elty of our work lies in that we use only captioning training
data with no external sources to learn the word detector,
and propose an end-to-end design for learning both word
detection and caption generation simultaneously. More-
over, compared to video captioning work of [
17] where
only movie description of LSMDC is addressed, this work
is more comprehensive in that we validate the usefulness of
our method for all the four tasks of LSMDC.
Attention for Captioning. Attention mechanism has
been successfully applied to caption generation. One of the
earliest works is [
31] that dynamically focuses on different
image regions to produce an output word sequence. Later
this soft attention has been extended as temporal attention
over video frames [
33, 35] for video captioning.
Beyond the attention on spatial or temporal structure of
visual input, recently You et al. [
34] propose an attention on
attribute words for image captioning. That is, the method
enumerates a set of important object labels in the image,
and then dynamically switch attention among these con-
cept labels. Although our approach also exploits the idea
of semantic attention, it bears two key differences. First,
we extend the semantic attention to video domains for the
rst time, not only for video captioning but also for retrieval
and question answering tasks. Second, the approach of [
34]
relies on the classiers that are separately learned from ex-
ternal datasets, whereas our approach is learnable end-to-
end with only training data of captioning. It signicantly
reduces efforts to prepare for additional multi-label classi-
ers.
1.2. Contributions
We summarize the contributions of this work as follows.
(1) We propose a novel end-to-end learning approach
for detecting a list of concept words and attend on them
to enhance the performance of multiple video-to-language
tasks. The proposed concept word detection and attention
model can be plugged into any models of video captioning,
retrieval, and question answering. Our technical novelties
can be seen from two recent trends of image/video caption-
ing research. First, our work is a rst end-to-end trainable
model not only for concept word detection but also for lan-
guage generation. Second, our work is a rst semantic at-
tention model for video-to-language tasks.
(2) To validate the applicability of the proposed ap-
proach, we participate in all the four tasks of LSMDC 2016.
3166

Our models have won three of them, including ll-in-the-
blank, multiple-choice test, and movie retrieval. We also
attain comparable performance for movie description.
2. Detection of Concept Words from Videos
We rst explain the pre-processing steps for representa-
tion of words and video frames. Then, we explain how we
detect concept words for a given video.
2.1. Preprocessing
Dictionary and Word Embedding. We dene a vo-
cabulary dictionary V by collecting the words that occur
more than three times in the dataset. The dictionary size
is |V| = 12 486, from which our models sequentially select
words as output. We train the word2vec skip-gram embed-
ding [
14] to obtain the word embedding matrix E R
d×|V|
where d is the word embedding dimension and V is the dic-
tionary size. We set d = 300 in our implementation.
Video Representation. We rst equidistantly sample
one per ten frames from a video, to reduce the frame re-
dundancy while minimizing loss of information. We denote
the number of video frames by N . We limit the maximum
number of frames to be N
max
= 40; if a video is too long,
we use a wider interval for uniform sampling.
We employ a convolutional neural network (CNN) to en-
code video input. Specically, we extract the feature map of
each frame from the res5c layer (i.e. R
7×7×2,048
) of ResNet
[
9] pretrained on ImageNet dataset [20], and then apply a
2 × 2 max-pooling followed by a 3 × 3 convolution to re-
duce dimension to R
4×4×500
. Reducing the number of spa-
tial grid regions to 4 × 4 helps the concept word detector
get trained much faster, while not hurting detection perfor-
mance signicantly. We denote resulting visual features of
frames by {v
n
}
N
n=1
. Throughout this paper, we use n for
denoting video frame index.
2.2. An Attention Model for Concept Detection
Concept Words and Traces. We propose the concept
word detector using LSTM networks with soft attention
mechanism. Its structure is shown in the red box of Fig.
2.
Its goal is, for a given video, to discover a list of concept
words that consistently appear across frame regions. The
detected concept words are used as additional references for
video captioning models (section
3.1), which generates out-
put sentence by selectively attending on those words.
We rst dene a set of candidate words with a size of
V from all training captions. Among them, we discover K
concept words per video. We set V = 2, 000 and K = 10.
We rst apply the automatic POS tagging of NLTK [
3], to
extract nouns, verbs and adjectives from all training cap-
tion sentences [
7]. We then compute the frequencies of
those words in a training set, and select the V most com-
mon words as concept word candidates.
Since we do not have groundtruth bounding boxes for
concept words in the videos, we cannot train individual con-
cept detectors in a standard supervised setting. Our idea is
to adopt a soft attention mechanism to infer words by track-
ing regions that are spatially consistent. To this end, we em-
ploy a set of tracing LSTMs, each of which takes care of a
single spatially-consistent meaning being tracked over time,
what we call trace. That is, we keep track of spatial atten-
tion over video frames using an LSTM, so that spatial atten-
tions in adjacent frames resemble the spatial consistency of
a single concept (e.g. a moving object, or an action in video
clips; see Fig.
1). We use a total of L tracing LSTMs to cap-
ture out L traces (or concepts), where L is the number of
spatial regions in the visual feature (i.e. L = 4 × 4 = 16 for
v R
4×4×D
). Fusing these L concepts together, we nally
discover K concept words, as will be described next.
Computation of Spatial Attention. For each trace l,
we maintain spatial attention weights α
(l)
n
R
4×4
, indi-
cating where to attend on (4 × 4) spatial grid locations of
v
n
, through video frames n = 1 . . . N . The initial attention
weight α
(l)
0
at n = 0 is initialized with an one-hot matrix,
for each of L grid locations. We compute the hidden states
h
(l)
n
R
500
of the LSTM through n = 1 . . . N by:
c
(l)
n
= α
(l)
n
v
n
(1)
h
(l)
n
= LSTM(c
(l)
n
, h
(l)
n1
). (2)
where A B =
j,k
A
(j,k)
· B
(j,k,:)
. The input to LSTMs
is the context vector c
(l)
n
R
500
, which is obtained by ap-
plying spatial attention α
(l)
n
to the visual feature v
n
. Note
that the parameters of L LSTMs are shared.
The attention weight vector α
(l)
n
R
4×4
at time step n
is updated as follows:
e
(l)
n
(j, k) = v
n
(j, k) h
(l)
n1
, (3)
α
(l)
n
= softmax
Conv(e
(l)
n
)
, (4)
where is elementwise product, and Conv(·) denotes two
convolution operations before the softmax layer in Fig.
2.
Note that α
(l)
n
in Eq.(3) is computed from the previous hid-
den state h
(l)
n1
of the LSTM.
The spatial attention α
(l)
n
measures how each spatial grid
location of visual features is related to the concept being
tracked through tracing LSTMs. By repeating these two
steps of Eq.(
1)–(3) from n = 1 to N, our model can contin-
uously nd important and temporally consistent meanings
over time, that are closely related to a part of video, rather
than focusing on each video frame individually.
Finally, we predict the concept condence vector p:
p = σ
W
p
h
(1)
N
; · · · ; h
(L)
N
+ b
p
R
V
, (5)
3167








"
($)
2000











*+
,
.
.
.

.



=
7


.
.
×300




"
"
"
($)
×
4×4×
××2048
4×4×2048


4×4×

"D+
($)

×
4×4×
4×4×
4×4×1
"D+
($)
"
"

(3×3) (3×3)
4×4×500

H
($)
$*+
I
"
($)
H
($)



Figure 2. The architecture of the concept word detection in a top red box (section 2.2), and our video description model in bottom, which
uses semantic attention on the detected concept words (section
3.1).
that is, we rst concatenate the hidden states {h
(l)
N
}
L
l=1
at
the last time step of all tracing LSTMs, apply a linear trans-
form parameterized by W
p
R
V ×(500L)
and b
p
R
V
,
and apply the elementwise sigmoid activation σ.
Training and Inference. For training, we obtain a ref-
erence concept condence vector p
R
V
whose element
p
i
is 1 if the corresponding word exists in the groundtruth
caption; otherwise, 0. We minimize the following sigmoid
cross-entropy cost L
con
, which is often used for multi-label
classication [
30] where each class is independent and not
mutually exclusive:
L
con
=
1
V
V
i=1
[p
i
log(p
i
) + (1 p
i
) log(1 p
i
)] . (6)
Strictly speaking, since we apply an end-to-end learning ap-
proach, the cost of Eq.(
6) is used as an auxiliary term for the
overall cost function, which will be discussed in section
3.
For inference, we compute p for a given query video,
and nd top K words from the score p (i.e. argmax
1:K
p).
Finally, we represent these K concept words by their word
embedding {a
i
}
K
i=1
.
3. Video-to-Language Models
We design a different base model for each of LSMDC
tasks, while they share the concept word detector and the
semantic attention mechanism. That is, we aim to validate
that the proposed concept word detection is useful to a wide
range of video-to-language models. For base models, we
take advantage of state-of-the-art techniques, for which we
do not argue as our contribution. We refer to our video-to-
language models leveraging the concept word detector as
CT-SAN (Concept-Tracing Semantic Attention Network).
For better understanding of our models, we outline the
four LSMDC tasks as follows: (i) Movie description: gen-
erating a single descriptive sentence for a given movie clip,
(ii) Fill-in-the-blank: given a video and a sentence with a
single blank, nding a suitable word for the blank from
the whole vocabulary set, (iii) Multiple-choice test: given
a video query and ve descriptive sentences, choosing the
correct one out of them, and (iv) Movie retrieval: ranking
1,000 movie clips for a given natural language query.
We defer more model details to the supplementary le.
Especially, we skip the description of multiple-choice and
movie retrieval models in Figure
3(b)–(c), which can be
found in the supplementary le.
3.1. A Model for Description
Fig.
2 shows the proposed video captioning model. It
takes video features {v
n
}
N
n=1
and the detected concept
words {a
i
}
K
i=1
as input, and produces a word sequence as
output {y
t
}
T
t=1
. The model comprises video encoding and
caption decoding LSTMs, and two semantic attention mod-
els. The two LSTM networks have two layers in depth, with
layer normalization [1] and dropout [22] with a rate of 0.2.
Video Encoder. The video encoding LSTM encodes a
video into a sequence of hidden states {s
n
}
N
n=1
R
D
.
s
n
= LSTM(
v
n
, s
n1
) (7)
3168

where v
n
R
D
is obtained by (4, 4)-average-pooling v
n
.
Caption Decoder. The caption decoding LSTM is a nor-
mal LSTM network as follows:
h
t
= LSTM(x
t
, h
t1
), (8)
where the input x
t
is an intermediate representation of t-
th word input with semantic attention applied, as will be
described below. We initialize the hidden state at t = 0 by
the last hidden state of the video encoder: h
0
= s
N
R
D
.
Semantic Attention. Based on [
34], our model in Fig.2
uses the semantic attention in two different parts, which are
called as input and output semantic attention, respectively.
The input semantic attention φ computes an attention
weight γ
t,i
, which is assigned to each predicted concept
word a
i
. It helps the caption decoding LSTM focus on dif-
ferent concept words dynamically at each step t.
The attention weight γ
t,i
R
K
and input vector x
t
R
D
to the LSTM are obtained by
γ
t,i
exp((Ey
t1
)
W
γ
a
i
), (9)
x
t
= φ(y
t1
, {a
i
})
= W
x
(Ey
t1
+ diag(w
x,a
)
i
γ
t,i
a
i
). (10)
We multiply a previous word y
t1
R
|V|
by the word em-
bedding matrix E to be d-dimensional. The parameters to
learn include W
γ
R
d×d
, W
x
R
D×d
and w
x,a
R
d
.
The output semantic attention ϕ guides how to dynam-
ically weight the concept words {a
i
} when generating an
output word y
t
at each step. We use h
t
, the hidden state
of decoding LSTM at t as an input to the output attention
function ϕ. We then compute p
t
R
D
by attending the
concept words set {a
i
} with the weight β
t,i
:
β
t,i
exp(h
t
W
β
σ(a
i
)), (11)
p
t
= ϕ(h
t
, {a
i
})
= h
t
+ diag(w
h,a
)
i
β
t,i
W
β
σ(a
i
), (12)
where σ is the hyperbolic tangent, and parameters include
w
h,a
R
D
and W
β
R
D×d
.
Finally, the probability of output word is obtained as
p(y
t
| y
1:t1
) = softmax(W
y
p
t
+ b
y
), (13)
where W
y
R
|VD
and b
y
R
|V|
. This procedure loops
until y
t
corresponds to the <EOS> token.
Training. To learn the parameters of the model, we de-
ne a loss function as the total negative log-likelihood of all
the words, with regularization terms on attention weights
{α
t,i
}, {β
t,i
}, and {γ
t,i
} [
34], as well as the loss L
con
for
concept discovery (Eq.
6):
L =
t
log p(y
t
) + λ
1
(g(β) + g(γ)) + λ
2
L
con
(14)
where λ
1
, λ
2
are hyperparameters and g is a regularization
function with setting to p = 2, q = 0.5 as
g(α) = α
1,p
+ α
1,q
(15)
=
i
t
α
t,i
p
1/p
+
t
i
α
t,i
q
1/q
.
For the rest of models, we transfer the parameters of the
concept word detector trained with the description model,
and allow the parameters being ne-tuned.
3.2. A Model for Fill-in-the-Blank
Fig.
3(a) illustrates the proposed model for the ll-in-the-
blank task. It is based on a bidirectional LSTM network
(BLSTM) [
21, 10], which is useful in predicting a blank
word from an imperfect sentence, since it considers the se-
quence in both forward and backward directions. Our key
idea is to employ the semantic attention mechanism on both
input and output of the BLSTM, to strengthen the meaning
of input and output words with the detected concept words.
The model takes word representation {c
t
}
T
t=1
and con-
cept words {a
i
}
K
i=1
as input. Each c
t
R
d
is obtained by
multiplying the one-hot word vector by an embedding ma-
trix E. Suppose that the t-th text input is a blank for which
we use a special token <blank>. We add the word predic-
tion module only on top of the t-th step of the BLSTM.
BLSTM. The input video is represented by the video
encoding LSTM in Figure 2. The hidden state of the nal
video frame s
N
is used to initialize the hidden states of the
BLSTM: h
b
T +1
= h
f
0
= s
N
, where {h
f
t
}
T
t=1
and {h
b
t
}
T
t=1
are the forward and backward hidden states of the BLSTM,
respectively:
h
f
t
= LSTM(x
t
, h
f
t1
), (16)
h
b
t
= LSTM(x
t
, h
b
t+1
). (17)
We also use the layer normalization [
1].
Semantic Attention. The input and output semantic
attention of this model is almost identical to those of the
captioning model in section
3.1, only except that the word
representation c
t
R
d
is used as input at each time step,
instead of previous word vector y
t1
. Then the attention
weighted word vector {x
t
}
T
t=1
is fed into the BLSTM.
The output semantic attention is also similar to that of
the captioning model in section
3.1, only except that we ap-
ply the attention only once at t-th step where the <blank>
token is taken as input. We feed the output of the BLSTM
o
t
= tanh(W
o
[h
f
t
; h
b
t
] + b
o
), (18)
where W
o
R
D×2D
and b
o
R
D
, into the output atten-
tion function ϕ, which generates p R
D
as in Eq.(12) of
the description model, p = ϕ(o
t
, {a
i
}).
3169

Citations
More filters
Posted Content
TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

2,248 citations

Proceedings ArticleDOI
27 Oct 2019
TL;DR: This article proposed to learn text-to-video embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations, which leads to state-of-the-art results on instructional video datasets such as YouCook2 or CrossTask.
Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available.

402 citations

Book ChapterDOI
23 Aug 2020
TL;DR: A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.
Abstract: The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.

389 citations

Posted Content
25 Jul 2017
TL;DR: A combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of the method to VQA.
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

356 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.
Abstract: In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.

353 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations