scispace - formally typeset
Open AccessProceedings ArticleDOI

End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

TLDR
In this paper, a high-level concept word detector is proposed that can be integrated with any video-to-language models to generate a list of concept words as useful semantic priors for language generation models.
Abstract
We propose a high-level concept word detector that can be integrated with any video-to-language models. It takes a video as input and generates a list of concept words as useful semantic priors for language generation models. The proposed word detector has two important properties. First, it does not require any external knowledge sources for training. Second, the proposed word detector is trainable in an end-to-end manner jointly with any video-to-language models. To effectively exploit the detected words, we also develop a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model. In order to demonstrate that the proposed approach indeed improves the performance of multiple video-to-language tasks, we participate in all the four tasks of LSMDC 2016 [18]. Our approach has won three of them, including fill-in-the-blank, multiple-choice test, and movie retrieval.

read more

Content maybe subject to copyright    Report

End-to-end Concept Word Detection
for Video Captioning, Retrieval, and Question Answering
Youngjae Yu Hyungjin Ko Jongwook Choi Gunhee Kim
Seoul National University, Seoul, Korea
{yj.yu, hj.ko}@vision.snu.ac.kr, {wookayin, gunhee}@snu.ac.kr
http://vision.snu.ac.kr/project/lsmdc-2016
Abstract
We propose a high-level concept word detector that can
be integrated with any video-to-language models. It takes
a video as input and generates a list of concept words as
useful semantic priors for language generation models. The
proposed word detector has two important properties. First,
it does not require any external knowledge sources for train-
ing. Second, the proposed word detector is trainable in
an end-to-end manner jointly with any video-to-language
models. To effectively exploit the detected words, we also
develop a semantic attention mechanism that selectively fo-
cuses on the detected concept words and fuse them with the
word encoding and decoding in the language model. In or-
der to demonstrate that the proposed approach indeed im-
proves the performance of multiple video-to-language tasks,
we participate in all the four tasks of LSMDC 2016 [
18].
Our approach has won three of them, including ll-in-the-
blank, multiple-choice test, and movie retrieval.
1. Introduction
Video-to-language tasks, including video captioning [
6,
8, 17, 27, 32, 35] and video question answering (QA) [23],
are recent emerging challenges in computer vision research.
This set of problems is interesting as one of frontiers in ar-
ticial intelligence; beyond that, it can also potentiate mul-
tiple practical applications, such as retrieving video content
by users’ free-form queries or helping visually impaired
people understand the visual content. Recently, a number
of large-scale datasets have been introduced as a common
ground for researchers to promote the progress of video-to-
language research (e.g. [
4, 16, 18, 23]).
The objective of this work is to propose a concept word
detector, as shown in Fig.1, which takes a training set of
videos and associated sentences as input, and generates a
list of high-level concept words per video as useful seman-
tic priors for a variety of video-to-language tasks, includ-
ing video captioning, retrieval, and question answering. We
Tracing concepts by attention
down
pull car
get
street
housefront
K concept words
outside
Description Fill-in-the-blank Multi-choice Retrieval
He slows down in
front of one _____
with a garage ...
Q: A vehicle pulling up
A:
house
A car pulls up onto
the driveway …
1.The vehicle …
2.Someone eyes …
3.A man is …
4.There is …
5.A boy walks …
A:
Q:
Input movie clip
drive
road
Concept word detector
Trace
Tracing LSTMs
Figure 1. The intuition of the proposed concept word detector.
Given a video clip, a set of tracing LSTMs extract multiple concept
words that consistently appear across frame regions. We then em-
ploy semantic attention to combine the detected concepts with text
encoding/decoding for several video-to-language tasks of LSMDC
2016, such as captioning, retrieval, and question answering.
design our word detector to have the following two charac-
teristics, to be easily integrated with any video-to-language
models. First, it does not require any external knowledge
sources for training. Instead, our detector learns the cor-
relation between words in the captions and video regions
from the whole training data. To this end, we use a contin-
uous soft attention mechanism that traces consistent visual
information across frames and associates them with concept
words from captions. Second, the word detector is trainable
in an end-to-end manner jointly with any video-to-language
models. The loss function for learning the word detector
can be plugged as an auxiliary term into the model’s overall
cost function; as a result, we can reduce efforts to separately
3165

collect training examples and learn both models.
We also develop language model components to to ef-
fectively exploit the detected words. Inspired by semantic
attention in image captioning research [
34], we develop an
attention mechanism that selectively focuses on the detected
concept words and fuse them with word encoding and de-
coding in the language model. That is, the detected concept
words are combined with input words to better represent the
hidden states of encoders, and with output words to gener-
ate more accurate word prediction.
In order to demonstrate that the proposed word detector
and attention mechanism indeed improve the performance
of multiple video-to-language tasks, we participate in four
tasks of LSMDC 2016 (Large Scale Movie Description
Challenge) [
18], which is one of the most active and suc-
cessful benchmarks that advance the progress of video-to-
language research. The challenges include movie descrip-
tion and multiple-choice test as video captioning, ll-in-the-
blank as video question answering, and movie retrieval as
video retrieval. Following the public evaluation protocol of
LSMDC 2016, our approach achieves the best accuracies
in the three tasks (ll-in-the-blank, multiple-choice test, and
movie retrieval), and comparable performance in the other
task (movie description).
1.1. Related Work
Our work can be uniquely positioned in the context of
two recent research directions in image/video captioning.
Image/Video Captioning with Word Detection. Image
and video captioning has been actively studied in recent vi-
sion and language research, including [
5, 6, 8, 17, 19, 27,
28], to name a few. Among them, there have been several
attempts to detect a set of concept words or attributes from
visual input to boost up the captioning performance. In im-
age captioning research, Fang et al. [
7] exploit a multiple in-
stance learning (MIL) approach to train visual detectors that
identify a set of words with bounding boxed regions of the
image. Based on the detected words, they retrieve and re-
rank the best caption sentence for the image. Wu et al. [
29]
use a CNN to learn a mapping between an image and se-
mantic attributes. They then exploit the mapping as an input
to the captioning decoder. They also extend the framework
to explicitly leverage external knowledge base such as DB-
pedia for question answering tasks. Venugopalan et al. [
26]
generate description with novel words beyond the ones in
the training set, by leveraging external sources, including
object recognition datasets like ImageNet and external text
corpus like Wikipedia. You et al. [
34] also exploit weak
labels and tags on Internet images to train additional para-
metric visual classiers for image captioning.
In the video domain, it is more ambiguous to learn the re-
lation between descriptive words and visual patterns. There
have been only few work in video captioning. Rohrbach
et al. [
17] propose a two-step approach for video caption-
ing on the LSMDC dataset. They rst extract verbs, ob-
jects, and places from movie description, and separately
train SVM-based classiers for each group. They then learn
the LSTM decoder that generates text description based on
the responses of these visual classiers.
While almost all previous captioning methods exploit ex-
ternal classiers for concept or attribute detection, the nov-
elty of our work lies in that we use only captioning training
data with no external sources to learn the word detector,
and propose an end-to-end design for learning both word
detection and caption generation simultaneously. More-
over, compared to video captioning work of [
17] where
only movie description of LSMDC is addressed, this work
is more comprehensive in that we validate the usefulness of
our method for all the four tasks of LSMDC.
Attention for Captioning. Attention mechanism has
been successfully applied to caption generation. One of the
earliest works is [
31] that dynamically focuses on different
image regions to produce an output word sequence. Later
this soft attention has been extended as temporal attention
over video frames [
33, 35] for video captioning.
Beyond the attention on spatial or temporal structure of
visual input, recently You et al. [
34] propose an attention on
attribute words for image captioning. That is, the method
enumerates a set of important object labels in the image,
and then dynamically switch attention among these con-
cept labels. Although our approach also exploits the idea
of semantic attention, it bears two key differences. First,
we extend the semantic attention to video domains for the
rst time, not only for video captioning but also for retrieval
and question answering tasks. Second, the approach of [
34]
relies on the classiers that are separately learned from ex-
ternal datasets, whereas our approach is learnable end-to-
end with only training data of captioning. It signicantly
reduces efforts to prepare for additional multi-label classi-
ers.
1.2. Contributions
We summarize the contributions of this work as follows.
(1) We propose a novel end-to-end learning approach
for detecting a list of concept words and attend on them
to enhance the performance of multiple video-to-language
tasks. The proposed concept word detection and attention
model can be plugged into any models of video captioning,
retrieval, and question answering. Our technical novelties
can be seen from two recent trends of image/video caption-
ing research. First, our work is a rst end-to-end trainable
model not only for concept word detection but also for lan-
guage generation. Second, our work is a rst semantic at-
tention model for video-to-language tasks.
(2) To validate the applicability of the proposed ap-
proach, we participate in all the four tasks of LSMDC 2016.
3166

Our models have won three of them, including ll-in-the-
blank, multiple-choice test, and movie retrieval. We also
attain comparable performance for movie description.
2. Detection of Concept Words from Videos
We rst explain the pre-processing steps for representa-
tion of words and video frames. Then, we explain how we
detect concept words for a given video.
2.1. Preprocessing
Dictionary and Word Embedding. We dene a vo-
cabulary dictionary V by collecting the words that occur
more than three times in the dataset. The dictionary size
is |V| = 12 486, from which our models sequentially select
words as output. We train the word2vec skip-gram embed-
ding [
14] to obtain the word embedding matrix E R
d×|V|
where d is the word embedding dimension and V is the dic-
tionary size. We set d = 300 in our implementation.
Video Representation. We rst equidistantly sample
one per ten frames from a video, to reduce the frame re-
dundancy while minimizing loss of information. We denote
the number of video frames by N . We limit the maximum
number of frames to be N
max
= 40; if a video is too long,
we use a wider interval for uniform sampling.
We employ a convolutional neural network (CNN) to en-
code video input. Specically, we extract the feature map of
each frame from the res5c layer (i.e. R
7×7×2,048
) of ResNet
[
9] pretrained on ImageNet dataset [20], and then apply a
2 × 2 max-pooling followed by a 3 × 3 convolution to re-
duce dimension to R
4×4×500
. Reducing the number of spa-
tial grid regions to 4 × 4 helps the concept word detector
get trained much faster, while not hurting detection perfor-
mance signicantly. We denote resulting visual features of
frames by {v
n
}
N
n=1
. Throughout this paper, we use n for
denoting video frame index.
2.2. An Attention Model for Concept Detection
Concept Words and Traces. We propose the concept
word detector using LSTM networks with soft attention
mechanism. Its structure is shown in the red box of Fig.
2.
Its goal is, for a given video, to discover a list of concept
words that consistently appear across frame regions. The
detected concept words are used as additional references for
video captioning models (section
3.1), which generates out-
put sentence by selectively attending on those words.
We rst dene a set of candidate words with a size of
V from all training captions. Among them, we discover K
concept words per video. We set V = 2, 000 and K = 10.
We rst apply the automatic POS tagging of NLTK [
3], to
extract nouns, verbs and adjectives from all training cap-
tion sentences [
7]. We then compute the frequencies of
those words in a training set, and select the V most com-
mon words as concept word candidates.
Since we do not have groundtruth bounding boxes for
concept words in the videos, we cannot train individual con-
cept detectors in a standard supervised setting. Our idea is
to adopt a soft attention mechanism to infer words by track-
ing regions that are spatially consistent. To this end, we em-
ploy a set of tracing LSTMs, each of which takes care of a
single spatially-consistent meaning being tracked over time,
what we call trace. That is, we keep track of spatial atten-
tion over video frames using an LSTM, so that spatial atten-
tions in adjacent frames resemble the spatial consistency of
a single concept (e.g. a moving object, or an action in video
clips; see Fig.
1). We use a total of L tracing LSTMs to cap-
ture out L traces (or concepts), where L is the number of
spatial regions in the visual feature (i.e. L = 4 × 4 = 16 for
v R
4×4×D
). Fusing these L concepts together, we nally
discover K concept words, as will be described next.
Computation of Spatial Attention. For each trace l,
we maintain spatial attention weights α
(l)
n
R
4×4
, indi-
cating where to attend on (4 × 4) spatial grid locations of
v
n
, through video frames n = 1 . . . N . The initial attention
weight α
(l)
0
at n = 0 is initialized with an one-hot matrix,
for each of L grid locations. We compute the hidden states
h
(l)
n
R
500
of the LSTM through n = 1 . . . N by:
c
(l)
n
= α
(l)
n
v
n
(1)
h
(l)
n
= LSTM(c
(l)
n
, h
(l)
n1
). (2)
where A B =
j,k
A
(j,k)
· B
(j,k,:)
. The input to LSTMs
is the context vector c
(l)
n
R
500
, which is obtained by ap-
plying spatial attention α
(l)
n
to the visual feature v
n
. Note
that the parameters of L LSTMs are shared.
The attention weight vector α
(l)
n
R
4×4
at time step n
is updated as follows:
e
(l)
n
(j, k) = v
n
(j, k) h
(l)
n1
, (3)
α
(l)
n
= softmax
Conv(e
(l)
n
)
, (4)
where is elementwise product, and Conv(·) denotes two
convolution operations before the softmax layer in Fig.
2.
Note that α
(l)
n
in Eq.(3) is computed from the previous hid-
den state h
(l)
n1
of the LSTM.
The spatial attention α
(l)
n
measures how each spatial grid
location of visual features is related to the concept being
tracked through tracing LSTMs. By repeating these two
steps of Eq.(
1)–(3) from n = 1 to N, our model can contin-
uously nd important and temporally consistent meanings
over time, that are closely related to a part of video, rather
than focusing on each video frame individually.
Finally, we predict the concept condence vector p:
p = σ
W
p
h
(1)
N
; · · · ; h
(L)
N
+ b
p
R
V
, (5)
3167








"
($)
2000











*+
,
.
.
.

.



=
7


.
.
×300




"
"
"
($)
×
4×4×
××2048
4×4×2048


4×4×

"D+
($)

×
4×4×
4×4×
4×4×1
"D+
($)
"
"

(3×3) (3×3)
4×4×500

H
($)
$*+
I
"
($)
H
($)



Figure 2. The architecture of the concept word detection in a top red box (section 2.2), and our video description model in bottom, which
uses semantic attention on the detected concept words (section
3.1).
that is, we rst concatenate the hidden states {h
(l)
N
}
L
l=1
at
the last time step of all tracing LSTMs, apply a linear trans-
form parameterized by W
p
R
V ×(500L)
and b
p
R
V
,
and apply the elementwise sigmoid activation σ.
Training and Inference. For training, we obtain a ref-
erence concept condence vector p
R
V
whose element
p
i
is 1 if the corresponding word exists in the groundtruth
caption; otherwise, 0. We minimize the following sigmoid
cross-entropy cost L
con
, which is often used for multi-label
classication [
30] where each class is independent and not
mutually exclusive:
L
con
=
1
V
V
i=1
[p
i
log(p
i
) + (1 p
i
) log(1 p
i
)] . (6)
Strictly speaking, since we apply an end-to-end learning ap-
proach, the cost of Eq.(
6) is used as an auxiliary term for the
overall cost function, which will be discussed in section
3.
For inference, we compute p for a given query video,
and nd top K words from the score p (i.e. argmax
1:K
p).
Finally, we represent these K concept words by their word
embedding {a
i
}
K
i=1
.
3. Video-to-Language Models
We design a different base model for each of LSMDC
tasks, while they share the concept word detector and the
semantic attention mechanism. That is, we aim to validate
that the proposed concept word detection is useful to a wide
range of video-to-language models. For base models, we
take advantage of state-of-the-art techniques, for which we
do not argue as our contribution. We refer to our video-to-
language models leveraging the concept word detector as
CT-SAN (Concept-Tracing Semantic Attention Network).
For better understanding of our models, we outline the
four LSMDC tasks as follows: (i) Movie description: gen-
erating a single descriptive sentence for a given movie clip,
(ii) Fill-in-the-blank: given a video and a sentence with a
single blank, nding a suitable word for the blank from
the whole vocabulary set, (iii) Multiple-choice test: given
a video query and ve descriptive sentences, choosing the
correct one out of them, and (iv) Movie retrieval: ranking
1,000 movie clips for a given natural language query.
We defer more model details to the supplementary le.
Especially, we skip the description of multiple-choice and
movie retrieval models in Figure
3(b)–(c), which can be
found in the supplementary le.
3.1. A Model for Description
Fig.
2 shows the proposed video captioning model. It
takes video features {v
n
}
N
n=1
and the detected concept
words {a
i
}
K
i=1
as input, and produces a word sequence as
output {y
t
}
T
t=1
. The model comprises video encoding and
caption decoding LSTMs, and two semantic attention mod-
els. The two LSTM networks have two layers in depth, with
layer normalization [1] and dropout [22] with a rate of 0.2.
Video Encoder. The video encoding LSTM encodes a
video into a sequence of hidden states {s
n
}
N
n=1
R
D
.
s
n
= LSTM(
v
n
, s
n1
) (7)
3168

where v
n
R
D
is obtained by (4, 4)-average-pooling v
n
.
Caption Decoder. The caption decoding LSTM is a nor-
mal LSTM network as follows:
h
t
= LSTM(x
t
, h
t1
), (8)
where the input x
t
is an intermediate representation of t-
th word input with semantic attention applied, as will be
described below. We initialize the hidden state at t = 0 by
the last hidden state of the video encoder: h
0
= s
N
R
D
.
Semantic Attention. Based on [
34], our model in Fig.2
uses the semantic attention in two different parts, which are
called as input and output semantic attention, respectively.
The input semantic attention φ computes an attention
weight γ
t,i
, which is assigned to each predicted concept
word a
i
. It helps the caption decoding LSTM focus on dif-
ferent concept words dynamically at each step t.
The attention weight γ
t,i
R
K
and input vector x
t
R
D
to the LSTM are obtained by
γ
t,i
exp((Ey
t1
)
W
γ
a
i
), (9)
x
t
= φ(y
t1
, {a
i
})
= W
x
(Ey
t1
+ diag(w
x,a
)
i
γ
t,i
a
i
). (10)
We multiply a previous word y
t1
R
|V|
by the word em-
bedding matrix E to be d-dimensional. The parameters to
learn include W
γ
R
d×d
, W
x
R
D×d
and w
x,a
R
d
.
The output semantic attention ϕ guides how to dynam-
ically weight the concept words {a
i
} when generating an
output word y
t
at each step. We use h
t
, the hidden state
of decoding LSTM at t as an input to the output attention
function ϕ. We then compute p
t
R
D
by attending the
concept words set {a
i
} with the weight β
t,i
:
β
t,i
exp(h
t
W
β
σ(a
i
)), (11)
p
t
= ϕ(h
t
, {a
i
})
= h
t
+ diag(w
h,a
)
i
β
t,i
W
β
σ(a
i
), (12)
where σ is the hyperbolic tangent, and parameters include
w
h,a
R
D
and W
β
R
D×d
.
Finally, the probability of output word is obtained as
p(y
t
| y
1:t1
) = softmax(W
y
p
t
+ b
y
), (13)
where W
y
R
|VD
and b
y
R
|V|
. This procedure loops
until y
t
corresponds to the <EOS> token.
Training. To learn the parameters of the model, we de-
ne a loss function as the total negative log-likelihood of all
the words, with regularization terms on attention weights
{α
t,i
}, {β
t,i
}, and {γ
t,i
} [
34], as well as the loss L
con
for
concept discovery (Eq.
6):
L =
t
log p(y
t
) + λ
1
(g(β) + g(γ)) + λ
2
L
con
(14)
where λ
1
, λ
2
are hyperparameters and g is a regularization
function with setting to p = 2, q = 0.5 as
g(α) = α
1,p
+ α
1,q
(15)
=
i
t
α
t,i
p
1/p
+
t
i
α
t,i
q
1/q
.
For the rest of models, we transfer the parameters of the
concept word detector trained with the description model,
and allow the parameters being ne-tuned.
3.2. A Model for Fill-in-the-Blank
Fig.
3(a) illustrates the proposed model for the ll-in-the-
blank task. It is based on a bidirectional LSTM network
(BLSTM) [
21, 10], which is useful in predicting a blank
word from an imperfect sentence, since it considers the se-
quence in both forward and backward directions. Our key
idea is to employ the semantic attention mechanism on both
input and output of the BLSTM, to strengthen the meaning
of input and output words with the detected concept words.
The model takes word representation {c
t
}
T
t=1
and con-
cept words {a
i
}
K
i=1
as input. Each c
t
R
d
is obtained by
multiplying the one-hot word vector by an embedding ma-
trix E. Suppose that the t-th text input is a blank for which
we use a special token <blank>. We add the word predic-
tion module only on top of the t-th step of the BLSTM.
BLSTM. The input video is represented by the video
encoding LSTM in Figure 2. The hidden state of the nal
video frame s
N
is used to initialize the hidden states of the
BLSTM: h
b
T +1
= h
f
0
= s
N
, where {h
f
t
}
T
t=1
and {h
b
t
}
T
t=1
are the forward and backward hidden states of the BLSTM,
respectively:
h
f
t
= LSTM(x
t
, h
f
t1
), (16)
h
b
t
= LSTM(x
t
, h
b
t+1
). (17)
We also use the layer normalization [
1].
Semantic Attention. The input and output semantic
attention of this model is almost identical to those of the
captioning model in section
3.1, only except that the word
representation c
t
R
d
is used as input at each time step,
instead of previous word vector y
t1
. Then the attention
weighted word vector {x
t
}
T
t=1
is fed into the BLSTM.
The output semantic attention is also similar to that of
the captioning model in section
3.1, only except that we ap-
ply the attention only once at t-th step where the <blank>
token is taken as input. We feed the output of the BLSTM
o
t
= tanh(W
o
[h
f
t
; h
b
t
] + b
o
), (18)
where W
o
R
D×2D
and b
o
R
D
, into the output atten-
tion function ϕ, which generates p R
D
as in Eq.(12) of
the description model, p = ϕ(o
t
, {a
i
}).
3169

Citations
More filters
Posted Content

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.
Proceedings ArticleDOI

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

TL;DR: This article proposed to learn text-to-video embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations, which leads to state-of-the-art results on instructional video datasets such as YouCook2 or CrossTask.
Book ChapterDOI

Multi-modal Transformer for Video Retrieval

TL;DR: A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.
Posted Content

Bottom-Up and Top-Down Attention for Image Captioning and VQA.

TL;DR: A combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of the method to VQA.
Proceedings ArticleDOI

ActBERT: Learning Global-Local Video-Text Representations

TL;DR: This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.
References
More filters
Posted Content

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

TL;DR: An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods.
Proceedings ArticleDOI

What Value Do Explicit High Level Concepts Have in Vision to Language Problems

TL;DR: This paper proposed a method of incorporating high-level concepts into the successful CNN-RNN approach, and showed that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.
Journal ArticleDOI

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

TL;DR: A visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions and allows questions to be asked where the image alone does not contain the information required to select the appropriate answer.
Proceedings Article

Jointly modeling deep video and compositional text to bridge vision and language in a unified framework

TL;DR: The results show the approach outperforms SVM, CRF and CCA baselines in predicting Subject-Verb-Object triplet and natural sentence generation, and is better than CCA in video retrieval and language retrieval tasks.
Proceedings ArticleDOI

A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

TL;DR: This paper proposes a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions that captures the most relevant contents of a video in a natural language description.