End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

doi:10.1109/CVPR.2017.347

End-to-end Concept Word Detection

for Video Captioning, Retrieval, and Question Answering

Youngjae Yu Hyungjin Ko Jongwook Choi Gunhee Kim

Seoul National University, Seoul, Korea

{yj.yu, hj.ko}@vision.snu.ac.kr, {wookayin, gunhee}@snu.ac.kr

http://vision.snu.ac.kr/project/lsmdc-2016

Abstract

We propose a high-level concept word detector that can

be integrated with any video-to-language models. It takes

a video as input and generates a list of concept words as

useful semantic priors for language generation models. The

proposed word detector has two important properties. First,

it does not require any external knowledge sources for train-

ing. Second, the proposed word detector is trainable in

an end-to-end manner jointly with any video-to-language

models. To effectively exploit the detected words, we also

develop a semantic attention mechanism that selectively fo-

cuses on the detected concept words and fuse them with the

word encoding and decoding in the language model. In or-

der to demonstrate that the proposed approach indeed im-

proves the performance of multiple video-to-language tasks,

we participate in all the four tasks of LSMDC 2016 [

18].

Our approach has won three of them, including ﬁll-in-the-

blank, multiple-choice test, and movie retrieval.

1. Introduction

Video-to-language tasks, including video captioning [

6,

8, 17, 27, 32, 35] and video question answering (QA) [23],

are recent emerging challenges in computer vision research.

This set of problems is interesting as one of frontiers in ar-

tiﬁcial intelligence; beyond that, it can also potentiate mul-

tiple practical applications, such as retrieving video content

by users’ free-form queries or helping visually impaired

people understand the visual content. Recently, a number

of large-scale datasets have been introduced as a common

ground for researchers to promote the progress of video-to-

language research (e.g. [

4, 16, 18, 23]).

The objective of this work is to propose a concept word

detector, as shown in Fig.1, which takes a training set of

videos and associated sentences as input, and generates a

list of high-level concept words per video as useful seman-

tic priors for a variety of video-to-language tasks, includ-

ing video captioning, retrieval, and question answering. We

Tracing concepts by attention

down

pull car

get

street

housefront

K concept words

outside

Description Fill-in-the-blank Multi-choice Retrieval

He slows down in

front of one _____

with a garage ...

Q: A vehicle pulling up

A:

house

A car pulls up onto

the driveway …

1.The vehicle …

2.Someone eyes …

3.A man is …

4.There is …

5.A boy walks …

A:

Q:

Input movie clip

drive

road

Concept word detector

Trace

Tracing LSTMs

Figure 1. The intuition of the proposed concept word detector.

Given a video clip, a set of tracing LSTMs extract multiple concept

words that consistently appear across frame regions. We then em-

ploy semantic attention to combine the detected concepts with text

encoding/decoding for several video-to-language tasks of LSMDC

2016, such as captioning, retrieval, and question answering.

design our word detector to have the following two charac-

teristics, to be easily integrated with any video-to-language

models. First, it does not require any external knowledge

sources for training. Instead, our detector learns the cor-

relation between words in the captions and video regions

from the whole training data. To this end, we use a contin-

uous soft attention mechanism that traces consistent visual

information across frames and associates them with concept

words from captions. Second, the word detector is trainable

in an end-to-end manner jointly with any video-to-language

models. The loss function for learning the word detector

can be plugged as an auxiliary term into the model’s overall

cost function; as a result, we can reduce efforts to separately

3165

collect training examples and learn both models.

We also develop language model components to to ef-

fectively exploit the detected words. Inspired by semantic

attention in image captioning research [

34], we develop an

attention mechanism that selectively focuses on the detected

concept words and fuse them with word encoding and de-

coding in the language model. That is, the detected concept

words are combined with input words to better represent the

hidden states of encoders, and with output words to gener-

ate more accurate word prediction.

In order to demonstrate that the proposed word detector

and attention mechanism indeed improve the performance

of multiple video-to-language tasks, we participate in four

tasks of LSMDC 2016 (Large Scale Movie Description

Challenge) [

18], which is one of the most active and suc-

cessful benchmarks that advance the progress of video-to-

language research. The challenges include movie descrip-

tion and multiple-choice test as video captioning, ﬁll-in-the-

blank as video question answering, and movie retrieval as

video retrieval. Following the public evaluation protocol of

LSMDC 2016, our approach achieves the best accuracies

in the three tasks (ﬁll-in-the-blank, multiple-choice test, and

movie retrieval), and comparable performance in the other

task (movie description).

1.1. Related Work

Our work can be uniquely positioned in the context of

two recent research directions in image/video captioning.

Image/Video Captioning with Word Detection. Image

and video captioning has been actively studied in recent vi-

sion and language research, including [

5, 6, 8, 17, 19, 27,

28], to name a few. Among them, there have been several

attempts to detect a set of concept words or attributes from

visual input to boost up the captioning performance. In im-

age captioning research, Fang et al. [

7] exploit a multiple in-

stance learning (MIL) approach to train visual detectors that

identify a set of words with bounding boxed regions of the

image. Based on the detected words, they retrieve and re-

rank the best caption sentence for the image. Wu et al. [

29]

use a CNN to learn a mapping between an image and se-

mantic attributes. They then exploit the mapping as an input

to the captioning decoder. They also extend the framework

to explicitly leverage external knowledge base such as DB-

pedia for question answering tasks. Venugopalan et al. [

26]

generate description with novel words beyond the ones in

the training set, by leveraging external sources, including

object recognition datasets like ImageNet and external text

corpus like Wikipedia. You et al. [

34] also exploit weak

labels and tags on Internet images to train additional para-

metric visual classiﬁers for image captioning.

In the video domain, it is more ambiguous to learn the re-

lation between descriptive words and visual patterns. There

have been only few work in video captioning. Rohrbach

et al. [

17] propose a two-step approach for video caption-

ing on the LSMDC dataset. They ﬁrst extract verbs, ob-

jects, and places from movie description, and separately

train SVM-based classiﬁers for each group. They then learn

the LSTM decoder that generates text description based on

the responses of these visual classiﬁers.

While almost all previous captioning methods exploit ex-

ternal classiﬁers for concept or attribute detection, the nov-

elty of our work lies in that we use only captioning training

data with no external sources to learn the word detector,

and propose an end-to-end design for learning both word

detection and caption generation simultaneously. More-

over, compared to video captioning work of [

17] where

only movie description of LSMDC is addressed, this work

is more comprehensive in that we validate the usefulness of

our method for all the four tasks of LSMDC.

Attention for Captioning. Attention mechanism has

been successfully applied to caption generation. One of the

earliest works is [

31] that dynamically focuses on different

image regions to produce an output word sequence. Later

this soft attention has been extended as temporal attention

over video frames [

33, 35] for video captioning.

Beyond the attention on spatial or temporal structure of

visual input, recently You et al. [

34] propose an attention on

attribute words for image captioning. That is, the method

enumerates a set of important object labels in the image,

and then dynamically switch attention among these con-

cept labels. Although our approach also exploits the idea

of semantic attention, it bears two key differences. First,

we extend the semantic attention to video domains for the

ﬁrst time, not only for video captioning but also for retrieval

and question answering tasks. Second, the approach of [

34]

relies on the classiﬁers that are separately learned from ex-

ternal datasets, whereas our approach is learnable end-to-

end with only training data of captioning. It signiﬁcantly

reduces efforts to prepare for additional multi-label classi-

ﬁers.

1.2. Contributions

We summarize the contributions of this work as follows.

(1) We propose a novel end-to-end learning approach

for detecting a list of concept words and attend on them

to enhance the performance of multiple video-to-language

tasks. The proposed concept word detection and attention

model can be plugged into any models of video captioning,

retrieval, and question answering. Our technical novelties

can be seen from two recent trends of image/video caption-

ing research. First, our work is a ﬁrst end-to-end trainable

model not only for concept word detection but also for lan-

guage generation. Second, our work is a ﬁrst semantic at-

tention model for video-to-language tasks.

(2) To validate the applicability of the proposed ap-

proach, we participate in all the four tasks of LSMDC 2016.

3166

Our models have won three of them, including ﬁll-in-the-

blank, multiple-choice test, and movie retrieval. We also

attain comparable performance for movie description.

2. Detection of Concept Words from Videos

We ﬁrst explain the pre-processing steps for representa-

tion of words and video frames. Then, we explain how we

detect concept words for a given video.

2.1. Preprocessing

Dictionary and Word Embedding. We deﬁne a vo-

cabulary dictionary V by collecting the words that occur

more than three times in the dataset. The dictionary size

is |V| = 12 486, from which our models sequentially select

words as output. We train the word2vec skip-gram embed-

ding [

14] to obtain the word embedding matrix E ∈ R

d×|V|

where d is the word embedding dimension and V is the dic-

tionary size. We set d = 300 in our implementation.

Video Representation. We ﬁrst equidistantly sample

one per ten frames from a video, to reduce the frame re-

dundancy while minimizing loss of information. We denote

the number of video frames by N . We limit the maximum

number of frames to be N

max

= 40; if a video is too long,

we use a wider interval for uniform sampling.

We employ a convolutional neural network (CNN) to en-

code video input. Speciﬁcally, we extract the feature map of

each frame from the res5c layer (i.e. R

7×7×2,048

) of ResNet

[

9] pretrained on ImageNet dataset [20], and then apply a

2 × 2 max-pooling followed by a 3 × 3 convolution to re-

duce dimension to R

4×4×500

. Reducing the number of spa-

tial grid regions to 4 × 4 helps the concept word detector

get trained much faster, while not hurting detection perfor-

mance signiﬁcantly. We denote resulting visual features of

frames by {v

n

}

N

n=1

. Throughout this paper, we use n for

denoting video frame index.

2.2. An Attention Model for Concept Detection

Concept Words and Traces. We propose the concept

word detector using LSTM networks with soft attention

mechanism. Its structure is shown in the red box of Fig.

2.

Its goal is, for a given video, to discover a list of concept

words that consistently appear across frame regions. The

detected concept words are used as additional references for

video captioning models (section

3.1), which generates out-

put sentence by selectively attending on those words.

We ﬁrst deﬁne a set of candidate words with a size of

V from all training captions. Among them, we discover K

concept words per video. We set V = 2, 000 and K = 10.

We ﬁrst apply the automatic POS tagging of NLTK [

3], to

extract nouns, verbs and adjectives from all training cap-

tion sentences [

7]. We then compute the frequencies of

those words in a training set, and select the V most com-

mon words as concept word candidates.

Since we do not have groundtruth bounding boxes for

concept words in the videos, we cannot train individual con-

cept detectors in a standard supervised setting. Our idea is

to adopt a soft attention mechanism to infer words by track-

ing regions that are spatially consistent. To this end, we em-

ploy a set of tracing LSTMs, each of which takes care of a

single spatially-consistent meaning being tracked over time,

what we call trace. That is, we keep track of spatial atten-

tion over video frames using an LSTM, so that spatial atten-

tions in adjacent frames resemble the spatial consistency of

a single concept (e.g. a moving object, or an action in video

clips; see Fig.

1). We use a total of L tracing LSTMs to cap-

ture out L traces (or concepts), where L is the number of

spatial regions in the visual feature (i.e. L = 4 × 4 = 16 for

v ∈ R

4×4×D

). Fusing these L concepts together, we ﬁnally

discover K concept words, as will be described next.

Computation of Spatial Attention. For each trace l,

we maintain spatial attention weights α

(l)

n

∈ R

4×4

, indi-

cating where to attend on (4 × 4) spatial grid locations of

v

n

, through video frames n = 1 . . . N . The initial attention

weight α

(l)

0

at n = 0 is initialized with an one-hot matrix,

for each of L grid locations. We compute the hidden states

h

(l)

n

∈ R

500

of the LSTM through n = 1 . . . N by:

c

(l)

n

= α

(l)

n

⊗ v

n

(1)

h

(l)

n

= LSTM(c

(l)

n

, h

(l)

n−1

). (2)

where A ⊗ B =



j,k

A

(j,k)

· B

(j,k,:)

. The input to LSTMs

is the context vector c

(l)

n

∈ R

500

, which is obtained by ap-

plying spatial attention α

(l)

n

to the visual feature v

n

. Note

that the parameters of L LSTMs are shared.

The attention weight vector α

(l)

n

∈ R

4×4

at time step n

is updated as follows:

e

(l)

n

(j, k) = v

n

(j, k)  h

(l)

n−1

, (3)

α

(l)

n

= softmax



Conv(e

(l)

n

)



, (4)

where  is elementwise product, and Conv(·) denotes two

convolution operations before the softmax layer in Fig.

2.

Note that α

(l)

n

in Eq.(3) is computed from the previous hid-

den state h

(l)

n−1

of the LSTM.

The spatial attention α

(l)

n

measures how each spatial grid

location of visual features is related to the concept being

tracked through tracing LSTMs. By repeating these two

steps of Eq.(

1)–(3) from n = 1 to N, our model can contin-

uously ﬁnd important and temporally consistent meanings

over time, that are closely related to a part of video, rather

than focusing on each video frame individually.

Finally, we predict the concept conﬁdence vector p:

p = σ



W

p



h

(1)

N

; · · · ; h

(L)

N



+ b

p



∈ R

V

, (5)

3167





�











"

($)

2000



























*+

,



.



.







.





.







 = 



7







.



.

×300











"



"



"

($)

×

4×4×

××2048

4×4×2048





4×4×







"D+

($)



×

4×4×

4×4×1







"D+

($)



"



"



(3×3) (3×3)

4×4×500





H

($)

$*+

I



"

($)



H

($)



 





Figure 2. The architecture of the concept word detection in a top red box (section 2.2), and our video description model in bottom, which

uses semantic attention on the detected concept words (section

3.1).

that is, we ﬁrst concatenate the hidden states {h

(l)

N

}

L

l=1

at

the last time step of all tracing LSTMs, apply a linear trans-

form parameterized by W

p

∈ R

V ×(500L)

and b

p

∈ R

V

,

and apply the elementwise sigmoid activation σ.

Training and Inference. For training, we obtain a ref-

erence concept conﬁdence vector p

∗

∈ R

V

whose element

p

∗

i

is 1 if the corresponding word exists in the groundtruth

caption; otherwise, 0. We minimize the following sigmoid

cross-entropy cost L

con

, which is often used for multi-label

classiﬁcation [

30] where each class is independent and not

mutually exclusive:

L

con

= −

1

V



i=1

[p

∗

i

log(p

i

) + (1 − p

∗

i

) log(1 − p

i

)] . (6)

Strictly speaking, since we apply an end-to-end learning ap-

proach, the cost of Eq.(

6) is used as an auxiliary term for the

overall cost function, which will be discussed in section

3.

For inference, we compute p for a given query video,

and ﬁnd top K words from the score p (i.e. argmax

1:K

p).

Finally, we represent these K concept words by their word

embedding {a

i

}

K

i=1

.

3. Video-to-Language Models

We design a different base model for each of LSMDC

tasks, while they share the concept word detector and the

semantic attention mechanism. That is, we aim to validate

that the proposed concept word detection is useful to a wide

range of video-to-language models. For base models, we

take advantage of state-of-the-art techniques, for which we

do not argue as our contribution. We refer to our video-to-

language models leveraging the concept word detector as

CT-SAN (Concept-Tracing Semantic Attention Network).

For better understanding of our models, we outline the

four LSMDC tasks as follows: (i) Movie description: gen-

erating a single descriptive sentence for a given movie clip,

(ii) Fill-in-the-blank: given a video and a sentence with a

single blank, ﬁnding a suitable word for the blank from

the whole vocabulary set, (iii) Multiple-choice test: given

a video query and ﬁve descriptive sentences, choosing the

correct one out of them, and (iv) Movie retrieval: ranking

1,000 movie clips for a given natural language query.

We defer more model details to the supplementary ﬁle.

Especially, we skip the description of multiple-choice and

movie retrieval models in Figure

3(b)–(c), which can be

found in the supplementary ﬁle.

3.1. A Model for Description

Fig.

2 shows the proposed video captioning model. It

takes video features {v

n

}

N

n=1

and the detected concept

words {a

i

}

K

i=1

as input, and produces a word sequence as

output {y

t

}

T

t=1

. The model comprises video encoding and

caption decoding LSTMs, and two semantic attention mod-

els. The two LSTM networks have two layers in depth, with

layer normalization [1] and dropout [22] with a rate of 0.2.

Video Encoder. The video encoding LSTM encodes a

video into a sequence of hidden states {s

n

}

N

n=1

∈ R

D

.

s

n

= LSTM(

v

n

, s

n−1

) (7)

3168

where v

n

∈ R

D

is obtained by (4, 4)-average-pooling v

n

.

Caption Decoder. The caption decoding LSTM is a nor-

mal LSTM network as follows:

h

t

= LSTM(x

t

, h

t−1

), (8)

where the input x

t

is an intermediate representation of t-

th word input with semantic attention applied, as will be

described below. We initialize the hidden state at t = 0 by

the last hidden state of the video encoder: h

0

= s

N

∈ R

D

.

Semantic Attention. Based on [

34], our model in Fig.2

uses the semantic attention in two different parts, which are

called as input and output semantic attention, respectively.

The input semantic attention φ computes an attention

weight γ

t,i

, which is assigned to each predicted concept

word a

i

. It helps the caption decoding LSTM focus on dif-

ferent concept words dynamically at each step t.

The attention weight γ

t,i

∈ R

K

and input vector x

t

∈

R

D

to the LSTM are obtained by

γ

t,i

∝ exp((Ey

t−1

)

�

W

γ

a

i

), (9)

x

t

= φ(y

t−1

, {a

i

})

= W

x

(Ey

t−1

+ diag(w

x,a

)



i

γ

t,i

a

i

). (10)

We multiply a previous word y

t−1

∈ R

|V|

by the word em-

bedding matrix E to be d-dimensional. The parameters to

learn include W

γ

∈ R

d×d

, W

x

∈ R

D×d

and w

x,a

∈ R

d

.

The output semantic attention ϕ guides how to dynam-

ically weight the concept words {a

i

} when generating an

output word y

t

at each step. We use h

t

, the hidden state

of decoding LSTM at t as an input to the output attention

function ϕ. We then compute p

t

∈ R

D

by attending the

concept words set {a

i

} with the weight β

t,i

:

β

t,i

∝ exp(h

�

t

W

β

σ(a

i

)), (11)

p

t

= ϕ(h

t

, {a

i

})

= h

t

+ diag(w

h,a

)



i

β

t,i

W

β

σ(a

i

), (12)

where σ is the hyperbolic tangent, and parameters include

w

h,a

∈ R

D

and W

β

∈ R

D×d

.

Finally, the probability of output word is obtained as

p(y

t

| y

1:t−1

) = softmax(W

y

p

t

+ b

y

), (13)

where W

y

∈ R

|V|×D

and b

y

∈ R

|V|

. This procedure loops

until y

t

corresponds to the <EOS> token.

Training. To learn the parameters of the model, we de-

ﬁne a loss function as the total negative log-likelihood of all

the words, with regularization terms on attention weights

{α

t,i

}, {β

t,i

}, and {γ

t,i

} [

34], as well as the loss L

con

for

concept discovery (Eq.

6):

L = −



t

log p(y

t

) + λ

1

(g(β) + g(γ)) + λ

2

L

con

(14)

where λ

1

, λ

2

are hyperparameters and g is a regularization

function with setting to p = 2, q = 0.5 as

g(α) = α

1,p

+ α

�



1,q

(15)

=





i





t

α

t,i



p



1/p

+





t





i

α

t,i



q



1/q

.

For the rest of models, we transfer the parameters of the

concept word detector trained with the description model,

and allow the parameters being ﬁne-tuned.

3.2. A Model for Fill-in-the-Blank

Fig.

3(a) illustrates the proposed model for the ﬁll-in-the-

blank task. It is based on a bidirectional LSTM network

(BLSTM) [

21, 10], which is useful in predicting a blank

word from an imperfect sentence, since it considers the se-

quence in both forward and backward directions. Our key

idea is to employ the semantic attention mechanism on both

input and output of the BLSTM, to strengthen the meaning

of input and output words with the detected concept words.

The model takes word representation {c

t

}

T

t=1

and con-

cept words {a

i

}

K

i=1

as input. Each c

t

∈ R

d

is obtained by

multiplying the one-hot word vector by an embedding ma-

trix E. Suppose that the t-th text input is a blank for which

we use a special token <blank>. We add the word predic-

tion module only on top of the t-th step of the BLSTM.

BLSTM. The input video is represented by the video

encoding LSTM in Figure 2. The hidden state of the ﬁnal

video frame s

N

is used to initialize the hidden states of the

BLSTM: h

b

T +1

= h

f

0

= s

N

, where {h

f

t

}

T

t=1

and {h

b

t

}

T

t=1

are the forward and backward hidden states of the BLSTM,

respectively:

h

f

t

= LSTM(x

t

, h

f

t−1

), (16)

h

b

t

= LSTM(x

t

, h

b

t+1

). (17)

We also use the layer normalization [

1].

Semantic Attention. The input and output semantic

attention of this model is almost identical to those of the

captioning model in section

3.1, only except that the word

representation c

t

∈ R

d

is used as input at each time step,

instead of previous word vector y

t−1

. Then the attention

weighted word vector {x

t

}

T

t=1

is fed into the BLSTM.

The output semantic attention is also similar to that of

the captioning model in section

3.1, only except that we ap-

ply the attention only once at t-th step where the <blank>

token is taken as input. We feed the output of the BLSTM

o

t

= tanh(W

o

[h

f

t

; h

b

t

] + b

o

), (18)

where W

o

∈ R

D×2D

and b

o

∈ R

D

, into the output atten-

tion function ϕ, which generates p ∈ R

D

as in Eq.(12) of

the description model, p = ϕ(o

t

, {a

i

}).

3169

End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Citations

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Multi-modal Transformer for Video Retrieval

Bottom-Up and Top-Down Attention for Image Captioning and VQA.

ActBERT: Learning Global-Local Video-Text Representations

References

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

What Value Do Explicit High Level Concepts Have in Vision to Language Problems

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

Jointly modeling deep video and compositional text to bridge vision and language in a unified framework

A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

Related Papers (5)

Deep Residual Learning for Image Recognition

Learning Spatiotemporal Features with 3D Convolutional Networks

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset