Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition

Concepts Not Alone: Exploring Pairwise Relationships

for Zero-Shot Video Activity Recognition

Chuang Gan

1

, Ming Lin

3

,YiYang

2

, Gerard de Melo

1

and Alexander G. Hauptmann

4

1

IIIS, Tsinghua University, Beijing, China

2

QCIS, University of Technology Sydney, Sydney, Australia

3

DCM&B, University of Michigan, Ann Arbor, USA

4

SCS, Carnegie Mellon University, Pittsburgh, USA

Abstract

Vast quantities of videos are now being captured at astonish-

ing rates, but the majority of these are not labelled. To cope

with such data, we consider the task of content-based activity

recognition in videos without any manually labelled examples,

also known as zero-shot video recognition. To achieve this,

videos are represented in terms of detected visual concepts,

which are then scored as relevant or irrelevant according to

their similarity with a given textual query. In this paper, we

propose a more robust approach for scoring concepts in order

to alleviate many of the brittleness and low precision problems

of previous work. Not only do we jointly consider semantic

relatedness, visual reliability, and discriminative power. To

handle noise and non-linearities in the ranking scores of the

selected concepts, we propose a novel pairwise order matrix

approach for score aggregation. Extensive experiments on the

large-scale TRECVID Multimedia Event Detection data show

the superiority of our approach.

1 Introduction

Motivation.

The increasing ubiquity of devices capable of

capturing videos has led to an explosion in the amount of

recorded video content. Smartphones, action cameras, as well

as surveillance cameras mean that ever-increasing amounts

of our daily activities are captured on videos. Due to the

torrential volume of this data, the vast majority of videos are

never labeled. Moreover, even for those that are shared online,

the human-supplied metadata is often vague or unspeciﬁc

(e.g. “Albufeira, Summer 2015”). Unfortunately, video search

engines such as Youtube, Yahoo, and Bing, crucially depend

on textual keyword matching. Their approach works well for

popular videos but fails hopelessly for long-tail content or

personal video collections with insufﬁcient metadata.

Fortunately, encouraging progress has been made on

content-based video analysis in recent years. Standard ap-

proaches rely on low-level audio/visual input features that

are fed into machine learning algorithms such as sup-

port vector machines (SVMs) (Chang and Lin 2011) or

deep convolutional neural networks (Karpathy et al

.

2014;

Gan et al

.

2015b). These achieve promising results when

there are sufﬁcient numbers of labeled training examples for

every search query of interest.

Copyright

c



2016, Association for the Advancement of Artiﬁcial

However, due to the large number of possible search

queries, query-speciﬁc training is not always feasible. Zero-

shot learning (Lampert, Nickisch, and Harmeling 2009) ad-

dresses this problem by providing an alternative paradigm

that does not require positive training exemplars for every

class of videos. Given a textual query, one aims to retrieve

videos that are most relevant to it, exploiting visual attributes

of the videos.

While several algorithms (Jiang et al

.

2015a; Wu et al

.

2014; Dalton, Allan, and Mirajkar 2013; Habibian, Mensink,

and Snoek 2014; Liu et al

.

2013; Singh et al

.

2015) have re-

cently been proposed for such zero-shot video activity recog-

nition, state-of-the-art systems still suffer from brittleness

and low precision.

Contributions.

In this paper, we show how to make zero-

shot learning more robust. Similar to previous work, our

system consists of two main components. The ﬁrst of these

aims at a semantic query interpretation, in which the system

selects concepts pertaining to the query description from

a large pool of potential candidate concepts. The second

component produces an aggregation of the individual concept-

speciﬁc video ranking lists. We propose important strategies

for making both more robust:

•

We propose a simple yet effective concept selection ap-

proach in representing queries. Unlike previous work, con-

cept reliability and discriminative power are considered

as critical indicators in order to ensure robust zero-shot

activity recognition.

•

We devise a novel robust video ranking approach that relies

on the recovery of a low-rank order matrix from multiple

pairwise order matrices for different concept ranking lists.

•

Experimental results on challenging unconstrained video

data conﬁrm that the proposed system outperforms the

state-of-the-art zero-shot approaches.

2 Related Work

Video analysis has attracted a lot of research interest in the

past decade. A recent review can be found in (Jiang et al

.

2013). Standard video activity recognition systems, despite

their reasonable recognition performance, rely on custom

low-level representations, such as improved dense trajecto-

ries (Wang and Schmid 2013), and Mel-FrequencyCepstral

Coefﬁcients (MFCC) (Rabiner and Schafer 2007). These

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

3487

suffer from several deal-breaking drawbacks. First, they are

incapable of providing a semantic interpretation of a video.

Second, because of their high dimensionality, training effec-

tive event classiﬁers on the low-level representation requires a

substantial number of training examples per class. When only

a few or even no positive examples are available, the power of

low-level representations is limited. In our work, we instead

draw concept detection to derive higher-level representations.

Semantic video representations describe a video in terms

of pre-deﬁned pools of activity concepts and attributes, and

have proven both robust for video activity recognition and

interpretable by humans (Hauptmann et al

.

2007). This form

of representation has inspired the development of zero-shot

video activity recognition. Most existing zero-shot learning

frameworks follow a two-stage classiﬁcation approach. Given

a novel class, ﬁrst its semantic properties (i.e., attributes or

concepts) are identiﬁed, then its class label is predicted as

a ranking function of those attributes or concepts. To iden-

tify relevant concepts, most of the current approaches rely

on manually deﬁned concept schemes or blindly adopt in-

ventories based on some outside knowledge sources, such as

WordNet (Lin 1998), Wikipedia (Gabrilovich and Markovitch

2007), or label co-occurrence data (Mensink, Gavves, and

Snoek ). In contrast to our approach, none of the existing ap-

proaches take the visual reliability and discriminative power

into consideration. As we will see in Section 4, the selection

of concepts has a dramatic impact on the retrieval results.

When fusing rankings, most existing approaches to zero-

shot activity recognition directly combine the raw ranking

scores of related attributes or concepts as the ﬁnal ranking

list. Unlike these methods, we propose a more robust rank-

ing function, by ﬁrst converting the raw ranking score into

scale-invariant pairwise order matrices and then decompos-

ing the multiple pairwise order matrices into a single shared

order matrix, which is then consulted to generate the ranking

list. Our experiments conﬁrm that this leads to more robust

ranking results.

Our work is also related to the sentence generation

task (Guadarrama et al

.

2013; Sun, Gan, and Nevatia 2015),

where the goal is to generate natural language descriptions

of a video. Our goal, in contrast, is to address the opposite

direction: given a textual query, we seek to retrieve videos

that match the query. Matrix factorization is also related to

our work, which has been widely used in different tasks (Ye

et al. 2012; Fan et al. 2014; Yan et al. 2015).

3 Proposed Method

3.1 Overview

In our approach, we ﬁrst represent both the video data and the

textual description by embedding them in a semantic concept

space. The overall framework of our approach is outlined in

Figure 1. It consists of two major components.

The ﬁrst of the two components aims at a semantic query

interpretation. Given a query, i.e., a textual description of

the target event, we apply text analytics methods to extract

salient words that describe this target class. The system se-

lects relevant concepts matching these query words from a

large pool of potential candidate concepts.

We are given a textual query

q

as input. Based on a vocab-

ulary of

d

visual concepts, we ﬁrst select relevant concepts

according to their semantic similarity with

q

. A higher simi-

larity score indicates that the corresponding concept is more

related to the target video class denoted by q.

We also apply a bank of visual concept detectors to gener-

ate semantic representations for the videos. Thus, each video

is represented as a vector whose elements are detection scores

of different semantic concepts.

Having embedded both the query and the videos as numer-

ical vectors of concepts, we select relevant concepts based

on their similarity to the query, but also prune away selected

semantic concepts that have insufﬁcient reliability or discrim-

inative power.

The second component produces a ranked list of videos.

This is achieved by aggregating the individual concept-

speciﬁc ranking lists. After the preliminary ﬁltering, the de-

tection scores of selected concepts are still noisy. To alleviate

the problems of noise and non-linearity of different rank-

ing lists, we ﬁrst convert the raw ranking scores of selected

concepts into pairwise order matrices, in which each entry

characterizes the comparative relationship of two test sam-

ples. Our hypothesis is that the relative score relations are

consistent within component ranking lists, despite the large

variations that may exist in the absolute values of the raw

scores. Thus, we take the pairwise order matrices of different

semantic concepts as input, and recover a common shared

rank-2 order matrix. Even though the original matrix might

be inaccurate, the joint relations from multiple matrices may

be complementary with respect to each other and hence the

shared common matrix may well model the correct order.

Finally, we transform the shared order matrix into a ranking

list as the ﬁnal retrieval result. then recovered from these

pairwise matrices.

3.2 Query Interpretation

The goal of the query interpretation phase is to create a se-

mantic representation for the query in terms of salient se-

mantic concepts that it expresses. These are taken from a

large pool of candidate concepts and should not only be rele-

vant to the textual query but also visually discernable in the

videos. It is tedious if not impossible to manually label all

related concepts in a large pool of potential candidate con-

cepts. Therefore, most recent work relies on natural language

processing technologies to compute the semantic similarity

between the query and the candidate concepts (Gan et al

.

2015a).

While fully automatic, this approach alone is far from sat-

isfactory, due to the well-known semantic gap between query

and visual content. For instance, a given abstract concept

such as meeting may have vastly different visual representa-

tions in different circumstances. It appears unreasonable to

choose the relevant visual concepts only based on semantic

similarity scores between the concept names and the textual

query. In our approach, we posit that the selected relevant

concepts should have the following key properties:

1.

Semantic relevance: the selected concepts should be se-

mantically related with the textual query.

3488

Pairwise video

Retrieval

Input: textual query

Event name: Bee keeping

Definition: One or more people perform activities associated with the keeping of

honeybees.

Explication: Bee keeping refers to the maintenance of honeybees by humans. A

beekeeper keeps bees in order to collect products of the hive to pollinate crops, or to

produce bees for sale to others.

Evidence: bee, bee keeper, smoke, honey, knife.



S1

S2

S3

Ranking Score Vector

Pairwise Re lationship Matrix

T1

T2 T3

Matrix

Decomposition

N1 N2 N3

T

Final Ranking Score

semantic

1.0 1.0 0.9 0.70.8

reliable

0.3 0.9 0.85 0.850.95

discriminative

0.9 0.9 0.9 0.60.2

bee

bee hive

honeycomb insectperson



bee hive honeycomb intersect

Concept candidate

Concept filtering

Semantic Query

Generation

Figure 1: An illustration of our zero-shot video activity recognition framework. We ﬁrst automatically compute semantic

similarity scores between the query and concepts based on the cosine distances of continuous word vectors, and ﬁlter out the

concepts that are not sufﬁciently visually reliable or discriminative for events. Then we convert the raw ranking scores of selected

concepts to multiple pairwise order matrices, which are taken as input to recover a rank-2 order matrix based on low rank and

skew-symmetric constraints. A robust ranking score vector is ﬁnally extracted to ﬁt the recovered low-rank order matrix.

2.

Visual reliability: the selected concepts should be reliable

during the detection on different datasets.

3.

Event discriminativeness: The selected concepts should

be discriminative enough in the detection of the video

activity.

Query Analysis.

To automatically generate query terms

from event descriptions, we apply standard natural language

processing techniques to clean up its textual description, in-

cluding removal of common stop words and lemmatization

(stemming) to normalize word inﬂections. Then we compute

the TF-IDF score of the remaining terms, and select the top 5

terms as event query terms. Next, we compute the similarity

of query terms with concept terms as described below, and

average over all query terms.

Semantic Similarity Computation.

The semantic similar-

ity computation requires us to have trained a model that

can quantify the degree of semantic similarity between two

words. This can conveniently be done beforehand in an of-

ﬂine process. We draw on the recent success of the skip-gram

with negative sampling neural network model (Mikolov et

al

.

2013). Given a large text corpus such as Wikipedia, the

objective is to produce vectors that represent words such

that vector similarity reﬂects word similarities. The training

objective of the skip-gram model achieves this by optimiz-

ing for word representations that allow for predicting the

surrounding context words in a sentence. More precisely,

given a sequence of words

{w

1

, w

2

, ..., w

d

}

, it searches for

a vector representation for each word w

i

such that

1

d



t=1



−c≤j≤c,j=0

log(P (w

t+j

|w

t

)) (1)

is maximized, where

c

controls the context window length.

The probability of

w

t+j

given

w

t

is deﬁned by the softmax

3489

function

P (w

i

|w

j

)=

exp(w

i

T

w

j

)



w

exp(w

i

T

w

j

)s

(2)

In order to optimize this more efﬁciently, a binary Huffman

tree is used to predict words from the vocabulary, and training

is carried out with stochastic gradient ascent, using negative

sampling to limit the number of predictions (Mikolov et al

.

2013).

Once we have optimized the word vector representations

using the corpus, we can measure the similarity between

words via the standard cosine measure. The larger the cosine

score between two word vectors, the more they are deemed

semantically related.

Reliability and Discriminativeness Validation.

While con-

cepts selected by the above pipeline are likely semantically

related, they need not be visually reliable or discriminative

enough for activity recognition. To test the reliability, we use

two-fold drop-out cross validation. The averaged precision

based on the two-fold cross validation reﬂects the reliability

of the tested concept. We ﬁlter out concepts whose precision

is below a threshold (set to 80% in all our experiments to

reasonably balance precision and recall).

The discriminative power is assessed using detection

scores on held-out data. Concepts that have detection scores

over

1

2

for over 50% of all activity classes are deleted. For

example, person is semantically related to the event bee keep-

ing and has a high reliability score. However, it obtains high

scores on most of the videos, so the term lacks discriminative

power for discerning more speciﬁc activities. Therefore, we

may prune such concepts from the concept pool.

3.3 Pairwise Order Matrix Construction

The next step is to use semantic representation of the query to

retrieve and rank a set of videos. For retrieval, most of exist-

ing approaches (Jiang et al

.

2015a; Wu et al

.

2014; Dalton, Al-

lan, and Mirajkar 2013; Habibian, Mensink, and Snoek 2014;

Liu et al

.

2013) naively average the detection scores of se-

lected relevant concepts. This leads to a suboptimal solution.

First, the scale of different detection scores for different con-

cepts is not comparable even after normalization. Thus, it

is unwise to use the same weight when fusing them, but

manually ﬁne-tuning weights is also not possible in practice.

Second, even within the same concept, the detection scores

are not necessarily linear. For example, we may not be able

to discern any apparent difference between

0.5

and

0.9

, yet

perceive a marked difference between 0.1 and 0.15.

Assume that we have

d

concepts,

n

videos in the system.

We apply

d

concept detectors to these videos. Our implemen-

tation uses a deep Convolutional Neural Network (CNNs)

architecture (Krizhevsky, Sutskever, and Hinton 2012). We

take the key frames of a given test video as input, run a

forward pass through the CNN, and use the softmax score

as the concept detection score on the key frame. To arrive

at a video-level representation, we rely on simple average

pooling.

This process yields a detection score matrix

X ∈ R

d×n

.

Each column

X

i

stores the detection scores for the

i

-th video

with respect to all

d

concepts. For the

k

th

concept, we con-

struct a pairwise order matrix T

(k)

,

T

(k)

i,j

=sign(X

k,i

− X

k,j

) .

The matrix

T

(k)

encodes the pairwise order of every two

videos measured under the

k

-th concept. In particular,

T

i,j

=

1

indicates that the

i

-th sample is more detected as positive

with greater conﬁdence than the

j

-th sample for the

k

-th

concept, while

T

i,j

= −1

indicates the opposite comparative

relation. Meanwhile,

T

i,j

=0

indicates that the

i

-th sample

has a similar detection conﬁdence value. As this order matrix

captures relative assessments between different samples, it is

not inﬂuenced by the scale or non-linearity of the detection

scores within or between concepts.

Assume that there is a ground truth ranking score denoted

as

s

. The corresponding pairwise order matrix

ˆ

T

is deﬁned

by

ˆ

T = se

T

− es

T

(3)

where

e =[1, 1, ···]

T

. The next proposition shows that the

rank of

ˆ

T is exactly 2.

Proposition 1.

The rank of

ˆ

T

is exactly

2

when the scores in

s are not all equal to a constant value.

Proof.

It is easy to conﬁrm that

rank(

ˆ

T ) ≤ 2

from Eq. (3).

If there is some

s

such that

ˆ

T

is rank

1

, then

ˆ

T

i

= c

i

ˆ

T

1

for

some constant

c

i

. Since

ˆ

T = −

ˆ

T

and

ˆ

T

i,i

=0

,wehave

ˆ

T =0.

Proposition 1 entails that

rank(

ˆ

T )=2

is a restricted

convex relaxation with respect to directly optimizing

s

(Yuan,

Li, and Zhang 2014). This suggests that we may use rank-

2

hard iterative singular value thresholding, as we will explain

shortly in more detail.

Based on the above discussion, we now turn to studying

how to estimate

ˆ

T

from a set of pairwise order matrices

T

(k)

,k =1, 2, ··· ,d,

with rank-

2

and skew-symmetric con-

straints.

3.4 Recovering the Pairwise Order Matrix

Assume that we have a set of

d

pairwise order matrices

ˆ

T

(1)

,

ˆ

T

(2)

, ··· ,

ˆ

T

(k)

. Because the detection scores are noisy,

the order constraints of

ˆ

T

(k)

may contradict each other. We

need a robust approach to recover the matrix

ˆ

T

by adaptively

fusing

ˆ

T

(k)

. To this end, each time

ˆ

T

i,j

violates the given

order

T

(k)

i,j

, we penalize it with a loss function

(

ˆ

T

i,j

,T

(k)

i,j

)

.

To maximize the margin, we use the hinge loss in this paper

(although theoretically any other loss function is applicable):

(

ˆ

T,T

(k)

) 

n



i,j=1

[1 − T

(k)

i,j

ˆ

T

i,j

]

+

(4)

where

[z]

+

= max(z, 0)

. Then we have the following opti-

mization problem to recover

ˆ

T :

3490

Algorithm 1 Hard Iterative Singular Value Thresholding

1: Input: T

(k)

, step size η.

2:

ˆ

T

0

=0

3: for i =1, 2, ··· ,Ldo

4:

ˆ

G

t

=

ˆ

T

t−1

− η∇

ˆ

T

(

ˆ

T ) .

5:



G

t

=

1

2

(

ˆ

G

t

−

ˆ

G

T

t

)

6: SVD:



G

t

=



i

λ

i

U

i

V

T

i

7: rank-2 thresholding:

ˆ

T

t

= λ

1

U

1

V

T

1

+ λ

2

U

2

V

T

2

8: end for

9: Output:

ˆ

T =

ˆ

T

L

min

ˆ

T

(

ˆ

T ) 

d



k=1

(

ˆ

T,T

(k)

)+λ

ˆ

T 

2

(5)

s.t. rank(

ˆ

T )=2 (6)

ˆ

T = −

ˆ

T. (7)

The above optimization can be effectively solved with hard

iterative singular value thresholding, as depicted in Algorithm

1. It is not difﬁcult to show that Algorithm 1 converges ge-

ometrically to the global optimal because the optimization

problem is restricted convex.

In Algorithm 1, we ﬁrst carry out a gradient descent step

with respect to

ˆ

T

with step size

η

.Inline

5

, we project the

intermediate solution onto skew-symmetric subspace. In lines

6

and

7

, we greedily threshold the intermediate solution with

the rank-2 constraint. It is important to note that line

5

must

be above lines

6

and

7

because a skew-symmetric matrix is

still skew-symmetric after rank-2 singular value thresholding

but the reverse does not hold.

After getting the optimized matrix

ˆ

T

, we seek to recover

the ﬁnal ranking score vector

ˆs

. Based on our rank-2 assump-

tion mentioned above, we expect that

ˆ

T

is generated from

ˆs

as

ˆ

T =ˆse

T

− eˆs

T

. The authors in (Jiang et al

.

2011) have

shown that using

(1/m)

ˆ

Te

as the recovered

s

will provide

the best least-square approximation, which can be formally

described as follows:

(1/m)e

ˆ

T =argmin

ˆs

||

ˆ

T − (ˆse

T

+ eˆs

T

)||. (8)

Therefore, we can treat

(1/m)e

ˆ

T

as the recovered

ˆs

after the

retrieval, giving us our ﬁnal results.

3.5 Out-of-Sample Extension

We can additionally also deal with the case of new out-of-

sample test videos. Given a new test video

x

m+1

, we ﬁrst

semantically represent it as an

n

-dimensional vector. For

each dimension, we ﬁnd its nearest neighbour from the ex-

isting test data

X = {x

1

,x

2

, ..., x

m

}

. Let

x

i

denote the

nearest example for the

i

-th semantic concept, and

w

i

de-

note the feature similarity based on

i

-th semantic feature

type. Then, the ranking score of

x

m+1

can be computed as

ˆs(x

m+1

)



n

i=1

w(t

i

m+1

,x

i

)



n

i=1

w(t

i

m+1

,x

i

)

ˆs(x

i

)

, where

x

i

is the aggre-

gated score for sample x

i

.

4 Experiments

4.1 Experimental Setup

Dataset and Metrics.

We conduct experiments on the chal-

lenging TRECVID Multimedia Event Detection datasets

from 2013 (MED13test) and 2014 (MED14test). Each in-

cludes 25,000 testing videos (over 960 hours of video) with

per-video ground truth annotations for 20 event categories,

all ofﬁcially provided by NIST. Each category has a textual

description in the form of event name, deﬁnition, explica-

tion, and related evidence types. Since we focus on zero-shot

event detection, the experiments are conducted without using

any examples. To evaluate the results, we apply the ofﬁcial

metric: average precision (AP) per event, and mean Average

Precision (mAP) by averaging all 20 events.

Image-based Concepts.

We obtain 1000 image-based con-

cept detectors using a deep Convolutional Neural Network

(CNNs) (Krizhevsky, Sutskever, and Hinton 2012). We use

the VGG19 Net (Simonyan and Zisserman 2015) architecture,

as implemented in the Caffe (Jia 2013) toolbox. The network

is trained on the ImageNet ILSVRC-2014 dataset (Deng et

al

.

2009), which includes 1.2M training images categorized

into 1000 classes.

Video-based Concepts.

We also obtain video-based con-

cepts from four publicly available datasets: UCF101 (Soomro,

Zamir, and Shah 2012), FCVID (Jiang et al

.

2015b), Google

Sports1M (Karpathy et al

.

2014), and ActivityNet (Heilbron

et al

.

). They contain 101 action categories, 239 action cat-

egories, 487 sports categories, and 203 activity categories

respectively. We extract the improved dense trajectory (Wang

and Schmid 2013) features from videos, and aggregate the

local features into video-level feature vectors by Fisher vec-

tors (Oneata et al

.

2013). We train linear SVM classiﬁers and

employ 5-fold cross validation to select the parameters.

Held-Out Data.

In order to obtain the discriminative power

scores, we test on the UCF101 dataset (crcv.ucf.edu/data/ ).

4.2 Experimental Results

Comparison with Previous Work.

In Table 1, we com-

pare our approach with other recent state-of-the-art systems,

speciﬁcally the Bi-Concept approach (Habibian, Mensink,

and Snoek 2014), EventNet (Ye et al

.

), the weak concepts

approach (Wu et al

.

2014), Selecting (Singh et al

.

2015), and

SPaR (Jiang et al

.

2014). The ﬁrst three of these only rely on

concept aggregation, while Selecting and SPaR combine con-

cept aggregation and re-ranking strategies. We report results

on MEDtest13, as this allows us to directly quote the val-

ues given in the original papers, for fairness. The results are

comparable, as we use the same data split. To better analyse

our approach, we also implement a traditional attribute-based

retrieval approach (“Basic”) (Gan et al

.

2015a). However, for

a fair comparison, we use our concept features, as these are

stronger, as shown later on.

3491

Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition

Citations

Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing

Look into Person: Joint Body Parsing & Pose Estimation Network and a New Benchmark

Semantic Object Parsing with Graph LSTM

Instance-Level Human Parsing via Part Grouping Network.

Semantic Object Parsing with Graph LSTM

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

LIBSVM: A library for support vector machines

Distributed Representations of Words and Phrases and their Compositionality

Related Papers (5)

Deep Residual Learning for Image Recognition

Fully convolutional networks for semantic segmentation

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

ImageNet: A large-scale hierarchical image database

Learning Spatiotemporal Features with 3D Convolutional Networks