Deep Local Video Feature for Action Recognition

doi:10.1109/CVPRW.2017.161

Zhenzhong Lan

1

Yi Zhu

2

Alexander G. Hauptmann

1

Shawn Newsam

2

1

Carnegie Mellon University

2

University of California, Merced

{lanzhzh,alex}@cs.cmu.edu {yzhu25,snewsam}@ucmerced.edu

Abstract

We investigate the problem of representing an entire

video using CNN features for human action recognition.

End-to-end learning of CNN/RNNs is currently not possible

for whole videos due to GPU memory limitations and so a

common practice is to use sampled frames as inputs along

with the video labels as supervision. However, the global

video labels might not be suitable for all of the temporally

local samples as the videos often contain content besides

the action of interest. We therefore propose to instead treat

the deep networks trained on local inputs as local feature

extractors. The local features are then aggregated to form

global features which are used to assign video-level labels

through a second classiﬁcation stage. This framework is

more robust to the noisy local labels that result from propa-

gating video-level labels. We investigate a number of design

choices for this local feature approach such as the optimal

sampling and aggregation methods. Experimental results

on the HMDB51 and UCF101 datasets show that a simple

maximum pooling on the sparsely sampled local features

leads to signiﬁcant performance improvement.

1. Introduction

Despite much effort and progress, deep convolutional

neural networks (CNNs) and recurrent neural networks

(RNNs) have yet to achieve the same success on video clas-

siﬁcation as they have on image classiﬁcation. This can in

large part be attributed to the following two differences be-

tween image and videos, differences which are key to deep-

learning based approaches. First, videos are much larger in

size and thus it becomes memory prohibitive to train and ap-

ply CNNs/RNNs at the video level. Second, it is very difﬁ-

cult to construct the large labeled video datasets required for

training deep networks. Recent approaches [

14, 22, 23] cir-

cumvent these problems by learning on sampled frames or

very short video clips (temporally local inputs

1

) with video-

level (global) labels.

1

local from here on will mean temporally local

However, video-level label information can be incom-

plete or even missing at frame/clip-level. This information

mismatch leads to the problem of false label assignment. In

other words, the imprecise frame/clip-level labels populated

from video labels are too noisy to guide precise mapping

from local video snippets to labels. To deal with this prob-

lem, a common practice is to sample multiple frames/clips

from a video at testing time and aggregate the prediction

scores of these sampled frames/clips to get the ﬁnal results

for that video. However, simply averaging the prediction

scores, without another level of mapping, is not enough to

recover the damages brought by false label assignment. Es-

pecially for long untrimmed videos [

7, 4] in open domain,

the problem become even more signiﬁcant.

We instead compensate for the noisy labels by treating

the deep networks trained on local inputs as feature extrac-

tors as shown in Figure

1. Local features extracted using the

pre-trained networks are aggregated into global video-level

features and another mapping function (e.g., a shallow net-

work) is learned using the same dataset to assign video-level

labels.

Our method is therefore related to the ﬁne-tuning prac-

tices that are popular in image classiﬁcation. The major dif-

ference is that we train our feature extraction networks with

local data and with very noisy labels due to the false label

assignment. We thus rely heavily on the shallow network to

compensate for the suboptimal local feature learning.

Our method is also similar to the common practice of

using networks pre-trained on the ImageNet image classiﬁ-

cation task to extract frame-level (local) features for video

classiﬁcation [

26, 9]. The main difference is that our local

feature extractors (deep networks) are trained on the target

dataset. Therefore, the features extracted from deep net-

works are in-domain. We don’t have the domain gap prob-

lem as we have in using the ImageNet trained deep net-

works.

We name our new class of local video features Deep lO-

cal Video Feature (DOVF).

In summary, DOVF is a class of local video features that

are extracted from deep neural networks trained on local

video clips using global video labels. In this paper, we in-

1

Spatial'

CNN

SVM

Score

Fusion

Temporal

CNN

Video-Level

Action Label

Spatial'

CNN

Temporal

CNN

Spatial'

CNN

Temporal

CNN

Video-Level

Action Label

SVM

Training

Inference

Figure 1. Overview of our proposed framework which consists of two stages. First (Left): A temporal segment network [23] is trained

using local video snippets with the video-level labels. These networks are used as local feature extractors. Second (Right): The local

features are aggregated to form a global feature which is then mapped to the video-level labels. Our results show that this two stage process

compensates for the noisy snippet labels that result from propagating the video-level labels.

vestigate the following design choices related to DOVF:

• From which layer(s) of the CNNs should the local fea-

tures be extracted? Without further investigation, the

only guidance we have is that we should avoid the

probability layer as it is likely to severely overﬁt the

noisy training data and thus result in a distribution that

varies greatly between the training and test sets.

• What is the best way to aggregate the local features

into video-level global features? We consider a num-

ber of feature aggregation methods such as mean pool-

ing, max pooling, Fisher Vectors (FV) encoding, etc.

• How densely should the local features be extracted?

Sparse temporal sampling would be preferred from a

efﬁciency standpoint.

• How complementary is DOVF to traditional local fea-

tures such as IDT [

20]? The more complementary they

are, the more opportunity there is for improvement by

applying techniques that have been developed for tra-

ditional local features.

The remainder of this paper is organized as follows. We

ﬁrst provide some background on video features with an

emphasis on recent work on learning with deep neural net-

works. We then describe the experimental settings we use

to evaluate our framework on the HMDB51 and UCF101

datasets. We conclude with a discussion including potential

improvements.

2. Related works

New video representations have typically been the main

source of breakthroughs for video classiﬁcation.

In traditional video representations, trajectory based ap-

proaches [

20, 6], especially the Dense Trajectory (DT) and

its improved form, IDT [

19, 20], are the basis of current

state-of-the-art hand-crafted algorithms. These trajectory-

based methods are designed to address the shortcomings of

image-extended video features. Their superior performance

validates the need for motion feature representations. Many

studies have tried to improve upon IDT due to its success

and popularity. Peng et al. [

11] enhanced the performance

of IDT by increasing the codebook sizes and fusing mul-

tiple coding methods. Sapienza et al. [

13] explored ways

to sub-sample and generate vocabularies for DT features.

Hoai and Zisserman [

5] achieved superior performance on

several action recognition datasets by applying data aug-

mentation, modeling the score distributions over video sub-

sequences, and capturing the relationships among action

classes. Fernando et al. [

3] modeled the evolution of ap-

pearance in video and achieved state-of-the-art results on

the Hollywood2 dataset. [

10] proposed to extract features

from videos at multiple playback speeds to achieve speed

invariance. However, these traditional, hand-crafted meth-

ods have recently started to become overshadowed by the

rise of deep learning using neural networks.

Motivated by the success of CNNs, researchers have in-

vested signiﬁcant effort towards developing CNN equiva-

2

lents for learning video features. Several accomplishments

have been reported from using CNNs for action recognition

in videos [

27, 25, 16, 29]. Karpathy et al. [8] trained deep

CNNs using one million weakly labeled YouTube videos

and reported moderate success using the network as a fea-

ture extractor. Simonyan and Zisserman [

14] demonstrated

a result competitive with IDT [

20] by training deep CNNs

using both sampled frames and stacked optical ﬂow. Tran

et al. [

15] explored 3D CNNs to simultaneously learn spa-

tiotemporal features without pre-computing optical ﬂow.

This allows them to achieve competitive performance at

much faster rates. Wang et al. [

21, 22, 23] provide insight-

ful analyses on improving two-stream frameworks such as

pre-training two-stream CNNs, using smaller learning rates,

using deeper networks, etc. These improvements result in a

CNN-based approach that ﬁnally outperforms IDT [20] by

a large margin on the UCF101 dataset. These approaches,

however, all rely on shot-clip predictions to determine the

ﬁnal video labels and do not use global features.

Two concurrent works [

1, 12] on global features for ac-

tion recognition have recently been posted on arXiv. Both

propose new feature aggregation methods to pool the lo-

cal neural network features to form global video features.

Diba et al. [

1] propose a bilinear model to pool the outputs

of the last convolutional layers of pre-trained networks and

achieve state-of-the-art results on both the HMDB51 and

UCF101 datasets. Qiu et al. [

12] propose a new quanti-

zation method similar to FV and achieve comparable per-

formance to [1]. However, neither work provides detailed

analysis of the local neural network features that are used.

In this paper, we perform an extensive analysis and show

that a simple max pooling can achieve similar or better re-

sults compared to much more complex feature aggregation

methods such as those in [

1, 12].

3. Methodology

In this section, we ﬁrst review temporal segment net-

works [

23], the architecture upon which our approach is

built. We next describe our Deep lOcal Video Features

(DOVF), methods for aggregating them to form global fea-

tures, and the mapping of the global features to video-level

labels. Finally, we provide our experimental settings.

3.1. Temporal Segment Networks

With the goal of capturing long-range temporal structure

for improved action recognition, Wang et al. propose tem-

poral segment networks (TSN) [

23] with a sparse sampling

strategy. This allows an entire video to be analyzed with

reasonable computational costs. TSN ﬁrst divides a video

evenly into three segments and one short snippet is ran-

domly selected from each segment. Two-stream networks

are then applied to the short snippets to obtain the initial

action class prediction scores. TSN ﬁnally uses a segmen-

tal consensus function to combine the outputs from multiple

short snippets to predict the action class probabilities for the

video as a whole.

Wang et al. [

23] show TSN achieves state-of-the-

art results on the popular action recognition benchmarks

UCF101 and HMDB51. These results demonstrate the im-

portance of capturing long-range temporal information for

video analysis. However, the training of the local snippet

classiﬁers is performed using the video-level labels. As

noted earlier, these are likely to be noisy labels and will

thus limit the accuracy of the snippet-level classiﬁcation.

We there propose instead to use the snippet-level anal-

ysis for local feature extraction and add a second stage

which maps the aggregated features to the video-level la-

bels. The combination of DOVF and a second classiﬁcation

stage compensates for the suboptimal snippet-level classiﬁ-

cation that results from the noisy training dataset.

3.2. Deep local video feature (DOVF)

Instead of performing action recognition in a single step

like [

23, 1], our framework consists of two stages. In the

ﬁrst stage, deep networks (e.g., TSN) that have been trained

with video-level labels to perform snipped-level classiﬁca-

tion are used as local feature extractors. In the second stage,

the local features are aggregated to form global features

and another classiﬁer which has also been trained using the

video-level labels performs the video-level classiﬁcation.

The training of our classiﬁcation framework proceeds as

follows where each video V in the training set has ground

truth action label p. In the ﬁrst stage, V is evenly divided

into N segments, v

1

, v

2

, · · · , v

N

and one short snippet

is randomly selected from each segment, s

1

, s

2

, · · · , s

N

.

These snippets are assigned the video-level labels and the

snippets from all the training videos are used to train a two-

stream CNNs (single RGB video frame and stack of con-

secutive optical ﬂow images). The details on training the

two-stream CNNs can be found in [

22, 23]. Once trained,

the network is used to extract local features, f

1

, f

2

, · · · , f

N

from a video.

In the second stage, the local features are aggregated into

a global feature f

G

,

f

G

= G(f

1

, f

2

, · · · , f

N

) (1)

where G denotes the aggregation function. We explore dif-

ferent aggregation functions in Section

4.2. We then learn a

classiﬁer that maps the global feature f

G

to the video label

p:

p = M (f

G

). (2)

Once trained, the framework can be used to predict the

label of a video. Figure

1 contains an overview of the frame-

work.

3

ID

VGG16 Inception-BN

Name Dimension Type Name Dimension Type

L-1 fc8 101 FC fc-action 101 FC

L-2 fc7 4096 FC global pool 1024 Conv

L-3 fc6 4096 FC inception 5b 50176 Conv

L-4 pool5 25088 Conv inception 5a 50176 Conv

L-5 conv5 3 100352 Conv inception 4e 51744 Conv

Table 1. Names, dimensions and types of the layers that we consider in the VGG16 and Inception-BN networks for local feature extraction.

Layers

Spatial CNNs (%) Temporal CNNs (%) Two-stream (%)

VGG-16 Inception-BN VGG16 Inception-BN VGG-16 Inception-BN

L-1 77.8 83.9 82.6 83.7 89.6 91.7

L-2 79.5 88.3 85.1 88.8 91.4 94.2

L-3 80.1 88.3 86.6 88.7 91.8 93.9

L-4 83.7 85.6 86.5 85.3 92.4 91.4

L-5 83.5 83.6 87.0 83.6 92.3 89.8

TSN [23] 79.8 85.7 85.7 87.9 90.9 93.5

Table 2. Layer-wise comparison of VGG-16 and Inception-BN networks on the split 1 of UCF101. The values are the overall video-level

classiﬁcation accuracy of our complete framework.

3.3. Experimental settings

We compare two networks, VGG16 and Inception-BN,

for the local feature extraction. (We use networks trained by

Wang et al. [

22, 23].) We further compare the outputs from

the last ﬁve layers from each network as our features. Table

1 shows the layer names of each network and the correspon-

dent feature dimensions. We divide the layers into two cate-

gories: fully-connected (FC) layers and convolution (Conv)

layers (pooling layers are treated as Conv layers). FC layers

have signiﬁcantly more parameters and are thus more likely

to overﬁt the training data than the Conv layers. As shown,

VGG16 has three FC layers while Inception-BN only has

one.

Following the scheme of [

14, 23], we evenly sample 25

frames and ﬂow clips for each video. For each frame/clip,

we perform 10x data augmentation by cropping the 4 cor-

ners and center along with horizontal ﬂipping. A single fea-

ture is computed for each frame/clip by averaging over the

augmented data. This results in a set of 25 local features for

each video. The dimensions of local features extracted from

different network/layer combinations are shown in Table

1.

We compare a number of local feature aggregation meth-

ods ranging from simple mean and maximum pooling to

more complex feature encoding methods such as Bag of

words (BoW ), Vector of Locally Aggregated Descriptors

(V LAD) and Fisher Vector (F V ) encoding. In order to

incorporate global temporal information, we divide each

video into three parts and perform the aggregation sepa-

rately. That is, the ﬁrst eight, middle nine and ﬁnal eight

of the 25 local features are separately aggregated and then

concatenated to form the ﬁnal global feature. This increases

the ﬁnal feature dimension by three. After concatenation,

we perform a square root normalization and L2 normaliza-

tion as in [

10] on the global feature.

We use support vector machines (SVMs) to map (clas-

sify) the global features to video-level labels. We use a chi-

square kernel and C = 100 as in [

10] except for the FV and

VLAD aggregated features where we use a linear kernel as

suggested in [

18]. Note that while we use SVMs to predict

the video action labels, other mappings/classiﬁers, such as

a shallow neural network, could also be used.

The spatial-net and temporal-net prediction scores of the

two-stream network are fused with weights 1 and 1.5, re-

spectively, as in [

23].

4. Evaluation

In this section, we experimentally explore the design

choices posed in the Introduction using the UCF101 and

HMDB51 datasets.

UCF101 is composed of realistic action videos from

YouTube. It contains 13, 320 video clips distributed among

101 action classes. HMDB51 includes 6, 766 video clips

of 51 actions extracted from a wide range of sources, such

as online videos and movies. UCF101 and HMDB51 both

have a standard three split evaluation protocol. We report

the average recognition accuracies over the three splits.

Our default conﬁguration uses the outputs of the

global

pool layer in the Inception-BN network as the local

features due to this layer’s low dimension (3072 dimensions

with global information encoding). It also uses maximum

pooling to aggregate the local features to form the global

features.

4

Layers

Spatial CNNs (%) Temporal CNNs (%) Two-stream (%)

HMDB51 UCF101 HMDB51 UCF101 HMDB51 UCF101

Mean 56.0 87.5 63.7 88.3 71.1 93.8

Mean Std 58.1 88.1 65.2 88.5 72.0 94.2

Max 57.7 88.3 64.8 88.8 72.5 94.2

BoW 36.9 71.9 47.9 80.0 53.4 85.3

F V 39.1 69.8 55.6 81.3 58.5 83.8

V LAD 45.3 77.3 57.4 84.7 64.7 89.2

Table 3. Comparison of different local feature aggregation methods on split 1 of UCF101 and HMDB51.

# of

samples

Spatial CNNs (%) Temporal CNNs (%) Two-stream (%)

HMDB51 UCF101 HMDB51 UCF101 HMDB51 UCF101

3 52.5 85.6 54.9 82.4 64.6 91.6

9 56.1 87.4 62.2 87.7 70.9 93.5

15 56.9 88.2 64.4 88.5 72.3 93.8

21 57.1 88.1 64.8 88.6 71.8 94.1

25 57.7 88.3 64.8 88.8 72.5 94.2

Max 57.6 88.4 65.3 88.9 72.4 94.3

Table 4. Number of samples per video versus accuracy on split 1 of UCF101 and HMDB51.

4.1. From which layer(s) should the local features

be extracted?

We conduct experiments using both VGG16 and

Inception-BN to explore which layers are optimal for ex-

tracting the local features. The video-level action classiﬁca-

tion accuracies on split 1 of UCF101 using different layers

are shown in Table

2.

Layer L-2 from Inception-BN and layer L-4 from

VGG16 give the best performance. These are the ﬁnal con-

volution layers in each network which suggests the follow-

ing three reasons for their superior performance. First, the

convolution layers have far fewer parameters compared to

the fully connected layers and thus are less likely to over-

ﬁt the training data that has false label assignment problem.

Second, the fully connected layers do not preserve spatial

information while the convolution layers do. Third, the later

convolution layers encode more global (spatial) information

than the earlier ones. We conclude that extracting the lo-

cal features from the ﬁnal convolution layers is the optimal

choice. We believe this ﬁnding helps explain why several

recent works [

26, 1, 12] also choose the output of the ﬁnal

convolution layer for further processing.

Compared to the results of Wang et al. [

23], from which

we get the pre-trained networks, we can see that our ap-

proach do improve the performance on both spatial-net and

temporal-net. However, the improvements from spatial net-

works are much larger. This larger improvement may be

because that, in training local feature extractors, the inputs

for spatial net are single frames while the input for temporal

net are video clips with 10 stacked frames. Smaller inputs

lead to larger chance of false label assignment hence larger

performance gap compared to our global feature approach.

Previous work [

27, 9, 26] on using local features from

networks pre-trained using the ImageNet dataset show that

combining features from multiple layers can improve the

overall performance signiﬁcantly. We investigated combin-

ing features from multiple layers but found no improve-

ment. This difference shows that ﬁne-tuning brings some

new characteristics to the local features.

In the remaining experiments, we use the output of

the global

pool layer from the Inception-BN network as it

achieves the best performance.

4.2. What is the optimal aggregation strategy?

We consider six aggregation methods on split 1 of the the

UCF101 and HMDB51 datasets.

Given n local features, each of which has a dimension of

d, the six different aggregation methods are as follows:

• Mean computes the mean value of the n local features

along each dimension.

• Max selects the maximum value along each dimen-

sion.

• Mean

Std, inspired by Fisher Vector encoding, com-

putes the mean and standard deviation along each di-

mension.

• BoW quantizes each of the n local features as one of

k codewords using a codebook generated through k-

means clustering.

• V LAD is similar to BoW but encodes the distance

between each of the n local features and the assigned

codewords.

5

Deep Local Video Feature for Action Recognition

Citations

IEEE transactions on pattern analysis and machine intelligence

Hidden Two-Stream Convolutional Networks for Action Recognition

Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos

Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval

End-to-end Video-level Representation Learning for Action Recognition

References

Learning Spatiotemporal Features with 3D Convolutional Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Large-Scale Video Classification with Convolutional Neural Networks

Action Recognition with Improved Trajectories

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Related Papers (5)

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Learning Spatiotemporal Features with 3D Convolutional Networks

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Two-Stream Convolutional Networks for Action Recognition in Videos

Large-Scale Video Classification with Convolutional Neural Networks