scispace - formally typeset
Open AccessProceedings ArticleDOI

Deep Local Video Feature for Action Recognition

Reads0
Chats0
TLDR
In this article, the authors proposed to use the deep networks trained on local inputs as local feature extractors and aggregated the local features to form global features which are then used to assign video-level labels through a second classification stage.
Abstract
We investigate the problem of representing an entire video using CNN features for human action recognition. End-to-end learning of CNN/RNNs is currently not possible for whole videos due to GPU memory limitations and so a common practice is to use sampled frames as inputs along with the video labels as supervision. However, the global video labels might not be suitable for all of the temporally local samples as the videos often contain content besides the action of interest. We therefore propose to instead treat the deep networks trained on local inputs as local feature extractors. The local features are then aggregated to form global features which are used to assign video-level labels through a second classification stage. We investigate a number of design choices for this local feature approach. Experimental results on the HMDB51 and UCF101 datasets show that a simple maximum pooling on the sparsely sampled local features leads to significant performance improvement.

read more

Content maybe subject to copyright    Report

Deep Local Video Feature for Action Recognition
Zhenzhong Lan
1
Yi Zhu
2
Alexander G. Hauptmann
1
Shawn Newsam
2
1
Carnegie Mellon University
2
University of California, Merced
{lanzhzh,alex}@cs.cmu.edu {yzhu25,snewsam}@ucmerced.edu
Abstract
We investigate the problem of representing an entire
video using CNN features for human action recognition.
End-to-end learning of CNN/RNNs is currently not possible
for whole videos due to GPU memory limitations and so a
common practice is to use sampled frames as inputs along
with the video labels as supervision. However, the global
video labels might not be suitable for all of the temporally
local samples as the videos often contain content besides
the action of interest. We therefore propose to instead treat
the deep networks trained on local inputs as local feature
extractors. The local features are then aggregated to form
global features which are used to assign video-level labels
through a second classification stage. This framework is
more robust to the noisy local labels that result from propa-
gating video-level labels. We investigate a number of design
choices for this local feature approach such as the optimal
sampling and aggregation methods. Experimental results
on the HMDB51 and UCF101 datasets show that a simple
maximum pooling on the sparsely sampled local features
leads to significant performance improvement.
1. Introduction
Despite much effort and progress, deep convolutional
neural networks (CNNs) and recurrent neural networks
(RNNs) have yet to achieve the same success on video clas-
sification as they have on image classification. This can in
large part be attributed to the following two differences be-
tween image and videos, differences which are key to deep-
learning based approaches. First, videos are much larger in
size and thus it becomes memory prohibitive to train and ap-
ply CNNs/RNNs at the video level. Second, it is very diffi-
cult to construct the large labeled video datasets required for
training deep networks. Recent approaches [
14, 22, 23] cir-
cumvent these problems by learning on sampled frames or
very short video clips (temporally local inputs
1
) with video-
level (global) labels.
1
local from here on will mean temporally local
However, video-level label information can be incom-
plete or even missing at frame/clip-level. This information
mismatch leads to the problem of false label assignment. In
other words, the imprecise frame/clip-level labels populated
from video labels are too noisy to guide precise mapping
from local video snippets to labels. To deal with this prob-
lem, a common practice is to sample multiple frames/clips
from a video at testing time and aggregate the prediction
scores of these sampled frames/clips to get the final results
for that video. However, simply averaging the prediction
scores, without another level of mapping, is not enough to
recover the damages brought by false label assignment. Es-
pecially for long untrimmed videos [
7, 4] in open domain,
the problem become even more significant.
We instead compensate for the noisy labels by treating
the deep networks trained on local inputs as feature extrac-
tors as shown in Figure
1. Local features extracted using the
pre-trained networks are aggregated into global video-level
features and another mapping function (e.g., a shallow net-
work) is learned using the same dataset to assign video-level
labels.
Our method is therefore related to the fine-tuning prac-
tices that are popular in image classification. The major dif-
ference is that we train our feature extraction networks with
local data and with very noisy labels due to the false label
assignment. We thus rely heavily on the shallow network to
compensate for the suboptimal local feature learning.
Our method is also similar to the common practice of
using networks pre-trained on the ImageNet image classifi-
cation task to extract frame-level (local) features for video
classification [
26, 9]. The main difference is that our local
feature extractors (deep networks) are trained on the target
dataset. Therefore, the features extracted from deep net-
works are in-domain. We don’t have the domain gap prob-
lem as we have in using the ImageNet trained deep net-
works.
We name our new class of local video features Deep lO-
cal Video Feature (DOVF).
In summary, DOVF is a class of local video features that
are extracted from deep neural networks trained on local
video clips using global video labels. In this paper, we in-
1

Spatial'
CNN
SVM
Score
Fusion
Temporal
CNN
Video-Level
Action Label
Spatial'
CNN
Temporal
CNN
Spatial'
CNN
Temporal
CNN
Video-Level
Action Label
SVM
Training
Inference
Figure 1. Overview of our proposed framework which consists of two stages. First (Left): A temporal segment network [23] is trained
using local video snippets with the video-level labels. These networks are used as local feature extractors. Second (Right): The local
features are aggregated to form a global feature which is then mapped to the video-level labels. Our results show that this two stage process
compensates for the noisy snippet labels that result from propagating the video-level labels.
vestigate the following design choices related to DOVF:
From which layer(s) of the CNNs should the local fea-
tures be extracted? Without further investigation, the
only guidance we have is that we should avoid the
probability layer as it is likely to severely overfit the
noisy training data and thus result in a distribution that
varies greatly between the training and test sets.
What is the best way to aggregate the local features
into video-level global features? We consider a num-
ber of feature aggregation methods such as mean pool-
ing, max pooling, Fisher Vectors (FV) encoding, etc.
How densely should the local features be extracted?
Sparse temporal sampling would be preferred from a
efficiency standpoint.
How complementary is DOVF to traditional local fea-
tures such as IDT [
20]? The more complementary they
are, the more opportunity there is for improvement by
applying techniques that have been developed for tra-
ditional local features.
The remainder of this paper is organized as follows. We
first provide some background on video features with an
emphasis on recent work on learning with deep neural net-
works. We then describe the experimental settings we use
to evaluate our framework on the HMDB51 and UCF101
datasets. We conclude with a discussion including potential
improvements.
2. Related works
New video representations have typically been the main
source of breakthroughs for video classification.
In traditional video representations, trajectory based ap-
proaches [
20, 6], especially the Dense Trajectory (DT) and
its improved form, IDT [
19, 20], are the basis of current
state-of-the-art hand-crafted algorithms. These trajectory-
based methods are designed to address the shortcomings of
image-extended video features. Their superior performance
validates the need for motion feature representations. Many
studies have tried to improve upon IDT due to its success
and popularity. Peng et al. [
11] enhanced the performance
of IDT by increasing the codebook sizes and fusing mul-
tiple coding methods. Sapienza et al. [
13] explored ways
to sub-sample and generate vocabularies for DT features.
Hoai and Zisserman [
5] achieved superior performance on
several action recognition datasets by applying data aug-
mentation, modeling the score distributions over video sub-
sequences, and capturing the relationships among action
classes. Fernando et al. [
3] modeled the evolution of ap-
pearance in video and achieved state-of-the-art results on
the Hollywood2 dataset. [
10] proposed to extract features
from videos at multiple playback speeds to achieve speed
invariance. However, these traditional, hand-crafted meth-
ods have recently started to become overshadowed by the
rise of deep learning using neural networks.
Motivated by the success of CNNs, researchers have in-
vested significant effort towards developing CNN equiva-
2

lents for learning video features. Several accomplishments
have been reported from using CNNs for action recognition
in videos [
27, 25, 16, 29]. Karpathy et al. [8] trained deep
CNNs using one million weakly labeled YouTube videos
and reported moderate success using the network as a fea-
ture extractor. Simonyan and Zisserman [
14] demonstrated
a result competitive with IDT [
20] by training deep CNNs
using both sampled frames and stacked optical flow. Tran
et al. [
15] explored 3D CNNs to simultaneously learn spa-
tiotemporal features without pre-computing optical flow.
This allows them to achieve competitive performance at
much faster rates. Wang et al. [
21, 22, 23] provide insight-
ful analyses on improving two-stream frameworks such as
pre-training two-stream CNNs, using smaller learning rates,
using deeper networks, etc. These improvements result in a
CNN-based approach that finally outperforms IDT [20] by
a large margin on the UCF101 dataset. These approaches,
however, all rely on shot-clip predictions to determine the
final video labels and do not use global features.
Two concurrent works [
1, 12] on global features for ac-
tion recognition have recently been posted on arXiv. Both
propose new feature aggregation methods to pool the lo-
cal neural network features to form global video features.
Diba et al. [
1] propose a bilinear model to pool the outputs
of the last convolutional layers of pre-trained networks and
achieve state-of-the-art results on both the HMDB51 and
UCF101 datasets. Qiu et al. [
12] propose a new quanti-
zation method similar to FV and achieve comparable per-
formance to [1]. However, neither work provides detailed
analysis of the local neural network features that are used.
In this paper, we perform an extensive analysis and show
that a simple max pooling can achieve similar or better re-
sults compared to much more complex feature aggregation
methods such as those in [
1, 12].
3. Methodology
In this section, we first review temporal segment net-
works [
23], the architecture upon which our approach is
built. We next describe our Deep lOcal Video Features
(DOVF), methods for aggregating them to form global fea-
tures, and the mapping of the global features to video-level
labels. Finally, we provide our experimental settings.
3.1. Temporal Segment Networks
With the goal of capturing long-range temporal structure
for improved action recognition, Wang et al. propose tem-
poral segment networks (TSN) [
23] with a sparse sampling
strategy. This allows an entire video to be analyzed with
reasonable computational costs. TSN first divides a video
evenly into three segments and one short snippet is ran-
domly selected from each segment. Two-stream networks
are then applied to the short snippets to obtain the initial
action class prediction scores. TSN finally uses a segmen-
tal consensus function to combine the outputs from multiple
short snippets to predict the action class probabilities for the
video as a whole.
Wang et al. [
23] show TSN achieves state-of-the-
art results on the popular action recognition benchmarks
UCF101 and HMDB51. These results demonstrate the im-
portance of capturing long-range temporal information for
video analysis. However, the training of the local snippet
classifiers is performed using the video-level labels. As
noted earlier, these are likely to be noisy labels and will
thus limit the accuracy of the snippet-level classification.
We there propose instead to use the snippet-level anal-
ysis for local feature extraction and add a second stage
which maps the aggregated features to the video-level la-
bels. The combination of DOVF and a second classification
stage compensates for the suboptimal snippet-level classifi-
cation that results from the noisy training dataset.
3.2. Deep local video feature (DOVF)
Instead of performing action recognition in a single step
like [
23, 1], our framework consists of two stages. In the
first stage, deep networks (e.g., TSN) that have been trained
with video-level labels to perform snipped-level classifica-
tion are used as local feature extractors. In the second stage,
the local features are aggregated to form global features
and another classifier which has also been trained using the
video-level labels performs the video-level classification.
The training of our classification framework proceeds as
follows where each video V in the training set has ground
truth action label p. In the first stage, V is evenly divided
into N segments, v
1
, v
2
, · · · , v
N
and one short snippet
is randomly selected from each segment, s
1
, s
2
, · · · , s
N
.
These snippets are assigned the video-level labels and the
snippets from all the training videos are used to train a two-
stream CNNs (single RGB video frame and stack of con-
secutive optical flow images). The details on training the
two-stream CNNs can be found in [
22, 23]. Once trained,
the network is used to extract local features, f
1
, f
2
, · · · , f
N
from a video.
In the second stage, the local features are aggregated into
a global feature f
G
,
f
G
= G(f
1
, f
2
, · · · , f
N
) (1)
where G denotes the aggregation function. We explore dif-
ferent aggregation functions in Section
4.2. We then learn a
classifier that maps the global feature f
G
to the video label
p:
p = M (f
G
). (2)
Once trained, the framework can be used to predict the
label of a video. Figure
1 contains an overview of the frame-
work.
3

ID
VGG16 Inception-BN
Name Dimension Type Name Dimension Type
L-1 fc8 101 FC fc-action 101 FC
L-2 fc7 4096 FC global pool 1024 Conv
L-3 fc6 4096 FC inception 5b 50176 Conv
L-4 pool5 25088 Conv inception 5a 50176 Conv
L-5 conv5 3 100352 Conv inception 4e 51744 Conv
Table 1. Names, dimensions and types of the layers that we consider in the VGG16 and Inception-BN networks for local feature extraction.
Layers
Spatial CNNs (%) Temporal CNNs (%) Two-stream (%)
VGG-16 Inception-BN VGG16 Inception-BN VGG-16 Inception-BN
L-1 77.8 83.9 82.6 83.7 89.6 91.7
L-2 79.5 88.3 85.1 88.8 91.4 94.2
L-3 80.1 88.3 86.6 88.7 91.8 93.9
L-4 83.7 85.6 86.5 85.3 92.4 91.4
L-5 83.5 83.6 87.0 83.6 92.3 89.8
TSN [23] 79.8 85.7 85.7 87.9 90.9 93.5
Table 2. Layer-wise comparison of VGG-16 and Inception-BN networks on the split 1 of UCF101. The values are the overall video-level
classification accuracy of our complete framework.
3.3. Experimental settings
We compare two networks, VGG16 and Inception-BN,
for the local feature extraction. (We use networks trained by
Wang et al. [
22, 23].) We further compare the outputs from
the last five layers from each network as our features. Table
1 shows the layer names of each network and the correspon-
dent feature dimensions. We divide the layers into two cate-
gories: fully-connected (FC) layers and convolution (Conv)
layers (pooling layers are treated as Conv layers). FC layers
have significantly more parameters and are thus more likely
to overfit the training data than the Conv layers. As shown,
VGG16 has three FC layers while Inception-BN only has
one.
Following the scheme of [
14, 23], we evenly sample 25
frames and flow clips for each video. For each frame/clip,
we perform 10x data augmentation by cropping the 4 cor-
ners and center along with horizontal flipping. A single fea-
ture is computed for each frame/clip by averaging over the
augmented data. This results in a set of 25 local features for
each video. The dimensions of local features extracted from
different network/layer combinations are shown in Table
1.
We compare a number of local feature aggregation meth-
ods ranging from simple mean and maximum pooling to
more complex feature encoding methods such as Bag of
words (BoW ), Vector of Locally Aggregated Descriptors
(V LAD) and Fisher Vector (F V ) encoding. In order to
incorporate global temporal information, we divide each
video into three parts and perform the aggregation sepa-
rately. That is, the first eight, middle nine and final eight
of the 25 local features are separately aggregated and then
concatenated to form the final global feature. This increases
the final feature dimension by three. After concatenation,
we perform a square root normalization and L2 normaliza-
tion as in [
10] on the global feature.
We use support vector machines (SVMs) to map (clas-
sify) the global features to video-level labels. We use a chi-
square kernel and C = 100 as in [
10] except for the FV and
VLAD aggregated features where we use a linear kernel as
suggested in [
18]. Note that while we use SVMs to predict
the video action labels, other mappings/classifiers, such as
a shallow neural network, could also be used.
The spatial-net and temporal-net prediction scores of the
two-stream network are fused with weights 1 and 1.5, re-
spectively, as in [
23].
4. Evaluation
In this section, we experimentally explore the design
choices posed in the Introduction using the UCF101 and
HMDB51 datasets.
UCF101 is composed of realistic action videos from
YouTube. It contains 13, 320 video clips distributed among
101 action classes. HMDB51 includes 6, 766 video clips
of 51 actions extracted from a wide range of sources, such
as online videos and movies. UCF101 and HMDB51 both
have a standard three split evaluation protocol. We report
the average recognition accuracies over the three splits.
Our default configuration uses the outputs of the
global
pool layer in the Inception-BN network as the local
features due to this layer’s low dimension (3072 dimensions
with global information encoding). It also uses maximum
pooling to aggregate the local features to form the global
features.
4

Layers
Spatial CNNs (%) Temporal CNNs (%) Two-stream (%)
HMDB51 UCF101 HMDB51 UCF101 HMDB51 UCF101
Mean 56.0 87.5 63.7 88.3 71.1 93.8
Mean Std 58.1 88.1 65.2 88.5 72.0 94.2
Max 57.7 88.3 64.8 88.8 72.5 94.2
BoW 36.9 71.9 47.9 80.0 53.4 85.3
F V 39.1 69.8 55.6 81.3 58.5 83.8
V LAD 45.3 77.3 57.4 84.7 64.7 89.2
Table 3. Comparison of different local feature aggregation methods on split 1 of UCF101 and HMDB51.
# of
samples
Spatial CNNs (%) Temporal CNNs (%) Two-stream (%)
HMDB51 UCF101 HMDB51 UCF101 HMDB51 UCF101
3 52.5 85.6 54.9 82.4 64.6 91.6
9 56.1 87.4 62.2 87.7 70.9 93.5
15 56.9 88.2 64.4 88.5 72.3 93.8
21 57.1 88.1 64.8 88.6 71.8 94.1
25 57.7 88.3 64.8 88.8 72.5 94.2
Max 57.6 88.4 65.3 88.9 72.4 94.3
Table 4. Number of samples per video versus accuracy on split 1 of UCF101 and HMDB51.
4.1. From which layer(s) should the local features
be extracted?
We conduct experiments using both VGG16 and
Inception-BN to explore which layers are optimal for ex-
tracting the local features. The video-level action classifica-
tion accuracies on split 1 of UCF101 using different layers
are shown in Table
2.
Layer L-2 from Inception-BN and layer L-4 from
VGG16 give the best performance. These are the final con-
volution layers in each network which suggests the follow-
ing three reasons for their superior performance. First, the
convolution layers have far fewer parameters compared to
the fully connected layers and thus are less likely to over-
fit the training data that has false label assignment problem.
Second, the fully connected layers do not preserve spatial
information while the convolution layers do. Third, the later
convolution layers encode more global (spatial) information
than the earlier ones. We conclude that extracting the lo-
cal features from the final convolution layers is the optimal
choice. We believe this finding helps explain why several
recent works [
26, 1, 12] also choose the output of the final
convolution layer for further processing.
Compared to the results of Wang et al. [
23], from which
we get the pre-trained networks, we can see that our ap-
proach do improve the performance on both spatial-net and
temporal-net. However, the improvements from spatial net-
works are much larger. This larger improvement may be
because that, in training local feature extractors, the inputs
for spatial net are single frames while the input for temporal
net are video clips with 10 stacked frames. Smaller inputs
lead to larger chance of false label assignment hence larger
performance gap compared to our global feature approach.
Previous work [
27, 9, 26] on using local features from
networks pre-trained using the ImageNet dataset show that
combining features from multiple layers can improve the
overall performance significantly. We investigated combin-
ing features from multiple layers but found no improve-
ment. This difference shows that fine-tuning brings some
new characteristics to the local features.
In the remaining experiments, we use the output of
the global
pool layer from the Inception-BN network as it
achieves the best performance.
4.2. What is the optimal aggregation strategy?
We consider six aggregation methods on split 1 of the the
UCF101 and HMDB51 datasets.
Given n local features, each of which has a dimension of
d, the six different aggregation methods are as follows:
Mean computes the mean value of the n local features
along each dimension.
Max selects the maximum value along each dimen-
sion.
Mean
Std, inspired by Fisher Vector encoding, com-
putes the mean and standard deviation along each di-
mension.
BoW quantizes each of the n local features as one of
k codewords using a codebook generated through k-
means clustering.
V LAD is similar to BoW but encodes the distance
between each of the n local features and the assigned
codewords.
5

Citations
More filters

IEEE transactions on pattern analysis and machine intelligence

Ieee Xplore
TL;DR: This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis and addressing interesting real-world computer Vision and multimedia applications.
Book ChapterDOI

Hidden Two-Stream Convolutional Networks for Action Recognition

TL;DR: In this paper, a hidden two-stream CNN architecture is proposed, which takes raw video frames as input and directly predicts action classes without explicitly computing optical flow, which is 10x faster than its two-stage baseline.
Journal ArticleDOI

Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos

TL;DR: The experimental results show that, the proposed RSTAN outperforms other recent RNN-based approaches on UCF101 and HMDB51 as well as achieves the state-of-the-art on JHMDB.
Journal ArticleDOI

Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval

TL;DR: Extensive experiments performed on three popular video datasets show that the UDVH is overwhelmingly better than the state of the arts in terms of various evaluation metrics, which makes it practical in real-world applications.
Proceedings ArticleDOI

End-to-end Video-level Representation Learning for Action Recognition

TL;DR: This paper builds upon two-stream ConvNets and proposes Deep networks with Temporal Pyramid Pooling (DTPP), an end-to-end video-level representation learning approach, to address problems of partial observation training and single temporal scale modeling in action recognition.
References
More filters
Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Proceedings Article

Two-Stream Convolutional Networks for Action Recognition in Videos

TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Proceedings ArticleDOI

Large-Scale Video Classification with Convolutional Neural Networks

TL;DR: This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Proceedings ArticleDOI

Action Recognition with Improved Trajectories

TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.
Book ChapterDOI

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

TL;DR: Temporal Segment Networks (TSN) as discussed by the authors combine a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video, which obtains the state-of-the-art performance on the datasets of HMDB51 and UCF101.
Related Papers (5)