scispace - formally typeset
Open AccessBook ChapterDOI

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Reads0
Chats0
TLDR
Temporal Segment Networks (TSN) as discussed by the authors combine a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video, which obtains the state-of-the-art performance on the datasets of HMDB51 and UCF101.
Abstract
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 (\( 69.4\,\% \)) and UCF101 (\( 94.2\,\% \)). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices (Models and code at https://github.com/yjxiong/temporal-segment-networks).

read more

Content maybe subject to copyright    Report

Temporal Segment Networks: Towards Good
Practices for Deep Action Recognition
Limin Wang
1(
B
)
, Yuanjun Xiong
2
, Zhe Wang
3
, Yu Qiao
3
, Dahua Lin
2
,
Xiaoou Tang
2
, and Luc Van Gool
1
1
Computer Vision Lab, ETH Zurich, Zurich, Switzerland
07wanglimin@gmail.com
2
Department of Information Engineering, The Chinese University of Hong Kong,
Hong Kong, China
3
Shenzhen Institutes of Advanced Technology, CAS, Shenzhen, China
Abstract. Deep convolutional networks have achieved great success for
visual recognition in still images. However, for act ion recognition in
videos, the advantage over traditional methods is not so evident. This
pap er aims to discover the principles to design effective ConvNet archi-
tectures for action recognition in videos and learn these models given
limited training samples. Our first contribution is temporal segment net-
work (TSN), a novel framework for video-based action recognition. which
is based on the idea of long-range temporal structure modeling. It com-
bines a sparse temporal sampling strategy and video-level supervision to
enable efficient and effective learning using the whole action video. The
other contribution is our study on a series of goo d practices in learn-
ing ConvNets on video data with the help of temporal segment network.
Our approach obtains the state-the-of- art performance on the datasets of
HMDB51 (69.4 %) and UCF101 (94.2 %). We also visualize the learned
ConvNet models, which qualitatively demonstrates the effectiveness of
temp oral segment network and the proposed good practices (Models and
co de at
https://github.com/yjxiong/temporal-segment-networks).
Keywords: Action recognition
· Temporal segment networks · Good
practices · ConvNets
1 Introduction
Video-based action recognition has drawn a significant amount of attention
from the academic community [
16], owing to its applications in many areas
like security and behavior analysis. In action recognition, there are two crucial
and complementary aspects: appearances an d dynamics. The performance of a
recognition system depends, to a large extent, on whether it is able to extract
Electronic supplementary material The online version of this chapter (doi:10.
1007/978-3-319-46484-8
2) contains supplementary material, which is available to
authorized users.
c
Springer International Publishing AG 2016
B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 20–36, 2016.
DOI: 10.1007/978-3-319-46484-8
2

TSNs: Towards Good Practices for Deep Action Recognition 21
and utilize relevant information therefrom. However, extracting such informa-
tion is non-trivial due to a number of complexities, such as scale variations, view
point changes, and camera motions. Thus it becomes crucial to design effective
representations that can deal with these challenges while preserve categorical
information of action classes. Recently, Convolutional Networks (ConvNets) [
7]
have witnessed great success in classifying images of objects, scenes, and com-
plex events [
811]. ConvNets have also been introduced to solve the problem of
video-based action recognition [1,1214]. Deep ConvNets come with great mod-
eling cap acity and are capable of learning discriminative representation from
raw visual data with the help of large-scale supervised datasets. However, unlike
image classification, end-to-end deep ConvNets remain unable to achieve sig-
nificant advantage over traditional hand-crafted f eatures for video-based action
recognition.
In our v iew, the application of ConvNets in video-based action recognition
is impeded by two major obstacles. First, long-ran ge temporal structure p lays
an important role in understanding the dynamics in action videos [
1518]. How-
ever, mainstream ConvNet frameworks [1,13] usually focus on appearances and
short-term motions, thus lacking the capacity to incorporate long-range temp oral
structure. Recently there are a few attempts [
4,19,20] to deal with this prob-
lem. These methods mostly rely on dense temporal sampling with a pre-defined
sampling interval. This approach would incur excessive computational cost when
applied to long video sequences, which limits its application in real-world prac-
tice and poses a risk of missing important information for videos longer than the
maximal sequence length. Second, in practice, training deep ConvNets requires
a large volume of training samples to achieve optimal performance. However,
due to the difficulty in data collection and ann otati on, publicly available action
recognition datasets (e.g. UCF101 [
21], HMDB51 [22]) remain limited, in b ot h
size and diversity. Consequently, very d eep ConvNets [
9,23], which have attained
remarkable success in image classification, are confronted with high risk of over-
fitting.
These challenges motivate us to study two problems: (1) how to design an
effective and efficient video-level framework for learning video representation that
is able to capture long-range temporal structure; (2) how to learn the ConvNet
models given limited t raining samples. In particular, we build ou r method on top
of the successful two-stream architecture [
1] while tackling the problems men-
tioned above. In terms of temporal structure modeling, a key observation is that
consecutive frames are highly redundant. Therefore, dense temporal sampling,
which usually results in highly similar sampled frames, is unnecessary. Instead a
sparse temporal sampling strategy will be more favorable in this case. Motivated
by this observation, we develop a video-level framework, called temporal seg-
ment network (TSN). This framework extracts short snippets over a long vid eo
sequence with a sparse sampling scheme, where the samples distribute uniformly
along the temporal dimension. Thereon, a segmental structure is employed to
aggregate information from the sampled snippets. In this sense, temporal seg-
ment networks are capable of modeling long-range temporal structure over the

22 L. Wang et al.
whole video. Moreover, this sparse sampling strategy preserves relevant infor-
mation with dramatically lower cost, thus enabling end-to-end learning over
long video sequences under a reasonable budget in both time and computing
resources.
To unleash the full potential of temporal segment network framework, we
adopt very deep ConvNet architectures [
9,23] introduced recently, and explored
a number of good practices to overcome the aforementioned difficulties caused
by the limited number of training sampl es, including (1) cross-mod ality pre-
training; (2) regularization; (3) enhanced data augmentation. Meanwhile, to
fully utilize visual content from videos, we empirically study four types of input
modalities to two-stream ConvNets, namely a single RGB image, stacked RGB
difference, stacked optical flow field, and stacked warped optical flow field.
We perform experiments on two challenging action recognition datasets,
namely UCF101 [21] and HMDB51 [22], to verify the effectiveness of our method.
In experiments, models learned using the temporal segment network significantly
outperform the state of the art on these two challenging action recognition
datasets. We also visualize the our learned two-stream models trying to pro-
vide some insights for fut ur e action recognition research.
2 Related Works
Action recognition has been extensively studied in past few years [
2,18,2426].
Previous works related to ours fall into two categories: (1) convolutional networks
for action recognition, (2) temporal structure modeling.
Convolutional Networks for Action Recognition. Several works have
been trying to design effective ConvNet architectures for action recognition in
videos [1,12,13,27,28]. Karpathy et al. [12] tested ConvNets with deep struc-
tures on a large dataset (S ports-1M). Simonyan et al. [
1] designed two-stream
ConvNets containing spatial and temporal net by exploiting ImageNet dataset
for pre-training and calculating optical flow to explicitly capture motion informa-
tion. Tran et al. [
13] explored 3D ConvNets [27] on the realistic and large-scale
video datasets, where they tried to learn both appearance and motion features
with 3D convolution operations. Sun et al. [
28] proposed a factorized spatio-
temporal ConvNets and exploited different ways to decompose 3D convolutional
kernels. Recently, several works focused on modeling long-range temporal struc-
ture with ConvNets [
4,19,20]. However, these methods di rectly operated on a
longer continuous video streams. Limited by computational cost these methods
usually process sequences of fixed lengths ran gin g from 64 to 120 frames. It is
non-trivial for these methods to learn f rom entire video due to their limited tem-
poral coverage. Our method differs from these end-to-end deep ConvNets by its
novel adoption of a sparse temporal sampling strategy, which enables efficient
learning using the entire videos without the limitation of sequence length.
Temporal Structure Modeling. Many research works have been devoted
to modeling the temporal structure for action recognition [
1518,29,30].

TSNs: Towards Good Practices for Deep Action Recognition 23
Gaidon et al. [
16] annotated each atomic action for each video and proposed
Actom Sequence Model (ASM) for action detection. Ni ebles et al. [15]proposed
to use latent variables to model the temporal decomposition of complex actions,
and resorted to the Latent SVM [
31] to learn the model parameters in an itera-
tive approach. Wang et al. [
17]andPirsiavashet al. [29] extended the temporal
decomposition of complex action into a hierarchical manner using Latent Hier-
archical Model (LHM) and S egmental Grammar Model (SGM), respectively.
Wang et al. [
30] designed a sequential skeleton model (SSM) to capture the rela-
tions among dynamic-poselets, and performed spatio-temporal action detection.
Fernando [
18] modeled the temporal evolution of BoVW representations for
action recognition. These methods, however, remain unable to assemble an end-
to-end learning scheme for modeling the temporal structure. The proposed tem-
poral segment network, while also emphasizing this principle, is the first frame-
work for end-to-end temporal structure modeling on the entire videos.
3 Action Recognition with Temporal Segment Networks
In this section, we give detailed descriptions of performing action recognition
with temporal segment networks. Specifically, we first introduce the basic con-
cepts in the framework of temporal segment network. Then, we study the good
practices in learning two-stream ConvNets within the temporal segment net-
work framework. Finally, we describe the testing details of the learned two-
stream ConvNets.
3.1 Temporal Segment Networks
As we discussed in Sect.
1, an obvious problem of the two-stream ConvNets
in their current forms is their inability in modeling long-range temporal struc-
ture. This is mainly due to their limited access to temporal context as they are
designed to operate only on a single frame (spatial networks) or a single stack of
frames in a short snippet (temporal network). However, complex actions, such
as sports action, comprise multiple stages spanning over a relatively long time.
It would be quite a loss failing to utilize long-range temporal structures in these
actions into ConvNet training. To tackle this issue, we propose temporal seg-
ment network, a video-level framework as shown in Fig.
1, to enable to model
dynamics throughout the whole video.
Specifically, our proposed temporal segment network framework, aiming to
utilize the visual information of entire videos to perform video-level prediction,
is also composed of spatial stream ConvNets and temporal stream ConvNets.
Instead of working on single frames or frame stacks, temporal segment networks
operate on a sequence of short snippets sparsely sampled from the entire video.
Each snippet in this sequence will produce its own preliminary prediction of
the action classes. Then a consensus among the snippets will be derived as
the video-level prediction. In the learning process, the loss values of video-level

24 L. Wang et al.
predictions, other than those of snippet-level predictions which were used in two-
stream ConvNets, are optimized by iteratively updating the model parameters.
Formally, given a video V , we divid e it into K segments {S
1
,S
2
, ··· ,S
K
}
of equal durations. Then, the temporal segment network models a sequence of
snippets as follows:
TSN(T
1
,T
2
, ··· ,T
K
)=H(G(F(T
1
; W), F(T
2
; W), ··· , F(T
K
; W))). (1)
Here (T
1
,T
2
, ··· ,T
K
) is a sequence of snippets. Each snippet T
k
is randomly
sampled from its corresponding segment S
k
. F(T
k
; W) is the function repre-
senting a ConvNet with parameters W which operates on the short snippet T
k
and produces class scores for all the classes. The segmental consensus function
G combines the outputs from multiple short snippets to obtain a consensus of
class hypothesis among them. Based on this consensus, the prediction function H
predicts the probability of each action class for the whole video. Here we choose
the widely used Softmax function for H. Combining with standard categori-
cal cross-entropy loss, the final loss function regarding the segmental consensus
G = G(F(T
1
; W), F(T
2
; W), ··· , F(T
K
; W)) is formed as
L(y, G)=
C
i=1
y
i
G
i
log
C
j=1
exp G
j
, (2)
where C is the number of action classes and y
i
the groundtruth label concerning
class i. In experiments, the number of snippets K is set to 3 according to previous
works on temporal modeling [
16,17]. The form of consensus function G remains
an open question. In this work we use the simplest form of G, where G
i
=
g(F
i
(T
1
),...,F
i
(T
K
)). Here a class score G
i
is inferr ed from the scores of the
same class on all the snippets, using an aggregation function g. We empirically
evaluated several different forms of the aggregation function g, including evenly
averaging, maximum, and weighted averaging in our experiments. Among them,
evenly averaging is used to report our final recognition accuracies.
This temporal segment network is differentiable or at least has subgradients,
depending on the choice of g. This allows us to utilize the multiple snippets
to jointly optimize the model parameters W with standard back-propagation
algorithms. In the back-propagation process, the gradients of model parameters
W with respect to the loss value L can be derived as
L(y, G)
W
=
L
G
K
k=1
G
F(T
k
)
F(T
k
)
W
, (3)
where K is number of segments temporal segment network uses.
When we use a gradient-based optimization method, like stochastic gradi-
ent descent (SGD), to learn the model parameters, Eq. (
3) guarantees that the
parameter updates are utilizing the segmental consensus G derived from all
snippet-level prediction. Optimized in this manner, temporal segment network-
can learn model parameters from the entire video rather than a short snippet.

Citations
More filters
Proceedings ArticleDOI

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

TL;DR: In this article, a Two-Stream Inflated 3D ConvNet (I3D) is proposed to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and their parameters.
Proceedings Article

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

TL;DR: Wang et al. as discussed by the authors proposed a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data.
Proceedings ArticleDOI

SlowFast Networks for Video Recognition

TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.
Posted Content

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

TL;DR: I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Proceedings ArticleDOI

A Closer Look at Spatiotemporal Convolutions for Action Recognition

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).