Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

doi:10.1007/978-3-319-46484-8_2

Temporal Segment Networks: Towards Good

Practices for Deep Action Recognition

Limin Wang

1(

B

)

, Yuanjun Xiong

2

, Zhe Wang

3

, Yu Qiao

3

, Dahua Lin

2

,

Xiaoou Tang

2

, and Luc Van Gool

1

Computer Vision Lab, ETH Zurich, Zurich, Switzerland

07wanglimin@gmail.com

2

Department of Information Engineering, The Chinese University of Hong Kong,

Hong Kong, China

3

Shenzhen Institutes of Advanced Technology, CAS, Shenzhen, China

Abstract. Deep convolutional networks have achieved great success for

visual recognition in still images. However, for act ion recognition in

videos, the advantage over traditional methods is not so evident. This

pap er aims to discover the principles to design eﬀective ConvNet archi-

tectures for action recognition in videos and learn these models given

limited training samples. Our ﬁrst contribution is temporal segment net-

work (TSN), a novel framework for video-based action recognition. which

is based on the idea of long-range temporal structure modeling. It com-

bines a sparse temporal sampling strategy and video-level supervision to

enable eﬃcient and eﬀective learning using the whole action video. The

other contribution is our study on a series of goo d practices in learn-

ing ConvNets on video data with the help of temporal segment network.

Our approach obtains the state-the-of- art performance on the datasets of

HMDB51 (69.4 %) and UCF101 (94.2 %). We also visualize the learned

ConvNet models, which qualitatively demonstrates the eﬀectiveness of

temp oral segment network and the proposed good practices (Models and

co de at

https://github.com/yjxiong/temporal-segment-networks).

Keywords: Action recognition

· Temporal segment networks · Good

practices · ConvNets

1 Introduction

Video-based action recognition has drawn a signiﬁcant amount of attention

from the academic community [

1–6], owing to its applications in many areas

like security and behavior analysis. In action recognition, there are two crucial

and complementary aspects: appearances an d dynamics. The performance of a

recognition system depends, to a large extent, on whether it is able to extract

Electronic supplementary material The online version of this chapter (doi:10.

1007/978-3-319-46484-8

2) contains supplementary material, which is available to

authorized users.

c

 Springer International Publishing AG 2016

B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 20–36, 2016.

DOI: 10.1007/978-3-319-46484-8

2

TSNs: Towards Good Practices for Deep Action Recognition 21

and utilize relevant information therefrom. However, extracting such informa-

tion is non-trivial due to a number of complexities, such as scale variations, view

point changes, and camera motions. Thus it becomes crucial to design eﬀective

representations that can deal with these challenges while preserve categorical

information of action classes. Recently, Convolutional Networks (ConvNets) [

7]

have witnessed great success in classifying images of objects, scenes, and com-

plex events [

8–11]. ConvNets have also been introduced to solve the problem of

video-based action recognition [1,12–14]. Deep ConvNets come with great mod-

eling cap acity and are capable of learning discriminative representation from

raw visual data with the help of large-scale supervised datasets. However, unlike

image classiﬁcation, end-to-end deep ConvNets remain unable to achieve sig-

niﬁcant advantage over traditional hand-crafted f eatures for video-based action

recognition.

In our v iew, the application of ConvNets in video-based action recognition

is impeded by two major obstacles. First, long-ran ge temporal structure p lays

an important role in understanding the dynamics in action videos [

15–18]. How-

ever, mainstream ConvNet frameworks [1,13] usually focus on appearances and

short-term motions, thus lacking the capacity to incorporate long-range temp oral

structure. Recently there are a few attempts [

4,19,20] to deal with this prob-

lem. These methods mostly rely on dense temporal sampling with a pre-deﬁned

sampling interval. This approach would incur excessive computational cost when

applied to long video sequences, which limits its application in real-world prac-

tice and poses a risk of missing important information for videos longer than the

maximal sequence length. Second, in practice, training deep ConvNets requires

a large volume of training samples to achieve optimal performance. However,

due to the diﬃculty in data collection and ann otati on, publicly available action

recognition datasets (e.g. UCF101 [

21], HMDB51 [22]) remain limited, in b ot h

size and diversity. Consequently, very d eep ConvNets [

9,23], which have attained

remarkable success in image classiﬁcation, are confronted with high risk of over-

ﬁtting.

These challenges motivate us to study two problems: (1) how to design an

eﬀective and eﬃcient video-level framework for learning video representation that

is able to capture long-range temporal structure; (2) how to learn the ConvNet

models given limited t raining samples. In particular, we build ou r method on top

of the successful two-stream architecture [

1] while tackling the problems men-

tioned above. In terms of temporal structure modeling, a key observation is that

consecutive frames are highly redundant. Therefore, dense temporal sampling,

which usually results in highly similar sampled frames, is unnecessary. Instead a

sparse temporal sampling strategy will be more favorable in this case. Motivated

by this observation, we develop a video-level framework, called temporal seg-

ment network (TSN). This framework extracts short snippets over a long vid eo

sequence with a sparse sampling scheme, where the samples distribute uniformly

along the temporal dimension. Thereon, a segmental structure is employed to

aggregate information from the sampled snippets. In this sense, temporal seg-

ment networks are capable of modeling long-range temporal structure over the

22 L. Wang et al.

whole video. Moreover, this sparse sampling strategy preserves relevant infor-

mation with dramatically lower cost, thus enabling end-to-end learning over

long video sequences under a reasonable budget in both time and computing

resources.

To unleash the full potential of temporal segment network framework, we

adopt very deep ConvNet architectures [

9,23] introduced recently, and explored

a number of good practices to overcome the aforementioned diﬃculties caused

by the limited number of training sampl es, including (1) cross-mod ality pre-

training; (2) regularization; (3) enhanced data augmentation. Meanwhile, to

fully utilize visual content from videos, we empirically study four types of input

modalities to two-stream ConvNets, namely a single RGB image, stacked RGB

diﬀerence, stacked optical ﬂow ﬁeld, and stacked warped optical ﬂow ﬁeld.

We perform experiments on two challenging action recognition datasets,

namely UCF101 [21] and HMDB51 [22], to verify the eﬀectiveness of our method.

In experiments, models learned using the temporal segment network signiﬁcantly

outperform the state of the art on these two challenging action recognition

datasets. We also visualize the our learned two-stream models trying to pro-

vide some insights for fut ur e action recognition research.

2 Related Works

Action recognition has been extensively studied in past few years [

2,18,24–26].

Previous works related to ours fall into two categories: (1) convolutional networks

for action recognition, (2) temporal structure modeling.

Convolutional Networks for Action Recognition. Several works have

been trying to design eﬀective ConvNet architectures for action recognition in

videos [1,12,13,27,28]. Karpathy et al. [12] tested ConvNets with deep struc-

tures on a large dataset (S ports-1M). Simonyan et al. [

1] designed two-stream

ConvNets containing spatial and temporal net by exploiting ImageNet dataset

for pre-training and calculating optical ﬂow to explicitly capture motion informa-

tion. Tran et al. [

13] explored 3D ConvNets [27] on the realistic and large-scale

video datasets, where they tried to learn both appearance and motion features

with 3D convolution operations. Sun et al. [

28] proposed a factorized spatio-

temporal ConvNets and exploited diﬀerent ways to decompose 3D convolutional

kernels. Recently, several works focused on modeling long-range temporal struc-

ture with ConvNets [

4,19,20]. However, these methods di rectly operated on a

longer continuous video streams. Limited by computational cost these methods

usually process sequences of ﬁxed lengths ran gin g from 64 to 120 frames. It is

non-trivial for these methods to learn f rom entire video due to their limited tem-

poral coverage. Our method diﬀers from these end-to-end deep ConvNets by its

novel adoption of a sparse temporal sampling strategy, which enables eﬃcient

learning using the entire videos without the limitation of sequence length.

Temporal Structure Modeling. Many research works have been devoted

to modeling the temporal structure for action recognition [

15–18,29,30].

TSNs: Towards Good Practices for Deep Action Recognition 23

Gaidon et al. [

16] annotated each atomic action for each video and proposed

Actom Sequence Model (ASM) for action detection. Ni ebles et al. [15]proposed

to use latent variables to model the temporal decomposition of complex actions,

and resorted to the Latent SVM [

31] to learn the model parameters in an itera-

tive approach. Wang et al. [

17]andPirsiavashet al. [29] extended the temporal

decomposition of complex action into a hierarchical manner using Latent Hier-

archical Model (LHM) and S egmental Grammar Model (SGM), respectively.

Wang et al. [

30] designed a sequential skeleton model (SSM) to capture the rela-

tions among dynamic-poselets, and performed spatio-temporal action detection.

Fernando [

18] modeled the temporal evolution of BoVW representations for

action recognition. These methods, however, remain unable to assemble an end-

to-end learning scheme for modeling the temporal structure. The proposed tem-

poral segment network, while also emphasizing this principle, is the ﬁrst frame-

work for end-to-end temporal structure modeling on the entire videos.

3 Action Recognition with Temporal Segment Networks

In this section, we give detailed descriptions of performing action recognition

with temporal segment networks. Speciﬁcally, we ﬁrst introduce the basic con-

cepts in the framework of temporal segment network. Then, we study the good

practices in learning two-stream ConvNets within the temporal segment net-

work framework. Finally, we describe the testing details of the learned two-

stream ConvNets.

3.1 Temporal Segment Networks

As we discussed in Sect.

1, an obvious problem of the two-stream ConvNets

in their current forms is their inability in modeling long-range temporal struc-

ture. This is mainly due to their limited access to temporal context as they are

designed to operate only on a single frame (spatial networks) or a single stack of

frames in a short snippet (temporal network). However, complex actions, such

as sports action, comprise multiple stages spanning over a relatively long time.

It would be quite a loss failing to utilize long-range temporal structures in these

actions into ConvNet training. To tackle this issue, we propose temporal seg-

ment network, a video-level framework as shown in Fig.

1, to enable to model

dynamics throughout the whole video.

Speciﬁcally, our proposed temporal segment network framework, aiming to

utilize the visual information of entire videos to perform video-level prediction,

is also composed of spatial stream ConvNets and temporal stream ConvNets.

Instead of working on single frames or frame stacks, temporal segment networks

operate on a sequence of short snippets sparsely sampled from the entire video.

Each snippet in this sequence will produce its own preliminary prediction of

the action classes. Then a consensus among the snippets will be derived as

the video-level prediction. In the learning process, the loss values of video-level

24 L. Wang et al.

predictions, other than those of snippet-level predictions which were used in two-

stream ConvNets, are optimized by iteratively updating the model parameters.

Formally, given a video V , we divid e it into K segments {S

1

,S

2

, ··· ,S

K

}

of equal durations. Then, the temporal segment network models a sequence of

snippets as follows:

TSN(T

1

,T

2

, ··· ,T

K

)=H(G(F(T

1

; W), F(T

2

; W), ··· , F(T

K

; W))). (1)

Here (T

1

,T

2

, ··· ,T

K

) is a sequence of snippets. Each snippet T

k

is randomly

sampled from its corresponding segment S

k

. F(T

k

; W) is the function repre-

senting a ConvNet with parameters W which operates on the short snippet T

k

and produces class scores for all the classes. The segmental consensus function

G combines the outputs from multiple short snippets to obtain a consensus of

class hypothesis among them. Based on this consensus, the prediction function H

predicts the probability of each action class for the whole video. Here we choose

the widely used Softmax function for H. Combining with standard categori-

cal cross-entropy loss, the ﬁnal loss function regarding the segmental consensus

G = G(F(T

1

; W), F(T

2

; W), ··· , F(T

K

; W)) is formed as

L(y, G)=−

C



i=1

y

i

⎛

⎝

G

i

− log

C



j=1

exp G

j

⎞

⎠

, (2)

where C is the number of action classes and y

i

the groundtruth label concerning

class i. In experiments, the number of snippets K is set to 3 according to previous

works on temporal modeling [

16,17]. The form of consensus function G remains

an open question. In this work we use the simplest form of G, where G

i

=

g(F

i

(T

1

),...,F

i

(T

K

)). Here a class score G

i

is inferr ed from the scores of the

same class on all the snippets, using an aggregation function g. We empirically

evaluated several diﬀerent forms of the aggregation function g, including evenly

averaging, maximum, and weighted averaging in our experiments. Among them,

evenly averaging is used to report our ﬁnal recognition accuracies.

This temporal segment network is diﬀerentiable or at least has subgradients,

depending on the choice of g. This allows us to utilize the multiple snippets

to jointly optimize the model parameters W with standard back-propagation

algorithms. In the back-propagation process, the gradients of model parameters

W with respect to the loss value L can be derived as

∂L(y, G)

∂W

=

∂L

∂G

K



k=1

∂G

∂F(T

k

)

∂F(T

k

)

∂W

, (3)

where K is number of segments temporal segment network uses.

When we use a gradient-based optimization method, like stochastic gradi-

ent descent (SGD), to learn the model parameters, Eq. (

3) guarantees that the

parameter updates are utilizing the segmental consensus G derived from all

snippet-level prediction. Optimized in this manner, temporal segment network-

can learn model parameters from the entire video rather than a short snippet.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Citations

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

SlowFast Networks for Video Recognition

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

A Closer Look at Spatiotemporal Convolutions for Action Recognition

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Gradient-based learning applied to document recognition

Going deeper with convolutions

Related Papers (5)

Learning Spatiotemporal Features with 3D Convolutional Networks

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Two-Stream Convolutional Networks for Action Recognition in Videos

Deep Residual Learning for Image Recognition

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild