scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Deep Temporal Linear Encoding Networks

01 Jul 2017-pp 1541-1550
TL;DR: Temporal linear encoding (TLE) as discussed by the authors is proposed to encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space, which is applicable to all kinds of networks like 2D and 3D CNNs.
Abstract: The CNN-encoding of features from entire videos for the representation of human actions has rarely been addressed. Instead, CNN work has focused on approaches to fuse spatial and temporal networks, but these were typically limited to processing shorter sequences. We present a new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos. It encodes this aggregated information into a robust video feature representation, via end-to-end learning. Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space, (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification, and (c) they model feature interactions in a more expressive way and without loss of information. We conduct experiments on two challenging human action datasets: HMDB51 and UCF101. The experiments show that TLE outperforms current state-of-the-art methods on both datasets.

Summary (4 min read)

1. Introduction

  • Human action recognition [6, 15, 25, 35] in videos has attracted quite some attention, due to the potential applications in video surveillance, behavior analysis, video retrieval, and more.
  • Even if considerable progress was made, the performance of computer vision systems still falls behind that of people.
  • On top of the challenges that make object class recognition hard, there are issues like camera motion and the continously changing viewpoints that come with it.
  • This work was carried out while he was at ESAT-PSI, KU Leuven.

Temporal Linear Encoding (TLE) Layer

  • Given several segments of an entire video, be it either a number of frames or a number of clips, the model builds a compact video representation from the spatial and temporal cues they contain, through end-to-end learning.
  • This reliance on dense temporal sampling leads to excessive computational costs for longer videos.
  • The fusion methods of spatial and motion information lie at the heart of the state-of-the-art two-stream ConvNets.
  • Specifically, TLE captures the important concepts from the longrange temporal structure in different frames or clips, and aggregates it into a compact and robust feature representation by linear encoding.
  • Finally, conclusions are drawn in Section 6.

Score Fusion

  • The proposed temporal linear encoding captures more expressive interactions between the segments across entire videos, and encodes these interactions into a compact representation for video-level prediction.
  • To the best of their knowledge, this is the first end-to-end deep network that encodes temporal features from entire videos.

3. Approach

  • Motivated by this, IDTs [35] showed that the densely sampling feature points in video frames and using optical flow to track them yields a good video representation.
  • This suggests that the authors need a video representation that encodes all the frames together, in order to also capture long-range dynamics.
  • Given earlier successes with deep learning, creating effective video representations should seem possible via the end-to-end learning of deep neural networks.
  • The hope would be that such representations embody more of the semantic information extracted along the whole video.
  • More details about the CNN encoding layer is given in Section 3.1.

3.1. Deep Temporal linear encoding

  • The aggregation function can be applied to the output of different convolutional layers.
  • This temporal aggregation allows us to linearly encode and aggregate information from the entire video into a compact and robust feature representation.
  • This retains the temporal relationship between all the segments without the loss of important information.
  • The authors investigated different functions T for the temporal aggregation of the segments.

Input:

  • Of all the temporal aggregation functions illustrated above, element-wise multiplication of feature maps yielded best results, and was therefore selected.
  • The advantage of encoding is that every channel of the aggregated temporal segments interacts with every other channel, thus leading to a powerful feature representation of the entire video.
  • The resulting bilinear features capture the interaction of features with each other at all spatial locations, hence leading to a high-dimensional representation.
  • As the network has fullyconnected layers between the last convolutional layer and the classification layer, the model parameters of the fully-connected layer and classification layer are learned when training the network from scratch or when fine-tuning a pre-trained network.
  • Let ℓ denote the loss function, and dℓ/d(X) be the gradient of the loss function with respect to X. Algorithm 2 illustrates the forward and backward passes of their temporal linear encoding steps for the 3 segments setup, also known as End-to-end training.

4. Evaluation

  • The authors first introduce the datasets and implementation details of their proposed approach.
  • Then the authors demonstrate the applicability of their temporal linear encoding on 2D and 3D ConvNets using frames or clips to encode long-range dynamics across entire videos.
  • Finally, the authors compare temporal linear encoding with the state-of-the-art methods.

4.1. Datasets

  • The authors conduct experiments on two challenging video datasets with human actions, namely HMDB51 [18] and UCF101 [28] .
  • The HMDB51 dataset consists of 51 action categories with 6,766 video clips in all.
  • Both of these datasets have at least 100 video clips for each action category.
  • For both datasets, the authors use the three training/testing splits provided as the original evaluation scheme for these datasets, and report the average accuracy over these three splits.

4.2. Implementation details

  • The authors use the caffe toolbox [14] for ConvNet implementation and all the networks are trained on two Geforce Titan X GPUs.
  • The convolutional feature maps extracted from the last convolutional layers (after the rectified output of the last convolutional layer, when there is one) are fed as input into the bilinear models.
  • The authors follow two steps to fine-tune the whole model.
  • The authors apply the same augmentation and discretization techniques for both RGB and optical flow frames, as discussed earlier.

C3D ConvNets:

  • The video is decomposed into non-overlapping, equal-duration clips of 16 frames.
  • For fine-tuning the network, the authors replace the previous classification layer with a C-way softmax layer, where C is the number of action categories.
  • For fine-tuning the model, the authors use the same two steps scheme explained earlier.
  • The authors use the C3D Con-vNet [33] architecture as a default ConvNet architecture for TLE.

Testing:

  • The three parts are associated with the 3 segments.
  • For TLE two-stream ConvNet testing, at a time, the authors extact 1 RGB frame or 10 optical flow frames from each part and feed these into the 3 segments network sequentially.
  • For video prediction, the authors average predictions over all group of frame segments.
  • The prediction scores of the spatial and temporal ConvNets are combined in a late fusion approach via averaging.
  • The authors decompose each video into nonoverlapping clips of 16 frames, they then divide the number of clips into 3 equal parts.

4.3. Evaluation of TLE

  • The authors explore (i) different aggregation functions T to linearly aggregate the segments into a compact intermediate representation for encoding; and (ii) different ConvNet architectures for both two-stream (spatial and temporal networks) and C3D networks.
  • For this evaluation, the authors report the accuracy of split1 on UCF101 and HMDB51.
  • The reported performance is for TLE with bilinear models using the tensor sketch algorithm.

Two-Stream ConvNets:

  • In their evaluation, the authors explore three aggregations functions (i) element-wise average, (ii) element-wise maximum, and (iii) element-wise multiplication.
  • The authors observe that the element-wise multiplication performs the best.
  • The authors believe combining the feature maps in this way allows us to aggregate the appearance and motion information accurately, hence leading to better results.
  • Interestingly, the authors also found that aggregating rectified output of the last convolutional feature maps achieves around the same classification performance, in comparison to non-rectified ones.
  • Therefore, the authors choose BN-Inception as a default ConvNet architecture for TLE.

4.4. Comparison with the state-of-the-art

  • Finally, after exploring the aggregation function and good ConvNet architectures, the authors compare their TLE with the current state-of-the-art methods over all three splits of the UCF101 and HMDB51 datasets.
  • The authors report the average accuracy over the three splits of both datasets.
  • The performance gap of TLE with bilinear models using the tensor sketch algorithm (TLE:Bilinear+TS) is however small 0.5/0.5%, and 3.4/2.3% for TLE with fully-connected pooling (TLE:FC-Pooling) when compared to TLE:Bilinear on UCF101/HMDB51.

5. Scene Context Embedding

  • This section describes an additional experiment to incorporate scene context in order to improve the success of action recognition.
  • The authors transfer the learned representations between the two tasks for better action recognition.
  • The authors exploit the context information from the scenes to improve the action recognition.
  • The authors use the VGG-16 network architecture in this experiment.
  • The accuracy is reported for split1 on both datasets.

6. Conclusion

  • The model performs action prediction over an entire video.
  • Even though, in this paper, the authors have focused on two-stream and C3D ConvNets architectures, their method has the potential to generalize to other architectures, and can readily be employed with other encoding methods also.
  • Thus, it can lead to more accurate classification.
  • Another potential of this work is that TLEs are flexible enough to be readily employed to other forms of sequential data streams for feature embedding.

Did you find this useful? Give us your feedback

Figures (10)

Content maybe subject to copyright    Report

Deep Temporal Linear Encoding Networks
Ali Diba
1,⋆
, Vivek Sharma
2,⋆,
, Luc Van Gool
1,3
1
ESAT-PSI, KU Leuven,
2
CV:HCI, KIT,
3
CVL, ETH Z
¨
urich
{ali.diba,luc.vangool}@esat.kuleuven.be, vivek.sharma@kit.edu
Abstract
The CNN-encoding of features from entire videos for
the representation of human actions has rarely been ad-
dressed. Instead, CNN work has focused on approaches
to fuse spatial and temporal networks, but these were typ-
ically limited to processing shorter sequences. We present
a new video representation, called temporal linear encod-
ing (TLE) and embedded inside of CNNs as a new layer,
which captures the appearance and motion throughout en-
tire videos. It encodes this aggregated information into a
robust video feature representation, via end-to-end learn-
ing. Advantages of TLEs are: (a) they encode the entire
video into a compact feature representation, learning the
semantics and a discriminative feature space; (b) they are
applicable to all kinds of networks like 2D and 3D CNNs for
video classification; and (c) they model feature interactions
in a more expressive way and without loss of information.
We conduct experiments on two challenging human action
datasets: HMDB51 and UCF101. The experiments show
that TLE outperforms current state-of-the-art methods on
both datasets.
1. Introduction
Human action recognition [6, 15, 25, 35] in videos has
attracted quite some attention, due to the potential appli-
cations in video surveillance, behavior analysis, video re-
trieval, and more. Even if considerable progress was made,
the performance of computer vision systems still falls be-
hind that of people. On top of the challenges that make
object class recognition hard, there are issues like camera
motion and the continously changing viewpoints that come
with it. Whereas Convolutional Networks (ConvNets) have
caused several sub-fields of vision to leap forward, they still
lack the capacity to exploit long-range temporal informa-
tion, probably the main reason why end-to-end networks
are still unable to outperform methods using hand-crafted
Ali Diba and Vivek Sharma contributed equally to this work and
listed in alphabetical order.
This work was carried out while he was at ESAT-PSI, KU Leuven.
Tennis
Temporal Linear Encoding
(TLE) Layer
End-To-End Training
C
O
N
V
C
O
N
V
C
O
N
V
Figure 1: Temporal linear encoding for video classification.
Given several segments of an entire video, be it either a
number of frames or a number of clips, the model builds a
compact video representation from the spatial and temporal
cues they contain, through end-to-end learning. The Con-
vNets applied to different segments share the same weights.
features [35].
Neural networks for action recognition can be catego-
rized into two types, namely one-stream ConvNets [15, 33]
(which use only one stream at a time: either spatial or tem-
poral information), and two-stream ConvNets [25] (which
integrate both spatial and temporal information at the same
time).
As to the one-stream ConvNets, spatial networks per-
form action recognition from individual video frames. They
lack any form of motion modeling. On the other hand,
temporal networks typically get their motion information
from dense optical flow. This reliance on dense temporal
sampling leads to excessive computational costs for longer
videos. One way to avoid processing the abundance of in-
put frames is by extracting a fixed number of shorter clips,
evenly distributed over the video [25, 33].
The two-stream ConvNets have shown to outperform
one-stream ConvNets. They exploit fusion techniques like
trajectory-constrained pooling [37], 3D pooling [8], and
1
2329

consensus pooling [38]. The fusion methods of spatial and
motion information lie at the heart of the state-of-the-art
two-stream ConvNets.
Motivated by the above observations, we propose the
new spatio-temporal encoding illustrated in Figure 1. The
design of the spatio-temporal deep feature encoding aims
to aggregate multiple video segments (i.e. frames or clips)
over longer time ranges. To that end, we use our ‘tempo-
ral linear encoding’ (TLE), which is inspired by previous
works on video representations [35] and feature encoding
methods [20, 31]. TLE is a form of temporal aggregation of
features sparsely sampled over the whole video using fea-
ture map aggregation techniques, and then projected to a
lower dimensional feature space using encoding methods
powered by end-to-end learning of deep networks. Specif-
ically, TLE captures the important concepts from the long-
range temporal structure in different frames or clips, and ag-
gregates it into a compact and robust feature representation
by linear encoding. The compact temporal feature represen-
tation fits action recognition well, as it is a global feature
representation over the whole video. The goal of the paper
is not only to achieve high performance, but also to show
that TLEs are computationally efficient, robust, and com-
pact. TLE is evaluated on two challenging action recogni-
tion datasets, namely HMDB51 [18] and UCF101 [28]. We
experimentally show that the two-stream ConvNets when
combined with TLEs achieve state-of-the-art performance
on HMDB51 (71.1%) and UCF101 (95.6%).
The rest of the paper is organized as follows. In Sec-
tion 2, we discuss related work. Section 3 describes our
proposed approach. Experimental results and their analysis
are presented in Section 4 and Section 5. Finally, conclu-
sions are drawn in Section 6.
2. Related Work
Action Recognition without ConvNets: Over the last two
decades, several action recognition techniques in videos
have been proposed by the vision community. Quite
a few are concerned with effective representations us-
ing local spatio-temporal features, suc h as HOG3D [16],
SIFT3D [24], HOF [19], ESURF [39], and MBH [4]. Re-
cently, IDT [35] was proposed, which is currently the
state-of-the-art among hand-crafted features. Despite this
good performance, these features have several shortcom-
ings: they are computationally expensive; they fail to cap-
ture semantic concepts; they lack discriminative capacity as
well as scalability. To overcome such issues, several tech-
niques have been proposed to model the temporal struc-
ture for action recognition, such as the actom sequence
model [10] which considers sequence of histograms; tem-
poral action decomposition [21] which exploits the tempo-
ral structure of human actions by temporally decomposing
video frames; dynamic poselets [36] which uses a relational
model for action detection; and the temporal evolution of
appearance representations [9] which uses a ranking func-
tion capable of modeling the evolution of both appearance
and motion over time.
ConvNets for Action Recognition: Recently several at-
tempts have been made to go beyond individual image-level
appearance information and exploit the temporal informa-
tion using ConvNet architectures. End-to-end ConvNets
have been introduced in [8, 25, 33, 38] for action recog-
nition. Karpathy et al. [15] trained a deep network operat-
ing on individual frames using a very large sports activities
dataset (Sports-1M). Yet, the deep model turned out to be
less accurate than an IDTs-based representation because it
could not capture the motion information. To overcome this
problem, Simonyan et al. [25] proposed a two-stream net-
work, cohorts of spatial and temporal ConvNets. The input
to the spatial and temporal networks are RGB frames and
stacks of multiple-frame dense optical flow fields, respec-
tively. The network was still limited in its capacity to cap-
ture temporal information, because it operated on a fixed
number of regularly spaced, single frames from the entire
video. Tran et al. [33] explored 3D ConvNets on video
streams for spatio-temporal feature learning for clips of 16
frames, and filter kernel of size 3 × 3 × 3. In this way, they
avoid to calculate the optical flow explicitly and still achieve
good performance. Sun et al. [30] proposed a factorized
spatio-temporal ConvNet and decomposed the 3D convolu-
tions into 2D spatial and 1D temporal convolutions. Similar
to [25] and [33] is Feichtenhofer et al.s [8] work, where
they employ 3D Conv fusion and 3D pooling to fuse spatial
and temporal networks using RGB images and a stack of 10
optical flow frames as input. Wang et al. [38] use multiple
clips sparsely sampled from the whole video as input for
both streams, and then combine the scores for all clips in a
late fusion approach.
Encoding Methods: As to prior encoding methods, there
is a vast literature on BoW [3, 27], Fisher vector encod-
ing [22] and sparse encoding [40]. Such methods have
performed very well in various vision tasks. FV encod-
ing [31] and VLAD [1, 12] have lately been integrated as
a layer in ConvNet architectures, and CNN encoded fea-
tures have produced superior results for several challeng-
ing tasks. Likewise, Bilinear models [20, 32] have been
widely used and have achieved state-of-the-art results. Bi-
linear models are computationally expensive as they return
matrix outer products, hence can lead to prohibitively high
dimensions. To tackle this problem, compact bilinear pool-
ing [11] was proposed which uses the Tensor Sketch Al-
gorithm [23], to project features from a high dimensional
space to a lower dimensional one, while retaining state-of-
the-art performances. Compact bilinear pooling has shown
to perform better than FV encoding and fully-connected
networks [11]. Moreover, this type of feature representa-
2330

RGB TLE
Layer
C
O
N
V
C
O
N
V
C
O
N
V
Optical Flow TLE
Layer
C
O
N
V
C
O
N
V
C
O
N
V
Score Fusion
Figure 2: Our temporal linear encoding applied to the de-
sign of two-stream ConvNets [25]: spatial and temporal net-
works. The spatial network operates on RGB frames, and
the temporal network operates on optical flow fields. The
features maps from the spatial and temporal ConvNets for
multiple such segments are aggregated and encoded. Fi-
nally, the scores for the two ConvNets are combined in a
late fusion approach as averaging. The ConvNet weights
for the spatial stream are shared and similarly for the tem-
poral stream.
tion is compact, non-redundant, avoids over-fitting, and re-
duces the number of parameters of CNNs significantly, as it
replaces fully-connected layers.
Our proposed temporal linear encoding captures more
expressive interactions between the segments across entire
videos, and encodes these interactions into a compact rep-
resentation for video-level prediction. To the best of our
knowledge, this is the first end-to-end deep network that
encodes temporal features from entire videos.
3. Approach
In a video, the motion between consecutive frames tends
to be small. Motivated by this, IDTs [35] showed that the
densely sampling feature points in video frames and using
optical flow to track them yields a good video represen-
tation. This suggests that we need a video representation
that encodes all the frames together, in order to also cap-
ture long-range dynamics. To tackle this issue, recently
some techniques have combined several consecutive [25] or
sparsely sampled [38] frames into short clips. Unlike IDTs,
these techniques use ConvNets with late fusion to combine
spatial and temporal cues, but they still fail to efficiently
encode all frames together.
Given earlier successes with deep learning, creating ef-
fective video representations should seem possible via the
end-to-end learning of deep neural networks. The hope
would be that such representations embody more of the se-
mantic information extracted along the whole video. Our
goal is to create a single feature space in which to repre-
sent each video using all its selected frames or clips, rather
than scoring separate frames/clips with classifiers and label
the video based on scores aggregation. We propose tempo-
ral linear encoding (TLE) to aggregate spatial and temporal
information from an entire video, and to encode it into a
robust and compact representation, using end-to-end learn-
ing, as shown in Fig. 2 and Fig. 3. Algorithm 1 sketches
the steps of the proposed TLE. More details about the CNN
encoding layer is given in Section 3.1.
3.1. Deep Temporal linear encoding
Consider the output feature maps of CNNs truncated at a
convolutional layer for K segments extracted from a video
V . The feature maps are matrices {S
1
, S
2
, ..., S
K
} of size
S R
h×w×c
, where h, w and c denote the height, width,
and number of channels of the CNN feature maps. A tempo-
ral aggregation function T : S
1
, S
2
, . . . , S
K
X, aggre-
gates K temporal feature maps to output an encoded fea-
ture map X. The aggregation function can be applied to
the output of different convolutional layers. This temporal
aggregation allows us to linearly encode and aggregate in-
formation from the entire video into a compact and robust
feature representation. This retains the temporal relation-
ship between all the segments without the loss of important
information. We investigated different functions T for the
temporal aggregation of the segments.
Algorithm 1 Deep Temporal Linear Encoding Layer
Input: CNN features for K frames/clips
{S
1
, S
2
, ..., S
K
} of video V , S R
h×w×c
, where
h, w and c are height, width, and channels of feature
maps respectively.
Output: Temporal linear encoded feature map y R
d
,
where d is the encoded feature dimension.
Temporal Linear Encoding:
1. X = S
1
S
2
. . . S
K
, X R
h×w×c
, where is an
aggregation operator
2. y = EncodingMethod(X), y R
d
, where d denotes
the encoded feature dimensions
2331

C3D TLE
Layer
3D
Conv
3D
Conv
3D
Conv
Tennis
Figure 3: Our temporal linear encoding applied to 3D Con-
vNets [33]. These use video clips as input. The feature
maps from the clips are aggregated and encoded. The output
of the network is a video-level prediction. The ConvNets
operating on the different clips all share the same weights.
Element-wise Average of Segments:
X = (S
1
S
2
. . . S
K
)/K (1)
Element-wise Maximum of Segments:
X = max{S
1
, S
2
, . . . , S
K
} (2)
Element-wise Multiplication of Segments:
X = S
1
S
2
. . . S
K
(3)
Of all the temporal aggregation functions illustrated above,
element-wise multiplication of feature maps yielded best re-
sults, and was therefore selected.
The temporal aggregated matrix X is fed as input to an
encoding (or pooling) method E : X y, resulting in a
linearly encoded feature vector y, y R
d
, where d denotes
the encoded feature dimensions. The advantage of encoding
is that every channel of the aggregated temporal segments
interacts with every other channel, thus leading to a power-
ful feature representation of the entire video. In this work,
we investigate two encoding methods E:
Bilinear Models: A bilinear model [20, 32] computes
the outer product of two feature maps, given by:
y = W [X X
] (4)
Where X R
(hw)×c
, X
R
(hw)×c
are input fea-
ture maps, y R
(cc
)
are bilinear features, denotes
the outer product, [ ] turns the matrix into a vector by
concatenating the columns, and W represents model
!
"
!
#
!
$
X % !
"
& !
#
& ' & !
$
( % )*+,-.*/0123,-4X5
6
7189,:;<6<.*1;:61*+,-.*/
-=
->
?
-=
-(
-(
->
-=
-!
#
-=
-!
"
-=
-!
$
6@,*A,<B2.,*;<6
C1;2B:160;9D
C:;81DE@<.9D
=
Figure 4: Computing gradients for back-propagation in the
temporal linear encoding.
parameters to be learned (here linear). In our case,
X = X
. The resulting bilinear features capture the
interaction of features with each other at all spatial lo-
cations, hence leading to a high-dimensional represen-
tation. For this reason, we use the Tensor Sketch al-
gorithm [11, 23], which projects this high-dimensional
space to a lower-dimensional space, without comput-
ing the outer product directly. That cuts down on the
number of model parameters significantly. The model
parameters W are learned with end-to-end back prop-
agation.
Fully connected pooling: As the network has fully-
connected layers between the last convolutional layer
and the classification layer, the model parameters of
the fully-connected layer and classification layer are
learned when training the network from scratch or
when fine-tuning a pre-trained network.
Compared to the fully-connected pooling method, bilin-
ear models projects the high dimensional feature space to
a lower dimensional space, which is far fewer in parame-
ters and still perform better than fully-connected layers in
performance, apart from computational efficiency.
One can readily employ other encoding methods like
deep fisher encoding [31] or VLAD [1, 12], instead of bi-
linear models or fully connected pooling. When bilinear
models are used the features are passed through a signed
squared root and L
2
-normalization. In either case, we use
softmax as a classifier.
End-to-end training: We use K = 3, following the
advice from temporal modeling work [10]. Let the output
feature maps of the CNNs be S
1
, S
2
, and S
3
. The tempo-
rally aggregated features are given by X = S
1
S
2
S
3
,
and the temporal linearly encoded features are denoted by
y. Let denote the loss function, and dℓ/d(X) be the gra-
dient of the loss function with respect to X. Algorithm 2
illustrates the forward and backward passes of our temporal
linear encoding steps for the 3 segments setup.
The Back-propagation for the joint optimization of the
2332

Algorithm 2 Forward & backward propagation steps for
our deep temporal linear encoding with bilinear models for
a scheme with 3 segments.
Input: Convolutional feature maps for a scheme of 3 seg-
ments, {S
1
, S
2
, S
3
}, S R
h×w×c
Output: y R
d
Temporal Linear Encoding:
Forward Pass:
1. X = S
1
S
2
S
3
, X R
h×w×c
2. y = [XX
T
], y R
c
2
Backward Pass:
1.
dℓ
dS
1
= (S
2
S
3
)
dℓ
dX
,
dℓ
dS
2
= (S
1
S
3
)
dℓ
dX
,
dℓ
dS
3
= (S
1
S
2
)
dℓ
dX
K temporal segments can be derived as:
dℓ
dS
k
= ((S
1
... S
K
)\S
k
)
dℓ
dX
, k K
(5)
In the end-to-end learning, the model parameters for the
K temporal segments are optimized using stochastic gradi-
ent descent (SGD). Moreover, the temporal linear encoding
model parameters are learned from the entire video. The
scheme is illustrated in Fig. 4.
4. Evaluation
In this section, we first introduce the datasets and im-
plementation details of our proposed approach. Then we
demonstrate the applicability of our temporal linear encod-
ing on 2D and 3D ConvNets using frames or clips to en-
code long-range dynamics across entire videos. Finally, we
compare temporal linear encoding with the state-of-the-art
methods.
4.1. Datasets
We conduct experiments on two challenging video
datasets with human actions, namely HMDB51 [18] and
UCF101 [28]. The HMDB51 dataset consists of 51 ac-
tion categories with 6,766 video clips in all. The UCF101
dataset consists of 101 action classes with 13,320 video
clips. Both of these datasets have at least 100 video clips
for each action category. For both datasets, we use the
three training/testing splits provided as the original evalu-
ation scheme for these datasets, and report the average ac-
curacy over these three splits.
4.2. Implementation details
We use the caffe toolbox [14] for ConvNet implemen-
tation and all the networks are trained on two Geforce Ti-
tan X GPUs. Here, we describe the implementation details
of our two schemes, temporal linear encoding with two-
stream ConvNets and temporal linear encoding with C3D
ConvNets using bilinear models and fully-connected pool-
ing. As mentioned earlier in the approach section, we use 3
segments for ConvNet training and testing.
Two-stream ConvNets: We employ three pre-trained
models trained on the ImageNet dataset [5], namely
AlexNet [17], VGG-16 [26], and BN-Inception [13], for the
design of the two-stream ConvNets. The two-stream net-
work consists of spatial and temporal networks, the spatial
ConvNet operates on RGB frames, and the temporal Con-
vNet operates on a stack of 10 dense optical flow frames.
The input RGB image or optical flow frames are of size
256 × 340, and are randomly cropped to a size 224 × 224,
and then mean-subtracted for network training. To fine-tune
the network, we replace the previous classification layer
with a C-way softmax layer, where C is the number of
action categories. We use mini-batch stochastic gradient
descent (SGD) to learn the model parameters with a fixed
weight decay of 5 × 10
4
, momentum of 0.9, and a batch
size of 15 for network training. The prediction scores of the
spatial and temporal ConvNets are combined in a late fusion
approach as averaging before softmax normalization.
TLE with Bilinear Models: In our experiments for
bilinear models, we retain only the convolutional layers of
each network, more specifically we remove all the fully con-
nected layers, similar to [11, 20]. The convolutional feature
maps extracted from the last convolutional layers (after the
rectified output of the last convolutional layer, when there
is one) are fed as input into the bilinear models. For ex-
ample, the convolutional feature maps for the last layer of
BN-Inception produces an output of size 14 × 14 × 1024,
leading to bilinear features 1024 × 1024, and 8,196 fea-
tures for compact bilinear models. We follow two steps to
fine-tune the whole model. First, we train the last layer us-
ing logistic regression. Secondly, we fine-tune the whole
model. In both steps for training spatial ConvNets, we ini-
tialize the learning rate with 10
3
and decrease it by a factor
of 10 every 4,000 iterations. The maximum number of iter-
ations is set to 12,000. We use flip augmentation about the
horizontal axis and RGB jittering for RGB frames. For the
temporal ConvNet, we use a stack of 10 optical flow frames
as input clip. We rescale the optical flow fields linearly to
a range of [0, 255] and compress as JPEG images. For the
extraction of the optical flow frames, we use the TVL1 op-
tical flow algorithm [42] from the OpenCV toolbox with
CUDA implementation. In both steps for training the tem-
poral ConvNets, we initialize the learning rate with 10
3
and manually decrease by a factor of 10 every 10,000 itera-
tions. The maximum number of iterations is set to 30,000.
We use batch normalization. Before the features are fed
into the softmax layer, the features are passed through a
2333

Citations
More filters
Posted Content
TL;DR: The complete state-of-the-art techniques in the action recognition and prediction are surveyed, including existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are provided.
Abstract: Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in the action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions.

351 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: A novel representation that gracefully encodes the movement of some semantic keypoints is introduced that outperforms other state-of-the-art pose representations and is complementary to standard appearance and motion streams.
Abstract: Most state-of-the-art methods for action recognition rely on a two-stream architecture that processes appearance and motion independently. In this paper, we claim that considering them jointly offers rich information for action recognition. We introduce a novel representation that gracefully encodes the movement of some semantic keypoints. We use the human joints as these keypoints and term our Pose moTion representation PoTion. Specifically, we first run a state-of-the-art human pose estimator [4] and extract heatmaps for the human joints in each frame. We obtain our PoTion representation by temporally aggregating these probability maps. This is achieved by 'colorizing' each of them depending on the relative time of the frames in the video clip and summing them. This fixed-size representation for an entire video clip is suitable to classify actions using a shallow convolutional neural network. Our experimental evaluation shows that PoTion outperforms other state-of-the-art pose representations [6, 48]. Furthermore, it is complementary to standard appearance and motion streams. When combining PoTion with the recent two-stream I3D approach [5], we obtain state-of-the-art performance on the JHMDB, HMDB and UCF101 datasets.

286 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, the authors proposed to train a deep network directly on the compressed video, which has a higher information density, and found the training to be easier than learning deep image representations.
Abstract: Training robust deep video representations has proven to be much more challenging than learning deep image representations. This is in part due to the enormous size of raw video streams and the high temporal redundancy; the true and interesting signal is often drowned in too much irrelevant data. Motivated by that the superfluous information can be reduced by up to two orders of magnitude by video compression (using H.264, HEVC, etc.), we propose to train a deep network directly on the compressed video. This representation has a higher information density, and we found the training to be easier. In addition, the signals in a compressed video provide free, albeit noisy, motion information. We propose novel techniques to use them effectively. Our approach is about 4.6 times faster than Res3D and 2.7 times faster than ResNet-152. On the task of action recognition, our approach outperforms all the other methods on the UCF-101, HMDB-51, and Charades dataset.

281 citations

Book ChapterDOI
02 Dec 2018
TL;DR: In this paper, a hidden two-stream CNN architecture is proposed, which takes raw video frames as input and directly predicts action classes without explicitly computing optical flow, which is 10x faster than its two-stage baseline.
Abstract: Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architecture that implicitly captures motion information between adjacent frames. We name our approach hidden two-stream CNNs because it only takes raw video frames as input and directly predicts action classes without explicitly computing optical flow. Our end-to-end approach is 10x faster than its two-stage baseline. Experimental results on four challenging action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show that our approach significantly outperforms the previous best real-time approaches.

266 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, the optical flow guided feature (OFF) is proposed to extract spatio-temporal information, especially the temporal information between frames simultaneously, which enables the network to distill temporal information through a fast and robust approach.
Abstract: Motion representation plays a vital role in human action recognition in videos. In this study, we introduce a novel compact motion representation for video action recognition, named Optical Flow guided Feature (OFF), which enables the network to distill temporal information through a fast and robust approach. The OFF is derived from the definition of optical flow and is orthogonal to the optical flow. The derivation also provides theoretical support for using the difference between two frames. By directly calculating pixel-wise spatio-temporal gradients of the deep feature maps, the OFF could be embedded in any existing CNN based video action recognition framework with only a slight additional cost. It enables the CNN to extract spatiotemporal information, especially the temporal information between frames simultaneously. This simple but powerful idea is validated by experimental results. The network with OFF fed only by RGB inputs achieves a competitive accuracy of 93.3% on UCF-101, which is comparable with the result obtained by two streams (RGB and optical flow), but is 15 times faster in speed. Experimental results also show that OFF is complementary to other motion modalities such as optical flow. When the proposed method is plugged into the state-of-the-art video action recognition framework, it has 96.0% and 74.2% accuracy on UCF-101 and HMDB-51 respectively. The code for this project is available at: https://github.com/kevin-ssy/Optical-Flow-Guided-Feature

261 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations

Posted Content
TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Abstract: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU ($\approx$ 2.5 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.

12,531 citations

Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Deep temporal linear encoding networks" ?

The authors present a new video representation, called temporal linear encoding ( TLE ) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos. The authors conduct experiments on two challenging human action datasets: HMDB51 and UCF101. 

In future work, concerning the spatial and temporal segment aggregation, the authors plan to further investigate architectural alternatives. 

The goal of the paper is not only to achieve high performance, but also to show that TLEs are computationally efficient, robust, and compact. 

Similar to two-stream ConvNets, element-wise multiplication performs better in comparison to other candidate functions, and was therefore selected as a default aggregation function.− 

A temporal aggregation function T : S1, S2, . . . , SK → X , aggregates K temporal feature maps to output an encoded feature map X . 

On top of the challenges that make object class recognition hard, there are issues like camera motion and the continously changing viewpoints that come with it. 

In total, the authors sample 5 RGB frames or stacks of optical flow frames (i.e. 15 frames for the three-segments in total) from the whole video. 

Their proposed temporal linear encoding captures more expressive interactions between the segments across entire videos, and encodes these interactions into a compact representation for video-level prediction. 

The feature maps are matrices {S1, S2, ..., SK} of size S ∈ Rh×w×c, where h, w and c denote the height, width, and number of channels of the CNN feature maps. 

One way to avoid processing the abundance of input frames is by extracting a fixed number of shorter clips, evenly distributed over the video [25, 33]. 

For TLE two-stream ConvNet testing, at a time, the authors extact 1 RGB frame or 10 optical flow frames from each part and feed these into the 3 segments network sequentially. 

In both steps for training spatial ConvNets, the authors initialize the learning rate with 10−3 and decrease it by a factor of 10 every 4,000 iterations. 

The performance gap of TLE with bilinear models using the tensor sketch algorithm (TLE:Bilinear+TS) is however small 0.5/0.5%, and 3.4/2.3% for TLE with fully-connected pooling (TLE:FC-Pooling) when compared to TLE:Bilinear on UCF101/HMDB51. 

the authors also found that aggregating rectified output of the last convolutional feature maps achieves around the same classification performance, in comparison to non-rectified ones.− 

One can observe that the optical flow is better at capturing the motion information (shown in Table 2), and when combined with the appearance information in long-range temporal structure is effective to perform video-level learning.