Proceedings Article•DOI•

Deep Temporal Linear Encoding Networks

Q: What contributions have the authors mentioned in the paper "Deep temporal linear encoding networks" ?

The authors present a new video representation, called temporal linear encoding ( TLE ) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos. The authors conduct experiments on two challenging human action datasets: HMDB51 and UCF101.

Q: What future works have the authors mentioned in the paper "Deep temporal linear encoding networks" ?

In future work, concerning the spatial and temporal segment aggregation, the authors plan to further investigate architectural alternatives.

Q: What is the aggregation function for TLE?

Similar to two-stream ConvNets, element-wise multiplication performs better in comparison to other candidate functions, and was therefore selected as a default aggregation function.−

Q: What is the function that aggregates the temporal information from the video?

A temporal aggregation function T : S1, S2, . . . , SK → X , aggregates K temporal feature maps to output an encoded feature map X .

Q: How many frames are used for the three segments?

In total, the authors sample 5 RGB frames or stacks of optical flow frames (i.e. 15 frames for the three-segments in total) from the whole video.

Q: What is the description of the proposed temporal linear encoding?

Their proposed temporal linear encoding captures more expressive interactions between the segments across entire videos, and encodes these interactions into a compact representation for video-level prediction.

Q: What is the aggregation function for the CNN feature maps?

The feature maps are matrices {S1, S2, ..., SK} of size S ∈ Rh×w×c, where h, w and c denote the height, width, and number of channels of the CNN feature maps.

Q: How do you avoid processing the abundance of input frames?

One way to avoid processing the abundance of input frames is by extracting a fixed number of shorter clips, evenly distributed over the video [25, 33].

Ali Diba¹, Vivek Sharma, Luc Van Gool²•Institutions (2)

Katholieke Universiteit Leuven¹, ETH Zurich²

01 Jul 2017-pp 1541-1550

TL;DR: Temporal linear encoding (TLE) as discussed by the authors is proposed to encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space, which is applicable to all kinds of networks like 2D and 3D CNNs.

read less

Abstract: The CNN-encoding of features from entire videos for the representation of human actions has rarely been addressed. Instead, CNN work has focused on approaches to fuse spatial and temporal networks, but these were typically limited to processing shorter sequences. We present a new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos. It encodes this aggregated information into a robust video feature representation, via end-to-end learning. Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space, (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification, and (c) they model feature interactions in a more expressive way and without loss of information. We conduct experiments on two challenging human action datasets: HMDB51 and UCF101. The experiments show that TLE outperforms current state-of-the-art methods on both datasets.

...read moreread less

Summary (4 min read)

Jump to: [1. Introduction] – [Temporal Linear Encoding (TLE) Layer] – [2. Related Work] – [Score Fusion] – [3. Approach] – [3.1. Deep Temporal linear encoding] – [Input:] – [4. Evaluation] – [4.1. Datasets] – [4.2. Implementation details] – [C3D ConvNets:] – [Testing:] – [4.3. Evaluation of TLE] – [Two-Stream ConvNets:] – [4.4. Comparison with the state-of-the-art] – [5. Scene Context Embedding] and [6. Conclusion]

1. Introduction

Human action recognition [6, 15, 25, 35] in videos has attracted quite some attention, due to the potential applications in video surveillance, behavior analysis, video retrieval, and more.
Even if considerable progress was made, the performance of computer vision systems still falls behind that of people.
On top of the challenges that make object class recognition hard, there are issues like camera motion and the continously changing viewpoints that come with it.
This work was carried out while he was at ESAT-PSI, KU Leuven.

Temporal Linear Encoding (TLE) Layer

Given several segments of an entire video, be it either a number of frames or a number of clips, the model builds a compact video representation from the spatial and temporal cues they contain, through end-to-end learning.
This reliance on dense temporal sampling leads to excessive computational costs for longer videos.
The fusion methods of spatial and motion information lie at the heart of the state-of-the-art two-stream ConvNets.
Specifically, TLE captures the important concepts from the longrange temporal structure in different frames or clips, and aggregates it into a compact and robust feature representation by linear encoding.
Finally, conclusions are drawn in Section 6.

Score Fusion

The proposed temporal linear encoding captures more expressive interactions between the segments across entire videos, and encodes these interactions into a compact representation for video-level prediction.
To the best of their knowledge, this is the first end-to-end deep network that encodes temporal features from entire videos.

3. Approach

Motivated by this, IDTs [35] showed that the densely sampling feature points in video frames and using optical flow to track them yields a good video representation.
This suggests that the authors need a video representation that encodes all the frames together, in order to also capture long-range dynamics.
Given earlier successes with deep learning, creating effective video representations should seem possible via the end-to-end learning of deep neural networks.
The hope would be that such representations embody more of the semantic information extracted along the whole video.
More details about the CNN encoding layer is given in Section 3.1.

3.1. Deep Temporal linear encoding

The aggregation function can be applied to the output of different convolutional layers.
This temporal aggregation allows us to linearly encode and aggregate information from the entire video into a compact and robust feature representation.
This retains the temporal relationship between all the segments without the loss of important information.
The authors investigated different functions T for the temporal aggregation of the segments.

Input:

Of all the temporal aggregation functions illustrated above, element-wise multiplication of feature maps yielded best results, and was therefore selected.
The advantage of encoding is that every channel of the aggregated temporal segments interacts with every other channel, thus leading to a powerful feature representation of the entire video.
The resulting bilinear features capture the interaction of features with each other at all spatial locations, hence leading to a high-dimensional representation.
As the network has fullyconnected layers between the last convolutional layer and the classification layer, the model parameters of the fully-connected layer and classification layer are learned when training the network from scratch or when fine-tuning a pre-trained network.
Let ℓ denote the loss function, and dℓ/d(X) be the gradient of the loss function with respect to X. Algorithm 2 illustrates the forward and backward passes of their temporal linear encoding steps for the 3 segments setup, also known as End-to-end training.

4. Evaluation

The authors first introduce the datasets and implementation details of their proposed approach.
Then the authors demonstrate the applicability of their temporal linear encoding on 2D and 3D ConvNets using frames or clips to encode long-range dynamics across entire videos.
Finally, the authors compare temporal linear encoding with the state-of-the-art methods.

4.1. Datasets

The authors conduct experiments on two challenging video datasets with human actions, namely HMDB51 [18] and UCF101 [28] .
The HMDB51 dataset consists of 51 action categories with 6,766 video clips in all.
Both of these datasets have at least 100 video clips for each action category.
For both datasets, the authors use the three training/testing splits provided as the original evaluation scheme for these datasets, and report the average accuracy over these three splits.

4.2. Implementation details

The authors use the caffe toolbox [14] for ConvNet implementation and all the networks are trained on two Geforce Titan X GPUs.
The convolutional feature maps extracted from the last convolutional layers (after the rectified output of the last convolutional layer, when there is one) are fed as input into the bilinear models.
The authors follow two steps to fine-tune the whole model.
The authors apply the same augmentation and discretization techniques for both RGB and optical flow frames, as discussed earlier.

C3D ConvNets:

The video is decomposed into non-overlapping, equal-duration clips of 16 frames.
For fine-tuning the network, the authors replace the previous classification layer with a C-way softmax layer, where C is the number of action categories.
For fine-tuning the model, the authors use the same two steps scheme explained earlier.
The authors use the C3D Con-vNet [33] architecture as a default ConvNet architecture for TLE.

Testing:

The three parts are associated with the 3 segments.
For TLE two-stream ConvNet testing, at a time, the authors extact 1 RGB frame or 10 optical flow frames from each part and feed these into the 3 segments network sequentially.
For video prediction, the authors average predictions over all group of frame segments.
The prediction scores of the spatial and temporal ConvNets are combined in a late fusion approach via averaging.
The authors decompose each video into nonoverlapping clips of 16 frames, they then divide the number of clips into 3 equal parts.

4.3. Evaluation of TLE

The authors explore (i) different aggregation functions T to linearly aggregate the segments into a compact intermediate representation for encoding; and (ii) different ConvNet architectures for both two-stream (spatial and temporal networks) and C3D networks.
For this evaluation, the authors report the accuracy of split1 on UCF101 and HMDB51.
The reported performance is for TLE with bilinear models using the tensor sketch algorithm.

Two-Stream ConvNets:

In their evaluation, the authors explore three aggregations functions (i) element-wise average, (ii) element-wise maximum, and (iii) element-wise multiplication.
The authors observe that the element-wise multiplication performs the best.
The authors believe combining the feature maps in this way allows us to aggregate the appearance and motion information accurately, hence leading to better results.
Interestingly, the authors also found that aggregating rectified output of the last convolutional feature maps achieves around the same classification performance, in comparison to non-rectified ones.
Therefore, the authors choose BN-Inception as a default ConvNet architecture for TLE.

4.4. Comparison with the state-of-the-art

Finally, after exploring the aggregation function and good ConvNet architectures, the authors compare their TLE with the current state-of-the-art methods over all three splits of the UCF101 and HMDB51 datasets.
The authors report the average accuracy over the three splits of both datasets.
The performance gap of TLE with bilinear models using the tensor sketch algorithm (TLE:Bilinear+TS) is however small 0.5/0.5%, and 3.4/2.3% for TLE with fully-connected pooling (TLE:FC-Pooling) when compared to TLE:Bilinear on UCF101/HMDB51.

5. Scene Context Embedding

This section describes an additional experiment to incorporate scene context in order to improve the success of action recognition.
The authors transfer the learned representations between the two tasks for better action recognition.
The authors exploit the context information from the scenes to improve the action recognition.
The authors use the VGG-16 network architecture in this experiment.
The accuracy is reported for split1 on both datasets.

6. Conclusion

The model performs action prediction over an entire video.
Even though, in this paper, the authors have focused on two-stream and C3D ConvNets architectures, their method has the potential to generalize to other architectures, and can readily be employed with other encoding methods also.
Thus, it can lead to more accurate classification.
Another potential of this work is that TLEs are flexible enough to be readily employed to other forms of sequential data streams for feature embedding.

Did you find this useful? Give us your feedback

Figures (10)

Figure 1: Temporal linear encoding for video classification. Given several segments of an entire video, be it either a number of frames or a number of clips, the model builds a compact video representation from the spatial and temporal cues they contain, through end-to-end learning. The ConvNets applied to different segments share the same weights.

Table 1: Accuracy (%) performance comparison of the aggregation functions in TLE BN-Inception network for two-stream ConvNets using 3 segments on UCF101 and HMDB51 datasets (split1).

Table 4: Two-stream ConvNets. Accuracy (%) performance comparison of TLE BN-Inception network with state-of-the-art methods over all three splits of UCF101 and HMDB51.

Table 2: Different architecture accuracy (%) performance comparison of spatial and temporal ConvNets using 3 segments on the UCF101 and HMDB51 datasets (split1).

Table 3: Performance comparison of different aggregation functions in TLE C3D ConvNet using 3 segments on the UCF101 dataset (split1).

Figure 2: Our temporal linear encoding applied to the design of two-stream ConvNets [25]: spatial and temporal networks. The spatial network operates on RGB frames, and the temporal network operates on optical flow fields. The features maps from the spatial and temporal ConvNets for multiple such segments are aggregated and encoded. Finally, the scores for the two ConvNets are combined in a late fusion approach as averaging. The ConvNet weights for the spatial stream are shared and similarly for the temporal stream.

Table 6: Accuracy (%) performance comparison of the VGG-16 spatial ConvNets using 3 segments, when combined with context information pre-trained on Places365 [43]. The accuracy is reported for split1 on both datasets.

Table 5: C3D ConvNets. Accuracy (%) performance comparison of TLE with state-of-the-art methods over all three splits of UCF101 and HMDB51.

Figure 4: Computing gradients for back-propagation in the temporal linear encoding.

Figure 3: Our temporal linear encoding applied to 3D ConvNets [33]. These use video clips as input. The feature maps from the clips are aggregated and encoded. The output of the network is a video-level prediction. The ConvNets operating on the different clips all share the same weights.

Content maybe subject to copyright Report

Deep Temporal Linear Encoding Networks

Ali Diba

1,⋆

, Vivek Sharma

2,⋆,∓

, Luc Van Gool

1,3

ESAT-PSI, KU Leuven,

CV:HCI, KIT,

CVL, ETH Z

urich

{ali.diba,luc.vangool}@esat.kuleuven.be, vivek.sharma@kit.edu

Abstract

The CNN-encoding of features from entire videos for

the representation of human actions has rarely been ad-

dressed. Instead, CNN work has focused on approaches

to fuse spatial and temporal networks, but these were typ-

ically limited to processing shorter sequences. We present

a new video representation, called temporal linear encod-

ing (TLE) and embedded inside of CNNs as a new layer,

which captures the appearance and motion throughout en-

tire videos. It encodes this aggregated information into a

robust video feature representation, via end-to-end learn-

ing. Advantages of TLEs are: (a) they encode the entire

video into a compact feature representation, learning the

semantics and a discriminative feature space; (b) they are

applicable to all kinds of networks like 2D and 3D CNNs for

video classiﬁcation; and (c) they model feature interactions

in a more expressive way and without loss of information.

We conduct experiments on two challenging human action

datasets: HMDB51 and UCF101. The experiments show

that TLE outperforms current state-of-the-art methods on

both datasets.

1. Introduction

Human action recognition [6, 15, 25, 35] in videos has

attracted quite some attention, due to the potential appli-

cations in video surveillance, behavior analysis, video re-

trieval, and more. Even if considerable progress was made,

the performance of computer vision systems still falls be-

hind that of people. On top of the challenges that make

object class recognition hard, there are issues like camera

motion and the continously changing viewpoints that come

with it. Whereas Convolutional Networks (ConvNets) have

caused several sub-ﬁelds of vision to leap forward, they still

lack the capacity to exploit long-range temporal informa-

tion, probably the main reason why end-to-end networks

are still unable to outperform methods using hand-crafted

⋆

Ali Diba and Vivek Sharma contributed equally to this work and

listed in alphabetical order.

∓

This work was carried out while he was at ESAT-PSI, KU Leuven.

Tennis

Temporal Linear Encoding

(TLE) Layer

End-To-End Training

Figure 1: Temporal linear encoding for video classiﬁcation.

Given several segments of an entire video, be it either a

number of frames or a number of clips, the model builds a

compact video representation from the spatial and temporal

cues they contain, through end-to-end learning. The Con-

vNets applied to different segments share the same weights.

features [35].

Neural networks for action recognition can be catego-

rized into two types, namely one-stream ConvNets [15, 33]

(which use only one stream at a time: either spatial or tem-

poral information), and two-stream ConvNets [25] (which

integrate both spatial and temporal information at the same

time).

As to the one-stream ConvNets, spatial networks per-

form action recognition from individual video frames. They

lack any form of motion modeling. On the other hand,

temporal networks typically get their motion information

from dense optical ﬂow. This reliance on dense temporal

sampling leads to excessive computational costs for longer

videos. One way to avoid processing the abundance of in-

put frames is by extracting a ﬁxed number of shorter clips,

evenly distributed over the video [25, 33].

The two-stream ConvNets have shown to outperform

one-stream ConvNets. They exploit fusion techniques like

trajectory-constrained pooling [37], 3D pooling [8], and

2329

consensus pooling [38]. The fusion methods of spatial and

motion information lie at the heart of the state-of-the-art

two-stream ConvNets.

Motivated by the above observations, we propose the

new spatio-temporal encoding illustrated in Figure 1. The

design of the spatio-temporal deep feature encoding aims

to aggregate multiple video segments (i.e. frames or clips)

over longer time ranges. To that end, we use our ‘tempo-

ral linear encoding’ (TLE), which is inspired by previous

works on video representations [35] and feature encoding

methods [20, 31]. TLE is a form of temporal aggregation of

features sparsely sampled over the whole video using fea-

ture map aggregation techniques, and then projected to a

lower dimensional feature space using encoding methods

ically, TLE captures the important concepts from the long-

range temporal structure in different frames or clips, and ag-

gregates it into a compact and robust feature representation

by linear encoding. The compact temporal feature represen-

tation ﬁts action recognition well, as it is a global feature

representation over the whole video. The goal of the paper

is not only to achieve high performance, but also to show

that TLEs are computationally efﬁcient, robust, and com-

pact. TLE is evaluated on two challenging action recogni-

tion datasets, namely HMDB51 [18] and UCF101 [28]. We

experimentally show that the two-stream ConvNets when

combined with TLEs achieve state-of-the-art performance

on HMDB51 (71.1%) and UCF101 (95.6%).

The rest of the paper is organized as follows. In Sec-

tion 2, we discuss related work. Section 3 describes our

proposed approach. Experimental results and their analysis

are presented in Section 4 and Section 5. Finally, conclu-

sions are drawn in Section 6.

2. Related Work

Action Recognition without ConvNets: Over the last two

decades, several action recognition techniques in videos

have been proposed by the vision community. Quite

a few are concerned with effective representations us-

ing local spatio-temporal features, suc h as HOG3D [16],

SIFT3D [24], HOF [19], ESURF [39], and MBH [4]. Re-

cently, IDT [35] was proposed, which is currently the

state-of-the-art among hand-crafted features. Despite this

good performance, these features have several shortcom-

ings: they are computationally expensive; they fail to cap-

ture semantic concepts; they lack discriminative capacity as

well as scalability. To overcome such issues, several tech-

niques have been proposed to model the temporal struc-

ture for action recognition, such as the actom sequence

model [10] which considers sequence of histograms; tem-

poral action decomposition [21] which exploits the tempo-

ral structure of human actions by temporally decomposing

video frames; dynamic poselets [36] which uses a relational

model for action detection; and the temporal evolution of

appearance representations [9] which uses a ranking func-

tion capable of modeling the evolution of both appearance

and motion over time.

ConvNets for Action Recognition: Recently several at-

tempts have been made to go beyond individual image-level

appearance information and exploit the temporal informa-

tion using ConvNet architectures. End-to-end ConvNets

have been introduced in [8, 25, 33, 38] for action recog-

nition. Karpathy et al. [15] trained a deep network operat-

ing on individual frames using a very large sports activities

dataset (Sports-1M). Yet, the deep model turned out to be

less accurate than an IDTs-based representation because it

could not capture the motion information. To overcome this

problem, Simonyan et al. [25] proposed a two-stream net-

work, cohorts of spatial and temporal ConvNets. The input

to the spatial and temporal networks are RGB frames and

stacks of multiple-frame dense optical ﬂow ﬁelds, respec-

tively. The network was still limited in its capacity to cap-

ture temporal information, because it operated on a ﬁxed

number of regularly spaced, single frames from the entire

video. Tran et al. [33] explored 3D ConvNets on video

streams for spatio-temporal feature learning for clips of 16

frames, and ﬁlter kernel of size 3 × 3 × 3. In this way, they

avoid to calculate the optical ﬂow explicitly and still achieve

good performance. Sun et al. [30] proposed a factorized

spatio-temporal ConvNet and decomposed the 3D convolu-

tions into 2D spatial and 1D temporal convolutions. Similar

to [25] and [33] is Feichtenhofer et al.’s [8] work, where

they employ 3D Conv fusion and 3D pooling to fuse spatial

and temporal networks using RGB images and a stack of 10

optical ﬂow frames as input. Wang et al. [38] use multiple

clips sparsely sampled from the whole video as input for

both streams, and then combine the scores for all clips in a

late fusion approach.

Encoding Methods: As to prior encoding methods, there

is a vast literature on BoW [3, 27], Fisher vector encod-

ing [22] and sparse encoding [40]. Such methods have

performed very well in various vision tasks. FV encod-

ing [31] and VLAD [1, 12] have lately been integrated as

a layer in ConvNet architectures, and CNN encoded fea-

tures have produced superior results for several challeng-

ing tasks. Likewise, Bilinear models [20, 32] have been

widely used and have achieved state-of-the-art results. Bi-

linear models are computationally expensive as they return

matrix outer products, hence can lead to prohibitively high

dimensions. To tackle this problem, compact bilinear pool-

ing [11] was proposed which uses the Tensor Sketch Al-

gorithm [23], to project features from a high dimensional

space to a lower dimensional one, while retaining state-of-

the-art performances. Compact bilinear pooling has shown

to perform better than FV encoding and fully-connected

networks [11]. Moreover, this type of feature representa-

2330

RGB TLE

Layer

Optical Flow TLE

Layer

Score Fusion

Figure 2: Our temporal linear encoding applied to the de-

sign of two-stream ConvNets [25]: spatial and temporal net-

works. The spatial network operates on RGB frames, and

the temporal network operates on optical ﬂow ﬁelds. The

features maps from the spatial and temporal ConvNets for

multiple such segments are aggregated and encoded. Fi-

nally, the scores for the two ConvNets are combined in a

late fusion approach as averaging. The ConvNet weights

for the spatial stream are shared and similarly for the tem-

poral stream.

tion is compact, non-redundant, avoids over-ﬁtting, and re-

duces the number of parameters of CNNs signiﬁcantly, as it

replaces fully-connected layers.

Our proposed temporal linear encoding captures more

expressive interactions between the segments across entire

videos, and encodes these interactions into a compact rep-

resentation for video-level prediction. To the best of our

knowledge, this is the ﬁrst end-to-end deep network that

encodes temporal features from entire videos.

3. Approach

In a video, the motion between consecutive frames tends

to be small. Motivated by this, IDTs [35] showed that the

densely sampling feature points in video frames and using

optical ﬂow to track them yields a good video represen-

tation. This suggests that we need a video representation

that encodes all the frames together, in order to also cap-

ture long-range dynamics. To tackle this issue, recently

some techniques have combined several consecutive [25] or

sparsely sampled [38] frames into short clips. Unlike IDTs,

these techniques use ConvNets with late fusion to combine

spatial and temporal cues, but they still fail to efﬁciently

encode all frames together.

Given earlier successes with deep learning, creating ef-

fective video representations should seem possible via the

end-to-end learning of deep neural networks. The hope

would be that such representations embody more of the se-

mantic information extracted along the whole video. Our

goal is to create a single feature space in which to repre-

sent each video using all its selected frames or clips, rather

than scoring separate frames/clips with classiﬁers and label

the video based on scores aggregation. We propose tempo-

ral linear encoding (TLE) to aggregate spatial and temporal

information from an entire video, and to encode it into a

robust and compact representation, using end-to-end learn-

ing, as shown in Fig. 2 and Fig. 3. Algorithm 1 sketches

the steps of the proposed TLE. More details about the CNN

encoding layer is given in Section 3.1.

3.1. Deep Temporal linear encoding

Consider the output feature maps of CNNs truncated at a

convolutional layer for K segments extracted from a video

V . The feature maps are matrices {S

, S

, ..., S

} of size

S ∈ R

h×w×c

, where h, w and c denote the height, width,

and number of channels of the CNN feature maps. A tempo-

ral aggregation function T : S

, S

, . . . , S

→ X, aggre-

gates K temporal feature maps to output an encoded fea-

ture map X. The aggregation function can be applied to

the output of different convolutional layers. This temporal

aggregation allows us to linearly encode and aggregate in-

formation from the entire video into a compact and robust

feature representation. This retains the temporal relation-

ship between all the segments without the loss of important

information. We investigated different functions T for the

temporal aggregation of the segments.

Algorithm 1 Deep Temporal Linear Encoding Layer

Input: CNN features for K frames/clips

, S

, ..., S

} of video V , S ∈ R

h×w×c

, where

h, w and c are height, width, and channels of feature

maps respectively.

Output: Temporal linear encoded feature map y ∈ R

where d is the encoded feature dimension.

Temporal Linear Encoding:

1. X = S

⋄ S

⋄ . . . ⋄ S

, X ∈ R

h×w×c

, where ⋄ is an

aggregation operator

2. y = EncodingMethod(X), y ∈ R

, where d denotes

the encoded feature dimensions

2331

C3D TLE

Layer

Conv

Tennis

Figure 3: Our temporal linear encoding applied to 3D Con-

vNets [33]. These use video clips as input. The feature

maps from the clips are aggregated and encoded. The output

of the network is a video-level prediction. The ConvNets

operating on the different clips all share the same weights.

• Element-wise Average of Segments:

X = (S

⊕ S

⊕ . . . ⊕ S

)/K (1)

• Element-wise Maximum of Segments:

X = max{S

, S

, . . . , S

} (2)

• Element-wise Multiplication of Segments:

X = S

◦ S

◦ . . . ◦ S

(3)

Of all the temporal aggregation functions illustrated above,

element-wise multiplication of feature maps yielded best re-

sults, and was therefore selected.

The temporal aggregated matrix X is fed as input to an

encoding (or pooling) method E : X → y, resulting in a

linearly encoded feature vector y, y ∈ R

, where d denotes

the encoded feature dimensions. The advantage of encoding

is that every channel of the aggregated temporal segments

interacts with every other channel, thus leading to a power-

ful feature representation of the entire video. In this work,

we investigate two encoding methods E:

• Bilinear Models: A bilinear model [20, 32] computes

the outer product of two feature maps, given by:

y = W [X ⊗ X

′

] (4)

Where X ∈ R

(hw)×c

, X

′

∈ R

(hw)×c

′

are input fea-

ture maps, y ∈ R

(cc

′

)

are bilinear features, ⊗ denotes

the outer product, [ ] turns the matrix into a vector by

concatenating the columns, and W represents model

X % !

& !

& ' & !

( % )*+,-.*/0123,-4X5

7189,:;<6<.*1;:61*+,-.*/

6@,*A,<B2.,*;<6

C1;2B:160;9D

C:;81DE@<.9D

Figure 4: Computing gradients for back-propagation in the

temporal linear encoding.

parameters to be learned (here linear). In our case,

X = X

′

. The resulting bilinear features capture the

interaction of features with each other at all spatial lo-

cations, hence leading to a high-dimensional represen-

tation. For this reason, we use the Tensor Sketch al-

gorithm [11, 23], which projects this high-dimensional

space to a lower-dimensional space, without comput-

ing the outer product directly. That cuts down on the

number of model parameters signiﬁcantly. The model

parameters W are learned with end-to-end back prop-

agation.

• Fully connected pooling: As the network has fully-

connected layers between the last convolutional layer

and the classiﬁcation layer, the model parameters of

the fully-connected layer and classiﬁcation layer are

learned when training the network from scratch or

when ﬁne-tuning a pre-trained network.

Compared to the fully-connected pooling method, bilin-

ear models projects the high dimensional feature space to

a lower dimensional space, which is far fewer in parame-

ters and still perform better than fully-connected layers in

performance, apart from computational efﬁciency.

One can readily employ other encoding methods like

deep ﬁsher encoding [31] or VLAD [1, 12], instead of bi-

linear models or fully connected pooling. When bilinear

models are used the features are passed through a signed

squared root and L

-normalization. In either case, we use

softmax as a classiﬁer.

End-to-end training: We use K = 3, following the

advice from temporal modeling work [10]. Let the output

feature maps of the CNNs be S

, S

, and S

. The tempo-

rally aggregated features are given by X = S

◦ S

and the temporal linearly encoded features are denoted by

y. Let ℓ denote the loss function, and dℓ/d(X) be the gra-

dient of the loss function with respect to X. Algorithm 2

illustrates the forward and backward passes of our temporal

linear encoding steps for the 3 segments setup.

The Back-propagation for the joint optimization of the

2332

Algorithm 2 Forward & backward propagation steps for

our deep temporal linear encoding with bilinear models for

a scheme with 3 segments.

Input: Convolutional feature maps for a scheme of 3 seg-

ments, {S

, S

}, S ∈ R

h×w×c

Output: y ∈ R

Temporal Linear Encoding:

Forward Pass:

1. X = S

◦ S

, X ∈ R

h×w×c

2. y = [XX

], y ∈ R

Backward Pass:

dℓ

= (S

◦ S

)

dℓ

= (S

◦ S

)

dℓ

= (S

◦ S

)

dℓ

K temporal segments can be derived as:

dℓ

= ((S

◦ ... ◦ S

)\S

)

dℓ

, k ∈ K

(5)

In the end-to-end learning, the model parameters for the

K temporal segments are optimized using stochastic gradi-

ent descent (SGD). Moreover, the temporal linear encoding

model parameters are learned from the entire video. The

scheme is illustrated in Fig. 4.

4. Evaluation

In this section, we ﬁrst introduce the datasets and im-

plementation details of our proposed approach. Then we

demonstrate the applicability of our temporal linear encod-

ing on 2D and 3D ConvNets using frames or clips to en-

code long-range dynamics across entire videos. Finally, we

compare temporal linear encoding with the state-of-the-art

methods.

4.1. Datasets

We conduct experiments on two challenging video

datasets with human actions, namely HMDB51 [18] and

UCF101 [28]. The HMDB51 dataset consists of 51 ac-

tion categories with 6,766 video clips in all. The UCF101

dataset consists of 101 action classes with 13,320 video

clips. Both of these datasets have at least 100 video clips

for each action category. For both datasets, we use the

three training/testing splits provided as the original evalu-

ation scheme for these datasets, and report the average ac-

curacy over these three splits.

4.2. Implementation details

We use the caffe toolbox [14] for ConvNet implemen-

tation and all the networks are trained on two Geforce Ti-

tan X GPUs. Here, we describe the implementation details

of our two schemes, temporal linear encoding with two-

stream ConvNets and temporal linear encoding with C3D

ConvNets using bilinear models and fully-connected pool-

ing. As mentioned earlier in the approach section, we use 3

segments for ConvNet training and testing.

Two-stream ConvNets: We employ three pre-trained

models trained on the ImageNet dataset [5], namely

AlexNet [17], VGG-16 [26], and BN-Inception [13], for the

design of the two-stream ConvNets. The two-stream net-

work consists of spatial and temporal networks, the spatial

ConvNet operates on RGB frames, and the temporal Con-

vNet operates on a stack of 10 dense optical ﬂow frames.

The input RGB image or optical ﬂow frames are of size

256 × 340, and are randomly cropped to a size 224 × 224,

and then mean-subtracted for network training. To ﬁne-tune

the network, we replace the previous classiﬁcation layer

with a C-way softmax layer, where C is the number of

action categories. We use mini-batch stochastic gradient

descent (SGD) to learn the model parameters with a ﬁxed

weight decay of 5 × 10

−4

, momentum of 0.9, and a batch

size of 15 for network training. The prediction scores of the

spatial and temporal ConvNets are combined in a late fusion

approach as averaging before softmax normalization.

− TLE with Bilinear Models: In our experiments for

bilinear models, we retain only the convolutional layers of

each network, more speciﬁcally we remove all the fully con-

nected layers, similar to [11, 20]. The convolutional feature

maps extracted from the last convolutional layers (after the

rectiﬁed output of the last convolutional layer, when there

is one) are fed as input into the bilinear models. For ex-

ample, the convolutional feature maps for the last layer of

BN-Inception produces an output of size 14 × 14 × 1024,

leading to bilinear features 1024 × 1024, and 8,196 fea-

tures for compact bilinear models. We follow two steps to

ﬁne-tune the whole model. First, we train the last layer us-

ing logistic regression. Secondly, we ﬁne-tune the whole

model. In both steps for training spatial ConvNets, we ini-

tialize the learning rate with 10

−3

and decrease it by a factor

of 10 every 4,000 iterations. The maximum number of iter-

ations is set to 12,000. We use ﬂip augmentation about the

horizontal axis and RGB jittering for RGB frames. For the

temporal ConvNet, we use a stack of 10 optical ﬂow frames

as input clip. We rescale the optical ﬂow ﬁelds linearly to

a range of [0, 255] and compress as JPEG images. For the

extraction of the optical ﬂow frames, we use the TVL1 op-

tical ﬂow algorithm [42] from the OpenCV toolbox with

CUDA implementation. In both steps for training the tem-

poral ConvNets, we initialize the learning rate with 10

−3

and manually decrease by a factor of 10 every 10,000 itera-

tions. The maximum number of iterations is set to 30,000.

We use batch normalization. Before the features are fed

into the softmax layer, the features are passed through a

2333

HTML Viewer

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Deep temporal linear encoding networks" ?

The authors present a new video representation, called temporal linear encoding ( TLE ) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos. The authors conduct experiments on two challenging human action datasets: HMDB51 and UCF101.

Q2. What future works have the authors mentioned in the paper "Deep temporal linear encoding networks" ?

In future work, concerning the spatial and temporal segment aggregation, the authors plan to further investigate architectural alternatives.

Q3. What is the goal of the paper?

The goal of the paper is not only to achieve high performance, but also to show that TLEs are computationally efficient, robust, and compact.

Q4. What is the aggregation function for TLE?

Similar to two-stream ConvNets, element-wise multiplication performs better in comparison to other candidate functions, and was therefore selected as a default aggregation function.−

Q5. What is the function that aggregates the temporal information from the video?

A temporal aggregation function T : S1, S2, . . . , SK → X , aggregates K temporal feature maps to output an encoded feature map X .

Q6. What are the main challenges that make object class recognition hard?

On top of the challenges that make object class recognition hard, there are issues like camera motion and the continously changing viewpoints that come with it.

Q7. How many frames are used for the three segments?

In total, the authors sample 5 RGB frames or stacks of optical flow frames (i.e. 15 frames for the three-segments in total) from the whole video.

Q8. What is the description of the proposed temporal linear encoding?

Their proposed temporal linear encoding captures more expressive interactions between the segments across entire videos, and encodes these interactions into a compact representation for video-level prediction.

Q9. What is the aggregation function for the CNN feature maps?

The feature maps are matrices {S1, S2, ..., SK} of size S ∈ Rh×w×c, where h, w and c denote the height, width, and number of channels of the CNN feature maps.

Q10. How do you avoid processing the abundance of input frames?

One way to avoid processing the abundance of input frames is by extracting a fixed number of shorter clips, evenly distributed over the video [25, 33].

Q11. What is the way to test TLE two-stream?

For TLE two-stream ConvNet testing, at a time, the authors extact 1 RGB frame or 10 optical flow frames from each part and feed these into the 3 segments network sequentially.

Q12. How many iterations are set to train the temporal ConvNets?

In both steps for training spatial ConvNets, the authors initialize the learning rate with 10−3 and decrease it by a factor of 10 every 4,000 iterations.

Q13. What is the performance gap of TLE with bilinear models?

The performance gap of TLE with bilinear models using the tensor sketch algorithm (TLE:Bilinear+TS) is however small 0.5/0.5%, and 3.4/2.3% for TLE with fully-connected pooling (TLE:FC-Pooling) when compared to TLE:Bilinear on UCF101/HMDB51.

Q14. What is the way to aggregate the features of the last convolutional feature maps?

the authors also found that aggregating rectified output of the last convolutional feature maps achieves around the same classification performance, in comparison to non-rectified ones.−

Q15. What is the way to perform video-level learning?

One can observe that the optical flow is better at capturing the motion information (shown in Table 2), and when combined with the appearance information in long-range temporal structure is effective to perform video-level learning.

Deep Temporal Linear Encoding Networks

Summary (4 min read)

1. Introduction

Temporal Linear Encoding (TLE) Layer

Score Fusion

3. Approach

3.1. Deep Temporal linear encoding

Input:

4. Evaluation

4.1. Datasets

4.2. Implementation details

C3D ConvNets:

Testing:

4.3. Evaluation of TLE

Two-Stream ConvNets:

4.4. Comparison with the state-of-the-art

5. Scene Context Embedding

6. Conclusion

Figures (10)

Citations

References

Related Papers (5)

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Deep temporal linear encoding networks" ?

Q2. What future works have the authors mentioned in the paper "Deep temporal linear encoding networks" ?

Q3. What is the goal of the paper?

Q4. What is the aggregation function for TLE?

Q5. What is the function that aggregates the temporal information from the video?

Q6. What are the main challenges that make object class recognition hard?

Q7. How many frames are used for the three segments?

Q8. What is the description of the proposed temporal linear encoding?

Q9. What is the aggregation function for the CNN feature maps?

Q10. How do you avoid processing the abundance of input frames?

Q11. What is the way to test TLE two-stream?

Q12. How many iterations are set to train the temporal ConvNets?

Q13. What is the performance gap of TLE with bilinear models?

Q14. What is the way to aggregate the features of the last convolutional feature maps?

Q15. What is the way to perform video-level learning?