Unsupervised Learning of Long-Term Motion Dynamics for Videos

doi:10.1109/CVPR.2017.751

Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, Li Fei-Fei

Stanford University

{zelunluo,boya,dahuang,alahi,feifeili}@cs.stanford.edu

Abstract

We present an unsupervised representation learning ap-

proach that compactly encodes the motion dependencies

in videos. Given a pair of images from a video clip, our

framework learns to predict the long-term 3D motions. To

reduce the complexity of the learning framework, we pro-

pose to describe the motion as a sequence of atomic 3D

ﬂows computed with RGB-D modality. We use a Recur-

rent Neural Network based Encoder-Decoder framework

to predict these sequences of ﬂows. We argue that in or-

der for the decoder to reconstruct these sequences, the en-

coder must learn a robust video representation that captures

long-term motion dependencies and spatial-temporal rela-

tions. We demonstrate the effectiveness of our learned tem-

poral representations on activity classiﬁcation across multi-

ple modalities and datasets such as NTU RGB+D and MSR

Daily Activity 3D. Our framework is generic to any input

modality, i.e., RGB, depth, and RGB-D videos.

1. Introduction

Human activities can often be described as a sequence of

basic motions. For instance, common activities like brush-

ing hair or waving a hand can be described as a sequence of

successive raising and lowering of the hand. Over the past

years, researchers have studied multiple strategies to effec-

tively represent motion dynamics and classify activities in

videos [38, 20, 42]. However, the existing methods suffer

from the inability to compactly encode long-term motion

dependencies. In this work, we propose to learn a represen-

tation that can describe the sequence of motions by learning

to predict it. In other words, we are interested in learning a

representation that, given a pair of video frames, can predict

the sequence of basic motions (see in Figure 1). We believe

that if the learned representation has encoded enough infor-

mation to predict the motion, it is discriminative enough to

classify activities in videos. Hence, our ﬁnal goal is to use

our learned representation to classify activities in videos.

To classify activities, we argue that a video representa-

tion needs to capture not only the semantics, but also the

Figure 1. We propose a method that learns a video representation

by predicting a sequence of basic motions described as atomic 3D

ﬂows. The learned representation is then extracted from this model

to recognize activities.

motion dependencies in a long temporal sequence. Since

robust representations exist to extract semantic informa-

tion [29], we focus our effort on learning a representation

that encodes the sequence of basic motions in consecutive

frames. We deﬁne basic motions as atomic 3D ﬂows. The

atomic 3D ﬂows are computed by quantizing the estimated

dense 3D ﬂows in space and time using RGB-D modal-

ity. Given a pair of images from a video clip, our frame-

work learns a representation that can predict the sequence

of atomic 3D ﬂows.

Our learning framework is unsupervised, i.e., it does

not require human-labeled data. Not relying on labels has

the following beneﬁts. It is not clear how many labels

are needed to understand activities in videos. For a sin-

gle image, millions of labels have been used to surpass

human-level accuracy in extracting semantic information

[29]. Consequently, we would expect that videos will re-

quire several orders of magnitude more labels to learn a rep-

resentation in a supervised setting. It will be unrealistic to

collect all these labels.

Recently, a stream of unsupervised methods have been

proposed to learn temporal structures from videos. These

2203

methods are formulated with various objectives - supervi-

sion. Some focus on constructing future frames [32, 23],

or enforcing the learned representations to be temporally

smooth [53], while others make use of the sequential or-

der of frames sampled from a video [17, 44]. Although

they show promising results, most of the learned represen-

tations still focus heavily on either capturing semantic fea-

tures [17], or are not discriminative enough for classifying

activities as the output supervision is too large and coarse

(e.g., frame reconstruction).

When learning a representation that predicts motions,

the following properties are needed: the output supervision

needs to be of i) low dimensionality, ii) easy to parame-

terize, and iii) discriminative enough for other tasks. We

address the ﬁrst two properties by reducing the dimension-

ality of the ﬂows through clustering. Then, we address the

third property by augmenting the RGB videos with depth

modality to reason on 3D motions. By inferring 3D mo-

tion as opposed to view-speciﬁc 2D optical ﬂow, our model

is able to learn an intermediate representation that captures

less view-speciﬁc spatial-temporal interactions. Compared

to 2D dense trajectories [38], our 3D motions are of much

lower dimensionality. Moreover, we focus on inferring the

sequence of basic motions that describes an activity as op-

posed to tracking keypoints over space and time. We claim

that our proposed description of the motion enables our

learning framework to predict longer motion dependencies

since the complexity of the output space is reduced. In Sec-

tion 5.2, we show quantitatively that our proposed method

outperforms previous methods on activity recognition.

The contributions of our work are as follows:

(i) We propose to use a Recurrent Neural Network based

Encoder-Decoder framework to effectively learn a rep-

resentation that predicts the sequence of basic motions.

Whereas existing unsupervised methods describe mo-

tion as either a single optical ﬂow [37] or 2D dense tra-

jectories [38], we propose to describe it as a sequence

of atomic 3D ﬂows over a long period of time (Section

3).

(ii) We are the ﬁrst to explore and generalize unsuper-

vised learning methods across different modalities. We

study the performance of our unsupervised task - pre-

dicting the sequence of basic motions - using various

input modalities: RGB → motion, depth → motion,

and RGB-D → motion (Section 5.1).

(iii) We show the effectiveness of our learned represen-

tations on activity recognition tasks across multiple

modalities and datasets (Section 5.2). At the time of

its introduction, our model outperforms state-of-the-

art unsupervised methods [17, 32] across modalities

(RGB and depth).

2. Related Work

We ﬁrst present previous works on unsupervised repre-

sentation learning for images and videos. Then, we give a

brief overview on existing methods that classify activities in

multi-modal videos.

Unsupervised Representation Learning. In the RGB

domain, unsupervised learning of visual representations

has shown usefulness for various supervised tasks such as

pedestrian detection and object detection [1, 26]. To ex-

ploit temporal structures, researchers have started focusing

on learning visual representations using RGB videos. Early

works such as [53] focused on inclusion of constraints via

video to autoencoder framework. The most common con-

straint is enforcing learned representations to be temporally

smooth [53]. More recently, a stream of reconstruction-

based models has been proposed. Ranzato et al. [23]

proposed a generative model that uses a recurrent neural

network to predict the next frame or interpolate between

frames. This was extended by Srivastava et al. [32] where

they utilized a LSTM Encoder-Decoder framework to re-

construct current frame or predict future frames. Another

line of work [44] uses video data to mine patches which be-

long to the same object to learn representations useful for

distinguishing objects. Misra et al. [17] presented an ap-

proach to learn visual representation with an unsupervised

sequential veriﬁcation task, and showed performance gain

for supervised tasks like activity recognition and pose esti-

mation. One common problem for the learned representa-

tions is that they capture mostly semantic features that we

can get from ImageNet or short-range activities, neglecting

the temporal features.

RGB-D / depth-Based Activity Recognition. Techniques

for activity recognition in this domain use appearance and

motion information in order to reason about non-rigid hu-

man deformations activities. Feature-based approaches

such as HON4D [20], HOPC [21], and DCSF [46] capture

spatio-temporal features in a temporal grid-like structure.

Skeleton-based approaches such as [5, 22, 35, 39, 50] move

beyond such sparse grid-like pooling and focus on how to

propose good skeletal representations. Haque et al. [4] pro-

posed an alternative to skeleton representation by using a

Recurrent Attention model (RAM). Another stream of work

uses probabilistic graphical models such as Hidden Markov

Models (HMM) [49], Conditional Random Fields (CRF)

[12] or Latent Dialect Allocation (LDA) [45] to capture

spatial-temporal structures and learn the relations in activi-

ties from RGB-D videos. However, most of these works re-

quire a lot of feature engineering and can only model short-

range action relations. State-of-the-art methods [15, 16]

for RGB-D/depth-based activity recognition report human

level performance on well-established datasets like MSR-

DailyActivity3D [14] and CAD-120 [33]. However, these

datasets were often constructed under various constraints,

2204

Figure 2. Our proposed learning framework based on the LSTM Encoder-Decoder method. During the encoding step, a downsampling

network (referred to as “Conv”) extracts a low-dimensionality feature from the input frames. Note that we use a pair of frames as the input

to reduce temporal ambiguity. Then, the LTSM learns a temporal representation. This representation is then decoded with the upsampling

network (referred to as “Deconv”) to output the atomic 3D ﬂows.

including single-view, single background, or with very few

subjects. On the other hand, [27] shows that there is a big

performance gap between human and existing methods on a

more challenging dataset [27], which contains signiﬁcantly

more subjects, viewpoints, and background information.

RGB-Based Activity Recognition. The past few years

have seen great progress on activity recognition on short

clips [13, 51, 28, 38, 40]. These works can be roughly

divided into two categories. The ﬁrst category focuses

on handcrafted local features and Bag of Visual Words

(BoVWs) representation. The most successful example is to

extract improved trajectory features [38] and employ Fisher

vector representation [25]. The second category utilizes

deep convolutional neural networks (ConvNets) to learn

video representations from raw data (e.g., RGB images or

optical ﬂow ﬁelds) and train a recognition system in an end-

to-end manner. The most competitive deep learning model

is the deep two-stream ConvNets [42] and its successors

[43, 41], which combine both semantic features extracted

by ConvNets and traditional optical ﬂow that captures mo-

tion. However, unlike image classiﬁcation, the beneﬁt of

using deep neural networks over traditional handcrafted fea-

tures is not very evident. This is potentially because super-

vised training of deep networks requires a lot of data, whilst

the current RGB activity recognition datasets are still too

small.

3. Method

The goal of our method is to learn a representation that

predicts the sequence of basic motions, which are deﬁned as

atomic 3D ﬂows (described in details in Section 3.1). The

problem is formulated as follows: given a pair of images

hX

1

, X

2

i, our objective is to predict the sequence of atomic

3D ﬂows over T temporal steps: h

b

Y

1

,

b

Y

2

, ...,

b

Y

T

i, where

b

Y

t

is the atomic 3D ﬂow at time t (see Figure 2). Note that

X

i

∈ R

H×W ×D

and

b

Y

t

∈ R

H×W ×3

, where D is the num-

ber of input channels, and H, W are the height and width

of the video frames respectively. In Section 5, we experi-

ment with inputs from three different modalities: RGB only

(D = 3), depth only (D = 1), and RGB-D (D = 4).

The learned representation – the red cuboid in Figure 2 –

can then be used as a motion feature for activity recognition

(as described in Section 4). In the remaining of this section,

we ﬁrst present details on how we describe basic motions.

Then, we present the learning framework .

3.1. Sequence of Atomic 3D Flows

To effectively predict the sequence of basic motions, we

need to describe the motion as a low-dimensional signal

such that it is easy to parameterize and is discriminative

enough for other tasks such as activity recognition. Inspired

by the vector quantization algorithms for image compres-

sion [9], we propose to address the ﬁrst goal by quantiz-

ing the estimated 3D ﬂows in space and time, referred to as

atomic 3D ﬂows. We address the discriminative property

by inferring a long-term sequence of 3D ﬂows instead of a

single 3D ﬂow. With these properties, our learned represen-

tation has the ability to capture longer term motion depen-

dencies.

Reasoning in 3D. Whereas previous unsupervised learning

methods model 2D motions in the RGB space [37], we pro-

pose to predict motions in 3D. The beneﬁt of using depth

information along with RGB input is to overcome difﬁcul-

ties such as variations of texture, illumination, shape, view-

point, self occlusion, clutter and occlusion. We augment the

RGB videos with depth modality and estimate the 3D ﬂows

2205

Figure 3. Qualitative results on predicting motion: two examples of long-term ﬂow prediction (8 timesteps, 0.8s). The right hand side

illustrates the “Getting up” activity whereas the right side presents the “Sitting down” activity. A: Ground truth 3D ﬂow. Each row

corresponds to ﬂow along x, y, z direction respectively. B: Predicted 3D ﬂows. C: Ground truth depths. The two frames in green boxes

are the input. D: Depth reconstructed by adding ground truth depth and predicted ﬂow. E: Depth reconstructed by adding the previous

reconstructed depth and predicted ﬂow, except for the ﬁrst frame, in which case the ground truth depth is used.

[8] in order to reduce the level of ambiguities that exist in

each independent modality.

Reasoning with sequences. Previous unsupervised learn-

ing methods have modeled motion as either a single optical

ﬂow [37] or a dense trajectories over multiple frames [38].

The ﬁrst approach has the advantage of representing motion

with a single ﬁxed size image. However, it only encodes

a short range motion. The second approach addresses the

long-term motion dependencies but is difﬁcult to efﬁciently

model each keypoint. We propose a third alternative: model

the motion as a sequence of ﬂows. Motivated by the recent

success of RNN to predict sequence of images [34], we pro-

pose to learn to predict the sequence of ﬂows over a long

period of time. To ease the prediction of the sequence, we

can further transform the ﬂow into a lower dimensionality

signal (referred to as atomic ﬂows).

Reasoning with atomic ﬂows. Flow prediction can be

posed as a regression problem where the loss is squared

Euclidean distance between the ground truth ﬂow and pre-

dicted ﬂow. Unfortunately, the squared Euclidean distance

in pixel space is not a good metric, since it is not stable

to small image deformations, and the output space tends

to smoothen results to the mean [23]. Instead, we formu-

late the ﬂow prediction task as a classiﬁcation task using

Z = F(Y), where Y ∈ R

H×W ×3

, Z ∈ R

h×w×K

, and F

maps each non-overlapping M × M 3D ﬂow patch in Y

to a probability distribution over K quantized classes (i.e.,

atomic ﬂows). More speciﬁcally, we assign a soft class label

over K quantized codewords for each M × M ﬂow patch,

where M = H/h = W /w. After mapping each patch

to a probability distribution, we get a probability distribu-

tion Z ∈ R

h×w×K

over all patches. We investigated three

quantization methods: k-means codebook (similar to [37]),

uniform codebook, and learnable codebook (initialized with

k-means or uniform codebook, and trained end-to-end). We

got the best result using uniform codebook and training the

codebook end-to-end only leads to minor performance gain.

K-means codebook results in inferior performance because

the lack of balance causes k-means to produce a poor clus-

tering.

Our uniform quantization is performed as follows: we

construct a codebook C ∈ R

K×3

by quantizing bounded

3D ﬂow into equal-sized bins, where we have

3

√

K distinct

classes along each axes. For each M × M 3D ﬂow patch,

we compute its mean and retrieve its k nearest neighbors

(each represents one ﬂow class) from the codebook. Em-

pirically, we ﬁnd having the number of nearest neighbors

k > 1 (soft label) yields better performance. To reconstruct

the predicted ﬂow

b

Y from predicted distribution

b

Z, we re-

place each codebook distribution as a linear combination

of codewords. The parameters are determined empirically

such that K = 125 (5 quantized bins across each dimen-

sion) and M = 8.

3.2. Learning framework

To learn a representation that encodes the long-term mo-

tion dependencies in videos, we cast the learning framework

as a sequence-to-sequence problem. We propose to use a

Recurrent Neural Network (RNN) based Encoder-Decoder

framework to effectively learn these motion dependencies.

Given two frames, our proposed RNN predicts the sequence

of atomic 3D ﬂows.

Figure 2 presents an overview of our learning frame-

work, which can be divided into an encoding and decoding

steps. During encoding, a downsampling network (referred

to as “Conv”) extracts a low-dimensionality feature from the

2206

Figure 4. Our proposed network architecture for activity recogni-

tion. Each pair of video frames is encoded with our learned tempo-

ral representation (ﬁxing the weights). Then, a classiﬁcation layer

is trained to infer the activities.

input frames. Then, the LTSM runs through the sequence of

extracted features to learn a temporal representation. This

representation is then decoded with the upsampling network

(“Deconv”) to output the atomic 3D ﬂows.

The LSTM Encoder-Decoder framework [34] provides a

general framework for sequence-to-sequence learning prob-

lems, and its ability to capture long-term temporal depen-

dencies makes it a natural choice for this application. How-

ever, vanilla LSTMs do not take spatial correlations into

consideration. In fact, putting them between the upsampling

and downsampling networks leads to much slower con-

vergence speed and signiﬁcantly worse performance, com-

pared to a single-step ﬂow prediction without LSTMs. To

preserve the spatial information in intermediate represen-

tations, we use the convolutional LSTM unit [47] that has

convolutional structures in both the input-to-state and state-

to-state transitions. Here are more details on the downsam-

pling and upsampling networks:

Downsampling Network (“Conv” ). We train a Convolu-

tional Neural Network (CNN) to extract high-level features

from each input frame. The architecture of our network is

similar to the standard VGG-16 network [29] with the fol-

lowing modiﬁcations. Our network is fully convolutional,

with the ﬁrst two fully connected layers converted to con-

volution with the same number of parameters to preserve

spatial information. The last softmax layer is replaced by a

convolutional layer with a ﬁlter of size 1 × 1 × 32, result-

ing in a downsampled output of shape 7 × 7 × 32. A batch

normalization layer [7] is added to the output of every con-

volutional layer. In addition, the number of input channels

in the ﬁrst convolutional layer is adapted according to the

modality.

Upsampling Network (“Deconv”). We use an upsampling

CNN with fractionally-strided convolution [31] to perform

spatial upsampling and atomic 3D ﬂow prediction. A stack

of ﬁve fractionally-strided convolutions upsamples each in-

put to the predicted distribution

b

Z ∈ R

h×w×K

, where

b

Z

ij

represents the unscaled log probabilities over the (i, j)

th

Figure 5. Motion prediction error on NTU-RGB+D. We plot the

per-pixel root mean square error of estimating the atomic 3D ﬂows

with respect to time across different input modalities.

ﬂow patch.

3.3. Loss Function

Finally, we deﬁne a loss function that is stable and easy

to optimize for motion prediction. As described in section

3.1, we deﬁne the cross-entropy loss between the ground

truth distribution Z over the atomic 3D ﬂow space C and

the predicted distribution

b

Z:

L

ce

(Z,

b

Z) = −

H

′

X

i=1

W

′

X

j=1

K

X

k=1

w

k

Z

ijk

log

b

Z

ijk

(1)

where w ∈ R

K

is a weighting vector for rebalancing the

loss based on the frequency of each atomic ﬂow vectors.

The distribution of atomic 3D ﬂows is strongly biased to-

wards classes with small ﬂow magnitude, as there is little to

no motion in the background. Without accounting for this,

the loss function is dominated by classes with very small

ﬂow magnitudes, causing the model to predict only class

0 which represents no motion. Following the approach in

[52], we deﬁne the class weight w as follow:

w ∝



(1−λ)

e

p+

λ

K



−1

and

K

X

k=1

e

p

k

w

k

= 1 (2)

where

e

p is the empirical distribution of the codewords in

codebook C, and λ is the smoothing weight.

4. Activity recognition

The ﬁnal goal of our learned representation is to clas-

sify activities in videos. We use our encoder architecture

2207

Unsupervised Learning of Long-Term Motion Dynamics for Videos

Citations

Contrastive Multiview Coding

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

Contrastive Multiview Coding

References

Adam: A Method for Stochastic Optimization

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

ImageNet Large Scale Visual Recognition Challenge

Related Papers (5)

Unsupervised Learning of Video Representations using LSTMs

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Learning Spatiotemporal Features with 3D Convolutional Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Deep Residual Learning for Image Recognition