scispace - formally typeset
Open AccessProceedings ArticleDOI

Unsupervised Learning of Long-Term Motion Dynamics for Videos

TLDR
An unsupervised representation learning approach that compactly encodes the motion dependencies in videos and demonstrates the effectiveness of the learned temporal representations on activity classification across multiple modalities and datasets such as NTU RGB+D and MSR Daily Activity 3D.
Abstract
We present an unsupervised representation learning approach that compactly encodes the motion dependencies in videos. Given a pair of images from a video clip, our framework learns to predict the long-term 3D motions. To reduce the complexity of the learning framework, we propose to describe the motion as a sequence of atomic 3D flows computed with RGB-D modality. We use a Recurrent Neural Network based Encoder-Decoder framework to predict these sequences of flows. We argue that in order for the decoder to reconstruct these sequences, the encoder must learn a robust video representation that captures long-term motion dependencies and spatial-temporal relations. We demonstrate the effectiveness of our learned temporal representations on activity classification across multiple modalities and datasets such as NTU RGB+D and MSR Daily Activity 3D. Our framework is generic to any input modality, i.e., RGB, depth, and RGB-D videos.

read more

Content maybe subject to copyright    Report

Unsupervised Learning of Long-Term Motion Dynamics for Videos
Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, Li Fei-Fei
Stanford University
{zelunluo,boya,dahuang,alahi,feifeili}@cs.stanford.edu
Abstract
We present an unsupervised representation learning ap-
proach that compactly encodes the motion dependencies
in videos. Given a pair of images from a video clip, our
framework learns to predict the long-term 3D motions. To
reduce the complexity of the learning framework, we pro-
pose to describe the motion as a sequence of atomic 3D
flows computed with RGB-D modality. We use a Recur-
rent Neural Network based Encoder-Decoder framework
to predict these sequences of flows. We argue that in or-
der for the decoder to reconstruct these sequences, the en-
coder must learn a robust video representation that captures
long-term motion dependencies and spatial-temporal rela-
tions. We demonstrate the effectiveness of our learned tem-
poral representations on activity classification across multi-
ple modalities and datasets such as NTU RGB+D and MSR
Daily Activity 3D. Our framework is generic to any input
modality, i.e., RGB, depth, and RGB-D videos.
1. Introduction
Human activities can often be described as a sequence of
basic motions. For instance, common activities like brush-
ing hair or waving a hand can be described as a sequence of
successive raising and lowering of the hand. Over the past
years, researchers have studied multiple strategies to effec-
tively represent motion dynamics and classify activities in
videos [38, 20, 42]. However, the existing methods suffer
from the inability to compactly encode long-term motion
dependencies. In this work, we propose to learn a represen-
tation that can describe the sequence of motions by learning
to predict it. In other words, we are interested in learning a
representation that, given a pair of video frames, can predict
the sequence of basic motions (see in Figure 1). We believe
that if the learned representation has encoded enough infor-
mation to predict the motion, it is discriminative enough to
classify activities in videos. Hence, our final goal is to use
our learned representation to classify activities in videos.
To classify activities, we argue that a video representa-
tion needs to capture not only the semantics, but also the
Figure 1. We propose a method that learns a video representation
by predicting a sequence of basic motions described as atomic 3D
flows. The learned representation is then extracted from this model
to recognize activities.
motion dependencies in a long temporal sequence. Since
robust representations exist to extract semantic informa-
tion [29], we focus our effort on learning a representation
that encodes the sequence of basic motions in consecutive
frames. We define basic motions as atomic 3D flows. The
atomic 3D flows are computed by quantizing the estimated
dense 3D flows in space and time using RGB-D modal-
ity. Given a pair of images from a video clip, our frame-
work learns a representation that can predict the sequence
of atomic 3D flows.
Our learning framework is unsupervised, i.e., it does
not require human-labeled data. Not relying on labels has
the following benefits. It is not clear how many labels
are needed to understand activities in videos. For a sin-
gle image, millions of labels have been used to surpass
human-level accuracy in extracting semantic information
[29]. Consequently, we would expect that videos will re-
quire several orders of magnitude more labels to learn a rep-
resentation in a supervised setting. It will be unrealistic to
collect all these labels.
Recently, a stream of unsupervised methods have been
proposed to learn temporal structures from videos. These
2203

methods are formulated with various objectives - supervi-
sion. Some focus on constructing future frames [32, 23],
or enforcing the learned representations to be temporally
smooth [53], while others make use of the sequential or-
der of frames sampled from a video [17, 44]. Although
they show promising results, most of the learned represen-
tations still focus heavily on either capturing semantic fea-
tures [17], or are not discriminative enough for classifying
activities as the output supervision is too large and coarse
(e.g., frame reconstruction).
When learning a representation that predicts motions,
the following properties are needed: the output supervision
needs to be of i) low dimensionality, ii) easy to parame-
terize, and iii) discriminative enough for other tasks. We
address the first two properties by reducing the dimension-
ality of the flows through clustering. Then, we address the
third property by augmenting the RGB videos with depth
modality to reason on 3D motions. By inferring 3D mo-
tion as opposed to view-specific 2D optical flow, our model
is able to learn an intermediate representation that captures
less view-specific spatial-temporal interactions. Compared
to 2D dense trajectories [38], our 3D motions are of much
lower dimensionality. Moreover, we focus on inferring the
sequence of basic motions that describes an activity as op-
posed to tracking keypoints over space and time. We claim
that our proposed description of the motion enables our
learning framework to predict longer motion dependencies
since the complexity of the output space is reduced. In Sec-
tion 5.2, we show quantitatively that our proposed method
outperforms previous methods on activity recognition.
The contributions of our work are as follows:
(i) We propose to use a Recurrent Neural Network based
Encoder-Decoder framework to effectively learn a rep-
resentation that predicts the sequence of basic motions.
Whereas existing unsupervised methods describe mo-
tion as either a single optical flow [37] or 2D dense tra-
jectories [38], we propose to describe it as a sequence
of atomic 3D flows over a long period of time (Section
3).
(ii) We are the first to explore and generalize unsuper-
vised learning methods across different modalities. We
study the performance of our unsupervised task - pre-
dicting the sequence of basic motions - using various
input modalities: RGB motion, depth motion,
and RGB-D motion (Section 5.1).
(iii) We show the effectiveness of our learned represen-
tations on activity recognition tasks across multiple
modalities and datasets (Section 5.2). At the time of
its introduction, our model outperforms state-of-the-
art unsupervised methods [17, 32] across modalities
(RGB and depth).
2. Related Work
We first present previous works on unsupervised repre-
sentation learning for images and videos. Then, we give a
brief overview on existing methods that classify activities in
multi-modal videos.
Unsupervised Representation Learning. In the RGB
domain, unsupervised learning of visual representations
has shown usefulness for various supervised tasks such as
pedestrian detection and object detection [1, 26]. To ex-
ploit temporal structures, researchers have started focusing
on learning visual representations using RGB videos. Early
works such as [53] focused on inclusion of constraints via
video to autoencoder framework. The most common con-
straint is enforcing learned representations to be temporally
smooth [53]. More recently, a stream of reconstruction-
based models has been proposed. Ranzato et al. [23]
proposed a generative model that uses a recurrent neural
network to predict the next frame or interpolate between
frames. This was extended by Srivastava et al. [32] where
they utilized a LSTM Encoder-Decoder framework to re-
construct current frame or predict future frames. Another
line of work [44] uses video data to mine patches which be-
long to the same object to learn representations useful for
distinguishing objects. Misra et al. [17] presented an ap-
proach to learn visual representation with an unsupervised
sequential verification task, and showed performance gain
for supervised tasks like activity recognition and pose esti-
mation. One common problem for the learned representa-
tions is that they capture mostly semantic features that we
can get from ImageNet or short-range activities, neglecting
the temporal features.
RGB-D / depth-Based Activity Recognition. Techniques
for activity recognition in this domain use appearance and
motion information in order to reason about non-rigid hu-
man deformations activities. Feature-based approaches
such as HON4D [20], HOPC [21], and DCSF [46] capture
spatio-temporal features in a temporal grid-like structure.
Skeleton-based approaches such as [5, 22, 35, 39, 50] move
beyond such sparse grid-like pooling and focus on how to
propose good skeletal representations. Haque et al. [4] pro-
posed an alternative to skeleton representation by using a
Recurrent Attention model (RAM). Another stream of work
uses probabilistic graphical models such as Hidden Markov
Models (HMM) [49], Conditional Random Fields (CRF)
[12] or Latent Dialect Allocation (LDA) [45] to capture
spatial-temporal structures and learn the relations in activi-
ties from RGB-D videos. However, most of these works re-
quire a lot of feature engineering and can only model short-
range action relations. State-of-the-art methods [15, 16]
for RGB-D/depth-based activity recognition report human
level performance on well-established datasets like MSR-
DailyActivity3D [14] and CAD-120 [33]. However, these
datasets were often constructed under various constraints,
2204

Figure 2. Our proposed learning framework based on the LSTM Encoder-Decoder method. During the encoding step, a downsampling
network (referred to as “Conv”) extracts a low-dimensionality feature from the input frames. Note that we use a pair of frames as the input
to reduce temporal ambiguity. Then, the LTSM learns a temporal representation. This representation is then decoded with the upsampling
network (referred to as “Deconv”) to output the atomic 3D flows.
including single-view, single background, or with very few
subjects. On the other hand, [27] shows that there is a big
performance gap between human and existing methods on a
more challenging dataset [27], which contains significantly
more subjects, viewpoints, and background information.
RGB-Based Activity Recognition. The past few years
have seen great progress on activity recognition on short
clips [13, 51, 28, 38, 40]. These works can be roughly
divided into two categories. The first category focuses
on handcrafted local features and Bag of Visual Words
(BoVWs) representation. The most successful example is to
extract improved trajectory features [38] and employ Fisher
vector representation [25]. The second category utilizes
deep convolutional neural networks (ConvNets) to learn
video representations from raw data (e.g., RGB images or
optical flow fields) and train a recognition system in an end-
to-end manner. The most competitive deep learning model
is the deep two-stream ConvNets [42] and its successors
[43, 41], which combine both semantic features extracted
by ConvNets and traditional optical flow that captures mo-
tion. However, unlike image classification, the benefit of
using deep neural networks over traditional handcrafted fea-
tures is not very evident. This is potentially because super-
vised training of deep networks requires a lot of data, whilst
the current RGB activity recognition datasets are still too
small.
3. Method
The goal of our method is to learn a representation that
predicts the sequence of basic motions, which are defined as
atomic 3D flows (described in details in Section 3.1). The
problem is formulated as follows: given a pair of images
hX
1
, X
2
i, our objective is to predict the sequence of atomic
3D flows over T temporal steps: h
b
Y
1
,
b
Y
2
, ...,
b
Y
T
i, where
b
Y
t
is the atomic 3D flow at time t (see Figure 2). Note that
X
i
R
H×W ×D
and
b
Y
t
R
H×W ×3
, where D is the num-
ber of input channels, and H, W are the height and width
of the video frames respectively. In Section 5, we experi-
ment with inputs from three different modalities: RGB only
(D = 3), depth only (D = 1), and RGB-D (D = 4).
The learned representation the red cuboid in Figure 2
can then be used as a motion feature for activity recognition
(as described in Section 4). In the remaining of this section,
we first present details on how we describe basic motions.
Then, we present the learning framework .
3.1. Sequence of Atomic 3D Flows
To effectively predict the sequence of basic motions, we
need to describe the motion as a low-dimensional signal
such that it is easy to parameterize and is discriminative
enough for other tasks such as activity recognition. Inspired
by the vector quantization algorithms for image compres-
sion [9], we propose to address the first goal by quantiz-
ing the estimated 3D flows in space and time, referred to as
atomic 3D flows. We address the discriminative property
by inferring a long-term sequence of 3D flows instead of a
single 3D flow. With these properties, our learned represen-
tation has the ability to capture longer term motion depen-
dencies.
Reasoning in 3D. Whereas previous unsupervised learning
methods model 2D motions in the RGB space [37], we pro-
pose to predict motions in 3D. The benefit of using depth
information along with RGB input is to overcome difficul-
ties such as variations of texture, illumination, shape, view-
point, self occlusion, clutter and occlusion. We augment the
RGB videos with depth modality and estimate the 3D flows
2205

Figure 3. Qualitative results on predicting motion: two examples of long-term flow prediction (8 timesteps, 0.8s). The right hand side
illustrates the “Getting up” activity whereas the right side presents the “Sitting down” activity. A: Ground truth 3D flow. Each row
corresponds to flow along x, y, z direction respectively. B: Predicted 3D flows. C: Ground truth depths. The two frames in green boxes
are the input. D: Depth reconstructed by adding ground truth depth and predicted flow. E: Depth reconstructed by adding the previous
reconstructed depth and predicted flow, except for the first frame, in which case the ground truth depth is used.
[8] in order to reduce the level of ambiguities that exist in
each independent modality.
Reasoning with sequences. Previous unsupervised learn-
ing methods have modeled motion as either a single optical
flow [37] or a dense trajectories over multiple frames [38].
The first approach has the advantage of representing motion
with a single fixed size image. However, it only encodes
a short range motion. The second approach addresses the
long-term motion dependencies but is difficult to efficiently
model each keypoint. We propose a third alternative: model
the motion as a sequence of flows. Motivated by the recent
success of RNN to predict sequence of images [34], we pro-
pose to learn to predict the sequence of flows over a long
period of time. To ease the prediction of the sequence, we
can further transform the flow into a lower dimensionality
signal (referred to as atomic flows).
Reasoning with atomic flows. Flow prediction can be
posed as a regression problem where the loss is squared
Euclidean distance between the ground truth flow and pre-
dicted flow. Unfortunately, the squared Euclidean distance
in pixel space is not a good metric, since it is not stable
to small image deformations, and the output space tends
to smoothen results to the mean [23]. Instead, we formu-
late the flow prediction task as a classification task using
Z = F(Y), where Y R
H×W ×3
, Z R
h×w×K
, and F
maps each non-overlapping M × M 3D flow patch in Y
to a probability distribution over K quantized classes (i.e.,
atomic flows). More specifically, we assign a soft class label
over K quantized codewords for each M × M flow patch,
where M = H/h = W /w. After mapping each patch
to a probability distribution, we get a probability distribu-
tion Z R
h×w×K
over all patches. We investigated three
quantization methods: k-means codebook (similar to [37]),
uniform codebook, and learnable codebook (initialized with
k-means or uniform codebook, and trained end-to-end). We
got the best result using uniform codebook and training the
codebook end-to-end only leads to minor performance gain.
K-means codebook results in inferior performance because
the lack of balance causes k-means to produce a poor clus-
tering.
Our uniform quantization is performed as follows: we
construct a codebook C R
K×3
by quantizing bounded
3D flow into equal-sized bins, where we have
3
K distinct
classes along each axes. For each M × M 3D flow patch,
we compute its mean and retrieve its k nearest neighbors
(each represents one flow class) from the codebook. Em-
pirically, we find having the number of nearest neighbors
k > 1 (soft label) yields better performance. To reconstruct
the predicted flow
b
Y from predicted distribution
b
Z, we re-
place each codebook distribution as a linear combination
of codewords. The parameters are determined empirically
such that K = 125 (5 quantized bins across each dimen-
sion) and M = 8.
3.2. Learning framework
To learn a representation that encodes the long-term mo-
tion dependencies in videos, we cast the learning framework
as a sequence-to-sequence problem. We propose to use a
Recurrent Neural Network (RNN) based Encoder-Decoder
framework to effectively learn these motion dependencies.
Given two frames, our proposed RNN predicts the sequence
of atomic 3D flows.
Figure 2 presents an overview of our learning frame-
work, which can be divided into an encoding and decoding
steps. During encoding, a downsampling network (referred
to as “Conv”) extracts a low-dimensionality feature from the
2206

Figure 4. Our proposed network architecture for activity recogni-
tion. Each pair of video frames is encoded with our learned tempo-
ral representation (fixing the weights). Then, a classification layer
is trained to infer the activities.
input frames. Then, the LTSM runs through the sequence of
extracted features to learn a temporal representation. This
representation is then decoded with the upsampling network
(“Deconv”) to output the atomic 3D flows.
The LSTM Encoder-Decoder framework [34] provides a
general framework for sequence-to-sequence learning prob-
lems, and its ability to capture long-term temporal depen-
dencies makes it a natural choice for this application. How-
ever, vanilla LSTMs do not take spatial correlations into
consideration. In fact, putting them between the upsampling
and downsampling networks leads to much slower con-
vergence speed and significantly worse performance, com-
pared to a single-step flow prediction without LSTMs. To
preserve the spatial information in intermediate represen-
tations, we use the convolutional LSTM unit [47] that has
convolutional structures in both the input-to-state and state-
to-state transitions. Here are more details on the downsam-
pling and upsampling networks:
Downsampling Network (“Conv” ). We train a Convolu-
tional Neural Network (CNN) to extract high-level features
from each input frame. The architecture of our network is
similar to the standard VGG-16 network [29] with the fol-
lowing modifications. Our network is fully convolutional,
with the first two fully connected layers converted to con-
volution with the same number of parameters to preserve
spatial information. The last softmax layer is replaced by a
convolutional layer with a filter of size 1 × 1 × 32, result-
ing in a downsampled output of shape 7 × 7 × 32. A batch
normalization layer [7] is added to the output of every con-
volutional layer. In addition, the number of input channels
in the first convolutional layer is adapted according to the
modality.
Upsampling Network (“Deconv”). We use an upsampling
CNN with fractionally-strided convolution [31] to perform
spatial upsampling and atomic 3D flow prediction. A stack
of five fractionally-strided convolutions upsamples each in-
put to the predicted distribution
b
Z R
h×w×K
, where
b
Z
ij
represents the unscaled log probabilities over the (i, j)
th
Figure 5. Motion prediction error on NTU-RGB+D. We plot the
per-pixel root mean square error of estimating the atomic 3D flows
with respect to time across different input modalities.
flow patch.
3.3. Loss Function
Finally, we define a loss function that is stable and easy
to optimize for motion prediction. As described in section
3.1, we define the cross-entropy loss between the ground
truth distribution Z over the atomic 3D flow space C and
the predicted distribution
b
Z:
L
ce
(Z,
b
Z) =
H
X
i=1
W
X
j=1
K
X
k=1
w
k
Z
ijk
log
b
Z
ijk
(1)
where w R
K
is a weighting vector for rebalancing the
loss based on the frequency of each atomic flow vectors.
The distribution of atomic 3D flows is strongly biased to-
wards classes with small flow magnitude, as there is little to
no motion in the background. Without accounting for this,
the loss function is dominated by classes with very small
flow magnitudes, causing the model to predict only class
0 which represents no motion. Following the approach in
[52], we define the class weight w as follow:
w
(1λ)
e
p+
λ
K
1
and
K
X
k=1
e
p
k
w
k
= 1 (2)
where
e
p is the empirical distribution of the codewords in
codebook C, and λ is the smoothing weight.
4. Activity recognition
The final goal of our learned representation is to clas-
sify activities in videos. We use our encoder architecture
2207

Citations
More filters
Posted Content

Contrastive Multiview Coding

TL;DR: Key properties of the multiview contrastive learning approach are analyzed, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views the authors learn from, the better the resulting representation captures underlying scene semantics.
Journal ArticleDOI

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

TL;DR: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.
Journal ArticleDOI

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

TL;DR: This work introduces a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, and investigates a novel one-shot 3D activity recognition problem on this dataset.
Posted Content

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

TL;DR: Self-Supervised Learning: Self-supervised learning as discussed by the authors is a subset of unsupervised image and video feature learning, which aims to learn general image features from large-scale unlabeled data without using any human-annotated labels.
Book ChapterDOI

Contrastive Multiview Coding

TL;DR: In this paper, a multiview contrastive learning framework is proposed to maximize mutual information between different views of the same scene but is otherwise compact, which achieves state-of-the-art results on image and video unsupervised learning benchmarks.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.