scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Deep Learning on Lie Groups for Skeleton-Based Action Recognition

TL;DR: The Lie group structure is incorporated into a deep network architecture to learn more appropriate Lie group features for 3D action recognition and a logarithm mapping layer is proposed to map the resulting manifold data into a tangent space that facilitates the application of regular output layers for the final classification.
Abstract: In recent years, skeleton-based action recognition has become a popular 3D classification problem. State-of-the-art methods typically first represent each motion sequence as a high-dimensional trajectory on a Lie group with an additional dynamic time warping, and then shallowly learn favorable Lie group features. In this paper we incorporate the Lie group structure into a deep network architecture to learn more appropriate Lie group features for 3D action recognition. Within the network structure, we design rotation mapping layers to transform the input Lie group features into desirable ones, which are aligned better in the temporal domain. To reduce the high feature dimensionality, the architecture is equipped with rotation pooling layers for the elements on the Lie group. Furthermore, we propose a logarithm mapping layer to map the resulting manifold data into a tangent space that facilitates the application of regular output layers for the final classification. Evaluations of the proposed network for standard 3D human action recognition datasets clearly demonstrate its superiority over existing shallow Lie group feature learning methods as well as most conventional deep learning methods.

Summary (2 min read)

1. Introduction

  • The authors focus on studying manifold-based approaches [41, 3, 42] to learn more appropriate Lie group representations of skeletal action data, that have achieved state-of-the-art performances for some 3D human action recognition benchmarks.
  • To handle this issue, they typically employ dynamic time warping (DTW), as originally used in speech processing [30].
  • To address this problem, [41, 3, 42] attempt to first flatten the underlying manifold via tangent approximation or rolling maps, and then exploit SVM or PCA-like method to learn features in the resulting flattened space.
  • The proposed network provides a paradigm to incorporate the Lie group structure into deep learning, which generalizes the traditional neural network model to non-Euclidean Lie groups.

2. Relevant Work

  • In particular, two sub-classes of the general Lie group learning theories were studied in detail, tackling first-order (gradient-based) and secondorder (non-gradient-based) learning. [15] introduced deep symmetry networks , a generalization of convolutional networks that forms feature maps over arbitrary symmetry groups that are basically Lie groups.
  • The symnets utilize kernel-based interpolation to tractably tie parameters and pool over symmetry spaces of any dimension.
  • [10] proposed a spectral version of convolutional networks to handle graphs.
  • For shape analysis, [28] proposed a ‘geodesic convolution’ on local geodesic coordinate systems to extract local patches on the shape manifold.

3. Lie Group Representation for Skeletal Data

  • The local coordinate system of body part en is calculated by rotating with minimum rotation so that its stating joint becomes the origin and it coincides with the x-axis.
  • When the anchor point is the identity matrix.
  • In ∈ SOn, the resulting tangent space is known as the Lie algebra son.

4. Lie Group Network for Skeleton-based Action Recognition

  • For the problem of skeleton-based action recognition, the authors build a deep network architecture to learn the Lie group representations of skeletal data.
  • The network structure is dubbed as LieNet, where each input is an element on the Lie Group.
  • Like convolutional networks , the LieNet also exhibits fully connected convolution-like layers and pooling layers, named rotation mapping layers and rotation pooling layers respectively.
  • In particular, the proposed RotMap layers perform transformations on input rotation matrices to generate new rotation matrices, which have the same manifold property, and are expected to be aligned more accurately for more reliable matching.
  • This transforms the rotation matrices into the usual skew-symmetric matrices, which lie in Euclidean space and hence can be fed into any regular output layers.

4.4. Output Layers

  • After performing the LogMap layers, the outputs can be transformed into vector form and concatenated directly frame by frame within one sequence due to their Euclidean nature.
  • Then, the authors can add any regular network layers such as rectified linear unit (ReLU) layers and regular fully connected (FC) layers.
  • In the FC layer, the dimensionality of the weight is set to dk × dk−1, where dk and dk−1 are the class number and the vector dimensionalities, respectively.
  • Besides, as studied in [37, 26], learning temporal dependencies over the sequential data can improve human action recognition.
  • Because of the space limitation, the authors do not study this any further.

5. Training Procedure

  • In order to train the proposed LieNets, the authors exploit the Stochastic gradient descent (SGD) algorithm that is one of the most popular network training tools.
  • The gradients of the data involved in RotPooling, LogMap and regular output layers can be calculated by Eqn.14 as usual.
  • As a consequence, merely using Eqn.13 to compute their Euclidean gradients rather than Riemannian gradients in the procedure of backpropagation would not generate valid rotation weights.
  • To handle this problem, the authors propose a new approach of updating the weights used in Eqn.6 for the RotMap layers.
  • Then, such update is mapped back to the SO3 manifold with a retraction operation.

6.1. Evaluation Datasets

  • G3D-Gaming dataset [5] contains 663 sequences of 20 different gaming motions.
  • Each subject performed every action more than two times.
  • Due to its large scale, the dataset is highly suitable for deep learning.

6.2. Implementation Details

  • As a result, for each moving skeleton, the authors finally compute a Lie group curve of length 100, 16, 64 for the G3D-Gaming, HDM05 and NTU RGB-D datasets, respectively.
  • As the focus of this work is on skeleton-based action recognition, the authors mainly utilize manifold-based approaches for comparison.
  • For a fair comparison, the authors use the source codes from the original authors, and set the involved parameters as in the original papers.
  • For the proposed LieNet, the authors build its architecture with single or multiple block(s) of RotMap/RotPooling layers illustrated in Fig.1 before the three final layers being LogMap, FC and softmax layers.
  • As the LieNet gets promising results on all datasets with the same configuration, this shows its insensitivity to the parameter settings.

6.3. Experimental Results

  • For the dataset, the authors follow a cross-subject test setting, where half the subjects are used for training and the other half are employed for testing.
  • As shown in Table 1, the LieNet shows its superiority over the two baseline methods SO and SE.
  • This extreme case would result in the loss of the temporal resolution and thus undermine the performance of recognizing activities.
  • The left of Fig.3 verifies the necessity of using RotMap, RotPooling and LogMap layers to improve the proposed LieNet-3Blocks.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Deep Learning on Lie Groups for Skeleton-based Action Recognition
Zhiwu Huang
, Chengde Wan
, Thomas Probst
, Luc Van Gool
†‡
Computer Vision Lab, ETH Zurich, Switzerland
VISICS, KU Leuven, Belgium
{zhiwu.huang, wanc, probstt, vangool}@vision.ee.ethz.ch
Abstract
In recent years, skeleton-based action recognition has
become a popular 3D classification problem. State-of-the-
art methods typically first represent each motion sequence
as a high-dimensional trajectory on a Lie group with an
additional dynamic time warping, and then shallowly learn
favorable Lie group features. In this paper we incorporate
the Lie group structure into a deep network architecture to
learn more appropriate Lie group features for 3D action
recognition. Within the network structure, we design rota-
tion mapping layers to transform the input Lie group fea-
tures into desirable ones, which are aligned better in the
temporal domain. To reduce the high feature dimensional-
ity, the architecture is equipped with rotation pooling layers
for the elements on the Lie group. Furthermore, we propose
a logarithm mapping layer to map the resulting manifold
data into a tangent space that facilitates the application of
regular output layers for the final classification. Evalua-
tions of the proposed network for standard 3D human ac-
tion recognition datasets clearly demonstrate its superiority
over existing shallow Lie group feature learning methods as
well as most conventional deep learning methods.
1. Introduction
Due to the development of depth sensors, 3D human
activity analysis [
27, 45, 23, 43, 41, 3, 42, 37, 44, 26,
35, 17] has attracted more interest than ever before. Re-
cent manifold-based approaches are quite successful at 3D
human action recognition thanks to their view-invariant
manifold-based representations for skeletal data. Typical
examples include shape silhouettes in the Kendall’s shape
space [
40, 3], linear dynamical systems on the Grassmann
manifold [
39], histograms of oriented optical flow on a
hyper-sphere [
11], and pairwise transformations of skele-
tal joints on a Lie group [
41, 3, 42]. In this paper, we focus
on studying manifold-based approaches [
41, 3, 42] to learn
more appropriate Lie group representations of skeletal ac-
tion data, that have achieved state-of-the-art performances
for some 3D human action recognition benchmarks.
As studied in [
41, 3, 42], Lie group feature learning
methods often suffer from speed variations (i.e., temporal
misalignment), which tend to deteriorate classification ac-
curacy. To handle this issue, they typically employ dynamic
time warping (DTW), as originally used in speech process-
ing [
30]. Unfortunately, such process costs additional time,
and also results in a two-step system that typically performs
worse than an end-to-end learning scheme. Moreover, such
Lie group representations for action recognition tend to be
extremely high-dimensional, in part because the features are
extracted per skeletal segment and then stacked. As a result,
any computation on such nonlinear trajectories is expensive
and complicated. To address this problem, [
41, 3, 42] at-
tempt to first flatten the underlying manifold via tangent
approximation or rolling maps, and then exploit SVM or
PCA-like method to learn features in the resulting flattened
space. Although these methods achieve some success, they
merely adopt shallow linear learning schemes, yielding sub-
optimal solutions on the specific nonlinear manifolds.
Deep neural networks have shown their great power in
learning compact and discriminative representations for im-
ages and videos, thanks to their ability to perform nonlin-
ear computations and the effectiveness of gradient descent
training with backpropagation. This has motivated us to
build a deep neural network architecture for representation
learning on Lie groups. In particular, inspired by the clas-
sical manifold learning theory [
38, 36, 4, 12, 20, 19], we
equip the new network structure with rotation mapping lay-
ers, with which the input Lie group features are transformed
to new ones with better alignment. As a result, the effect of
speed variations can be appropriately mitigated. In order
to reduce the high dimensionality of the Lie group features,
we design special pooling layers to compose them in terms
of spatial and temporal levels, respectively. As the output
data reside on nonlinear manifolds, we also propose a Rie-
mannian computing layer, whose outputs could be fed into
any regular output layers such as a softmax layer. In short,
our main contributions are:
A novel neural network architecture is introduced to
deeply learn more desirable Lie group representations
for the problem of skeleton-based action recognition.
6099

The proposed network provides a paradigm to incorpo-
rate the Lie group structure into deep learning, which
generalizes the traditional neural network model to
non-Euclidean Lie groups.
To train the network within the backpropagation
framework, a variant of stochastic gradient descent op-
timization is exploited in the context of Lie groups.
2. Relevant Work
Already quite some works [
46, 34, 2, 29, 33, 14, 15] have
applied aspects of Lie group theory to deep neural networks.
For example, [
33] investigated how stability properties of a
continuous recursive neural network can be altered within
neighbourhoods of equilibrium points by the use of Lie
group projections operating on the synaptic weight matrix.
[
14] studied the behavior of unsupervised neural networks
with orthonormality constraints, by exploiting the differen-
tial geometry of Lie groups. In particular, two sub-classes
of the general Lie group learning theories were studied
in detail, tackling first-order (gradient-based) and second-
order (non-gradient-based) learning. [
15] introduced deep
symmetry networks (symnets), a generalization of convolu-
tional networks that forms feature maps over arbitrary sym-
metry groups that are basically Lie groups. The symnets
utilize kernel-based interpolation to tractably tie parameters
and pool over symmetry spaces of any dimension.
Moreover, recently some deep learning models have
emerged [
10, 7, 28, 25, 18, 21] that deal with data in a non-
Euclidean domain. For instance, [
10] proposed a spectral
version of convolutional networks to handle graphs. It ex-
ploits the notion of non shift-invariant convolution, relying
on the analogy between the classical Fourier transform and
the Laplace-Beltrami eigenbasis. [
25] developed a scalable
method for treating an arbitrary spatio-temporal graph as a
rich recurrent neural network mixture, which can be used to
transform any spatio-temporal graph by employing a certain
set of well-defined steps. For shape analysis, [
28] proposed
a ‘geodesic convolution’ on local geodesic coordinate sys-
tems to extract local patches on the shape manifold. This
approach performs convolutions by sliding a window over
the manifold, and local geodesic coordinates are used in-
stead of image patches. To deeply learn symmetric positive
definite (SPD) matrices - used in many tasks - [
18] devel-
oped a Riemannian network on the manifolds of SPD matri-
ces, with some layers specially designed to deal with such
structured matrices.
In summary, such works have applied some theories of
Lie groups to regular networks, and even generalized the
common networks to non-Euclidean domains. Neverthe-
less, to the best of our knowledge, this is the first work that
studies a deep learning architecture on Lie groups to handle
the problem of skeleton-based action recognition.
3. Lie Group Representation for Skeletal Data
Let S = (V, E) be a body skeleton, where V =
{v
1
, . . . , v
N
} denotes the set of body joints, and E =
{e
1
, . . . , e
M
} indicates the set of edges, i.e. oriented rigid
body bones. As studied in [
41, 3, 42], the relative geome-
try of a pair of body parts e
n
and e
m
can be represented in
a local coordinate system attached to the other. The local
coordinate system of body part e
n
is calculated by rotating
with minimum rotation so that its stating joint becomes the
origin and it coincides with the x-axis. With the process,
we consequently get the transformed 3D vectors
ˆ
e
m
,
ˆ
e
n
for
the two edges e
m
, e
n
respectively. Then we can compute
the rotation matrix R
m,n
(R
T
m,n
R
m,n
= R
m,n
R
T
m,n
=
I
n
, | R
m,n
| = 1) from e
m
to the local coordinate system
of e
n
. Specifically, we can firstly calculate the axis-angle
representation (ω, θ) for the rotation matrix R
m,n
by
ω =
ˆ
e
m
ˆ
e
n
k
ˆ
e
m
ˆ
e
n
k
, (1)
θ = arccos(
ˆ
e
m
·
ˆ
e
n
). (2)
where , · are outer and inner products respectively. Then,
the axis-angle representation can be easily transformed to
a rotation matrix R
m,n
. In the same way, the rotation ma-
trix R
n,m
from e
n
to the local coordinate system of e
m
can be computed. To fully encode the relative geometry be-
tween e
m
and e
n
, R
m,n
and R
n,m
are both used. As a
result, a skeleton S at the time instance t is represented by
the form (R
1,2
(t), R
2,1
(t) . . . , R
M1,M
(t), R
M,M1
(t)),
where M is the number of body parts, and the number of
rotation matrices is 2C
2
M
(C
2
M
is the combination formula).
The set of n×n rotation matrices in R
n
forms the special
orthogonal group SO
n
which is actually a matrix Lie group
[
22, 9, 16]. Accordingly, each motion sequence of a moving
skeleton is represented with a curve on the Lie group SO
3
×
. . .×SO
3
. It is known that the matrix Lie group is endowed
with a Riemannian manifold structure that is differentiable.
Hence, at each point R
0
on SO
n
, one can derive the tangent
space T
R
0
SO
n
that is a vector space spanned by the set
of skew-symmetric matrices. When the anchor point is the
identity matrix I
n
SO
n
, the resulting tangent space is
known as the Lie algebra so
n
. As the tangent spaces are
equipped with the inner product, the Riemannian metric on
SO
n
can be defined by the Frobenius inner product:
< A
1
, A
2
>= trace(A
T
1
A
2
), A
1
, A
2
T
R
0
SO
n
. (3)
The logarithm map log
R
0
and exponential map exp
R
0
at R
0
on SO
n
associated with the Riemannian metric can
be expressed in terms of the usual matrix logarithm log and
exponential exp as
log
R
0
(R
1
) = log(R
1
R
T
0
) with R
0
, R
1
SO
n
, (4)
exp
R
0
(A
1
) = exp
A
1
R
T
0
with A
1
T
R
0
SO
n
. (5)
6100


⋯
󰇛󰇜

RotMap
Input RotMat

log

LogMap
󰇛󰇜
󰇛󰇜
RotPooling
󰇛󰇜
󰇛󰇜
max
󰇝
,..󰇞
max 󰇝
,…󰇞
RotPooling
󰇛󰇜

RotMap
󰇛󰇜
Output

⋯

⋯
󰇛󰇜
󰇛󰇜

⋯
Lie Group
󰇛󰇜
󰇛󰇜
󰇛󰇜
Figure 1. Conceptual illustration of the proposed Lie group Network (LieNet) architecture. In the network structure, the data space of each
RotMap/RotPooling layer corresponds to a Lie group, while the weight spaces of the RotMap layers are Lie groups as well.
4. Lie Group Network for Skeleton-based Ac-
tion Recognition
For the problem of skeleton-based action recognition, we
build a deep network architecture to learn the Lie group
representations of skeletal data. The network structure is
dubbed as LieNet, where each input is an element on the
Lie Group. Like convolutional networks (ConvNets), the
LieNet also exhibits fully connected convolution-like layers
and pooling layers, named rotation mapping (RotMap) lay-
ers and rotation pooling (RotPooling) layers respectively.
In particular, the proposed RotMap layers perform transfor-
mations on input rotation matrices to generate new rotation
matrices, which have the same manifold property, and are
expected to be aligned more accurately for more reliable
matching. The RotPooling layers aim to pool the resulting
rotation matrices at both spatial and temporal levels such
that the Lie group feature dimensionality can be reduced.
Since the rotation matrices reside on non-Euclidean mani-
folds, we have to design a layer named logarithm mapping
(LogMap) layer, to perform the Riemannian computations
on. This transforms the rotation matrices into the usual
skew-symmetric matrices, which lie in Euclidean space and
hence can be fed into any regular output layers. The archi-
tecture of the proposed LieNet is shown in Fig.
1.
4.1. RotMap Layer
As well-known from classical manifold learning theory
[
38, 36, 4, 12, 20, 19], one can learn or preserve the origi-
nal data structure to faithfully maintain geodesic distances
for better classification. Accordingly, we design a RotMap
layer to transform the input rotation matrices to new ones
that are more suitable for the final classification. Formally,
the RotMap layers adopt a rotation mapping f
r
as
f
(k)
r
((R
k1
1
, R
k1
2
. . . , R
k1
ˆ
M
); W
k
1
, W
k
2
. . . , W
k
ˆ
M
)
= (W
k
1
R
k1
1
, W
k
2
R
k1
2
. . . , W
k
ˆ
M
R
k1
ˆ
M
)
= (R
k
1
, R
k
2
. . . , R
k
ˆ
M
)
(6)
where
ˆ
M = 2C
2
M
(M is the number of body bones
in one skeleton, C
2
M
is the combination computation),
(R
k1
1
, R
k1
2
. . . , R
k1
ˆ
M
) SO
3
× SO
3
. . . × SO
3
is
the input Lie group feature (i.e., product of rotation ma-
trices) for one skeleton in the k-th layer, W
k
i
R
3×3
is the transformation matrix (connection weights), and
(R
k
1
, R
k
2
. . . , R
k
ˆ
M
) is the resulting Lie group representa-
tion. Note that although there is only one transformation
matrix for each rotation matrix, it would be easily extended
with multiple projections for each input. To ensure the form
(R
k
1
, R
k
2
. . . , R
k
ˆ
M
) becomes a valid product of rotation ma-
trices residing on SO
3
× SO
3
. . . × SO
3
, the transforma-
tion matrices W
k
1
, W
k
2
, . . . , W
k
ˆ
M
are all basically required
to be rotation matrices. Accordingly, both the data and the
weight spaces on each RotMap layer correspond to a Lie
group SO
3
× SO
3
. . . × SO
3
.
Since the RotMap layers are designed to work together
with the classification layer, each resulting skeleton rep-
resentation is tuned for more accurate classification in an
end-to-end deep learning manner. In other words, the major
purpose of designing the RotMap layers is to align the Lie
group representations of a moving skeleton for more faith-
ful matching.
4.2. RotPooling Layer
In order to reduce the complexity of deep models, it is
typically useful to reduce the size of the representations to
6101

decrease the amount of parameters and computation in the
network. For this purpose, it is common to insert a pooling
layer in-between successive convolutional layers in a typi-
cal ConvNet architecture. The pooling layers are often de-
signed to compute statistics in local neighborhoods, such as
sum aggregation, average energy and maximum activation.
Without loss of generality, we here just introduce max
pooling
1
to the LieNet setting with the equivalent notion
of neighborhood. Since the input and output of the special
pooling layers are both expected to be rotation matrices, we
call this kind of layers as rotation pooling (RotPooling) lay-
ers. For the RotPooling, we propose two different concepts
of neighborhood in this work. The first one is on the spa-
tial level. As shown in Fig.
2(a)(b), we first pool the Lie
group features on each pair of basic bones e
m
, e
n
in the
i-th frame, which is represented by the two rotation matri-
ces R
k1,i
m,n
, R
k1,i
n,m
(here k 1 is the order of the layer) as
aforementioned. Then, as depicted in Fig.
2(b)(c), we can
perform pooling on the adjacent bones that belong to the
same group (here, we can define five part groups, i.e., torso,
two arms and two legs, of the body). However, the second
step would inevitably result in a serious spatial misalign-
ment problem, and thus lead to bad matching performances.
Therefore, we finally only adopt the first step pooling. In
this setting, the function of the max pooling is given by
f
(k)
p
({R
k1,i
m,n
, R
k1,i
n,m
}) = max({R
k1,i
m,n
, R
k1,i
n,m
})
=
(
R
k1,i
m,n
, if Θ(R
k1,i
m,n
) > Θ(R
k1,i
n,m
),
R
k1,i
n,m
, otherwise,
(7)
where Θ(·) is the representation of the given rotation matrix
such as quaternion, Euler angle or Euler axis-angle. For
example, the Euler axis ω and angle θ representations are
typically calculated by
ω(R
n,m
) =
1
2 si n( θ(R
n,m
))
R
n,m
(3, 2) R
n,m
(2, 3)
R
n,m
(1, 3) R
n,m
(3, 1)
R
n,m
(2, 1) R
n,m
(1, 2)
,
(8)
θ(R
n,m
) = arccos
trace(R
n,m
) 1
2
, (9)
where R
n,m
(i, j) is the i-the row, j-th column element of
R
n,m
. Unfortunately, except the angle representation, it is
non-trivial to define an ordering relation for a quaternion
or an axis-angle representation. Hence, in this paper, we
finally adopt the angle form Eqn.9 of rotation matrices and
its simple ordering relation to calculate the function Θ(·).
The other pooling scheme is on the temporal level. As
shown in Fig.
2 (c)(d), the aim of the temporal pooling
1
In contrast to sum and mean poolings, max pooling can generate valid
rotation matrices directly, and hence suits the proposed LieNets. On the
other hand, leveraging Lie group computing to enable sum and mean pool-
ing to work for the LieNets, however, goes beyond the scope of this paper.
ܴ
௡ǡ௠
௞ିଵǡସ
(d)
ܴ
௡ǡ௠
௞ିଵǡଵ
ܴ
௡ǡ௠
௞ିଵǡଶ
ܴ
௡ǡ௠
௞ିଵǡଷ
ܴ
௡ǡ௠
௞ିଵǡସ
(c)
ሼܴ
௡ǡ௠
௞ିଵǡ௜
,…}
(b)
ܴ
௡ǡ௠
௞ିଵǡ௜
ܴ
௣ǡ௤
௞ିଵǡ௜
(c)
ܴ
௠ǡ௤
௞ିଵǡ௜
ሼܴ
௜ǡ௝
௞ିଵǡ௜
ǡǥ}
ሼܴ
௡ǡ௠
௞ିଵǡ௜
ǡǥ}
(a)
ܴ
௡ǡ௠
௞ିଵǡ௜
ܴ
௠ǡ௡
௞ିଵǡ௜
Figure 2. Illustration of spatial pooling (SpaPooling) (a)(b)(c)
and temporal pooling (TemPooling) (c)(d) schemes.
is to obtain more compact representations for a motion se-
quence. This is because a sequence often contains many
frames, which results in the problem of extremely high-
dimensional representations. Thus, pooling in the temporal
domain can reduce the model complexity as well. Formally,
the function of this kind of max pooling is defined as
f
(k)
p
({(R
k1,1
1,2
. . . R
k1,1
M1,M
) . . . , (R
k1,p
1,2
. . . , R
k1,p
M1,M
)})
= (max({R
k1,1
1,2
. . . , R
k1,p
1,2
}) . . . ,
max({R
k1,1
M1,M
. . . , R
k1,p
M1,M
})),
(10)
where M is the number of body parts in one skeleton, p is
the number of skeleton frames for pooling, and the function
max(·) is defined in the way of Eqn.
7.
4.3. LogMap Layer
Classification of curves on the Lie group SO
3
× . . . ×
SO
3
is a complicated task due to the non-Euclidean nature
of the underlying space. To address the problem as in [42],
we design the logarithm map (LogMap) layer to flatten the
Lie group SO
3
× . . . × SO
3
to its Lie algebra so
3
× . . . ×
so
3
. Accordingly, by using the logarithm map Eqn.
4, the
function of this layer can be defined as
f
(k)
l
((R
k1
1
, R
k1
2
. . . , R
k1
ˆ
M
))
= (log(R
k1
1
), log(R
k1
2
) . . . , log(R
k1
ˆ
M
)).
(11)
One typical approach to calculate the logarithm map is
to use the approach log(R) = U log(Σ)U
T
, where R =
U ΣU
T
, log(Σ) is the diagonal matrix of the eigenvalue
logarithms. However, the spectral operation not only suffers
from the problem of zeroes occurring in log(Σ) due to the
property of the rotation matrix R, but also consumes too
much time for matrix gradient computation [
24]. Therefore,
we resort to other approaches to perform the function of this
6102

layer. Fortunately, we can explore the relationship between
the logarithm map and the axis-angle representation as:
log(R) =
(
0, if θ(R) = 0,
θ(R)
2 sin(θ(R))
(R R
T
), otherwise,
(12)
where θ(R) is the angle Eqn.
9 of R. With this equation,
the corresponding matrix gradient can be easily derived by
traditional element-wise matrix calculation.
4.4. Output Layers
After performing the LogMap layers, the outputs can
be transformed into vector form and concatenated directly
frame by frame within one sequence due to their Euclidean
nature. Then, we can add any regular network layers such
as rectified linear unit (ReLU) layers and regular fully con-
nected (FC) layers. In particular for the ReLU layer, we
can simply set relatively small elements to zero as done in
classical ReLU. In the FC layer, the dimensionality of the
weight is set to d
k
× d
k1
, where d
k
and d
k1
are the class
number and the vector dimensionalities, respectively. For
skeleton-based action recognition, we employ a common
softmax layer as the final output layer. Besides, as studied in
[
37, 26], learning temporal dependencies over the sequen-
tial data can improve human action recognition. Hence, we
can also feed the outputs into Long Short-Term Memory
(LSTM) unit to learn useful temporal features. Because of
the space limitation, we do not study this any further.
5. Training Procedure
In order to train the proposed LieNets, we exploit the
Stochastic gradient descent (SGD) algorithm that is one of
the most popular network training tools. To begin with, let
the LieNet model be represented as a sequence of function
compositions f = f
(l)
f
(l1)
. . .f
(1)
with a parameter tu-
ple W = (W
l
, W
l1
. . . , W
1
), where f
(k)
is the function
for the k-th layer, W
k
(dropping the sample index for sim-
plicity) represents the weight parameters of the k-th layer,
and l is the number of layers. The loss of the k-th layer is
defined by L
(k)
= f
(l)
. . . f
(k)
, where is the loss
function for the final output layer.
To optimize the deep model, one classical SGD algo-
rithm needs to compute the gradient of the objective func-
tion, which is typically achieved by the backpropagation
chain rule. In particular, the gradients of the weight W
k
and the data R
k1
(dropping the sample index for simplic-
ity) for the k-th layer can be respectively computed by the
chain rule:
L
(k)
(R
k1
, y)
W
k
=
L
(k+1)
(R
k
, y)
R
k
f
(k)
(R
k1
)
W
k
, (13)
L
(k)
(R
k1
, y)
R
k1
=
L
(k+1)
(R
k
, y)
R
k
f
(k)
(R
k1
)
R
k1
, (14)
where y is the class label, R
k
= f
(k)
(R
k1
). Eqn.
13 is
the gradient for updating W
k
, while Eqn.
14 computes the
gradients in the layers below to update R
k1
.
The gradients of the data involved in RotPooling,
LogMap and regular output layers can be calculated by
Eqn.
14 as usual. Particularly, the gradient for the data in
RotPooling can be computed with the same gradient com-
puting approach used in a regular max pooling layer in the
context of traditional ConvNets. For the data in the LogMap
layer, the gradient can be obtained by the element-wise gra-
dient computation on the involved rotation matrices.
On the other hand, the computation of the gradients of
the parameter weights defined in the RotMap layers is non-
trivial. This is because the weight matrices are enforced
to be on the Riemannian manifold SO
3
of the rotation
matrices, i.e. the Lie group. As a consequence, merely
using Eqn.
13 to compute their Euclidean gradients rather
than Riemannian gradients in the procedure of backpropa-
gation would not generate valid rotation weights. To handle
this problem, we propose a new approach of updating the
weights used in Eqn.
6 for the RotMap layers. As studied in
[
1], the steepest descent direction for the used loss function
L
(k)
(R
k1
, y) with respect to W
k
on the manifold SO
3
is
the Riemannian gradient
˜
L
(k)
W
k
, which can be obtained by
parallel transporting the Euclidean gradients onto the corre-
sponding tangent space. In particular, transporting the gra-
dient from a point W
t
k
to another point W
t+1
k
requires sub-
tracting the normal component
¯
L
(k)
W
k
, at W
t+1
k
, which can
be obtained as follows:
¯
L
(k)
W
k
= L
(k)
W
k
W
T
k
W
k
, (15)
where the Euclidean gradient L
(k)
W
k
is computed by using
Eqn.
13 as
L
(k)
W
k
=
L
(k+1)
(R
k
, y)
R
k
R
T
k1
. (16)
Thanks to the parallel transport, the Riemannian gradient
can be calculated by
˜
L
(k)
W
k
= L
(k)
W
k
¯
L
(k)
W
k
. (17)
Searching along the tangential direction takes the update
in the tangent space of the SO
3
manifold. Then, such up-
date is mapped back to the SO
3
manifold with a retraction
operation. Consequently, an update of the weight W
k
on
the SO
3
manifold is of the following form
W
t+1
k
= Γ(W
t
k
λ
˜
L
(k)
W
k
), (18)
where W
t
k
is the current weight, Γ is the retraction opera-
tion, λ is the learning rate.
6103

Citations
More filters
Journal ArticleDOI
TL;DR: This work introduces a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, and investigates a novel one-shot 3D activity recognition problem on this dataset.
Abstract: Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition. The existing depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of large-scale training samples, realistic number of distinct class categories, diversity in camera views, varied environmental conditions, and variety of human subjects. In this work, we introduce a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes including daily, mutual, and health-related activities. We evaluate the performance of a series of existing 3D activity analysis methods on this dataset, and show the advantage of applying deep learning methods for 3D-based human action recognition. Furthermore, we investigate a novel one-shot 3D activity recognition problem on our dataset, and a simple yet effective Action-Part Semantic Relevance-aware (APSR) framework is proposed for this task, which yields promising results for recognition of the novel action classes. We believe the introduction of this large-scale dataset will enable the community to apply, adapt, and develop various data-hungry learning techniques for depth-based and RGB+D-based human activity understanding.

837 citations


Cites background from "Deep Learning on Lie Groups for Ske..."

  • ...Specifically, many of them have been evaluated based on the preliminary version [47] of our dataset, or pre-trained on it for transfer learning for other tasks [43], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79]....

    [...]

  • ...[62] incorporated Lie group structure into a deep architecture for skeleton-based action recognition....

    [...]

Proceedings ArticleDOI
13 Mar 2018
TL;DR: Independently Recurrent Neural Network (IndRNN) as discussed by the authors is a new type of RNN, where neurons in the same layer are independent of each other and they are connected across layers.
Abstract: Recurrent neural networks (RNNs) have been widely used for processing sequential data. However, RNNs are commonly difficult to train due to the well-known gradient vanishing and exploding problems and hard to learn long-term patterns. Long short-term memory (LSTM) and gated recurrent unit (GRU) were developed to address these problems, but the use of hyperbolic tangent and the sigmoid action functions results in gradient decay over layers. Consequently, construction of an efficiently trainable deep network is challenging. In addition, all the neurons in an RNN layer are entangled together and their behaviour is hard to interpret. To address these problems, a new type of RNN, referred to as independently recurrent neural network (IndRNN), is proposed in this paper, where neurons in the same layer are independent of each other and they are connected across layers. We have shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies. Moreover, an IndRNN can work with non-saturated activation functions such as relu (rectified linear unit) and be still trained robustly. Multiple IndRNNs can be stacked to construct a network that is deeper than the existing RNNs. Experimental results have shown that the proposed IndRNN is able to process very long sequences (over 5000 time steps), can be used to construct very deep networks (21 layers used in the experiment) and still be trained robustly. Better performances have been achieved on various tasks by using IndRNNs compared with the traditional RNN and LSTM.

437 citations

Journal ArticleDOI
TL;DR: A new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell.
Abstract: Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed work extends this idea to spatial domain as well as temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of these domains simultaneously. Based on the pictorial structure of Kinect's skeletal data, an effective tree-structure based traversal framework is also proposed. In order to deal with the noise in the skeletal data, a new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell. Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method.

436 citations


Additional excerpts

  • ...Skeleton-based action recognition has been explored in different aspects in recent years [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57]....

    [...]

Proceedings ArticleDOI
Yansong Tang, Yi Tian1, Jiwen Lu, Peiyang Li1, Jie Zhou 
18 Jun 2018
TL;DR: A deep progressive reinforcement learning (DPRL) method for action recognition in skeleton-based videos, which aims to distil the most informative frames and discard ambiguous frames in sequences for recognizing actions.
Abstract: In this paper, we propose a deep progressive reinforcement learning (DPRL) method for action recognition in skeleton-based videos, which aims to distil the most informative frames and discard ambiguous frames in sequences for recognizing actions. Since the choices of selecting representative frames are multitudinous for each video, we model the frame selection as a progressive process through deep reinforcement learning, during which we progressively adjust the chosen frames by taking two important factors into account: (1) the quality of the selected frames and (2) the relationship between the selected frames to the whole video. Moreover, considering the topology of human body inherently lies in a graph-based structure, where the vertices and edges represent the hinged joints and rigid bones respectively, we employ the graph-based convolutional neural network to capture the dependency between the joints for action recognition. Our approach achieves very competitive performance on three widely used benchmarks.

380 citations

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper develops the graph analogues of three prominent explainability methods for convolutional neural networks: contrastive gradient-based (CG) saliency maps, Class Activation Mapping (CAM), and Excitation Back-Propagation (EB) and their variants, gradient-weighted CAM (Grad-CAM) and contrastive EB (c-EB).
Abstract: With the growing use of graph convolutional neural networks (GCNNs) comes the need for explainability. In this paper, we introduce explainability methods for GCNNs. We develop the graph analogues of three prominent explainability methods for convolutional neural networks: contrastive gradient-based (CG) saliency maps, Class Activation Mapping (CAM), and Excitation Back-Propagation (EB) and their variants, gradient-weighted CAM (Grad-CAM) and contrastive EB (c-EB). We show a proof-of-concept of these methods on classification problems in two application domains: visual scene graphs and molecular graphs. To compare the methods, we identify three desirable properties of explanations: (1) their importance to classification, as measured by the impact of occlusions, (2) their contrastivity with respect to different classes, and (3) their sparseness on a graph. We call the corresponding quantitative metrics fidelity, contrastivity, and sparsity and evaluate them for each method. Lastly, we analyze the salient subgraphs obtained from explanations and report frequently occurring patterns.

329 citations


Cites methods from "Deep Learning on Lie Groups for Ske..."

  • ...In [36] GCNNs were used for shape segmentation, and in [14], they were used for skeleton-based action recognition....

    [...]

References
More filters
Journal ArticleDOI
22 Dec 2000-Science
TL;DR: Locally linear embedding (LLE) is introduced, an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs that learns the global structure of nonlinear manifolds.
Abstract: Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here, we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.

15,106 citations


"Deep Learning on Lie Groups for Ske..." refers background or methods in this paper

  • ...In particular, inspired by the classical manifold learning theory [38, 36, 4, 12, 20, 19], we equip the new network structure with rotation mapping layers, with which the input Lie group features are transformed to new ones with better alignment....

    [...]

  • ...As well-known from classical manifold learning theory [38, 36, 4, 12, 20, 19], one can learn or preserve the original data structure to faithfully maintain geodesic distances for better classification....

    [...]

Journal ArticleDOI
22 Dec 2000-Science
TL;DR: An approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set and efficiently computes a globally optimal solution, and is guaranteed to converge asymptotically to the true structure.
Abstract: Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs-30,000 auditory nerve fibers or 10(6) optic nerve fibers-a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.

13,652 citations


"Deep Learning on Lie Groups for Ske..." refers background or methods in this paper

  • ...In particular, inspired by the classical manifold learning theory [38, 36, 4, 12, 20, 19], we equip the new network structure with rotation mapping layers, with which the input Lie group features are transformed to new ones with better alignment....

    [...]

  • ...As well-known from classical manifold learning theory [38, 36, 4, 12, 20, 19], one can learn or preserve the original data structure to faithfully maintain geodesic distances for better classification....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors proposed a geometrically motivated algorithm for representing high-dimensional data, based on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold and the connections to the heat equation.
Abstract: One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a low-dimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the high-dimensional data. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.

7,210 citations

Book ChapterDOI
01 Jan 2010
TL;DR: A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems.
Abstract: During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.

5,561 citations


"Deep Learning on Lie Groups for Ske..." refers methods in this paper

  • ...While the convergence of the used SGD algorithm on Riemannian manifolds has been studied well in [8, 6] already, the convergence behavior (see Fig....

    [...]

Proceedings Article
21 May 2014
TL;DR: This paper considers possible generalizations of CNNs to signals defined on more general domains without the action of a translation group, and proposes two constructions, one based upon a hierarchical clustering of the domain, and another based on the spectrum of the graph Laplacian.
Abstract: Convolutional Neural Networks are extremely efficient architectures in image and audio recognition tasks, thanks to their ability to exploit the local translational invariance of signal classes over their domain. In this paper we consider possible generalizations of CNNs to signals defined on more general domains without the action of a translation group. In particular, we propose two constructions, one based upon a hierarchical clustering of the domain, and another based on the spectrum of the graph Laplacian. We show through experiments that for low-dimensional graphs it is possible to learn convolutional layers with a number of parameters independent of the input size, resulting in efficient deep architectures.

3,460 citations


"Deep Learning on Lie Groups for Ske..." refers background in this paper

  • ...Moreover, recently some deep learning models have emerged [10, 7, 28, 25, 18, 21] that deal with data in a nonEuclidean domain....

    [...]

  • ...For instance, [10] proposed a spectral version of convolutional networks to handle graphs....

    [...]