scispace - formally typeset
Open AccessJournal ArticleDOI

Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion

Reads0
Chats0
TLDR
This work poses the problem of learning motion primitives as one of temporal clustering, and derives an unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA), which finds a partition of a given multidimensional time series into m disjoint segments such that each segment belongs to one of k clusters.
Abstract
Temporal segmentation of human motion into plausible motion primitives is central to understanding and building computational models of human motion. Several issues contribute to the challenge of discovering motion primitives: the exponential nature of all possible movement combinations, the variability in the temporal scale of human actions, and the complexity of representing articulated motion. We pose the problem of learning motion primitives as one of temporal clustering, and derive an unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA). HACA finds a partition of a given multidimensional time series into m disjoint segments such that each segment belongs to one of k clusters. HACA combines kernel k-means with the generalized dynamic time alignment kernel to cluster time series data. Moreover, it provides a natural framework to find a low-dimensional embedding for time series. HACA is efficiently optimized with a coordinate descent strategy and dynamic programming. Experimental results on motion capture and video data demonstrate the effectiveness of HACA for segmenting complex motions and as a visualization tool. We also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on data of a honey bee dance. The HACA code is available online.

read more

Content maybe subject to copyright    Report

Hierarchical Aligned Cluster Analysis
for Temporal Clustering of Human Motion
Feng Zhou, Student Member, IEEE, Fernando De la Torre, and Jessica K. Hodgins
Abstract—Temporal segmentation of human motion into plausible motion primitives is central to understanding and building
computational models of human motion. Several issues contribute to the challenge of discovering motion primitives: the exponential
nature of all possible movement combinations, the variability in the temporal scale of human actions, and the complexity of
representing articulated motion. We pose the problem of learning motion primitives as a temporal clustering one, and derive an
unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA). HACA finds a partition of a given
multidimensional time series into m disjoint segments such that each segment belongs to one of k clusters. HACA combines kernel
k-means with the generalized dynamic time alignment kernel to cluster time series data. Moreover, it provides a natural framework to
find a low-dimensional embedding for the time series. HACA is efficiently optimized with a coordinate descent strategy and dynamic
programming. Experimental results on motion capture and video data demonstrate the effectiveness of HACA for segmenting complex
motions and as a visualization tool. We also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on
data of a honey bee dance. The HACA code is available online.
Index Terms—Temporal segmentation, time series clustering, time series visualization, human motion analysis, Kernel k-means,
spectral clustering, dynamic programming
Ç
1INTRODUCTION
S
YSTEMS that can detect, recognize, and synthesize human
motion are of interest in both research and industry due
to the large number of potential applications in virtual
reality, smart surveillance systems for advanced user
interfaces, and motion analysis (see [1], [2], [3] for a review).
The quality of the detection, recognition, or synthesis in
these applications greatly de pends on the spatial and
temporal resolution of motion databases, as well as the
complexity of the models. Unsupervised techniques to learn
motion primitives from training data have recently attracted
the interest of many scientists in computer vision [4], [5], [6],
[7], [8], [9], [10], [11], [12] and computer graphics [13], [14],
[15], [16], [17], [18], [19], [20]. Fig. 1 illustrates the problem
addressed in this paper: Given a sequence of a person
walking and running, the first level of the hierarchy
provided by our algorithm (HACA) is able to group the
frames into two classes: running and walking. In a finer
level of the hierarchy, HACA decomposes each of the
actions (e.g., running, walking) into motion primitives of
smaller temporal scale. However, some temporal compo-
nents might not necessarily have a physical meaning.
The inherent difficulty of temporally decomposing hu-
man motion stems from the large number of possible
movement combinations, a relatively large range of temporal
scales for different behaviors, the irregularity in the
periodicity of human actions, and the intraperson motion
variability. To address these challenges, this paper frames the
problem of hierarchical temporal decomposition of human
motion as an unsupervised learning problem, and proposes a
hierarchical aligned cluster analysis (HACA). HACA is a
generalization of kernel k-means (KKM) and spectral
clustering (SC) for time series clustering and embedding.
Over the last few years, several approaches for unsuper-
vised segmentation of activities have been proposed (see, for
example, [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15],
[16], [17], [18]). HACA presents several advantages:
. The temporal clustering problem is posed as an
energy minimization.
. HACA provides a natural embedding for clustering
and visualizing time series data.
. HACA provides a hierarchical decomposition at
several temporal scales (see Fig. 1). The time granu-
larity of the motion primitives is specified manually.
. Minimizing HACA is an NP problem. This paper
proposes an efficient coordinate descent minimiza-
tion algorithm to find a solution for HACA via
dynamic programming.
2PREVIOUS WORK
We build on prior research in human motion analysis and
temporal clustering.
2.1 Human Motion Analysis
Extensive literature in graphics and computer vision
addresses the problem of grouping human actions. In the
computer graphics literature, Barbic et al. [15] proposed an
algorithm to decompose human motion into distinct actions
based on probabilistic principal component analysis, which
places a cut when the distribution of human poses changes.
Jenkins et al. [13], [14] used the zero-velocity crossing points
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1
. The authors are with the Robotics Institute, Carnegie Mellon University,
Smith Hall, 5000 Forbes Ave, Pittsburgh, PA 15232.
E-mail: zhfe99@gmail.com, {ftorre, jkh}@cs.cmu.edu.
Manuscript received 27 Apr. 2011; revised 19 Apr. 2012; accepted 12 May
2012; published online 14 June 2012.
Recommended for acceptance by C. Sminchisescu.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2011-04-0268.
Digital Object Identifier no. 10.1109/TPAMI.2012.137.
0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

of the angular velocity to segment the stream of motion
capture data into short sequences. Jenkins and Matari
c [21]
further extended the work by finding a nonlinear embedding
using Isomap [22] that reveals the temporal structure of
segmented motion. Beaudoin et al. [20] developed a string-
based motif-finding algorithm to decompose motion into
action primitives and interpret actions as a composition on
the alphabet of these action primitives.
In the computer vision literature, Zhong et al. [23] used a
bipartite graph co-clustering algorithm to segment and
detect unusual activities in video. Zelnik-Manor and Irani
[5] extracted spatiotemporal features at multiple temporal
scales to isolate and cluster events. An outcome of the
clustering process is the temporal segmentation of long
video sequences into event subsequences. De la Torre et al.
[9] proposed a geometric-invariant clustering algorithm to
decompose a stream of facial behavior into facial gestures.
Unusual facial expressions can be detected through the
analysis of outlying temporal patterns. De la Torre and Agell
[24] decomposed a multimodal stream of human behavior
into several activities u sing semi-supervised temporal
clustering. Recently, Guerra-Filho and Aloimonos [8], [25]
presented a linguistic framework for modeling and learning
human activity representations from video. To obtain a low-
level representation, they segmented the movement by
estimating the velocity and acceleration of the actuator
attached to the joint. Minnen et al. [26] discovered motifs in
real-valued, multivariate time series data by locating regions
of high density in the space of all time series subsequences.
2.2 Temporal Clustering
Segmentation and clustering of time series data is a topic
that has been explored in fields other than computer vision
and graphics. In particular, there is a substantial amount of
work in the field of data mining [27], [28], speech processing
[29], [30], animal behavior analysis [12], [31], and signal
processing [32], [33].
Two of the most popular approaches are change-point
detection and switching linear dynamical system (SLDS).
The goal of change-point detection [32], [33], [34] is to
identify changes at unknown times and estimate the location
of changes in stochastic processes. Unlike previous work on
change-point detection, HACA finds the change points that
minimize the error across several segments (not only two)
that belong to one of k clusters.
SLDSs describe the dynamics of the time series by
switching several linear dynamical systems over time. The
switching states in SLDS inference implicitly provide the
segmentat ion of an input sequence. Because the exact
inference in SLDS is intractable, Pavlovi
c et al. [35] proposed
approximate inference algorithms by casting the SLDS
model as a dynamic Bayesian network (DBN). Oh et al.
[12] introduced a data-driven MCMC (DD-MCMC) infer-
ence method to identify the exact posterior of SLDSs in the
presence of intractability. In their framework [12], the
standard SLDS has been improved by incorporating a
duration model, thereby yielding a more accurate result in
segmentation. To address the problem of learning SLDSs
with an unknown number of modes, Fox et al. [31] proposed
a nonparametric Bayesian method that utilizes the hier-
archical Dirichlet process (HDP) as a prior on the para-
meters of SLDSs. Recently, Fox et al. [36] further extended
this work by adding the beta process prior to discover and
model dynamical behaviors that are shared among multiple
related time series.
3ALIGNED CLUSTER ANALYSIS (ACA)
This section describes ACA and hierarchical ACA (HACA),
an extension of kernel k-means and spectral clustering for
clustering time series. Section 3.1 reviews the matrix
formulation for k-means, KKM and SC. Section 3.2 describes
the properties of the frame kernel matrix that are key to
understanding ACA and HACA. Section 3.3 reviews the
dynamic time alignment kernel (DTAK) that is used as a
similarity measure between segments. Section 3.4 proposes
the ACA energy function and its matrix formulation is
discussed in Section 3.5. Section 3.6 describes a coordinate-
descent strategy for optimizing ACA. Section 3.7 presents an
efficient optimization strategy for ACA. Section 3.8 de-
scribes HACA.
3.1 k-Means, KKM and SC
Clustering refers to the partition of n data points into
k disjointed clusters. Among various approaches to un-
supervised clust ering, k-means [37] is favored for its
simplicity. k-means clustering splits a set of n samples into
k groups by minimizing the within-cluster variation.
k-means clustering finds the partition of the data that is a
local optima of the following energy function [38], [39]:
J
km
ðZ; GÞ¼
X
k
c¼1
X
n
i¼1
g
ci
kx
i
z
c
k
2
¼kX ZGk
2
F
;
s:t: G
T
1
k
¼ 1
n
;
ð1Þ
where x
i
2 IR
d
(see notation
1
) is a vector representing the
ith data point and z
c
2 IR
d
is the geometric centroid of the
data points for class c. G 2f0; 1g
kn
is a binary indicator
2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013
Fig. 1. Hierarchical decomposition of human motion. Each level of the
figure corresponds to one hierarchy found by HACA at different temporal
resolutions. The top row shows some samples of motion capture data of
a person walking and then running (7.2 seconds). The second row
shows the first level of the decomposition found by HACA. Each
temporal pattern contains samples for walking or running. The bottom
row shows the lower level, which contains subcycles of running and
walking.
1. Bold capital letters denote a matrix X, bold lower-case letters a column
vector x. x
i
represents the ith column of the matrix X. x
ij
denotes the scalar
in the ith row and jth column of the matrix X. All nonbold letters represent
scalars. 1
mn
; 0
mn
2 IR
mn
are matrices of ones and zeros. I
n
2 IR
nn
is an
identity matrix. kx
ffiffiffiffiffiffiffiffi
x
T
x
p
denotes t he euclidean distance. kXk
2
F
¼
trðX
T
XÞ¼trðXX
T
Þ designates the Frobenious norm of a matrix. X Y is
the Hadamard product of matrices. diagðxÞ is a diagonal matrix whose
diagonal elements are x. ½i; j and ½i; jÞ list the integers fi; i þ 1; ...;j 1;jg
and fi; i þ 1; ...;j 1g, respectively. X
½i;j
¼½x
i
; x
iþ1
; ...; x
j
is composed
by the columns of X indexed by the integers in ½i; j.
_
X denotes the previous
value of X in an updating scheme.

matrix such that g
ci
¼ 1 if the sample x
i
belongs to cluster c
and zero otherwise. The k-m eans algorithm performs
coordinate descent in the energy function J
km
ðZ; GÞ. Given
the actual value of the means
_
Z 2 IR
dn
, the first step finds
g
i
2f0; 1g
k
for each data point x
i
such that one of the rows
is one and the others are zero, while minimizing (1). The
second step computes Z ¼ XG
T
ðGG
T
Þ
1
, which is equiva-
lent to calculating the mean of each cluster. These
alternating steps are guaranteed to converge to a local
minimum of J
km
ðZ; GÞ [40].
A major limitation of the k-means algorithm is that it is
only optimal for spherical cluste rs. To overcome this
limitation, kernel k-means [41] implicitly maps the data to a
higher dimensional space using kernels. KKM [41], [38]
minimizes
J
kkm
ðGÞ¼
X
k
c¼1
X
n
i¼1
g
ci
kðx
i
Þz
c
k
2
|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}
dist
2
ðx
i
;z
c
Þ
¼kðXÞZGk
2
F
;
s:t: G
T
1
k
¼ 1
n
;
ð2Þ
where dist
2
ðx
i
; z
c
Þ is the squared distance between the
ith sample and the center of class c in the feature space, that is:
dist
2
ðx
i
; z
c
Þ¼
ii
2
n
c
X
n
j¼1
g
cj
ij
þ
1
n
2
c
X
n
j
1
;j
2
¼1
g
cj
1
g
cj
2
j
1
j
2
; (3)
where n
c
¼
P
n
j¼1
g
cj
is the number of samples that belong to
class c. The kernel function is defined as
ij
¼ ðx
i
Þ
T
ðx
j
Þ.
Similarly to the first step in the k-means algorithm, KKM
assigns the sample to the closest cluster mean (
_
Z) computed
in the previous step:
g
c
i
i
¼ 1; where c
i
¼ arg min
c
dist
2
x
i
;
_
z
c
: ð4Þ
In KKM, in general, the mean cannot be computed explicitly.
However, there is no need to compute the mean because
dist
2
ðx
i
;
_
z
c
Þ can be calculated from the kernel matrix.
Spectral clustering also minimizes a weighted version of
(2), but G is relaxed to be continuous. See [41] and [38] for a
more detailed explanation of the relation between SC and
KKM. In this paper, we will extend KKM and SC to cluster
and find a low-dimensional embedding of the time series.
3.2 Frame Kernel Matrix
This section describes some properties of the frame kernel
matrix, K ¼ ðXÞ
T
ðXÞ2IR
nn
, where X 2 IR
dn
is a multi-
dimensional time series of length n. Each entry,
ij
, defines
the similarity between two frames, x
i
and x
j
, by means of a
kernel function ðx
i
Þ
T
ðx
j
Þ. The linear kernel,
ij
¼ x
T
i
x
j
,
and the Gaussian kernel,
ij
¼ expð
kx
i
x
j
k
2
2
2
Þ, are perhaps the
most commonly used kernels. In the literature of dynamical
systems, the frame kernel matrix (K) is alternatively called
the recurrence matrix [42], [43], and its structure reveals
important information about the dynamics.
To illustrate the properties of this matrix, consider the 1D
time series shown in Fig. 2a. In this case, we compute the
frame kernel matrix using the exponential kernel,
ij
¼
expð
kx
i
x
j
k
2
2
2
Þ. We choose an infinitely small bandwidth
( ! 0) to make a binary frame kernel matrix (Fig. 2b). In the
following, we highlight one property, period ambiguity, that
is relevant to ACA. Fig. 2a (second and third row) plots two
different, but valid, decompositions of the same time series at
two different temporal scales. To avoid this ambiguity, we
introduce a parameter n
max
to constrain the length of the
segments. In this case, we set n
max
¼ 2 and n
max
¼ 4 for the
second and third rows, respectively. Similarly, Fig. 2c shows
an example of a multidimensional time series of motion
captu re data of a subject doing two activities. Fig. 2d
illustrates the corresponding frame kernel matrix at two
different temporal resolutions levels.
3.3 Dynamic Time Alignment Kernel
A temporal clustering algorithm needs to define a distance
between segments of different length. Ideally, this distance
should be invariant to the speed of the human action. This
section reviews the DTAK that extends dynamic time
warping (DTW) to satisfy the properties of a distance.
A frequent approach to aligning time series has been
DTW. A known drawback of using DTW as a distance is that
it fails to satisfy the triangle inequality [44]. To address this
issue, Shimodaira et al. [45] proposed the DTAK. Given two
sequences X ¼½x
1
; ...; x
n
x
2IR
dn
x
and Y ¼½y
1
; ...; y
n
y
2
IR
dn
y
DTAK computes the similarity using dynamic
programming. DTAK uses the cumulative kernel matrix U 2
IR
n
x
n
y
(as in DTW), computed in a recursive manner as
ðX; YÞ¼
u
n
x
n
y
n
x
þ n
y
;u
ij
¼ max
u
i1;j
þ
ij
u
i1;j1
þ 2
ij
u
i;j1
þ
ij
:
8
<
:
ð5Þ
U is initialized at the upper left, i.e., u
11
¼ 2
11
.
ij
¼ ðx
i
Þ
T
ðy
j
Þ¼exp
kx
i
y
j
k
2
2
2
!
is the frame kernel that constitutes the kernel matrix
K 2 IR
n
x
n
y
. Fig. 3c illustrates the procedure to build U for
two short sequences (Fig. 3a). Fig. 3b shows the binary frame
kernel matrix K when ! 0. The final value of DTAK,
ðX; YÞ¼
11
13
, is computed by normalizing the bottom right
of U with the sum of sequence lengths.
A more revealing mathematical expression to under-
stand DTAK can be obtained using matrix notation.
Observe that DTAK computes a monotonic trajectory (the
red curve in Fig. 3b) starting from the top-left corner to the
bottom-right corner of the frame kernel matrix K. This
ZHOU ET AL.: HIERARCHICAL ALIGNED CLUSTER ANALYSIS FOR TEMPORAL CLUSTERING OF HUMAN MOTION 3
Fig. 2. Decomposition of time series into two different temporal scales.
(a) Temporal clustering of 1D time series. Vertical black dotted lines
denote the segment’s boundaries. (b) Frame kernel matrix. Each
segment corresponds to a rectangular block (yellow line). (c) Temporal
clustering of motion capture data. (d) Frame kernel matrix.

monotonic trajectory can be mathematically parameterized
by two frame indexes vectors p 2f1:n
x
g
l
and q 2f1:n
y
g
l
,
where l is the optimal number of steps that need to align X
and Y by DTAK (e.g., l ¼ 8 in the case of Fig. 3). Using these
indexes, we can define a new normalized correspondence
matrix W ¼½w
ij
n
x
n
y
2 IR
n
x
n
y
,wherew
ij
¼
1
n
x
þn
y
ðp
c
p
c1
þ q
c
q
c1
Þ if there exist p
c
¼ i and q
c
¼ j for some c
(i.e., for every one of the l steps we have two indexes that
encode the correspondence between the two time series).
Otherwise, w
ij
¼ 0. See Fig. 3d for an example of W. Using
this new matrix, DTAK can be rewritten in a more compact
way as follows:
ðX; YÞ¼trðK
T
WÞ¼ ðXÞ
T
ðYÞ; ð6Þ
where ðÞ denotes a mapping of the sequence into a feature
space. By the Mercer theorem [46] this mapping exists when
ðX; YÞ is a positive definite kernel. Unfortunately, DTAK
is not necessarily a strictly positive definite kernel [47], [48],
and a regularization of the kernel matrix needs to be
performed (see Section 3.5).
3.4 Energy Function for ACA
Given a sequence X ¼½x
1
; ...; x
n
2IR
dn
with n samples,
ACA decomposes X into m disjointed segments, each of
which corresponds to one of k classes. The ith segment,
Y
i
¼
:
X
½s
i
;s
iþ1
Þ
¼½x
s
i
; ...; x
s
iþ1
1
2IR
dn
i
, is composed of sam-
ples that begin at position s
i
and end at s
iþ1
1. The length of
the segment is constrained as n
i
¼ s
iþ1
s
i
n
max
, where
n
max
is the maximum length of the segment and controls the
temporal granularity of the factorization. An indicator matrix
G 2f0; 1g
km
assigns each segment to a class; g
ci
¼ 1 if Y
i
belongs to class c, otherwise g
ci
¼ 0. For instance (see Fig. 4),
the 1D sequence (Fig. 4a) with 23 frames has been segmented
into seven segments that belong to three clusters (Fig. 4b).
A major limitation of standard k-means clustering for
analysis of time series data [49] is that the temporal ordering
of the frames is not taken into account. This section combines
KKM and SC with the DTAK to achieve temporal clustering.
ACA extends previous work on KKM and SC by minimizing:
J
aca
ðG; sÞ¼
X
k
c¼1
X
m
i¼1
g
ci
k
X
½s
i
;s
iþ1
Þ
z
c
k
2
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
dist
2
ðY
i
;z
c
Þ
¼k½
Y
1
; ...;
Y
m
ZGk
2
F
;
s:t: G
T
1
k
¼ 1
m
and s
iþ1
s
i
1;n
max
;
ð7Þ
where G 2f0; 1g
km
is a class indicator matrix and s 2 IR
mþ1
is the vector that contains the start and end of each segment.
Y
i
¼
:
X
½s
i
;s
iþ1
Þ
denotes a segment. Similar to KKM, dist
2
ðY
i
; z
c
Þ
is the squared distance between the ith segment and the
center of class c in the feature space defined by the nonlinear
mapping ðÞ, which is
dist
2
Y
i
; z
c
¼
ii
2
m
c
X
m
j¼1
g
cj
ij
þ
1
m
2
c
X
m
j
1
;j
2
¼1
g
cj
1
g
cj
2
j
1
j
2
;
where m
c
¼
P
m
j¼1
g
cj
is the number of segments that belong
to class c. The dynamic kernel function is defined as
ij
¼
ðY
i
Þ
T
ðY
j
Þ¼trðW
T
ij
K
ij
Þ based on (6).
The differences between ACA (7) and KKM (2) are worth
pointing out:
1. ACA clusters variable length features, that is, each
segment Y
i
might have a different number of
samples (columns of Y
i
), whereas standard KKM
has a fixed number of features (rows of x
i
).
2. A new variable, s, is introduced to represent the
starting and ending of each segment.
3. The distance used in ACA, dist
ðY
i
; z
c
Þ, is based on
DTAK, which is robust to noise and invariant to
temporal scaling factors.
4. A DP-based approach is used to efficiently solve ACA.
4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013
Fig. 3. Computation of DTAK. (a) Alignment examples for two 1D
sequences. (b) Frame kernel matrix (K). (c) Cumulative kernel matrix
(U). (d) Normalized correspondence matrix (W).
Fig. 4. A synthetic example of temporal clustering. (a) An example of 1D time series. (b) Temporal clustering of the 1D time series. (c) Sample-
segment indicator matrix (H). (d) Normalized correspondence matrix (W). (e) Segment-cluster indicator matrix (G). (f) Frame kernel matrix (K).
(g) Segment kernel matrix (T). (h) 2D embedding computed using the top two eigenvectors of T.

3.5 Matrix Formulation for ACA
A more enlightening formulation for ACA is the matrix
form. Suppose that a sequence X 2 IR
dn
of length n has
been segmented into m segments, fY
i
2 IR
dn
i
g
m
i¼1
and
P
m
i¼1
n
i
¼ n. A key insight to understanding ACA is that we
can define two kernel matrices: T ¼½
ij
mm
2 IR
mm
, the
segment kernel matrix (kernel between segments), and
K ¼½
ij
nn
2 IR
nn
, the frame kernel matrix (kernel be-
tween frames). Each element of the segment kernel matrix
(T),
ij
¼ ðY
i
; Y
j
Þ¼trðK
T
ij
W
ij
Þ, is the DTAK between the
ith and jth segments (Y
i
and Y
j
) computed using (6),
where K
ij
2 IR
n
i
n
j
and W
ij
2 IR
n
i
n
j
are the frame kernel
matrix and the normalized correspondence matrix between
segments Y
i
and Y
j
, respectively.
After some linear algebra, it can be shown that T 2 IR
mm
can be expressed as the product of a global correspondence
matrix (W), a global kernel frame matrix (K), and a sample-
segment indicator matrix (H) as follows:
T ¼
tr
K
T
ij
W
ij
mm
¼ H
K W
H
T
; ð8Þ
where W ¼½W
ij
mm
2 IR
nn
and K ¼½K
ij
mm
2 IR
nn
are
obtained by rearranging the m m blocks of W
ij
and
K
ij
,respectively. H 2f0; 1g
mn
is a matrix that encodes the
correspondence between samples and segments such that
h
ij
¼ 1 if the jth sample belongs to the ith segment. See Fig. 4
for an example of these matrices.
Unfortunately, DTAK is not a strictly positive definite
kernel [47], [48]. Thus, we add a scaled identity matrix to K,
that is, K K þ I
n
, where is chosen to be the absolute
value of the smallest eigenvalue of T if it has negative
eigenvalues.
2
After substituting the optimal value of
Z ¼½
Y
1
; ...;
Y
m
G
T
GG
T
1
in (7), a more understandable form of J
aca
results in:
J
aca
ðG; HÞ¼trððI
m
G
T
ðGG
T
Þ
1
GÞTÞ
¼ trððI
m
G
T
ðGG
T
Þ
1
GÞHðK WÞH
T
Þ
¼ trððL WÞKÞ;
where L ¼ I
n
H
T
G
T
GG
T
1
GH:
ð10Þ
Recall H depends on s and G 2f0; 1g
km
is the segment-
cluster indicator matrix such that g
ij
¼ 1 if the jth segment
belongs to the ith temporal cluster. See Fig. 4 for an example
of temporal clustering and the role of the matrices K; W; H.
Consider the special case when each segment is one frame,
i.e., m ¼ n and H ¼ I
n
. The segment kernel matrix becomes
simply the frame kernel matrix, i.e.,
ij
¼
ij
and W ¼ 1
nn
.
In this case, the energy function of ACA is equivalent to the
function minimized by KKM [41], [50], [51]:
J
kkm
ðGÞ¼trðLKÞ; where L ¼ I
n
G
T
ðGG
T
Þ
1
G: ð11Þ
KKM finds the bin ary matrix G 2f0; 1g
kn
(i.e., t he
indicator matrix between samples and clusters) which
makes G
T
ðGG
T
Þ
1
G as correlated as possible with the
sample kernel matrix K. On the other hand, ACA has two
indicator matrices: G, the segment-cluster indicator matrix,
that solves for the correspondence between segments and
clusters, and H, the sample-segment indicator matrix that
encodes the correspondence between samples and seg-
ments. ACA finds the two binary matrices G and H that,
after applying DTAK between all pairwise segments, make
the matrix H
T
G
T
ðGG
T
Þ
1
GH W as correlated as possible
with the frame kernel K. Fig. 4 illustrates the role of different
matrices in a synthetic temporal clustering example. Notice
that once the matrices K; W; H are computed by ACA, the
eigenvectors of the matrix T (8) provide a natural embed-
ding for visualizing the seven segments of the time series in
a low-dimensional space (see Fig. 4h).
3.6 Coordinate-Descent Optimization for ACA
In a previous section, we have formulated the problem of
temporal clustering as an integer programming problem (7)
over two variables (G and s). Recall that G encodes the
segment-cluster correspondence and s (or, equivalently, H)
encodes the sample-segment correspondence. Optimizing
over G and s is NP-hard. This section proposes an efficient
coordinate-descent scheme that alternates between comput-
ing s using dynamic programming and G with a winner-
take-all strategy [52].
We solve the following subproblem at each iteration:
G; s ¼ arg min
G;s
J
aca
ðG; sÞ¼arg min
G;s
X
k
c¼1
X
m
i¼1
g
ci
dist
2
Y
i
;
_
z
c
;
where
_
z
c
is the cluster mean implicitly computed from the
segmentation ð
_
G;
_
sÞ derived in the previous step. Given a
sequence X of length n, however, the number of all possible
segmentations is exponential, i.e., Oð2
n
Þ, which makes a
brute-force search for s infeasible. We used a DP-based
algorithm to exhaustively examine all possible segmenta-
tions in polynomial time. Observe that the matrix H (see
Fig. 4c) has a monotonic structure and can be optimally
optimized using DP.
Recall that we could rewrite (7) as a sum of the following
distances:
J
aca
ðG; sÞ¼
X
k
c¼1
X
m
i¼1
g
ci
dist
2
Y
i
; z
c
:
To further leverage the relationship between G and s,we
introduce an auxiliary function, JðÞ : ½1;n!IR ,
JðvÞ¼min
G;s
J
aca
ðG; sÞj
X
½1;v
; ð12Þ
to relate the minimum energy directly with the tail position v
of the subsequence X
½1;v
¼½x
1
; x
2
; ...; x
v
. We can further
justify that JðÞ satisfies the principle of optimality [53], i.e.,
JðvÞ¼min
1<iv
Jði 1Þþmin
G;s
J
aca
ðG; sÞj
X
½i; v
; ð13Þ
ZHOU ET AL.: HIERARCHICAL ALIGNED CLUSTER ANALYSIS FOR TEMPORAL CLUSTERING OF HUMAN MOTION 5
2. It can be proven that adding I
n
to K has the same effect as by adding
I
m
on T, that is,
HððK þ I
n
ÞWÞH
T
¼ T þ H
WH
T
T þ I
m
; ð9Þ
where
W ¼ I
n
W 2 IR
nn
is a diagonal matrix. Notice that the diagonal of
W is composed of m blocks of W
ii
and that W
ii
¼
1
n
i
I
n
i
according to (5).
Therefore, we can conclude
W ¼ diagð
1
n
1
1
n
1
;...;
1
n
m
1
n
m
Þ. In addition,
because H ¼½h
1
; ...; h
m
T
2f0; 1g
mn
is binary and its rows are orthogonal,
we can conclude h
iT
h
i
¼ n
i
and h
iT
h
j
¼ 0;i j. Combing these results for
W and H, we prove that H
WH
T
I
m
.

Figures
Citations
More filters
Journal ArticleDOI

Sequence of the most informative joints (SMIJ)

TL;DR: A new representation of human actions called Sequence of the Most Informative Joints (SMIJ), which is extremely easy to interpret and performs better than several state-of-the-art algorithms for the task of human action recognition.
Book ChapterDOI

A Hierarchical Representation for Future Action Prediction

TL;DR: This work considers inferring the future actions of people from a still image or a short video clip, which aims to capture the subtle details inherent in human movements that may imply a future action.
Proceedings ArticleDOI

Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization

TL;DR: A new clustering model, called DEeP Embedded Regularized ClusTering (DEPICT), which efficiently maps data into a discriminative embedding subspace and precisely predicts cluster assignments is proposed, which indicates the superiority and faster running time of DEPICT in real-world clustering tasks, where no labeled data is available for hyper-parameter tuning.
Posted Content

Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization

TL;DR: Zhang et al. as mentioned in this paper proposed a new clustering model, called DEeP Embedded Regularized Clustering (DEPICT), which efficiently maps data into a discriminative embedding subspace and precisely predicts cluster assignments.
Journal ArticleDOI

ModDrop: Adaptive Multi-Modal Gesture Recognition

TL;DR: The proposed ModDrop training technique ensures robustness of the classifier to missing signals in one or several channels to produce meaningful predictions from any number of available modalities, and demonstrates the applicability of the proposed fusion scheme to modalities of arbitrary nature by experiments on the same dataset augmented with audio.
References
More filters

Some methods for classification and analysis of multivariate observations

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Journal ArticleDOI

Normalized cuts and image segmentation

TL;DR: This work treats image segmentation as a graph partitioning problem and proposes a novel global criterion, the normalized cut, for segmenting the graph, which measures both the total dissimilarity between the different groups as well as the total similarity within the groups.
Journal ArticleDOI

A global geometric framework for nonlinear dimensionality reduction.

TL;DR: An approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set and efficiently computes a globally optimal solution, and is guaranteed to converge asymptotically to the true structure.
Proceedings Article

An iterative image registration technique with an application to stereo vision

TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Proceedings ArticleDOI

Normalized cuts and image segmentation

TL;DR: This work treats image segmentation as a graph partitioning problem and proposes a novel global criterion, the normalized cut, for segmenting the graph, which measures both the total dissimilarity between the different groups as well as the total similarity within the groups.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions in "Hierarchical aligned cluster analysis for temporal clustering of human motion" ?

The authors also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on data of a honey bee dance. 

The quality of the detection, recognition, or synthesis in these applications greatly depends on the spatial and temporal resolution of motion databases, as well as the complexity of the models. 

The KTH dataset contains six types of human actions (walking, jogging, running, boxing, hand-waving, and hand-clapping) performed by 25 subjects in different scenarios. 

Once the confusion matrix is computed, the authors apply the Hungarian algorithm [57] to find the optimum cluster correspondence, and compute the accuracy as follows:acc ¼ max Ptr CPtr C1k k ; ð17Þ subject to the constraint that P 2 f0; 1gk k is a permutation matrix. 

Common dimensionality reduction methods (e.g., PCA, LDA, Isomap, LLE) find embeddings from a data sample in the high-dimensional space to a point in the embedded space. 

A major limitation of standard k-means clustering foranalysis of time series data [49] is that the temporal orderingof the frames is not taken into account. 

To obtain a lowlevel representation, they segmented the movement by estimating the velocity and acceleration of the actuator attached to the joint. 

Minnen et al. [26] discovered motifs in real-valued, multivariate time series data by locating regions of high density in the space of all time series subsequences. 

De la Torre and Agell [24] decomposed a multimodal stream of human behavior into several activities using semi-supervised temporal clustering. 

ACA finds the two binary matrices G and H that, after applying DTAK between all pairwise segments, make the matrix HTGT ðGGT Þ 1GH W as correlated as possible with the frame kernel K. Fig. 4 illustrates the role of different matrices in a synthetic temporal clustering example. 

A straightforward implementation of (14) is prohibitively expansive, i.e., Oðn2n2maxÞ, due to the bottleneck of computing ðX½i;v ; _YjÞ for all i; v; j. 

Fig. 1 illustrates the problem addressed in this paper: Given a sequence of a person walking and running, the first level of the hierarchy provided by their algorithm (HACA) is able to group the frames into two classes: running and walking. 

Each element of the segment kernel matrix (T), ij ¼ ðYi;YjÞ ¼ trðKTijWijÞ, is the DTAK between the ith and jth segments (Yi and Yj) computed using (6), where Kij 2 IRni nj and Wij 2 IRni nj are the frame kernel matrix and the normalized correspondence matrix between segments Yi and Yj, respectively. 

In the computer vision literature, Zhong et al. [23] used a bipartite graph co-clustering algorithm to segment and detect unusual activities in video. 

algorithms such as ACA and HACA are able to achieve competitive detection performances (77 percent) for human actions in a completely unsupervised fashion.