How many actions are included in the KTH dataset?

The KTH dataset contains six types of human actions (walking, jogging, running, boxing, hand-waving, and hand-clapping) performed by 25 subjects in different scenarios.

How do the authors compute the confusion matrix?

Once the confusion matrix is computed, the authors apply the Hungarian algorithm [57] to find the optimum cluster correspondence, and compute the accuracy as follows:acc ¼ max Ptr CPtr C1k k ; ð17Þ subject to the constraint that P 2 f0; 1gk k is a permutation matrix.

What methods can be used to find embeddings?

Common dimensionality reduction methods (e.g., PCA, LDA, Isomap, LLE) find embeddings from a data sample in the high-dimensional space to a point in the embedded space.

What is the role of ac in a synthetic temporal clustering example?

ACA finds the two binary matrices G and H that, after applying DTAK between all pairwise segments, make the matrix HTGT ðGGT Þ 1GH W as correlated as possible with the frame kernel K. Fig. 4 illustrates the role of different matrices in a synthetic temporal clustering example.

Why is the implementation of (14) prohibitively expansive?

A straightforward implementation of (14) is prohibitively expansive, i.e., Oðn2n2maxÞ, due to the bottleneck of computing ðX½i;v ; _YjÞ for all i; v; j.

What is the DTAK between the ith and jth segments?

Each element of the segment kernel matrix (T), ij ¼ ðYi;YjÞ ¼ trðKTijWijÞ, is the DTAK between the ith and jth segments (Yi and Yj) computed using (6), where Kij 2 IRni nj and Wij 2 IRni nj are the frame kernel matrix and the normalized correspondence matrix between segments Yi and Yj, respectively.

How is the ACA algorithm able to detect human actions?

algorithms such as ACA and HACA are able to achieve competitive detection performances (77 percent) for human actions in a completely unsupervised fashion.

(Open Access) Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion (2013) | Feng Zhou

Q: What are the contributions in "Hierarchical aligned cluster analysis for temporal clustering of human motion" ?

The authors also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on data of a honey bee dance.

Q: What is the main limitation of k-means clustering?

A major limitation of standard k-means clustering foranalysis of time series data [49] is that the temporal orderingof the frames is not taken into account.

Q: How did they obtain a lowlevel representation of the movement?

To obtain a lowlevel representation, they segmented the movement by estimating the velocity and acceleration of the actuator attached to the joint.

Hierarchical Aligned Cluster Analysis

for Temporal Clustering of Human Motion

Feng Zhou, Student Member, IEEE, Fernando De la Torre, and Jessica K. Hodgins

Abstract—Temporal segmentation of human motion into plausible motion primitives is central to understanding and building

computational models of human motion. Several issues contribute to the challenge of discovering motion primitives: the exponential

nature of all possible movement combinations, the variability in the temporal scale of human actions, and the complexity of

representing articulated motion. We pose the problem of learning motion primitives as a temporal clustering one, and derive an

unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA). HACA finds a partition of a given

multidimensional time series into m disjoint segments such that each segment belongs to one of k clusters. HACA combines kernel

k-means with the generalized dynamic time alignment kernel to cluster time series data. Moreover, it provides a natural framework to

find a low-dimensional embedding for the time series. HACA is efficiently optimized with a coordinate descent strategy and dynamic

programming. Experimental results on motion capture and video data demonstrate the effectiveness of HACA for segmenting complex

motions and as a visualization tool. We also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on

data of a honey bee dance. The HACA code is available online.

Index Terms—Temporal segmentation, time series clustering, time series visualization, human motion analysis, Kernel k-means,

spectral clustering, dynamic programming

1INTRODUCTION

YSTEMS that can detect, recognize, and synthesize human

motion are of interest in both research and industry due

to the large number of potential applications in virtual

reality, smart surveillance systems for advanced user

interfaces, and motion analysis (see [1], [2], [3] for a review).

The quality of the detection, recognition, or synthesis in

these applications greatly de pends on the spatial and

temporal resolution of motion databases, as well as the

complexity of the models. Unsupervised techniques to learn

motion primitives from training data have recently attracted

the interest of many scientists in computer vision [4], [5], [6],

[7], [8], [9], [10], [11], [12] and computer graphics [13], [14],

[15], [16], [17], [18], [19], [20]. Fig. 1 illustrates the problem

addressed in this paper: Given a sequence of a person

walking and running, the first level of the hierarchy

provided by our algorithm (HACA) is able to group the

frames into two classes: running and walking. In a finer

level of the hierarchy, HACA decomposes each of the

actions (e.g., running, walking) into motion primitives of

smaller temporal scale. However, some temporal compo-

nents might not necessarily have a physical meaning.

The inherent difficulty of temporally decomposing hu-

man motion stems from the large number of possible

movement combinations, a relatively large range of temporal

scales for different behaviors, the irregularity in the

periodicity of human actions, and the intraperson motion

variability. To address these challenges, this paper frames the

problem of hierarchical temporal decomposition of human

motion as an unsupervised learning problem, and proposes a

hierarchical aligned cluster analysis (HACA). HACA is a

generalization of kernel k-means (KKM) and spectral

clustering (SC) for time series clustering and embedding.

Over the last few years, several approaches for unsuper-

vised segmentation of activities have been proposed (see, for

example, [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15],

[16], [17], [18]). HACA presents several advantages:

. The temporal clustering problem is posed as an

energy minimization.

. HACA provides a natural embedding for clustering

and visualizing time series data.

. HACA provides a hierarchical decomposition at

several temporal scales (see Fig. 1). The time granu-

larity of the motion primitives is specified manually.

. Minimizing HACA is an NP problem. This paper

proposes an efficient coordinate descent minimiza-

tion algorithm to find a solution for HACA via

dynamic programming.

2PREVIOUS WORK

We build on prior research in human motion analysis and

temporal clustering.

2.1 Human Motion Analysis

Extensive literature in graphics and computer vision

addresses the problem of grouping human actions. In the

computer graphics literature, Barbic et al. [15] proposed an

algorithm to decompose human motion into distinct actions

based on probabilistic principal component analysis, which

places a cut when the distribution of human poses changes.

Jenkins et al. [13], [14] used the zero-velocity crossing points

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1

. The authors are with the Robotics Institute, Carnegie Mellon University,

Smith Hall, 5000 Forbes Ave, Pittsburgh, PA 15232.

E-mail: zhfe99@gmail.com, {ftorre, jkh}@cs.cmu.edu.

Manuscript received 27 Apr. 2011; revised 19 Apr. 2012; accepted 12 May

2012; published online 14 June 2012.

Recommended for acceptance by C. Sminchisescu.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMI-2011-04-0268.

Digital Object Identifier no. 10.1109/TPAMI.2012.137.

0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

of the angular velocity to segment the stream of motion

capture data into short sequences. Jenkins and Matari



c [21]

further extended the work by finding a nonlinear embedding

using Isomap [22] that reveals the temporal structure of

segmented motion. Beaudoin et al. [20] developed a string-

based motif-finding algorithm to decompose motion into

action primitives and interpret actions as a composition on

the alphabet of these action primitives.

In the computer vision literature, Zhong et al. [23] used a

bipartite graph co-clustering algorithm to segment and

detect unusual activities in video. Zelnik-Manor and Irani

[5] extracted spatiotemporal features at multiple temporal

scales to isolate and cluster events. An outcome of the

clustering process is the temporal segmentation of long

video sequences into event subsequences. De la Torre et al.

[9] proposed a geometric-invariant clustering algorithm to

decompose a stream of facial behavior into facial gestures.

Unusual facial expressions can be detected through the

analysis of outlying temporal patterns. De la Torre and Agell

[24] decomposed a multimodal stream of human behavior

into several activities u sing semi-supervised temporal

clustering. Recently, Guerra-Filho and Aloimonos [8], [25]

presented a linguistic framework for modeling and learning

human activity representations from video. To obtain a low-

level representation, they segmented the movement by

estimating the velocity and acceleration of the actuator

attached to the joint. Minnen et al. [26] discovered motifs in

real-valued, multivariate time series data by locating regions

of high density in the space of all time series subsequences.

2.2 Temporal Clustering

Segmentation and clustering of time series data is a topic

that has been explored in fields other than computer vision

and graphics. In particular, there is a substantial amount of

work in the field of data mining [27], [28], speech processing

[29], [30], animal behavior analysis [12], [31], and signal

processing [32], [33].

Two of the most popular approaches are change-point

detection and switching linear dynamical system (SLDS).

The goal of change-point detection [32], [33], [34] is to

identify changes at unknown times and estimate the location

of changes in stochastic processes. Unlike previous work on

change-point detection, HACA finds the change points that

minimize the error across several segments (not only two)

that belong to one of k clusters.

SLDSs describe the dynamics of the time series by

switching several linear dynamical systems over time. The

switching states in SLDS inference implicitly provide the

segmentat ion of an input sequence. Because the exact

inference in SLDS is intractable, Pavlovi



c et al. [35] proposed

approximate inference algorithms by casting the SLDS

model as a dynamic Bayesian network (DBN). Oh et al.

[12] introduced a data-driven MCMC (DD-MCMC) infer-

ence method to identify the exact posterior of SLDSs in the

presence of intractability. In their framework [12], the

standard SLDS has been improved by incorporating a

duration model, thereby yielding a more accurate result in

segmentation. To address the problem of learning SLDSs

with an unknown number of modes, Fox et al. [31] proposed

a nonparametric Bayesian method that utilizes the hier-

archical Dirichlet process (HDP) as a prior on the para-

meters of SLDSs. Recently, Fox et al. [36] further extended

this work by adding the beta process prior to discover and

model dynamical behaviors that are shared among multiple

related time series.

3ALIGNED CLUSTER ANALYSIS (ACA)

This section describes ACA and hierarchical ACA (HACA),

an extension of kernel k-means and spectral clustering for

clustering time series. Section 3.1 reviews the matrix

formulation for k-means, KKM and SC. Section 3.2 describes

the properties of the frame kernel matrix that are key to

understanding ACA and HACA. Section 3.3 reviews the

dynamic time alignment kernel (DTAK) that is used as a

similarity measure between segments. Section 3.4 proposes

the ACA energy function and its matrix formulation is

discussed in Section 3.5. Section 3.6 describes a coordinate-

descent strategy for optimizing ACA. Section 3.7 presents an

efficient optimization strategy for ACA. Section 3.8 de-

scribes HACA.

3.1 k-Means, KKM and SC

Clustering refers to the partition of n data points into

k disjointed clusters. Among various approaches to un-

supervised clust ering, k-means [37] is favored for its

simplicity. k-means clustering splits a set of n samples into

k groups by minimizing the within-cluster variation.

k-means clustering finds the partition of the data that is a

local optima of the following energy function [38], [39]:

ðZ; GÞ¼

c¼1

i¼1

 z

¼kX  ZGk

;

s:t: G

¼ 1

;

ð1Þ

where x

2 IR

(see notation

) is a vector representing the

ith data point and z

2 IR

is the geometric centroid of the

data points for class c. G 2f0; 1g

kn

is a binary indicator

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

Fig. 1. Hierarchical decomposition of human motion. Each level of the

figure corresponds to one hierarchy found by HACA at different temporal

resolutions. The top row shows some samples of motion capture data of

a person walking and then running (7.2 seconds). The second row

shows the first level of the decomposition found by HACA. Each

temporal pattern contains samples for walking or running. The bottom

row shows the lower level, which contains subcycles of running and

walking.

1. Bold capital letters denote a matrix X, bold lower-case letters a column

vector x. x

represents the ith column of the matrix X. x

denotes the scalar

in the ith row and jth column of the matrix X. All nonbold letters represent

scalars. 1

mn

; 0

mn

2 IR

mn

are matrices of ones and zeros. I

2 IR

nn

is an

identity matrix. kxk¼

ﬃﬃﬃﬃﬃﬃﬃﬃﬃ

denotes t he euclidean distance. kXk

trðX

XÞ¼trðXX

Þ designates the Frobenious norm of a matrix. X  Y is

the Hadamard product of matrices. diagðxÞ is a diagonal matrix whose

diagonal elements are x. ½i; j and ½i; jÞ list the integers fi; i þ 1; ...;j 1;jg

and fi; i þ 1; ...;j 1g, respectively. X

½i;j

¼½x

; x

iþ1

; ...; x

 is composed

by the columns of X indexed by the integers in ½i; j.

X denotes the previous

value of X in an updating scheme.

matrix such that g

¼ 1 if the sample x

belongs to cluster c

and zero otherwise. The k-m eans algorithm performs

coordinate descent in the energy function J

ðZ; GÞ. Given

the actual value of the means

Z 2 IR

dn

, the first step finds

2f0; 1g

for each data point x

such that one of the rows

is one and the others are zero, while minimizing (1). The

second step computes Z ¼ XG

ðGG

1

, which is equiva-

lent to calculating the mean of each cluster. These

alternating steps are guaranteed to converge to a local

minimum of J

ðZ; GÞ [40].

A major limitation of the k-means algorithm is that it is

only optimal for spherical cluste rs. To overcome this

limitation, kernel k-means [41] implicitly maps the data to a

higher dimensional space using kernels. KKM [41], [38]

minimizes

kkm

ðGÞ¼

c¼1

i¼1

kðx

Þz

|ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

dist



ðx

¼kðXÞZGk

;

s:t: G

¼ 1

;

ð2Þ

where dist



ðx

; z

Þ is the squared distance between the

ith sample and the center of class c in the feature space, that is:

dist



ðx

; z

Þ¼



j¼1



¼1



; (3)

where n

j¼1

is the number of samples that belong to

class c. The kernel function  is defined as 

¼ ðx

ðx

Þ.

Similarly to the first step in the k-means algorithm, KKM

assigns the sample to the closest cluster mean (

Z) computed

in the previous step:



¼ 1; where c



¼ arg min

dist





;



: ð4Þ

In KKM, in general, the mean cannot be computed explicitly.

However, there is no need to compute the mean because

dist



ðx

;

Þ can be calculated from the kernel matrix.

Spectral clustering also minimizes a weighted version of

(2), but G is relaxed to be continuous. See [41] and [38] for a

more detailed explanation of the relation between SC and

KKM. In this paper, we will extend KKM and SC to cluster

and find a low-dimensional embedding of the time series.

3.2 Frame Kernel Matrix

This section describes some properties of the frame kernel

matrix, K ¼ ðXÞ

ðXÞ2IR

nn

, where X 2 IR

dn

is a multi-

dimensional time series of length n. Each entry, 

, defines

the similarity between two frames, x

and x

, by means of a

kernel function ðx

ðx

Þ. The linear kernel, 

¼ x

and the Gaussian kernel, 

¼ expð

x

2

Þ, are perhaps the

most commonly used kernels. In the literature of dynamical

systems, the frame kernel matrix (K) is alternatively called

the recurrence matrix [42], [43], and its structure reveals

important information about the dynamics.

To illustrate the properties of this matrix, consider the 1D

time series shown in Fig. 2a. In this case, we compute the

frame kernel matrix using the exponential kernel, 

expð

x

2

Þ. We choose an infinitely small bandwidth

( ! 0) to make a binary frame kernel matrix (Fig. 2b). In the

following, we highlight one property, period ambiguity, that

is relevant to ACA. Fig. 2a (second and third row) plots two

different, but valid, decompositions of the same time series at

two different temporal scales. To avoid this ambiguity, we

introduce a parameter n

max

to constrain the length of the

segments. In this case, we set n

max

¼ 2 and n

max

¼ 4 for the

second and third rows, respectively. Similarly, Fig. 2c shows

an example of a multidimensional time series of motion

captu re data of a subject doing two activities. Fig. 2d

illustrates the corresponding frame kernel matrix at two

different temporal resolutions levels.

3.3 Dynamic Time Alignment Kernel

A temporal clustering algorithm needs to define a distance

between segments of different length. Ideally, this distance

should be invariant to the speed of the human action. This

section reviews the DTAK that extends dynamic time

warping (DTW) to satisfy the properties of a distance.

A frequent approach to aligning time series has been

DTW. A known drawback of using DTW as a distance is that

it fails to satisfy the triangle inequality [44]. To address this

issue, Shimodaira et al. [45] proposed the DTAK. Given two

sequences X ¼½x

; ...; x

2IR

dn

and Y ¼½y

; ...; y

2

dn

DTAK computes the similarity using dynamic

programming. DTAK uses the cumulative kernel matrix U 2

n

(as in DTW), computed in a recursive manner as

ðX; YÞ¼

þ n

¼ max

i1;j

þ 

i1;j1

þ 2

i;j1

þ 

ð5Þ

U is initialized at the upper left, i.e., u

¼ 2



¼ ðx

ðy

Þ¼exp 

 y

2

is the frame kernel that constitutes the kernel matrix

K 2 IR

n

. Fig. 3c illustrates the procedure to build U for

two short sequences (Fig. 3a). Fig. 3b shows the binary frame

kernel matrix K when  ! 0. The final value of DTAK,

ðX; YÞ¼

, is computed by normalizing the bottom right

of U with the sum of sequence lengths.

A more revealing mathematical expression to under-

stand DTAK can be obtained using matrix notation.

Observe that DTAK computes a monotonic trajectory (the

red curve in Fig. 3b) starting from the top-left corner to the

bottom-right corner of the frame kernel matrix K. This

ZHOU ET AL.: HIERARCHICAL ALIGNED CLUSTER ANALYSIS FOR TEMPORAL CLUSTERING OF HUMAN MOTION 3

Fig. 2. Decomposition of time series into two different temporal scales.

(a) Temporal clustering of 1D time series. Vertical black dotted lines

denote the segment’s boundaries. (b) Frame kernel matrix. Each

segment corresponds to a rectangular block (yellow line). (c) Temporal

clustering of motion capture data. (d) Frame kernel matrix.

monotonic trajectory can be mathematically parameterized

by two frame indexes vectors p 2f1:n

and q 2f1:n

where l is the optimal number of steps that need to align X

and Y by DTAK (e.g., l ¼ 8 in the case of Fig. 3). Using these

indexes, we can define a new normalized correspondence

matrix W ¼½w



n

2 IR

n

,wherew

þn

ðp



c1

þ q

 q

c1

Þ if there exist p

¼ i and q

¼ j for some c

(i.e., for every one of the l steps we have two indexes that

encode the correspondence between the two time series).

Otherwise, w

¼ 0. See Fig. 3d for an example of W. Using

this new matrix, DTAK can be rewritten in a more compact

way as follows:

ðX; YÞ¼trðK

WÞ¼ ðXÞ

ðYÞ; ð6Þ

where ðÞ denotes a mapping of the sequence into a feature

space. By the Mercer theorem [46] this mapping exists when

ðX; YÞ is a positive definite kernel. Unfortunately, DTAK

is not necessarily a strictly positive definite kernel [47], [48],

and a regularization of the kernel matrix needs to be

performed (see Section 3.5).

3.4 Energy Function for ACA

Given a sequence X ¼½x

; ...; x

2IR

dn

with n samples,

ACA decomposes X into m disjointed segments, each of

which corresponds to one of k classes. The ith segment,

½s

iþ1

¼½x

; ...; x

iþ1

1

2IR

dn

, is composed of sam-

ples that begin at position s

and end at s

iþ1

 1. The length of

the segment is constrained as n

¼ s

iþ1

 s

 n

max

, where

max

is the maximum length of the segment and controls the

temporal granularity of the factorization. An indicator matrix

G 2f0; 1g

km

assigns each segment to a class; g

¼ 1 if Y

belongs to class c, otherwise g

¼ 0. For instance (see Fig. 4),

the 1D sequence (Fig. 4a) with 23 frames has been segmented

into seven segments that belong to three clusters (Fig. 4b).

A major limitation of standard k-means clustering for

analysis of time series data [49] is that the temporal ordering

of the frames is not taken into account. This section combines

KKM and SC with the DTAK to achieve temporal clustering.

ACA extends previous work on KKM and SC by minimizing:

aca

ðG; sÞ¼

c¼1

i¼1



½s

iþ1



 z

|ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

dist

ðY

¼k½





; ...;





ZGk

;

s:t: G

¼ 1

and s

iþ1

 s

2½1;n

max

;

ð7Þ

where G 2f0; 1g

km

is a class indicator matrix and s 2 IR

mþ1

is the vector that contains the start and end of each segment.

½s

iþ1

denotes a segment. Similar to KKM, dist

ðY

; z

is the squared distance between the ith segment and the

center of class c in the feature space defined by the nonlinear

mapping ðÞ, which is

dist



; z



¼ 



j¼1



¼1



;

where m

j¼1

is the number of segments that belong

to class c. The dynamic kernel function  is defined as 

ðY

Þ¼trðW

Þ based on (6).

The differences between ACA (7) and KKM (2) are worth

pointing out:

1. ACA clusters variable length features, that is, each

segment Y

might have a different number of

samples (columns of Y

), whereas standard KKM

has a fixed number of features (rows of x

2. A new variable, s, is introduced to represent the

starting and ending of each segment.

3. The distance used in ACA, dist

ðY

; z

Þ, is based on

DTAK, which is robust to noise and invariant to

temporal scaling factors.

4. A DP-based approach is used to efficiently solve ACA.

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

Fig. 3. Computation of DTAK. (a) Alignment examples for two 1D

sequences. (b) Frame kernel matrix (K). (c) Cumulative kernel matrix

(U). (d) Normalized correspondence matrix (W).

Fig. 4. A synthetic example of temporal clustering. (a) An example of 1D time series. (b) Temporal clustering of the 1D time series. (c) Sample-

segment indicator matrix (H). (d) Normalized correspondence matrix (W). (e) Segment-cluster indicator matrix (G). (f) Frame kernel matrix (K).

(g) Segment kernel matrix (T). (h) 2D embedding computed using the top two eigenvectors of T.

3.5 Matrix Formulation for ACA

A more enlightening formulation for ACA is the matrix

form. Suppose that a sequence X 2 IR

dn

of length n has

been segmented into m segments, fY

2 IR

dn

i¼1

and

i¼1

¼ n. A key insight to understanding ACA is that we

can define two kernel matrices: T ¼½



mm

2 IR

mm

, the

segment kernel matrix (kernel between segments), and

K ¼½



nn

2 IR

nn

, the frame kernel matrix (kernel be-

tween frames). Each element of the segment kernel matrix

(T), 

¼ ðY

; Y

Þ¼trðK

Þ, is the DTAK between the

ith and jth segments (Y

and Y

) computed using (6),

where K

2 IR

n

and W

2 IR

n

are the frame kernel

matrix and the normalized correspondence matrix between

segments Y

and Y

, respectively.

After some linear algebra, it can be shown that T 2 IR

mm

can be expressed as the product of a global correspondence

matrix (W), a global kernel frame matrix (K), and a sample-

segment indicator matrix (H) as follows:

T ¼







mm

¼ H



K  W



; ð8Þ

where W ¼½W



mm

2 IR

nn

and K ¼½K



mm

2 IR

nn

are

obtained by rearranging the m  m blocks of W

and

,respectively. H 2f0; 1g

mn

is a matrix that encodes the

correspondence between samples and segments such that

¼ 1 if the jth sample belongs to the ith segment. See Fig. 4

for an example of these matrices.

Unfortunately, DTAK is not a strictly positive definite

kernel [47], [48]. Thus, we add a scaled identity matrix to K,

that is, K K þ I

, where  is chosen to be the absolute

value of the smallest eigenvalue of T if it has negative

eigenvalues.

After substituting the optimal value of

Z ¼½





; ...;





G





1

in (7), a more understandable form of J

aca

results in:

aca

ðG; HÞ¼trððI

 G

ðGG

1

GÞTÞ

¼ trððI

 G

ðGG

1

GÞHðK  WÞH

¼ trððL  WÞKÞ;

where L ¼ I

 H





1

GH:

ð10Þ

Recall H depends on s and G 2f0; 1g

km

is the segment-

cluster indicator matrix such that g

¼ 1 if the jth segment

belongs to the ith temporal cluster. See Fig. 4 for an example

of temporal clustering and the role of the matrices K; W; H.

Consider the special case when each segment is one frame,

i.e., m ¼ n and H ¼ I

. The segment kernel matrix becomes

simply the frame kernel matrix, i.e., 

¼ 

and W ¼ 1

nn

In this case, the energy function of ACA is equivalent to the

function minimized by KKM [41], [50], [51]:

kkm

ðGÞ¼trðLKÞ; where L ¼ I

 G

ðGG

1

G: ð11Þ

KKM finds the bin ary matrix G 2f0; 1g

kn

(i.e., t he

indicator matrix between samples and clusters) which

makes G

ðGG

1

G as correlated as possible with the

sample kernel matrix K. On the other hand, ACA has two

indicator matrices: G, the segment-cluster indicator matrix,

that solves for the correspondence between segments and

clusters, and H, the sample-segment indicator matrix that

encodes the correspondence between samples and seg-

ments. ACA finds the two binary matrices G and H that,

after applying DTAK between all pairwise segments, make

the matrix H

ðGG

1

GH  W as correlated as possible

with the frame kernel K. Fig. 4 illustrates the role of different

matrices in a synthetic temporal clustering example. Notice

that once the matrices K; W; H are computed by ACA, the

eigenvectors of the matrix T (8) provide a natural embed-

ding for visualizing the seven segments of the time series in

a low-dimensional space (see Fig. 4h).

3.6 Coordinate-Descent Optimization for ACA

In a previous section, we have formulated the problem of

temporal clustering as an integer programming problem (7)

over two variables (G and s). Recall that G encodes the

segment-cluster correspondence and s (or, equivalently, H)

encodes the sample-segment correspondence. Optimizing

over G and s is NP-hard. This section proposes an efficient

coordinate-descent scheme that alternates between comput-

ing s using dynamic programming and G with a winner-

take-all strategy [52].

We solve the following subproblem at each iteration:

G; s ¼ arg min

G;s

aca

ðG; sÞ¼arg min

G;s

c¼1

i¼1

dist



;



;

where

is the cluster mean implicitly computed from the

segmentation ð

sÞ derived in the previous step. Given a

sequence X of length n, however, the number of all possible

segmentations is exponential, i.e., Oð2

Þ, which makes a

brute-force search for s infeasible. We used a DP-based

algorithm to exhaustively examine all possible segmenta-

tions in polynomial time. Observe that the matrix H (see

Fig. 4c) has a monotonic structure and can be optimally

optimized using DP.

Recall that we could rewrite (7) as a sum of the following

distances:

aca

ðG; sÞ¼

c¼1

i¼1

dist



; z



To further leverage the relationship between G and s,we

introduce an auxiliary function, JðÞ : ½1;n!IR ,

JðvÞ¼min

G;s

aca

ðG; sÞj

½1;v

; ð12Þ

to relate the minimum energy directly with the tail position v

of the subsequence X

½1;v

¼½x

; x

; ...; x

. We can further

justify that JðÞ satisfies the principle of optimality [53], i.e.,

JðvÞ¼min

1<iv



Jði  1Þþmin

G;s

aca

ðG; sÞj

½i; v



; ð13Þ

ZHOU ET AL.: HIERARCHICAL ALIGNED CLUSTER ANALYSIS FOR TEMPORAL CLUSTERING OF HUMAN MOTION 5

2. It can be proven that adding I

to K has the same effect as by adding

I

on T, that is,

HððK þ I

ÞWÞH

¼ T þ H



 T þ I

; ð9Þ

where



W ¼ I

 W 2 IR

nn

is a diagonal matrix. Notice that the diagonal of

W is composed of m blocks of W

and that W

according to (5).

Therefore, we can conclude



W ¼ diagð

;...;

Þ. In addition,

because H ¼½h

; ...; h



2f0; 1g

mn

is binary and its rows are orthogonal,

we can conclude h

¼ n

and h

¼ 0;i 6¼ j. Combing these results for



W and H, we prove that H



 I

Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion

Figures

Citations

Sequence of the most informative joints (SMIJ)

A Hierarchical Representation for Future Action Prediction

Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization

Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization

ModDrop: Adaptive Multi-Modal Gesture Recognition

References

Some methods for classification and analysis of multivariate observations

Normalized cuts and image segmentation

A global geometric framework for nonlinear dimensionality reduction.

An iterative image registration technique with an application to stereo vision

Normalized cuts and image segmentation

Related Papers (5)

Segmenting motion capture data into distinct behaviors

Real-time human pose recognition in parts from single depth images

Learning realistic human actions from movies

Mining actionlet ensemble for action recognition with depth cameras

Documentation Mocap Database HDM05

Frequently Asked Questions (15)

Q1. What are the contributions in "Hierarchical aligned cluster analysis for temporal clustering of human motion" ?

Q2. What is the quality of the detection, recognition, or synthesis in these applications?

Q3. How many actions are included in the KTH dataset?

Q4. How do the authors compute the confusion matrix?

Q5. What methods can be used to find embeddings?

Q6. What is the main limitation of k-means clustering?

Q7. How did they obtain a lowlevel representation of the movement?

Q8. How did they find motifs in multivariate time series data?

Q9. How did they decompose a multimodal stream of human behavior into several activities?

Q10. What is the role of ac in a synthetic temporal clustering example?

Q11. Why is the implementation of (14) prohibitively expansive?

Q12. What is the problem addressed in this paper?

Q13. What is the DTAK between the ith and jth segments?

Q14. What is the popular method of detecting unusual activities in video?

Q15. How is the ACA algorithm able to detect human actions?