How many actors are used to train the classifier?

8 out of the 9 actors in the database are used to train the classifier and select the exemplars, the 9th is used for the evaluation.

How much recognition rate can the authors achieve with a uniformly sub-sampled exemplar?

For a uniformly sub-sampled exemplar set of size 300, their method presents a recognition rate of 93.6% in crossvalidation on all 10 actions and 9 actors.

How much recognition rate do Wang and Suter report?

Wang and Suter [23] report a recognition rate of 97.78% with an approach that uses kernel-PCA for dimensional reduction and factorial conditional random fields to model motion dynamics.

How can the authors obtain the distance between silhouettes?

Consequently they can be matched with a standard distance function and the authors choose the squared Euclidean distance d(x, y) = |x − y|2, which is computed between the vector representations of the binary silhouette images.

What is the way to learn and evaluate a candidate exemplar?

in the first iteration classifier for each single candidate exemplar are learned, the exemplar with the best evaluation performance is added to the final exemplar set, and the learning and evaluation step is repeated using pairs of exemplars (containing the already selected), triples, quadruples, etc.

What is the way to represent a class c 1...C?

Each class c ∈ 1...C is represented through a single Gaussian distribution p(D∗|c) = N (D∗|µc, Σc), which the authors found adequate in experiments to model all important dependencies between exemplars.

How many exemplars are used in the experiment?

Since the forward selection includes one random step, in the case where several exemplars present the same validation rate, the authors repeat the experiment 10 times with all actors, and average over the results.

What is the case when estimating covariance?

Note that when estimating covariance Σ, and depending on the dimension n, it is often the case that insufficient training data is available for Σ, and consequently the estimation may be non-invertible.

(Open Access) Action recognition using exemplar-based embedding (2008) | Daniel Weinland

Q: What are the contributions in "Action recognition using exemplar-based embedding" ?

In this paper, the authors address the problem of representing human actions using visual cues for the purpose of learning and recognition. In contrast, the authors propose a new compact and efficient representation which does not account for such dependencies. The authors show how to build such embedding spaces of low dimension by identifying a vocabulary of highly discriminative exemplars using a forward selection. To test their representation, the authors have used a publicly available dataset which demonstrates that their method can precisely recognize actions, even with cluttered and non-segmented sequences.

Q: What is the main feature of their approach?

Another important feature of their approach is that it can be used with advanced image matching techniques, such as the Chamfer distance [10], for visual measurements.

Q: How long does it take to select a set of features?

With a non-optimized implementation in MATLAB, selection of approximately 50 features out of a few hundreds will take around 5 minutes.

Action Recognition using Exemplar-based Embedding

Daniel Weinland

∗

Edmond Boyer

LJK - INRIA Rhône-Alpes, France

{weinland, eboyer}@inrialpes.fr

Abstract

In this paper, we address the problem of representing hu-

man a ctions using visual cues for the purpose of learn ing

and recognition. Traditional approaches model action s as

space-time representations which explicitly or implicitly en-

code the dynamics of an action through temporal dependen-

cies. In contrast, we propose a new compact and efﬁcient

representation which does not account for such dependen-

cies. Instead, motion sequences are represented with re-

spect to a set of discriminative sta tic key-pose exemplars

and without m odeling any temporal ordering. The interest

is a time-invariant representation that drastically simpliﬁes

learning and recognition by removing time related informa-

tion such as speed or length of an actio n. The proposed rep-

resentation is equivalent to embedding actions into a space

deﬁned by distances to key-pose ex e m plars. We show how to

build such embedding sp aces of low dimension by identify-

ing a vocabulary of highly discriminative exemplars using a

forward selection. To test our representation, we have u sed

a publicly available dataset which demon strates that o ur

method can precisely recognize actions, even with cluttered

and n on-segmented sequenc es.

1. Introduction

Action recognition is of central importance in computer

vision with many applications in visual surveillance, hu-

man computer interaction and entertainment, among o thers.

A challenging issue in this ﬁeld originates from the diver-

sity of information which describes an action. This includ es

purely visual cues, e. g. shape and appearance, as well as dy-

namic cues, e.g. space-time trajectories and motion ﬁelds.

Such diversity raises the question of the relative importance

of these sources and also to what degree they c ompensate

for each other.

In a seminal work, Johansson [15] demonstrated through

psychoanalytical experiments that humans can recognize

∗

D. Weinland is supported by a grant from the European Community

under the EST Marie-Curie Project Visitor.

actions merely from the mo tion of a few light points a t-

tached to the hu man body. Following this idea, several

works, e.g. [1, 11], attempted to rec ognize action s using

trajectories of markers with speciﬁc locations on the human

body. While successful in constrained environments, these

approa c hes do not however extend to general scenario s.

Besides, static visual information give also very strong

cues on activities. In particular, humans are able to rec-

ognize many actions from a single image (see for instance

Figure 1). Consequently a signiﬁcant effort has been put

in representations ba sed on visual cues. Two main direc-

tions have been followed. Implicit representations simulta-

neously model in spac e and time with space-time volumes,

e.g. [3, 25], or by using space-time features, e.g. [6, 18, 19].

Explicit representations equip tradition al temporal models,

such as hidden Markov models (HMMs), with powerful im-

age matching a bilities based on exemplar representations,

e.g. [2 1, 7, 8, 24].

In this work we take a different strategy and represent

actions using static visual information without temporal de-

pendencies. Our results show tha t such representations can

effectively model complex actions and yield recognition

rates that equal or exceed those of the current state-of-the -

art approaches, with the virtues of simplicity and efﬁciency.

Our approach builds o n recent works on example -based

embedd ing metho ds [2, 12]. In these approaches complex

distances between signals are approximated in a Euclidean

embedd ing space that is spanned by a set o f distances to ex-

emplar measures. Our representations is grounded on such

embedd ing, focusing only on the visual c omponents of an

action. The main contribution is a time-invariant represen-

tation that does not require a time warping step and is in-

sensitive to variations in speed and len gth of an action. To

the best of our knowledge, no previous work has attempted

to use such an embedding based representatio n to model ac-

tions.

In the paper, we will show how to select exemplars for

such a representatio n using a forward feature selection tech -

nique [16]. In particular, we will demonstrate how complex

actions can be described in terms of a small but highly dis-

criminative exemplar sets. Experiments on the well known

Figure 1. Sample images from the Weizmann-dataset [3]. A hu-

man observer can easily identify many, if not all, actions from a

single image. The interested reader may recognize the following

actions: bend, jumping-jack, jump-in-place, jump-forward, run,

gallop-sideways, walk, wave one hand, wave two hands, and jump-

forward-one-leg. Note, that the displayed images have been auto-

matically identiﬁed by our method as discriminative exemplars.

Weizmann-dataset [3] conﬁrm that action recognition ca n

be achieved without considering temporal dependencies.

Another important feature of our approach is that it can

be used with advanced image matching technique s, such

as the Chamfer distance [ 10], for visual mea surements. In

contrast to the classical use of dimensional reduction with

silhouette representations, e.g. [23], such a method can be

used in scenarios where no backgr ound subtraction is avail-

able. In a second experiment we will demonstrate, that even

on cluttered non-segmented sequences, our method has pre-

cise recognition results.

The paper is organized as follows: In Sec tion 2 we re-

view related work. In Section 3 we present the embedding

representation. In Section 4 we show how to compute a

small but discriminative exemplar set. In Section 5 we eval-

uate our a pproach with a publicly available dataset before

conclud ing and d iscussing issues in Section 6.

2. Related Work

Actions can been recognized using the occurr ences of

key-frames. In the work of Carlsson and Sullivan [4],

class represen tative silhouettes are matched against video

frames to recogn iz e forehand and backhand strokes in ten-

nis recordings. In a similar way, our approach uses a set of

representative silhouette like models, i.e. the exemplars, but

does not assume a deterministic fra mework as in [4], where

exemplars are exclusively linked to classes, and decisions

are based on single frame detections.

Other exemplar based approaches, e.g. [7, 8, 21, 24],

learn HMMs with o bservation probabilities based on match-

ing distances to exemplars. In all these models, dynamic s

are explicitly modeled thro ugh Markovian transitions over

discrete state variables, whereas distances are mapped onto

probabilities, which can involve additional difﬁculties [17].

Dedeoglu et al. [5] propose a real-time system for ac -

tion recognition b ased on key-poses and histograms. His-

tograms introduce some degree of temporal invariance, al-

though te mporal order remains partially constrained with

such representation. Moreover, the conversion of exemplar

distances into nor malized distributions can cause additional

loss.

Exemplar based embeddin g methods have already been

proposed, e.g. [2, 12]. In [2] Athitsos and Sclaroff present

an approach for hand pose estimation based on Lipschitz

embedd ings. Guo et al. [12] use an exemplar-base embed-

ding approach to match images o f cars over different view -

points. However no attempts has been made to apply such

exemplar-based e mbedding approaches to action reco gni-

tion.

Wang and Suter [23] use kern el-PCA to derive a low di-

mensional representation of silhouettes, and factorial c on-

ditional random ﬁelds to m odel dyna mics. Having similar

results in evaluation tha n ou r method, such an approach is

computationally expensive, and moreover only practical in

backgr ound subtracted scenes.

Interestingly, the S3-C3 stage of the biological motivated

system by Jhuang et al. [14] shares as well some similari-

ties with our embedding representation. However, these two

representation are derived in a very different context.

3. Action Modeling

Our approac h proceeds as illustrated in Figure 2. An ac-

tion sequence is matched against a set of n exemplars. For

each exemplar the minimum matching distance to any of the

frames of the sequence is determined, and the resulting set

of distances forms a vector D

∗

in the embedding space R

The intuition we follow is that similar sequences will yield

proxim ities to discrim inative exemplars which are similar.

Hence their point representation in R

should be close. We

thus model actions in R

where both learning and recogni-

tion are performed. This is de ta iled in the following sec-

tions.

3.1. Exemplar-based Embedding

Our aim is to classify an action sequence Y = y

, . . . , y

over time with respect to the occurrence of known repre-

sentative exemplars X = {x

, . . . , x

}, e.g . silhouettes.

The exemplar selection is presented in a fu rther section (see

Section 4) a nd we assume here that they are given.

We start by computing for each exemplar x

the mini-

mum distan ce to frame s in the sequence:

∗

(Y ) = min

d(x

, y

), (1)

where d is a distance function between the primitives con-

sidered, as described in Section 3.3.

At this stage, distanc es could b e thresholded and con-

verted into binary detections, in the sense of a key-frame

classiﬁer [4]. This requires however thresholds to be cho-

sen and furthermore does not allow to model uncertainties.

Figure 2. Overview of the embedding method: Two action sequences Y (walk) and

Y (jump forward on one leg) are matched against a

set of silhouette exemplars x

. For each exemplar the best matching frame in the sequence is identiﬁed (exemplar displayed on top of the

corresponding frame; light colors correspond to high matching distances; dark colors to low matching distances). The resulting matching

distances d

∗

form vector D

∗

, which is interpreted as an embedding of the sequences into a low dimensional space R

. The ﬁnal classiﬁer

is learned over R

, where each point represents a complete sequence.

Probabilistic exemplar-based approaches [21] do m odel

such uncertainties by converting distan ces into probabili-

ties, but a s m e ntioned earlier, at the price of complex com-

putations for normalization constants. We instead simply

work on the vectors that result from co ncatenating all the

minimum distance s

∗

(Y ) = (d

∗

(Y ), . . . , d

∗

(Y ))

⊤

∈ R

, (2)

without any probabilistic treatment. Note that our repre-

sentation is sim ilar in princ iple to the em bedding de scribed

in [2, 12] in a static context. We extend it to temporal se-

quences.

3.2. Classiﬁer

In the embedding space R

, classiﬁcation of time se-

quences reduces to a simple operation which is to label

the vectors D

∗

(Y ). A major advantage over traditional a p-

proach e s is tha t such vectors encode complete sequences

without the need for time normalizations or alignments.

These vectors ar e points in R

that are labelled using a

standard Bayes classiﬁer. Each class c ∈ 1...C is repre-

sented through a single Gaussian distribution p(D

∗

|c) =

N (D

∗

|µ

, Σ

), which we found adequate in experiments

to model all impo rtant dependencies betwee n exemplars.

Assignments are determined through maximum a posteri-

ori estima tions:

g(D

∗

) = arg max

p(D

∗

|c)p(c), (3)

with p(c) bein g the prior of class c that, without loss of

generality, is assumed to be uniform.

Note that when estimating covariance Σ, and depending

on the dimension n, it is often the ca se th at insufﬁcient train-

ing data is available for Σ, and c onsequently the estimation

may b e non -invertible. We hence work with a regularized

covariance of the for m

Σ = Σ+ǫI, w ith I being the identity

matrix and ǫ a small value.

3.3. Image Representation and Distance Functions

Actions are represented as vectors of distan ces fro m ex-

emplars to the frames in the action’s sequence. Such dis-

tances could be of several types, dep ending o n the available

informa tion in the images, e.g. silhouettes o r edges. In the

following, we assume that silhouettes are available for the

exemplars, which is a rea sonable assumption in the learning

phase, a nd we consider two situations for recognition . First,

silhouettes, ob ta ined for instance with ba ckground subtrac-

tions, are available; Second o nly edge s can be considered.

Silhouette-to-Silhouette Matc hing In this scenario we

assume that background subtracted sequence s are available.

Consequently, x and y are both represented through silhou-

ettes. While difﬁcult to obtain in many practical contexts,

silhouettes, when available, provide rich and strong cues.

Consequently they can be matched with a standard distance

function and we choose the squared Euclidean distance

d(x, y) = |x − y|

, which is computed between the vec-

tor representations of the binary silhouette images. Hence,

the distance is simply the number of pixels with different

values in both images.

Silhouette-to-Edge Matching In a more realistic sce-

nario, background subtraction will not be p ossible due to

moving or changing bac kground as well as changing light,

among other reasons. In that case, mor e advanced d istances

dealing with imperfect image segmentations must be con-

sidered. In our expe riments, we use such a scenario where

edge observations y, instead of silhouettes, ar e taken into

account. In such observations, edges are usually spurious

or missing. As mention ed earlier we assume that exem-

plars are represented thro ugh edge templates, computed us-

ing background subtraction in a learning phase. The dis-

tance we consid er is then the Chamfer distance [10],which

measures the closest distance for each edge point on the ob-

servation x to any edge point in the exemplar y,

d(x, y) =

|x|

f ∈x

(f), (4)

where |x| is the number of edge points in x and d

(f) is

the distance between edge f an d the closest edge-point in

y. An efﬁcient way to compute the Chamfer distance is

by correlating the distanc e transformed observation with the

exemplar silhouette.

4. Key-Pose Selection

In the previous section, we assume that the exemplars,

a set of discriminative primitives, a re known. We explain

in this section how to obtain them using a forward feature

selection. In a classical way, such selection has to deal with

two co nﬂicting objectives. First, the set of exemplars must

be small to avoid learning and classiﬁcation in high dimen-

sions (curse of dimensionality) and to allow fo r fast com-

putations. Second, the set must con ta in enough elements

to account for variations within and between classes. We

will use the wrapper technique for feature selection intro-

duced in [16], but other possibilities will be discussed in

Section 4.2

Several criteria exist to measure and optimize the quality

of a feature set (see e.g. [13]). The wrapper approach can be

seen as a direct and straightforward solution this p roblem.

The criterion optimiz ed is the validation of the considered

classiﬁer, which is itself used as a black box by the wrapper

while performing a greedy search over the featu re space.

There are different search strategies for the wrapper and we

use a forward selection, which we recently successfully a p-

plied in a similar setting [24].

4.1. Forward Selection

Forward selection is a bottom-up search pro cedure that

adds new exemplars to the ﬁnal exemplar set one a t a time

until the ﬁnal set is reached. Can didate exemplars are all

frames in the trainin g set, or a sub-sampled set of these

frames. In ea ch step of the selection, classiﬁers for each

candidate exemplar set are learned and evaluated. Conse-

quently, in the ﬁrst iteration classiﬁer for each single candi-

date exemplar are learned, the exemplar with the best evalu-

ation perf ormance is added to the ﬁnal exemplar set, and the

learning and evaluation step is repeated using pairs of exem-

plars (contain ing the already selected), triples, quadr uples,

etc. The algorithm is given below (see Algorithm 1 ).

Algorithm 1 Forward Selectio n

Input: training seque nces Y = {Y

, . . . Y

}, validation se-

quences

Y = {Y

, . . . Y

ˆm

}

1. let candidate exemplar set X = {y : y ∈ Y}

2. let ﬁnal exemplar set X = ∅

3. while size of X sma ller than n

(a) for each y ∈ X

i. set X

′

← {y} ∪ X

ii. train classiﬁer g with Y and keep validation

performance on

(b) set X ← {y

∗

} ∪ X where y

∗

correspo nds to the

best validation performance ob ta ined in step 3 (a).

If multiple y

∗

with same performance exist, ran-

domly pick one.

∗

}

4. return X

4.2. Selection Discussion

Many technique s have been used in the literature to se-

lect exemplars and vocabulary sets in related approaches.

For instance, several methods sub-sample or cluster the

space of exemplars, e.g. [2, 21]. While genera lly a pplica-

ble in our context, such methods require nevertheless very

large sets of exemplars in order to reach the perfo rmance of

a smaller set that has been speciﬁcally selected with respect

to an optimiza tion criterion. Moreover, as we observed

in [24], a clusterin g can miss im portant discrim inative ex-

emplars, e.g. clusters may discr iminate body shapes instead

of a ctions.

Figure 3. Sample sequences and corresponding edge images. (Top

to bottom) jumping-jack, walk, wave two hands.

Another solution is to select features based on advanced

classiﬁcation techniques such as support vector machines

[22] or Adaboost [9]. Unfortunately, support vector ma-

chines are mainly designed for binary classiﬁcations and,

though extensions to mu ltiple classes exist, they hardly ex-

tract a sing le featur e set for all classes. On the other hand,

Adaboo st[9] can be extended to multiple c lasses and is

known for its ability to search over large numbers of fe a-

tures. We exp erimented Adaboost using weak classiﬁers

based on single exemplars and pairs of exemplars but per-

formances were less consistent than with the forward selec-

tion.

Wrapper methods, such as the forward selection, are

known to be particularly robust against over-ﬁtting [ 13] but

sometimes criticized for being slow due to the repetitive

learning and evaluation cycles. In our ca se, we need ap-

proxim ately n × m learning an d validation cycles to se-

lect n fea tures out of a candidate set with size m. With

a non-optimiz e d implementation in MATLAB, selection of

approximately 50 features out of a few hund reds will take

around 5 minutes. This is a very reasonable computation

time consid ering that this step is on ly required during the

learning phase and that a compact exemplar set will beneﬁt

to all re c ognition phases.

5. Experiments

We h ave experimented our approach with the Weizmann-

dataset [3] (see Figure 1 and 3) which has been r ecently

used by several authors [1, 14, 18, 20, 23]. It contains 10

actions: bend (bend), jumpin g-jack (jack), jump-in-place

(pjump) , jump- forward (jump), run (run), gallop-sideways

(side), jump-forward-one-leg (skip), wa lk (walk), wave one

hand ( wa ve1), wave two hands (wav e2), performed by 9 ac-

tors. Silhouettes extracted from backgrounds and original

image sequences are provided.

All recognition rates were computed with the leave-one-

out cross-validation. Details are as follows. 8 out o f the 9

actors in the database are used to train the classiﬁer and se-

lect the exemplars, the 9th is used for the evaluation. This is

repeated for all 9 acto rs and the ra te s are averaged. For the

exemplar selection, we further need to divide the 8 training

actors in to training an d validation sets. We do this as well

with a leave-one-ou t cross-validation, using 7 training ac-

tors and the 8th a s the validation set, then iterating over all

possibilities. Exemp la rs are constantly selected from all 8

actors, but never from the 9th that is used for the evalua-

tion. Also note that due to the small size of the training set,

the validation rate can easily re a ch 100% if too many exem-

plars are considere d. In this case, we randomly remove ex-

emplars during the validation step, to re duce the validation

rate and to allow new exemplars to be added. For testing we

nevertheless use all selected exemplars.

5.1. E valuation on Segmented Sequences

In these experim e nts, the background-subtracted silhou-

ettes which are provided with the Weizmann-dataset were

used to evaluate our metho d. For the exemplar selection, we

ﬁrst uniformly su bsample the sequences by a facto r 1/20

and perform the selection on th e re maining set of a pprox-

imately 3 00 candid ate frame s. When we use all the 300

frames as exemplars, the r ecognition rate of our method is

100%.

To reduce the num ber of exemplars we search via for-

ward selection over this set. In Figure 4, we show a sam-

ple exemplar set as returned from th e selection method.

Figure 4(a) shows the average validation rates per a c tion,

which were computed on the training set durin g th e se-

lection. Note that even though the overall validation rate

reaches 100% f or 15 exemplars, not all classes are explic-

itly represen te d through an exemplar, indicating tha t exem-

plars are shared between actions. The recognition rate on

the test set an d w ith respect to the number of exemplars

is shown in Figure 4(b). Since the forward selection in -

cludes one random step, in the case where several exemplars

present the same validation rate, w e r epeat the experiment

10 times with all a ctors, and average over the results. I n

Figure 4(c), we show recognition rates for the individual

classes. Remark in particular the actions jump-forward and

jump-forward-one-leg that are difﬁcult to classify, because

they are e asily confused.

In summary, our approach can reach recog nition rates

up to 100% with approximately 120 exemplars. More-

over, with very small exemplar sets (e.g. around 20 ex-

emplars), the average recognition rate on a dataset of 10

action and 9 actors is already higher than 90% and co ntinu-

ously incre a sing with additional exemplars (e.g. 97.7% for

50 exemplars). I n com parison, the space-time volume ap-

Action recognition using exemplar-based embedding

Figures

Citations

A survey on vision-based human action recognition

A survey of vision-based methods for action representation, segmentation and recognition

Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks

View-Independent Action Recognition from Temporal Self-Similarities

Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

References

Statistical learning theory

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

An introduction to variable and feature selection

Wrappers for feature subset selection

Visual perception of biological motion and a model for its analysis

Related Papers (5)

Learning realistic human actions from movies

Actions as space-time shapes

The recognition of human movement using temporal templates

Recognizing human actions: a local SVM approach

Behavior recognition via sparse spatio-temporal features

Frequently Asked Questions (15)

Q1. What are the contributions in "Action recognition using exemplar-based embedding" ?

Q2. How many actors are used to train the classifier?

Q3. What is the main feature of their approach?

Q4. How much recognition rate can the authors achieve with a uniformly sub-sampled exemplar?

Q5. How long does it take to select a set of features?

Q6. What is the reason for the small size of the training set?

Q7. How much recognition rate do Wang and Suter report?

Q8. How can the authors obtain the distance between silhouettes?

Q9. What is the way to learn and evaluate a candidate exemplar?

Q10. What is the way to represent a class c 1...C?

Q11. How many exemplars are used in the experiment?

Q12. What is the cost of converting distances into probabilities?

Q13. How do the authors compute the Chamfer distance?

Q14. What is the case when estimating covariance?

Q15. Why do the authors randomly remove exemplars during the validation step?