scispace - formally typeset
Open AccessProceedings ArticleDOI

Action recognition using exemplar-based embedding

TLDR
A time-invariant representation that drastically simplifies learning and recognition by removing time related information such as speed or length of an action is proposed.
Abstract
In this paper, we address the problem of representing human actions using visual cues for the purpose of learning and recognition. Traditional approaches model actions as space-time representations which explicitly or implicitly encode the dynamics of an action through temporal dependencies. In contrast, we propose a new compact and efficient representation which does not account for such dependencies. Instead, motion sequences are represented with respect to a set of discriminative static key-pose exemplars and without modeling any temporal ordering. The interest is a time-invariant representation that drastically simplifies learning and recognition by removing time related information such as speed or length of an action. The proposed representation is equivalent to embedding actions into a space defined by distances to key-pose exemplars. We show how to build such embedding spaces of low dimension by identifying a vocabulary of highly discriminative exemplars using a forward selection. To test our representation, we have used a publicly available dataset which demonstrates that our method can precisely recognize actions, even with cluttered and non-segmented sequences.

read more

Content maybe subject to copyright    Report

Action Recognition using Exemplar-based Embedding
Daniel Weinland
Edmond Boyer
LJK - INRIA Rhône-Alpes, France
{weinland, eboyer}@inrialpes.fr
Abstract
In this paper, we address the problem of representing hu-
man a ctions using visual cues for the purpose of learn ing
and recognition. Traditional approaches model action s as
space-time representations which explicitly or implicitly en-
code the dynamics of an action through temporal dependen-
cies. In contrast, we propose a new compact and efficient
representation which does not account for such dependen-
cies. Instead, motion sequences are represented with re-
spect to a set of discriminative sta tic key-pose exemplars
and without m odeling any temporal ordering. The interest
is a time-invariant representation that drastically simplifies
learning and recognition by removing time related informa-
tion such as speed or length of an actio n. The proposed rep-
resentation is equivalent to embedding actions into a space
defined by distances to key-pose ex e m plars. We show how to
build such embedding sp aces of low dimension by identify-
ing a vocabulary of highly discriminative exemplars using a
forward selection. To test our representation, we have u sed
a publicly available dataset which demon strates that o ur
method can precisely recognize actions, even with cluttered
and n on-segmented sequenc es.
1. Introduction
Action recognition is of central importance in computer
vision with many applications in visual surveillance, hu-
man computer interaction and entertainment, among o thers.
A challenging issue in this field originates from the diver-
sity of information which describes an action. This includ es
purely visual cues, e. g. shape and appearance, as well as dy-
namic cues, e.g. space-time trajectories and motion fields.
Such diversity raises the question of the relative importance
of these sources and also to what degree they c ompensate
for each other.
In a seminal work, Johansson [15] demonstrated through
psychoanalytical experiments that humans can recognize
D. Weinland is supported by a grant from the European Community
under the EST Marie-Curie Project Visitor.
actions merely from the mo tion of a few light points a t-
tached to the hu man body. Following this idea, several
works, e.g. [1, 11], attempted to rec ognize action s using
trajectories of markers with specific locations on the human
body. While successful in constrained environments, these
approa c hes do not however extend to general scenario s.
Besides, static visual information give also very strong
cues on activities. In particular, humans are able to rec-
ognize many actions from a single image (see for instance
Figure 1). Consequently a significant effort has been put
in representations ba sed on visual cues. Two main direc-
tions have been followed. Implicit representations simulta-
neously model in spac e and time with space-time volumes,
e.g. [3, 25], or by using space-time features, e.g. [6, 18, 19].
Explicit representations equip tradition al temporal models,
such as hidden Markov models (HMMs), with powerful im-
age matching a bilities based on exemplar representations,
e.g. [2 1, 7, 8, 24].
In this work we take a different strategy and represent
actions using static visual information without temporal de-
pendencies. Our results show tha t such representations can
effectively model complex actions and yield recognition
rates that equal or exceed those of the current state-of-the -
art approaches, with the virtues of simplicity and efficiency.
Our approach builds o n recent works on example -based
embedd ing metho ds [2, 12]. In these approaches complex
distances between signals are approximated in a Euclidean
embedd ing space that is spanned by a set o f distances to ex-
emplar measures. Our representations is grounded on such
embedd ing, focusing only on the visual c omponents of an
action. The main contribution is a time-invariant represen-
tation that does not require a time warping step and is in-
sensitive to variations in speed and len gth of an action. To
the best of our knowledge, no previous work has attempted
to use such an embedding based representatio n to model ac-
tions.
In the paper, we will show how to select exemplars for
such a representatio n using a forward feature selection tech -
nique [16]. In particular, we will demonstrate how complex
actions can be described in terms of a small but highly dis-
criminative exemplar sets. Experiments on the well known
1

Figure 1. Sample images from the Weizmann-dataset [3]. A hu-
man observer can easily identify many, if not all, actions from a
single image. The interested reader may recognize the following
actions: bend, jumping-jack, jump-in-place, jump-forward, run,
gallop-sideways, walk, wave one hand, wave two hands, and jump-
forward-one-leg. Note, that the displayed images have been auto-
matically identified by our method as discriminative exemplars.
Weizmann-dataset [3] confirm that action recognition ca n
be achieved without considering temporal dependencies.
Another important feature of our approach is that it can
be used with advanced image matching technique s, such
as the Chamfer distance [ 10], for visual mea surements. In
contrast to the classical use of dimensional reduction with
silhouette representations, e.g. [23], such a method can be
used in scenarios where no backgr ound subtraction is avail-
able. In a second experiment we will demonstrate, that even
on cluttered non-segmented sequences, our method has pre-
cise recognition results.
The paper is organized as follows: In Sec tion 2 we re-
view related work. In Section 3 we present the embedding
representation. In Section 4 we show how to compute a
small but discriminative exemplar set. In Section 5 we eval-
uate our a pproach with a publicly available dataset before
conclud ing and d iscussing issues in Section 6.
2. Related Work
Actions can been recognized using the occurr ences of
key-frames. In the work of Carlsson and Sullivan [4],
class represen tative silhouettes are matched against video
frames to recogn iz e forehand and backhand strokes in ten-
nis recordings. In a similar way, our approach uses a set of
representative silhouette like models, i.e. the exemplars, but
does not assume a deterministic fra mework as in [4], where
exemplars are exclusively linked to classes, and decisions
are based on single frame detections.
Other exemplar based approaches, e.g. [7, 8, 21, 24],
learn HMMs with o bservation probabilities based on match-
ing distances to exemplars. In all these models, dynamic s
are explicitly modeled thro ugh Markovian transitions over
discrete state variables, whereas distances are mapped onto
probabilities, which can involve additional difficulties [17].
Dedeoglu et al. [5] propose a real-time system for ac -
tion recognition b ased on key-poses and histograms. His-
tograms introduce some degree of temporal invariance, al-
though te mporal order remains partially constrained with
such representation. Moreover, the conversion of exemplar
distances into nor malized distributions can cause additional
loss.
Exemplar based embeddin g methods have already been
proposed, e.g. [2, 12]. In [2] Athitsos and Sclaroff present
an approach for hand pose estimation based on Lipschitz
embedd ings. Guo et al. [12] use an exemplar-base embed-
ding approach to match images o f cars over different view -
points. However no attempts has been made to apply such
exemplar-based e mbedding approaches to action reco gni-
tion.
Wang and Suter [23] use kern el-PCA to derive a low di-
mensional representation of silhouettes, and factorial c on-
ditional random fields to m odel dyna mics. Having similar
results in evaluation tha n ou r method, such an approach is
computationally expensive, and moreover only practical in
backgr ound subtracted scenes.
Interestingly, the S3-C3 stage of the biological motivated
system by Jhuang et al. [14] shares as well some similari-
ties with our embedding representation. However, these two
representation are derived in a very different context.
3. Action Modeling
Our approac h proceeds as illustrated in Figure 2. An ac-
tion sequence is matched against a set of n exemplars. For
each exemplar the minimum matching distance to any of the
frames of the sequence is determined, and the resulting set
of distances forms a vector D
in the embedding space R
n
.
The intuition we follow is that similar sequences will yield
proxim ities to discrim inative exemplars which are similar.
Hence their point representation in R
n
should be close. We
thus model actions in R
n
where both learning and recogni-
tion are performed. This is de ta iled in the following sec-
tions.
3.1. Exemplar-based Embedding
Our aim is to classify an action sequence Y = y
1
, . . . , y
t
over time with respect to the occurrence of known repre-
sentative exemplars X = {x
1
, . . . , x
n
}, e.g . silhouettes.
The exemplar selection is presented in a fu rther section (see
Section 4) a nd we assume here that they are given.
We start by computing for each exemplar x
i
the mini-
mum distan ce to frame s in the sequence:
d
i
(Y ) = min
j
d(x
i
, y
j
), (1)
where d is a distance function between the primitives con-
sidered, as described in Section 3.3.
At this stage, distanc es could b e thresholded and con-
verted into binary detections, in the sense of a key-frame
classifier [4]. This requires however thresholds to be cho-
sen and furthermore does not allow to model uncertainties.

Figure 2. Overview of the embedding method: Two action sequences Y (walk) and
´
Y (jump forward on one leg) are matched against a
set of silhouette exemplars x
i
. For each exemplar the best matching frame in the sequence is identified (exemplar displayed on top of the
corresponding frame; light colors correspond to high matching distances; dark colors to low matching distances). The resulting matching
distances d
i
form vector D
, which is interpreted as an embedding of the sequences into a low dimensional space R
n
. The final classifier
is learned over R
n
, where each point represents a complete sequence.
Probabilistic exemplar-based approaches [21] do m odel
such uncertainties by converting distan ces into probabili-
ties, but a s m e ntioned earlier, at the price of complex com-
putations for normalization constants. We instead simply
work on the vectors that result from co ncatenating all the
minimum distance s
D
(Y ) = (d
1
(Y ), . . . , d
n
(Y ))
R
n
, (2)
without any probabilistic treatment. Note that our repre-
sentation is sim ilar in princ iple to the em bedding de scribed
in [2, 12] in a static context. We extend it to temporal se-
quences.
3.2. Classifier
In the embedding space R
n
, classification of time se-
quences reduces to a simple operation which is to label
the vectors D
(Y ). A major advantage over traditional a p-
proach e s is tha t such vectors encode complete sequences
without the need for time normalizations or alignments.
These vectors ar e points in R
n
that are labelled using a
standard Bayes classifier. Each class c 1...C is repre-
sented through a single Gaussian distribution p(D
|c) =
N (D
|µ
c
, Σ
c
), which we found adequate in experiments
to model all impo rtant dependencies betwee n exemplars.
Assignments are determined through maximum a posteri-
ori estima tions:
g(D
) = arg max
c
p(D
|c)p(c), (3)
with p(c) bein g the prior of class c that, without loss of
generality, is assumed to be uniform.
Note that when estimating covariance Σ, and depending
on the dimension n, it is often the ca se th at insufficient train-
ing data is available for Σ, and c onsequently the estimation
may b e non -invertible. We hence work with a regularized
covariance of the for m
ˆ
Σ = Σ+ǫI, w ith I being the identity
matrix and ǫ a small value.
3.3. Image Representation and Distance Functions
Actions are represented as vectors of distan ces fro m ex-
emplars to the frames in the action’s sequence. Such dis-
tances could be of several types, dep ending o n the available
informa tion in the images, e.g. silhouettes o r edges. In the
following, we assume that silhouettes are available for the
exemplars, which is a rea sonable assumption in the learning
phase, a nd we consider two situations for recognition . First,
silhouettes, ob ta ined for instance with ba ckground subtrac-
tions, are available; Second o nly edge s can be considered.
Silhouette-to-Silhouette Matc hing In this scenario we
assume that background subtracted sequence s are available.

Consequently, x and y are both represented through silhou-
ettes. While difficult to obtain in many practical contexts,
silhouettes, when available, provide rich and strong cues.
Consequently they can be matched with a standard distance
function and we choose the squared Euclidean distance
d(x, y) = |x y|
2
, which is computed between the vec-
tor representations of the binary silhouette images. Hence,
the distance is simply the number of pixels with different
values in both images.
Silhouette-to-Edge Matching In a more realistic sce-
nario, background subtraction will not be p ossible due to
moving or changing bac kground as well as changing light,
among other reasons. In that case, mor e advanced d istances
dealing with imperfect image segmentations must be con-
sidered. In our expe riments, we use such a scenario where
edge observations y, instead of silhouettes, ar e taken into
account. In such observations, edges are usually spurious
or missing. As mention ed earlier we assume that exem-
plars are represented thro ugh edge templates, computed us-
ing background subtraction in a learning phase. The dis-
tance we consid er is then the Chamfer distance [10],which
measures the closest distance for each edge point on the ob-
servation x to any edge point in the exemplar y,
d(x, y) =
1
|x|
X
f x
d
y
(f), (4)
where |x| is the number of edge points in x and d
y
(f) is
the distance between edge f an d the closest edge-point in
y. An efficient way to compute the Chamfer distance is
by correlating the distanc e transformed observation with the
exemplar silhouette.
4. Key-Pose Selection
In the previous section, we assume that the exemplars,
a set of discriminative primitives, a re known. We explain
in this section how to obtain them using a forward feature
selection. In a classical way, such selection has to deal with
two co nflicting objectives. First, the set of exemplars must
be small to avoid learning and classification in high dimen-
sions (curse of dimensionality) and to allow fo r fast com-
putations. Second, the set must con ta in enough elements
to account for variations within and between classes. We
will use the wrapper technique for feature selection intro-
duced in [16], but other possibilities will be discussed in
Section 4.2
Several criteria exist to measure and optimize the quality
of a feature set (see e.g. [13]). The wrapper approach can be
seen as a direct and straightforward solution this p roblem.
The criterion optimiz ed is the validation of the considered
classifier, which is itself used as a black box by the wrapper
while performing a greedy search over the featu re space.
There are different search strategies for the wrapper and we
use a forward selection, which we recently successfully a p-
plied in a similar setting [24].
4.1. Forward Selection
Forward selection is a bottom-up search pro cedure that
adds new exemplars to the final exemplar set one a t a time
until the final set is reached. Can didate exemplars are all
frames in the trainin g set, or a sub-sampled set of these
frames. In ea ch step of the selection, classifiers for each
candidate exemplar set are learned and evaluated. Conse-
quently, in the first iteration classifier for each single candi-
date exemplar are learned, the exemplar with the best evalu-
ation perf ormance is added to the final exemplar set, and the
learning and evaluation step is repeated using pairs of exem-
plars (contain ing the already selected), triples, quadr uples,
etc. The algorithm is given below (see Algorithm 1 ).
Algorithm 1 Forward Selectio n
Input: training seque nces Y = {Y
1
, . . . Y
m
}, validation se-
quences
ˆ
Y = {Y
1
, . . . Y
ˆm
}
1. let candidate exemplar set X = {y : y Y}
2. let final exemplar set X =
3. while size of X sma ller than n
(a) for each y X
i. set X
{y} X
ii. train classifier g with Y and keep validation
performance on
ˆ
Y
(b) set X {y
} X where y
correspo nds to the
best validation performance ob ta ined in step 3 (a).
If multiple y
with same performance exist, ran-
domly pick one.
(c) set X X \ {y
}
4. return X
4.2. Selection Discussion
Many technique s have been used in the literature to se-
lect exemplars and vocabulary sets in related approaches.
For instance, several methods sub-sample or cluster the
space of exemplars, e.g. [2, 21]. While genera lly a pplica-
ble in our context, such methods require nevertheless very
large sets of exemplars in order to reach the perfo rmance of
a smaller set that has been specifically selected with respect
to an optimiza tion criterion. Moreover, as we observed
in [24], a clusterin g can miss im portant discrim inative ex-
emplars, e.g. clusters may discr iminate body shapes instead
of a ctions.

Figure 3. Sample sequences and corresponding edge images. (Top
to bottom) jumping-jack, walk, wave two hands.
Another solution is to select features based on advanced
classification techniques such as support vector machines
[22] or Adaboost [9]. Unfortunately, support vector ma-
chines are mainly designed for binary classifications and,
though extensions to mu ltiple classes exist, they hardly ex-
tract a sing le featur e set for all classes. On the other hand,
Adaboo st[9] can be extended to multiple c lasses and is
known for its ability to search over large numbers of fe a-
tures. We exp erimented Adaboost using weak classifiers
based on single exemplars and pairs of exemplars but per-
formances were less consistent than with the forward selec-
tion.
Wrapper methods, such as the forward selection, are
known to be particularly robust against over-fitting [ 13] but
sometimes criticized for being slow due to the repetitive
learning and evaluation cycles. In our ca se, we need ap-
proxim ately n × m learning an d validation cycles to se-
lect n fea tures out of a candidate set with size m. With
a non-optimiz e d implementation in MATLAB, selection of
approximately 50 features out of a few hund reds will take
around 5 minutes. This is a very reasonable computation
time consid ering that this step is on ly required during the
learning phase and that a compact exemplar set will benefit
to all re c ognition phases.
5. Experiments
We h ave experimented our approach with the Weizmann-
dataset [3] (see Figure 1 and 3) which has been r ecently
used by several authors [1, 14, 18, 20, 23]. It contains 10
actions: bend (bend), jumpin g-jack (jack), jump-in-place
(pjump) , jump- forward (jump), run (run), gallop-sideways
(side), jump-forward-one-leg (skip), wa lk (walk), wave one
hand ( wa ve1), wave two hands (wav e2), performed by 9 ac-
tors. Silhouettes extracted from backgrounds and original
image sequences are provided.
All recognition rates were computed with the leave-one-
out cross-validation. Details are as follows. 8 out o f the 9
actors in the database are used to train the classifier and se-
lect the exemplars, the 9th is used for the evaluation. This is
repeated for all 9 acto rs and the ra te s are averaged. For the
exemplar selection, we further need to divide the 8 training
actors in to training an d validation sets. We do this as well
with a leave-one-ou t cross-validation, using 7 training ac-
tors and the 8th a s the validation set, then iterating over all
possibilities. Exemp la rs are constantly selected from all 8
actors, but never from the 9th that is used for the evalua-
tion. Also note that due to the small size of the training set,
the validation rate can easily re a ch 100% if too many exem-
plars are considere d. In this case, we randomly remove ex-
emplars during the validation step, to re duce the validation
rate and to allow new exemplars to be added. For testing we
nevertheless use all selected exemplars.
5.1. E valuation on Segmented Sequences
In these experim e nts, the background-subtracted silhou-
ettes which are provided with the Weizmann-dataset were
used to evaluate our metho d. For the exemplar selection, we
first uniformly su bsample the sequences by a facto r 1/20
and perform the selection on th e re maining set of a pprox-
imately 3 00 candid ate frame s. When we use all the 300
frames as exemplars, the r ecognition rate of our method is
100%.
To reduce the num ber of exemplars we search via for-
ward selection over this set. In Figure 4, we show a sam-
ple exemplar set as returned from th e selection method.
Figure 4(a) shows the average validation rates per a c tion,
which were computed on the training set durin g th e se-
lection. Note that even though the overall validation rate
reaches 100% f or 15 exemplars, not all classes are explic-
itly represen te d through an exemplar, indicating tha t exem-
plars are shared between actions. The recognition rate on
the test set an d w ith respect to the number of exemplars
is shown in Figure 4(b). Since the forward selection in -
cludes one random step, in the case where several exemplars
present the same validation rate, w e r epeat the experiment
10 times with all a ctors, and average over the results. I n
Figure 4(c), we show recognition rates for the individual
classes. Remark in particular the actions jump-forward and
jump-forward-one-leg that are difficult to classify, because
they are e asily confused.
In summary, our approach can reach recog nition rates
up to 100% with approximately 120 exemplars. More-
over, with very small exemplar sets (e.g. around 20 ex-
emplars), the average recognition rate on a dataset of 10
action and 9 actors is already higher than 90% and co ntinu-
ously incre a sing with additional exemplars (e.g. 97.7% for
50 exemplars). I n com parison, the space-time volume ap-

Figures
Citations
More filters
Journal ArticleDOI

A survey on vision-based human action recognition

TL;DR: A detailed overview of current advances in vision-based human action recognition is provided, including a discussion of limitations of the state of the art and outline promising directions of research.
Journal ArticleDOI

A survey of vision-based methods for action representation, segmentation and recognition

TL;DR: This survey focuses on approaches that aim on classification of full-body motions, such as kicking, punching, and waving, and categorizes them according to how they represent the spatial and temporal structure of actions.
Proceedings ArticleDOI

Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks

TL;DR: In this article, a factorized spatio-temporal convolutional networks (FstCN) is proposed to factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers, followed by learning 1D temporal kernel in the upper layers.
Journal ArticleDOI

View-Independent Action Recognition from Temporal Self-Similarities

TL;DR: An action descriptor is developed that captures the structure of temporal similarities and dissimilarities within an action sequence and is shown to be stable under performance variations within a class of actions when individual speed fluctuations are ignored.
Posted Content

Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

TL;DR: In this paper, a factorized spatio-temporal convolutional networks (FstCN) is proposed to factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers, followed by learning 1D temporal kernel in the upper layers.
References
More filters

Statistical learning theory

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Journal ArticleDOI

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

TL;DR: The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.
Journal ArticleDOI

An introduction to variable and feature selection

TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Journal ArticleDOI

Wrappers for feature subset selection

TL;DR: The wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain and compares the wrapper approach to induction without feature subset selection and to Relief, a filter approach tofeature subset selection.
Journal ArticleDOI

Visual perception of biological motion and a model for its analysis

TL;DR: The kinetic-geometric model for visual vector analysis originally developed in the study of perception of motion combinations of the mechanical type was applied to biological motion patterns and the results turned out to be highly positive.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions in "Action recognition using exemplar-based embedding" ?

In this paper, the authors address the problem of representing human actions using visual cues for the purpose of learning and recognition. In contrast, the authors propose a new compact and efficient representation which does not account for such dependencies. The authors show how to build such embedding spaces of low dimension by identifying a vocabulary of highly discriminative exemplars using a forward selection. To test their representation, the authors have used a publicly available dataset which demonstrates that their method can precisely recognize actions, even with cluttered and non-segmented sequences. 

8 out of the 9 actors in the database are used to train the classifier and select the exemplars, the 9th is used for the evaluation. 

Another important feature of their approach is that it can be used with advanced image matching techniques, such as the Chamfer distance [10], for visual measurements. 

For a uniformly sub-sampled exemplar set of size 300, their method presents a recognition rate of 93.6% in crossvalidation on all 10 actions and 9 actors. 

With a non-optimized implementation in MATLAB, selection of approximately 50 features out of a few hundreds will take around 5 minutes. 

Also note that due to the small size of the training set, the validation rate can easily reach 100% if too many exemplars are considered. 

Wang and Suter [23] report a recognition rate of 97.78% with an approach that uses kernel-PCA for dimensional reduction and factorial conditional random fields to model motion dynamics. 

Consequently they can be matched with a standard distance function and the authors choose the squared Euclidean distance d(x, y) = |x − y|2, which is computed between the vector representations of the binary silhouette images. 

in the first iteration classifier for each single candidate exemplar are learned, the exemplar with the best evaluation performance is added to the final exemplar set, and the learning and evaluation step is repeated using pairs of exemplars (containing the already selected), triples, quadruples, etc. 

Each class c ∈ 1...C is represented through a single Gaussian distribution p(D∗|c) = N (D∗|µc, Σc), which the authors found adequate in experiments to model all important dependencies between exemplars. 

Since the forward selection includes one random step, in the case where several exemplars present the same validation rate, the authors repeat the experiment 10 times with all actors, and average over the results. 

Probabilistic exemplar-based approaches [21] do model such uncertainties by converting distances into probabilities, but as mentioned earlier, at the price of complex computations for normalization constants. 

An efficient way to compute the Chamfer distance is by correlating the distance transformed observation with the exemplar silhouette. 

Note that when estimating covariance Σ, and depending on the dimension n, it is often the case that insufficient training data is available for Σ, and consequently the estimation may be non-invertible. 

In this case, the authors randomly remove exemplars during the validation step, to reduce the validation rate and to allow new exemplars to be added.