Journal Article•DOI•

Performance metrics for activity recognition

Jamie A. Ward¹, Paul Lukowicz², Hans Gellersen¹•Institutions (2)

Lancaster University¹, University of Passau²

24 Jan 2011-ACM Transactions on Intelligent Systems and Technology (ACM)-Vol. 2, Iss: 1, pp 6

TL;DR: A comprehensive set of performance metrics and visualisations for continuous activity recognition (AR) and shows that where event- and frame-based precision and recall lead to an ambiguous interpretation of results in some cases, the proposed metrics provide a consistently unambiguous explanation.

read less

Abstract: In this article, we introduce and evaluate a comprehensive set of performance metrics and visualisations for continuous activity recognition (AR). We demonstrate how standard evaluation methods, often borrowed from related pattern recognition problems, fail to capture common artefacts found in continuous AR—specifically event fragmentation, event merging and timing offsets. We support our assertion with an analysis on a set of recently published AR papers. Building on an earlier initial work on the topic, we develop a frame-based visualisation and corresponding set of class-skew invariant metrics for the one class versus all evaluation. These are complemented by a new complete set of event-based metrics that allow a quick graphical representation of system performance—showing events that are correct, inserted, deleted, fragmented, merged and those which are both fragmented and merged. We evaluate the utility of our approach through comparison with standard metrics on data from three different published experiments. This shows that where event- and frame-based precision and recall lead to an ambiguous interpretation of results in some cases, the proposed metrics provide a consistently unambiguous explanation.

...read moreread less

Summary (5 min read)

1. INTRODUCTION

Human activity recognition (AR) is a fast growing research topic with many promising real-world applications.
As it matures so does the need for a comprehensive system of metrics that can be used to summarise and compare different AR systems.
A valid methodology for performance evaluation should fulfil two basic criteria: (1) It must be objective and unambiguous.
To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.

2. PERFORMANCE EVALUATION

In its general form AR is a multi-class problem with c “interesting” classes plus a “NULL” class.
In addition to insertions and deletions such multi class problems can produce substitution errors which are instances of one class being mistaken for another.
The quality of the similarity measure depends on the application domain and the underlying assumptions.
(2) The time shift in which events are detected in the classifier output is at most within the range of the event.
Another permissible variant is that several events in the output overlap with one event in the ground truth.

2.1 Existing Methods for Error Scoring

Performance metrics are usually calculated in three steps.
From the comparison a scoring is made on the matches and errors.
Two basic units of comparison are typically used – frames or events: Scoring Frames.
It is often the smallest unit of measure defined by the system (the sample rate) and in such cases approximates continuous time.
There is not necessarily a one-to-one relation between E and R. A comparison can instead be made using alternative means: for example DTW [Berndt and Clifford 1994], measuring the longest common subsequence [Agrawal et al. 1995], or a combination of different transformations [Perng et al. 2000].

2.2 Shortcomings of Conventional Performance Characterisation

Existing metrics often fall short of providing sufficient insight into the performance of an AR recognition system.
These plot a short section (300 s) of results described by Bulling et al. [2008] on the recognition of reading activities using body-worn sensors.
The authors also decide that several events detected by a one output count only as a single true positive.
Together with a poorer event precision, this indicates a larger number of false insertions in A.
However they fail to explicitly account for fragmented or merged events.

2.3 Significance of the Problem

To assess the prevalence of fragmenting, merge and timing errors, the authors surveyed a selection of papers on continuous AR published between 2004 and 2010 at selected computing conferences and journals (e.g., Pervasive, Ubicomp, Wearable Computing, etc.).
Table I highlights the main metrics used by each work, and whether these were based on frame, event, or some combination of both evaluation methods.
The final 3 columns indicate, either through explicit mention in the paper, or through evidence in an included graph, whether artefacts such as timing errors, fragmenting or merge were encountered.
The simple frame-based accuracy metric was heavily used in earlier work (often accompanied by a full confusion matrix), but has since given way to the pairing of precision and recall.
In most, however, there is strong evidence of timing offsets being an issue.

3. EXTENDED METHODS USING ADDITIONAL ERROR CATEGORIES

Ward et al. 2006a introduced an extension to the standard frame scoring scheme that the authors adopt here for the single class problem.
First the authors introduce additional categories of events to capture information on fragmenting and merge behaviour.
The authors then show how these are scored in an objective and unambiguous way.

3.1 Addition Event Information

This is when an event in the ground truth is recognised by several returns in the output.
The authors say that these ground events are merged (M), and refer to the single return event as a merging return (M ′).
A ground event can be both fragmented and merged.
Performance metrics for activity recognition · 117 truth event is clearly fragmented (into two returns).
But the second return in A also covers another event, thus merging the two.

3.2 Scoring Segments

An alternative scoring strategy was introduced by Ward et al. 2006a that provides a mid-way solution between the one-to-one mapping of frame scoring, while retaining useful information from event scoring.
This hybrid scheme is based on the notion of segments.
A segment is the largest part of an event on which the comparison between the ground truth and the output of the recognition system can be made in an unambiguous way.
For a binary problem, positive (p) versus negative (n), there are four possible outcomes to be scored: TPs, TNs, FPs and FNs.
A FPs that corresponds exactly to an inserted return, I. Merge, Ms. A FPs that occurs between two TPs segments within a merge return (i.e. the bit that joins two events).

3.3 Scoring Frames

Once the authors have assigned error categories to segments, it is a simple matter to transfer those assignments to the frames that constitute each segment.
The authors use these numbers in their subsequent frame analysis.

3.4 Deriving Event Scores Using Segments

Figure 2(b) shows an example of how event scores can be unambiguously assigned using information provided by the corresponding segment scores.
Note that a key difference between the frame (and segment) error scores and the event scores is that the former analysis focuses on characterising and reporting frame errors (FP and FN), whereas here the authors report on counts of matched events.
This is a troublesome definition because it completely ignores the possibility of fragmentation.
The authors assume that it is better to classify correct only those events that cannot be applied to any of the other event categories.
A correct event as used here is one that is matched with exactly one return event3.

3.5 Limits of Time Shift Tolerance

A key concerns behind their work is to distinguish between errors that are caused by small shifts in the recognition timing (which may be irrelevant for many applications) and the more “serious” errors of misclassified instances.
This may seem surprising given the fact that their evaluation works on sequential segment comparison.
So long as the recognized event has an overlap with the ground truth there will be a segment that is identified as correct, and adjoining segments will be labelled as timing errors (or fragmentation/merge when relevant).
Clearly, in cases that involve very short (in terms of the time scale of the sensor and recognition system), widely spaced events this would be a problem.
Moreover many applications look at complex longer term activities that can tale many seconds or even minutes.

4.1 Frame Metrics

Accuracy (TP+TNP+N ) is the most commonly used metric that can be calculated from a confusion matrix.
One drawback of precision is that it is heavily affected by changes in the proportions of classes in the dataset (class skew) [Fawcett 2004].
For this reason the authors prefer the skew-invariant fpr metric paired alongside tpr.
This is sometimes summarised in a single area-under-curve (AUC) metric [Ling et al. 2003].
This 2-class segment error table (2SET) is shown in Figure 4(a).

4.2 Event Metrics

From the categories laid out in 3.4 there are 8 different types of event error scores.
Four of these can be applied to ground truth events: deletions (D), fragmented (F), fragmented and merged (FM) and merged (M).
Together with correct events (C), these scores can be visualised in a single figure (see Figure 5), which the authors term the event analysis diagram (EAD).
Likewise, C+M ′+FM ′+F ′+I completely contains all of the returned events in a system output.
The EAD trivially shows exact counts of the event categories.

4.3 Application to Reading Example

The frame results for the two examples, A and B , are shown in pie chart format in Figure 6(b).
At first glance, these figures reveal the most striking differences between the two examples: the existence of insertion (ir) and fragmenting (fr) errors in A, where none are seen in B.
This influence of inexact timing is not apparent when the standard metrics in Figure 1 are used.
The charts are useful indicators at explaining how much of the false negative and false positive frames are given over to specific types of error.
This is where an event analysis is useful.

5. DATASETS

To assess the utility of the proposed method the authors use results calculated from three publicly available datasets: D1, from Bulling et al.
Following from the original papers, each set is evaluated using a different classifier: D1 using string matching; D2 using HMMs; and D3, decision tree.
The aim of this diverse selection is to show that the method can be applied to a range of different datasets and using different classifiers.
The authors do not intend to compare these results with one another (nor with the original results as published).
Rather the authors wish to show how results compare when presented using traditional metrics against those presented using their proposed metrics.

5.1 EOG Reading Dataset (D1)

The example in Figure 1 was taken from a study by Bulling et al. on recognising reading activity from patterns of horizontal electrooculogram-based (EOG) eye movements.
Six hours of data was collected using eight participants4.
The activities in this dataset are very fine-grained.
Following the method described in the original paper, the authors use string matching on discretised sequences of horizontal eye movements.
A threshold is applied to the output distance vector to determine ’reading’ or not.

5.2 Darmstadt Daily Routines Dataset (D2)

Huynh et al. introduced a novel approach for modelling daily routines using data from pocket and wrist-mounted accelerometers.
They collected a 7 day, singlesubject dataset.
A remaining 25% of the dataset is not modelled here (the unclassified case, or ‘NULL’).
Each observation feature vector is modelled using a mixture of two Gaussian.
The competing models are successively applied to a 30s sliding window.

5.3 MIT PLCouple1 Dataset (D3)

Logan et al. presented a study aimed at recognising common activities in an indoor setting using a large variety and number of ambient sensors.
A single subject was tracked and annotated for 100 hours using the MIT PlaceLab [Intille et al. 2006].
A wide range of activities are targeted, five of which the authors choose as a representative sample of the dataset: watching t.v., dishwashing, eating, using a computer and 4Download D1 at: http://www.andreas-bulling.de/publications/conferences/.

5.4 Application of Metrics to Datasets

Table II shows how the results from the three datasets might be analysed using standard metrics.
The opposite is also shown here: the ‘N’ charts for the D3 classes show that by far the most common frame errors within fpr are insertions (ir).
High ir correlates with what might be expected given the low event precision for these classes.
Performance metrics for activity recognition · 125 Reading, D1.
Almost 52% (14) of these events are merged together into 6 large merge outputs.

6.1 Highlighting the Benefits

To illustrate the benefits of the proposed metrics the authors take a second, more detailed look at two examples from the data presented in 5.4.
In both classes around 50% of the positive frames are correctly recognized.
Thus, in both cases, an application designer may by inclined not to use the system, or find a work-around that does not require the recognition of the particular classes.
For both classes around half of non-recognized true positive frames are due to timing errors, not real deletions.
This implies that the number of events that the system has returned is between 5 (for computer) and nearly 20 (for watching TV) times higher then the true number of events.

7. CONCLUSION

The authors have shown that on results generated using published, non-trivial datasets, the proposed metrics reveal novel information about classifier performance, for both frame and event analysis.
Because it is based on total durations, or number of frames, this method of reporting can be misleading when activity event durations are variable.
AR researchers have largely avoided this method of evaluation, in part, because of the difficulty of scoring correct and incorrect activities.
The introduction of a full characterisation of fragmented and merged events, and a revised definition of insertions and deletions, provides one possible solution to these difficulties.

Did you find this useful? Give us your feedback

Figures (10)

Fig. 1. Recognition results from a 300 s extract of the reading experiment reported by Bulling et al. 2008. A sequence of 11 ground truth events (gt) are shown alongside outputs for unsmoothed (A) and smoothed (B) recognition. Five event errors are highlighted: i) insertion, d) deletion, f) fragmentation, m) merge, and fm) fragmented and merged. For each sequence the table shows the % event recall Rec. and % event precision Pre.; as well as the % frame-based true positive and false positive rates, tpr and fpr, and precision pr.

Table I. The metrics used in a selection of continuous AR papers. Frame metrics include: accuracy (Acc.); precision and recall (P,R) – which are sometimes combined as f1 = 2 ·(P ·R)/(P +R); true and false positive rates (tpr, fpr) – or area under curve of tpr against fpr (AUC); the event confusion matrix (Conf); Acc. and P,R are also used as event metrics, as is edit distance (eDist.), and insertion and deletion counts (I,D). Error division diagram (EDD) is a hybrid frame-event method of presenting results. Also indicated are papers which, either through example plots, or through explicit discussion, exhibit artefacts of timing mismatch, fragmenting or merge.

Fig. 5. Format of an event analysis diagram (EAD). A ground truth event can be assigned to exactly one of five categories: D, F, FM, M or correctly matched with exactly one returned event (C). Similarly, a returned event can be assigned to one of: C, M’, FM’, F’ or I’.

Fig. 2. Typical event anomalies found when comparing a ground truth with a (mock) recognition output: (a) shows the sequence divided into segments, with the FPs and FNs segments annotated as described in 3.2; (b) shows the same sequence with all of its ground and output events annotated with the event scores described in 3.4.

Fig. 4. (a) 2-class segment error table (2SET): columns p and n denote ground truth, rows p’ and n’ denote classifier returns. Derived frame rate metrics are shown in (b).

Table II. Frame and event results for D1, D2 and D3 using standard metrics: % of frames in dataset for each class shown alongside frame-based true positive rate, or recall (tpr), false positive rate (fpr) and precision (pr.) as %; and total ground events |E|, output events |R|, event recall (Rec.) and event precision (Pre.)

Fig. 3. Assignment of segment error types (based on prior assignment of FPs, FNs, TPs and TNs). All possible error assignments are shown here for the start s1, middle (si and sn) and end send of a sequence. For example, an FPs segment at the start of a sequence that directly occurs before a FNs or TNs segment (s1 on the top row) is classed as an insertion (I). An FNs that occurs between two TPs (e.g., si on the 2nd row) would be classed as fragmenting (F ).

Fig. 8. Event summary for each class in the 3 datasets, with key to categories shown at top.

Fig. 6. Frame and event based analysis of Figure 1. Frame rates in (b) shown as a % of the total positive ground truth frames, P: tpr, dr, fr, uα and uω . Rates shown as a % of the negative frames, N: true negative (1− fpr), ir, mr, oα and oω . (The reading activity, P, takes up 47.6% of the 300s example data.) Note how the unsmoothed example contains fragmenting and insertion frames, whereas smoothed does not. The EADs of (c) show the number of actual (ground truth) events and returned (output) events for each event category (see Figure 5 for definitions). Also shown are the rates as % of the total actual events and as % of the total returned events.

Fig. 7. Frame-based error results for each class of the 3 datasets, D1, D2 and D3. Pie chart pairs represent error rates as percentages of the total positive ground truth frames, P and of the total negative frames, N. See Figure 4 for definitions of metrics used.

Content maybe subject to copyright Report

Performance metrics for activity recognition

JAMIE A. WARD

Lancaster University

PAUL LUKOWICZ

University of Passau

and

HANS W. GELLERSEN

Lancaster University

In this article we introduce and evaluate a comprehensive set of performance metrics and vi-

sualisations for continuous activity recognition (AR). We demonstrate how standard evaluation

methods, often borrowed from related pattern recognition problems, fail to capture common arte-

facts found in continuous AR – speciﬁcally event fragmentation, event merging and timing oﬀsets.

We support our assertion with an analysis on a set of recently published AR papers. Building on

an earlier initial work on the topic, we develop a frame-based visualisation and corresponding set

of class-skew invariant metrics for the one class versus all evaluation. These are complemented by

a new complete set of event-based metrics that allow a quick graphical representation of system

performance – showing events that are correct, inserted, deleted, fragmented, merged and those

which are both fragmented and merged. We evaluate the utility of our approach through compar-

ison with standard metrics on data from three diﬀerent published experiments. This shows that

where event- and frame-based precision and recall lead to an ambiguous interpretation of results

in some cases, the proposed metrics provide a consistently unambiguous explanation.

Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Design Methodology

General Terms: Performance, Standardization

Additional Key Words and Phrases: Activity recognition, metrics, performance evaluation

1. INTRODUCTION

Human activity recognition (AR) is a fast growing research topic with many promis-

ing real-world applications. As it matures so does the need for a comprehensive

system of metrics that can be used to summarise and compare diﬀerent AR systems.

A valid methodology for performance evaluation should fulﬁl two basic criteria:

(1) It must be objective and unambiguous. The outcome of an evaluation should

not depend on any arbitrary assumptions or parameters.

(2) It should not only grade, but also characterise performance. When comparing

Contact: j.a.ward@lancaster.ac.uk, paul.lukowicz@uni-passau.de, hwg@comp.lancs.ac.uk

Permission to make digital/hard copy of all or part of this material without fee for personal

or classroom use provided that the copies are not made or distributed for proﬁt or commercial

advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and

notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,

to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee.

 2010 ACM 1529-3785/2010/0700-1110000111 $5.00

ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010, Pages 111–132.

112 · Jamie A. Ward et al.

systems the method should give more then a binary decision, such as “A is

better then B”. Instead it should quantify the strengths and weaknesses of

each and give the system designer hints as to how improvements can be made.

Ward et al. [2006a] demonstrated that the standard evaluation metrics currently

used in AR do not adequately characterise performance. Information about typical

characteristics of activity events are routinely ignored in favour of making recog-

nition results ﬁt standard metrics such as event or frame accuracy. For example,

existing metrics do not reveal whether an activity has been fragmented into several

smaller activities, whether several activities have been merged into a single large

activity; or whether there are timing oﬀsets in the recognition of an activity. This

can lead to a presentation of results that can be confusing, and even misleading.

As we will show in this article, this is not just a theoretical problem but an issue

routinely encountered in real applications.

The problem of how to handle inexact time matching of ground truth to output

has been identiﬁed in a range of AR research, with a typical solution being to ignore

any result within a set margin of the event boundaries [Bao and Intille 2004], or

to employ some minimum coverage rule [Tapia et al. 2004; Westeyn et al. 2005;

Fogarty et al. 2006]. The problem of fragmented output has been noted in a

handful of publications, with solutions ranging from treating fragments as correct

events [Fogarty et al. 2006], to incorporating them in an equal way to insertion and

deletion errors (e.g., ‘reverse splicing’ [Patterson et al. 2005]). Evidence of merging

was hinted at by Lester et al. [2005], and is discussed as an ‘episode spanning two

activities,’ by Buettner et al. [2009].

In a ﬁrst attempt at characterising AR performance, Ward et al. [2006a] intro-

duced an unambiguous method for calculating insertions and deletions alongside

four new types of error: fragmentation, merge and the timing oﬀset errors of over-

ﬁll and underﬁll. Corresponding frame-by-frame metrics derived from all of these

categories were also proposed alongside a convenient visualisation of the informa-

tion. Although used in a handful of subsequent publications [Bulling et al. 2008;

Minnen et al. 2007; Stiefmeier et al. 2006; Ward et al. 2006b], the original metrics

suﬀer from a number of shortcomings:

(1) visualisation of frame errors using the error division diagram (EDD), which

plots insertion, deletion, fragmenting, merge, correct and timing errors as a

percentage of the total experiment time, is inﬂuenced by changes in the pro-

portion of diﬀerent classes, or class skew. This makes comparability between

datasets diﬃcult.

(2) event errors were not represented in a metric format suitable for comparison.

Instead absolute counts of insertions, deletions, etc., were shown.

This article extends the previous work in four ways, speciﬁcally we: 1) introduce a

system of frame-by-frame metrics which are invariant to class skew and 2) introduce

a new system of metrics for recording and visualising event performance. We then

3) apply the metrics to three previously published data sets and 4) show how these

oﬀer an improvement over traditional metrics. The contributed methods are based

on sequential, segment-wise comparison, but it is worth noting that they also have a

signiﬁcant amount of tolerance against small time shifts in the recognition. Unlike

ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010.

Performance metrics for activity recognition · 113

in other approaches (e.g., dynamic time warping, DTW [Berndt and Cliﬀord 1994]),

the time shift is not masked (or hidden in an abstract number such as matching

costs), but explicitly described in the form of underﬁll and overﬁll errors.

The article is organised as follows. We ﬁrst lay the groundwork for our contribu-

tion with an analysis of the AR performance evaluation problem, including a survey

of selected publications from the past six years of AR research. This is followed by

the introduction of AR event categories that extend Ward et al. [2006a]’s scoring

system (Section 3). We then introduce a new system of frame and event metrics

and show how they are applied (Section 4). The metrics are then evaluated by ap-

plication to results from three previously published datasets (Section 5), followed

by a concluding analysis of their beneﬁts and limitations (Section 6).

2. PERFORMANCE EVALUATION

In its general form AR is a multi-class problem with c “interesting” classes plus a

“NULL” class. The latter includes all parts of the signal where no relevant activity

has taken place. In addition to insertions and deletions such multi class problems

can produce substitution errors which are instances of one class being mistaken for

another. Note that insertions and deletions are a special case of a substitution with

one of the classes involved being the NULL class.

In this paper, we approach performance evaluation of multi-class AR by consider-

ing a class at a time. In doing so, the root problem we address is the characterisation

and summary of performance in a single, time-wise continuous, binary classiﬁcation.

That is, the output of the classiﬁer at any one time is either positive, p or nega-

tive, n. Evaluation can then be viewed as a comparison of two discrete time-series

(recognition output versus ground truth). We know that there is no objectively

‘best’ similarity measure for time series comparison. The quality of the similarity

measure depends on the application domain and the underlying assumptions. Here

we make two fundamental assumptions:

(1) Ground truth and classiﬁer prediction are available for each individual frame

of the signal.

(2) The time shift in which events are detected in the classiﬁer output is at most

within the range of the event. This means that events in the recognition output

can be assigned to events in the ground truth based on their time overlap. For

example, assume that we have two events, e

and e

, in the ground truth. If

output r

has temporal overlap with e

then we assume that it is a prediction

for e

(similarly for e

). If it has no temporal overlap with either of the two then

we assume it to be an insertion

. This allows us to do error scoring without

having to worry about permutations of assignments of events from the ground

truth to the classiﬁer prediction. From our study of published work we have

found this assumption to be plausible for most applications.

Note that it is permissible for r

to overlap with both part of e

and part of e

(and possibly

more events). Another permissible variant is that several events in the output overlap with one

event in the ground truth.

ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010.

114 · Jamie A. Ward et al.

events frames

Rec.P re. tpr fpr pr

63 44 77 18 80

55 100 86 9 90

time ——>

Fig. 1. Recognition results from a 300 s extract of the reading experiment reported by Bulling

et al. 2008. A sequence of 11 ground truth events (gt) are shown alongside outputs for unsmoothed

(A) and smoothed (B) recognition. Five event errors are highlighted: i) insertion, d) deletion, f)

fragmentation, m) merge, and fm) fragmented and merged. For each sequence the table shows

the % event recall Rec. and % event precision P re.; as well as the % frame-based true positive

and false positive rates, tpr and fpr, and precision pr.

2.1 Existing Methods for Error Scoring

Performance metrics are usually calculated in three steps. First a comparison is

made between the returned system output and what is known to have occurred

(or an approximation of what occurred). From the comparison a scoring is made

on the matches and errors. Finally these scores are summarised by one or more

metrics, usually expressed as a normalised rate or percentage.

Two basic units of comparison are typically used – frames or events:

Scoring Frames. A frame is a ﬁxed-length, ﬁxed-rate unit of time. It is often the

smallest unit of measure deﬁned by the system (the sample rate) and in such cases

approximates continuous time. Because of the one-to-one mapping between ground

and output, scoring frames is trivial, with frames assigned to one of: true positive

(TP), true negative (TN), false positive (FP) or false negative (FN).

Scoring Events. We deﬁne an event as a variable duration sequence of positive

frames within a continuous time-series. It has a start time and a stop time. Given a

test sequence of g known events, E = {e

, e

, ...e

}, a recognition outputs h return

events, R = {r

, r

, ...r

}. There is not necessarily a one-to-one relation between

E and R. A comparison can instead be made using alternative means: for exam-

ple DTW [Berndt and Cliﬀord 1994], measuring the longest common subsequence

[Agrawal et al. 1995], or a combination of diﬀerent transformations [Perng et al.

2000]. An event can then be scored as either correctly detected (C); falsely inserted

), where there is no corresponding event in the ground truth; or deleted (D),

where there is a failure to detect an event.

Commonly recommended frame based metrics include: true positive rate (tpr =

T P

T P +F N

), false positive rate (f pr =

F P

T N +F P

), precision (pr =

T P

T P +F P

); or some

combination of these (see 4.1.1). Similarly, event scores can be summarized by

precision (

correct

output returns

), recall (

correct

total

), or simply a count of I

and D.

2.2 Shortcomings of Conventional Performance Characterisation

Existing metrics often fall short of providing suﬃcient insight into the performance

of an AR recognition system. We illustrate this using the examples in Figure 1.

These plot a short section (300 s) of results described by Bulling et al. [2008] on the

recognition of reading activities using body-worn sensors. Plot A shows a classiﬁer

output with classes ‘reading’ versus ‘not reading’; plot B shows the same output

but smoothed by a 30s sliding window; and gt shows the annotated ground truth.

ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010.

Performance metrics for activity recognition · 115

For both A and B, traditional frame metrics (tpr, fpr, pr) are calculated, as are

event-based precision and recall (P re., Rec.). For the event analysis, a decision

needs to be made as to what constitutes a ‘correct’ event. Here we deﬁne a true

positive event as one that is detected by at least one output. We also decide that

several events detected by a one output count only as a single true positive.

The frame results show that the f pr of A is almost 10% higher than that of

B. Together with a poorer event precision, this indicates a larger number of false

insertions in A. A’s frame tpr is almost 10% lower than B. This might suggest more

deletions, and thus a lower recall – but in fact its recall is 8% higher. Why? The

answer is not clear from the metrics alone so we have to look at the plots. This

instantly shows that A is more fragmented than B – many short false negatives

break up some of the larger events. This has the eﬀect of reducing the true positive

frame count, while leaving the event count (based on the above assumption of

‘detected at least once’) unaﬀected.

Three key observations can be made of Figure 1: 1) some events in A are frag-

mented into several smaller chunks (f); 2) multiple events in B are recognised as a

single merged output (m); and 3) outputs are often oﬀset in time. These anomalies

represent typical fragmenting, merge and time errors, none of which are captured by

conventional metrics. Frame error scores of false positive or false negative simply do

not distinguish between frames that belong to a ‘serious’ error, such as insertion or

deletion, and those that are timing oﬀsets of otherwise correct events. Traditional

event-based comparisons might be able to accommodate oﬀsets using techniques

such as dynamic time warping (DTW), or fuzzy event boundaries. However they

fail to explicitly account for fragmented or merged events.

2.3 Signiﬁcance of the Problem

To assess the prevalence of fragmenting, merge and timing errors, we surveyed a

selection of papers on continuous AR published between 2004 and 2010 at selected

computing conferences and journals (e.g., Pervasive, Ubicomp, Wearable Comput-

ing, etc.) Table I highlights the main metrics used by each work, and whether these

were based on frame, event, or some combination of both evaluation methods. The

ﬁnal 3 columns indicate, either through explicit mention in the paper, or through

evidence in an included graph, whether artefacts such as timing errors, fragmenting

or merge were encountered.

The simple frame-based accuracy metric was heavily used in earlier work (often

accompanied by a full confusion matrix), but has since given way to the pairing

of precision and recall. Event analysis has been applied by several researchers,

however there is no clear consensus on the deﬁnition of a ‘correct’ event, nor on the

metrics that should be used. In most, however, there is strong evidence of timing

oﬀsets being an issue. Several highlight fragmenting and merge (though only those

using EDD acknowledge these as speciﬁc error categories).

3. EXTENDED METHODS USING ADDITIONAL ERROR CATEGORIES

Ward et al. 2006a introduced an extension to the standard frame scoring scheme

that we adopt here for the single class problem. First we introduce additional

categories of events to capture information on fragmenting and merge behaviour.

We then show how these are scored in an objective and unambiguous way.

ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010.

HTML Viewer

Frequently Asked Questions (11)

Q1. What are the contributions mentioned in the paper "Performance metrics for activity recognition" ?

In this article the authors introduce and evaluate a comprehensive set of performance metrics and visualisations for continuous activity recognition ( AR ). The authors demonstrate how standard evaluation methods, often borrowed from related pattern recognition problems, fail to capture common artefacts found in continuous AR – specifically event fragmentation, event merging and timing offsets. The authors evaluate the utility of their approach through comparison with standard metrics on data from three different published experiments. This shows that where eventand frame-based precision and recall lead to an ambiguous interpretation of results in some cases, the proposed metrics provide a consistently unambiguous explanation.

Q2. What future works have the authors mentioned in the paper "Performance metrics for activity recognition" ?

The authors believe that it is better to show all results in the brightest ( coldest ) light, and then give explanations afterwards if need be. Again a challenge for future work is how this information might be displayed in an informative way. In these cases, the events marked F may be aggregated with C and presented in an additional, application-specific metric. Although SET completely captures both segment and frame errors, it can be difficult to interpret.

Q3. What are the common metrics used to score frames?

Because of the one-to-one mapping between ground and output, scoring frames is trivial, with frames assigned to one of: true positive (TP), true negative (TN), false positive (FP) or false negative (FN).

Q4. What are the examples of substitution errors in multi-class AR?

In addition to insertions and deletions such multi class problems can produce substitution errors which are instances of one class being mistaken for another.

Q5. What are the new metrics for class skew invariance?

To maintain class skew invariance, the new 2SET metrics introduced here are based around tpr and fpr: that is, FN errors are expressed as a ratio of the total positive frames, P ; and the FP errors are expressed as a ratio of the total negative frames, N .

Q6. What are some of the common methods of comparing events?

Traditional event-based comparisons might be able to accommodate offsets using techniques such as dynamic time warping (DTW), or fuzzy event boundaries.

Q7. What is the main drawback of the segment-based method?

The segment-based method presented by Ward et al. [2006a] is intrinsically multi-class: Each pairing of ground truth and output segment is assigned to exactly one of six categories (insertion-deletion, insertionunderfill, insertion-fragmenting, overfill-deletion, overfill-underfill, and merge-deletion).

Q8. What is the way to handle inexact time matching of ground truth to output?

The problem of how to handle inexact time matching of ground truth to output has been identified in a range of AR research, with a typical solution being to ignore any result within a set margin of the event boundaries [Bao and Intille 2004], or to employ some minimum coverage rule [Tapia et al.

Q9. What are the final 3 columns of the graph?

The final 3 columns indicate, either through explicit mention in the paper, or through evidence in an included graph, whether artefacts such as timing errors, fragmenting or merge were encountered.

Q10. What are the two fundamental assumptions that the authors make?

Here the authors make two fundamental assumptions:(1) Ground truth and classifier prediction are available for each individual frame of the signal.

Q11. What are the key observations of Figure 1?

Three key observations can be made of Figure 1: 1) some events in A are fragmented into several smaller chunks (f); 2) multiple events in B are recognised as a single merged output (m); and 3) outputs are often offset in time.

Performance metrics for activity recognition

Summary (5 min read)

1. INTRODUCTION

2. PERFORMANCE EVALUATION

2.1 Existing Methods for Error Scoring

2.2 Shortcomings of Conventional Performance Characterisation

2.3 Significance of the Problem

3. EXTENDED METHODS USING ADDITIONAL ERROR CATEGORIES

3.1 Addition Event Information

3.2 Scoring Segments

3.3 Scoring Frames

3.4 Deriving Event Scores Using Segments

3.5 Limits of Time Shift Tolerance

4.1 Frame Metrics

4.2 Event Metrics

4.3 Application to Reading Example

5. DATASETS

5.1 EOG Reading Dataset (D1)

5.2 Darmstadt Daily Routines Dataset (D2)

5.3 MIT PLCouple1 Dataset (D3)

5.4 Application of Metrics to Datasets

6.1 Highlighting the Benefits

6.3 Related Work in Other Areas

7. CONCLUSION

Figures (10)

Citations

Cites background from "Performance metrics for activity re..."

Cites background or methods from "Performance metrics for activity re..."

Additional excerpts

References

"Performance metrics for activity re..." refers background in this paper

"Performance metrics for activity re..." refers background in this paper

"Performance metrics for activity re..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (11)

Q1. What are the contributions mentioned in the paper "Performance metrics for activity recognition" ?

Q2. What future works have the authors mentioned in the paper "Performance metrics for activity recognition" ?

Q3. What are the common metrics used to score frames?

Q4. What are the examples of substitution errors in multi-class AR?

Q5. What are the new metrics for class skew invariance?

Q6. What are some of the common methods of comparing events?

Q7. What is the main drawback of the segment-based method?

Q8. What is the way to handle inexact time matching of ground truth to output?

Q9. What are the final 3 columns of the graph?

Q10. What are the two fundamental assumptions that the authors make?

Q11. What are the key observations of Figure 1?

Trending Questions (1)