Performance metrics for activity recognition
Summary (5 min read)
1. INTRODUCTION
- Human activity recognition (AR) is a fast growing research topic with many promising real-world applications.
- As it matures so does the need for a comprehensive system of metrics that can be used to summarise and compare different AR systems.
- A valid methodology for performance evaluation should fulfil two basic criteria: (1) It must be objective and unambiguous.
- To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
2. PERFORMANCE EVALUATION
- In its general form AR is a multi-class problem with c “interesting” classes plus a “NULL” class.
- In addition to insertions and deletions such multi class problems can produce substitution errors which are instances of one class being mistaken for another.
- The quality of the similarity measure depends on the application domain and the underlying assumptions.
- (2) The time shift in which events are detected in the classifier output is at most within the range of the event.
- Another permissible variant is that several events in the output overlap with one event in the ground truth.
2.1 Existing Methods for Error Scoring
- Performance metrics are usually calculated in three steps.
- From the comparison a scoring is made on the matches and errors.
- Two basic units of comparison are typically used – frames or events: Scoring Frames.
- It is often the smallest unit of measure defined by the system (the sample rate) and in such cases approximates continuous time.
- There is not necessarily a one-to-one relation between E and R. A comparison can instead be made using alternative means: for example DTW [Berndt and Clifford 1994], measuring the longest common subsequence [Agrawal et al. 1995], or a combination of different transformations [Perng et al. 2000].
2.2 Shortcomings of Conventional Performance Characterisation
- Existing metrics often fall short of providing sufficient insight into the performance of an AR recognition system.
- These plot a short section (300 s) of results described by Bulling et al. [2008] on the recognition of reading activities using body-worn sensors.
- The authors also decide that several events detected by a one output count only as a single true positive.
- Together with a poorer event precision, this indicates a larger number of false insertions in A.
- However they fail to explicitly account for fragmented or merged events.
2.3 Significance of the Problem
- To assess the prevalence of fragmenting, merge and timing errors, the authors surveyed a selection of papers on continuous AR published between 2004 and 2010 at selected computing conferences and journals (e.g., Pervasive, Ubicomp, Wearable Computing, etc.).
- Table I highlights the main metrics used by each work, and whether these were based on frame, event, or some combination of both evaluation methods.
- The final 3 columns indicate, either through explicit mention in the paper, or through evidence in an included graph, whether artefacts such as timing errors, fragmenting or merge were encountered.
- The simple frame-based accuracy metric was heavily used in earlier work (often accompanied by a full confusion matrix), but has since given way to the pairing of precision and recall.
- In most, however, there is strong evidence of timing offsets being an issue.
3. EXTENDED METHODS USING ADDITIONAL ERROR CATEGORIES
- Ward et al. 2006a introduced an extension to the standard frame scoring scheme that the authors adopt here for the single class problem.
- First the authors introduce additional categories of events to capture information on fragmenting and merge behaviour.
- The authors then show how these are scored in an objective and unambiguous way.
3.1 Addition Event Information
- This is when an event in the ground truth is recognised by several returns in the output.
- The authors say that these ground events are merged (M), and refer to the single return event as a merging return (M ′).
- A ground event can be both fragmented and merged.
- Performance metrics for activity recognition · 117 truth event is clearly fragmented (into two returns).
- But the second return in A also covers another event, thus merging the two.
3.2 Scoring Segments
- An alternative scoring strategy was introduced by Ward et al. 2006a that provides a mid-way solution between the one-to-one mapping of frame scoring, while retaining useful information from event scoring.
- This hybrid scheme is based on the notion of segments.
- A segment is the largest part of an event on which the comparison between the ground truth and the output of the recognition system can be made in an unambiguous way.
- For a binary problem, positive (p) versus negative (n), there are four possible outcomes to be scored: TPs, TNs, FPs and FNs.
- A FPs that corresponds exactly to an inserted return, I. Merge, Ms. A FPs that occurs between two TPs segments within a merge return (i.e. the bit that joins two events).
3.3 Scoring Frames
- Once the authors have assigned error categories to segments, it is a simple matter to transfer those assignments to the frames that constitute each segment.
- The authors use these numbers in their subsequent frame analysis.
3.4 Deriving Event Scores Using Segments
- Figure 2(b) shows an example of how event scores can be unambiguously assigned using information provided by the corresponding segment scores.
- Note that a key difference between the frame (and segment) error scores and the event scores is that the former analysis focuses on characterising and reporting frame errors (FP and FN), whereas here the authors report on counts of matched events.
- This is a troublesome definition because it completely ignores the possibility of fragmentation.
- The authors assume that it is better to classify correct only those events that cannot be applied to any of the other event categories.
- A correct event as used here is one that is matched with exactly one return event3.
3.5 Limits of Time Shift Tolerance
- A key concerns behind their work is to distinguish between errors that are caused by small shifts in the recognition timing (which may be irrelevant for many applications) and the more “serious” errors of misclassified instances.
- This may seem surprising given the fact that their evaluation works on sequential segment comparison.
- So long as the recognized event has an overlap with the ground truth there will be a segment that is identified as correct, and adjoining segments will be labelled as timing errors (or fragmentation/merge when relevant).
- Clearly, in cases that involve very short (in terms of the time scale of the sensor and recognition system), widely spaced events this would be a problem.
- Moreover many applications look at complex longer term activities that can tale many seconds or even minutes.
4.1 Frame Metrics
- Accuracy (TP+TNP+N ) is the most commonly used metric that can be calculated from a confusion matrix.
- One drawback of precision is that it is heavily affected by changes in the proportions of classes in the dataset (class skew) [Fawcett 2004].
- For this reason the authors prefer the skew-invariant fpr metric paired alongside tpr.
- This is sometimes summarised in a single area-under-curve (AUC) metric [Ling et al. 2003].
- This 2-class segment error table (2SET) is shown in Figure 4(a).
4.2 Event Metrics
- From the categories laid out in 3.4 there are 8 different types of event error scores.
- Four of these can be applied to ground truth events: deletions (D), fragmented (F), fragmented and merged (FM) and merged (M).
- Together with correct events (C), these scores can be visualised in a single figure (see Figure 5), which the authors term the event analysis diagram (EAD).
- Likewise, C+M ′+FM ′+F ′+I completely contains all of the returned events in a system output.
- The EAD trivially shows exact counts of the event categories.
4.3 Application to Reading Example
- The frame results for the two examples, A and B , are shown in pie chart format in Figure 6(b).
- At first glance, these figures reveal the most striking differences between the two examples: the existence of insertion (ir) and fragmenting (fr) errors in A, where none are seen in B.
- This influence of inexact timing is not apparent when the standard metrics in Figure 1 are used.
- The charts are useful indicators at explaining how much of the false negative and false positive frames are given over to specific types of error.
- This is where an event analysis is useful.
5. DATASETS
- To assess the utility of the proposed method the authors use results calculated from three publicly available datasets: D1, from Bulling et al.
- Following from the original papers, each set is evaluated using a different classifier: D1 using string matching; D2 using HMMs; and D3, decision tree.
- The aim of this diverse selection is to show that the method can be applied to a range of different datasets and using different classifiers.
- The authors do not intend to compare these results with one another (nor with the original results as published).
- Rather the authors wish to show how results compare when presented using traditional metrics against those presented using their proposed metrics.
5.1 EOG Reading Dataset (D1)
- The example in Figure 1 was taken from a study by Bulling et al. on recognising reading activity from patterns of horizontal electrooculogram-based (EOG) eye movements.
- Six hours of data was collected using eight participants4.
- The activities in this dataset are very fine-grained.
- Following the method described in the original paper, the authors use string matching on discretised sequences of horizontal eye movements.
- A threshold is applied to the output distance vector to determine ’reading’ or not.
5.2 Darmstadt Daily Routines Dataset (D2)
- Huynh et al. introduced a novel approach for modelling daily routines using data from pocket and wrist-mounted accelerometers.
- They collected a 7 day, singlesubject dataset.
- A remaining 25% of the dataset is not modelled here (the unclassified case, or ‘NULL’).
- Each observation feature vector is modelled using a mixture of two Gaussian.
- The competing models are successively applied to a 30s sliding window.
5.3 MIT PLCouple1 Dataset (D3)
- Logan et al. presented a study aimed at recognising common activities in an indoor setting using a large variety and number of ambient sensors.
- A single subject was tracked and annotated for 100 hours using the MIT PlaceLab [Intille et al. 2006].
- A wide range of activities are targeted, five of which the authors choose as a representative sample of the dataset: watching t.v., dishwashing, eating, using a computer and 4Download D1 at: http://www.andreas-bulling.de/publications/conferences/.
5.4 Application of Metrics to Datasets
- Table II shows how the results from the three datasets might be analysed using standard metrics.
- The opposite is also shown here: the ‘N’ charts for the D3 classes show that by far the most common frame errors within fpr are insertions (ir).
- High ir correlates with what might be expected given the low event precision for these classes.
- Performance metrics for activity recognition · 125 Reading, D1.
- Almost 52% (14) of these events are merged together into 6 large merge outputs.
6.1 Highlighting the Benefits
- To illustrate the benefits of the proposed metrics the authors take a second, more detailed look at two examples from the data presented in 5.4.
- In both classes around 50% of the positive frames are correctly recognized.
- Thus, in both cases, an application designer may by inclined not to use the system, or find a work-around that does not require the recognition of the particular classes.
- For both classes around half of non-recognized true positive frames are due to timing errors, not real deletions.
- This implies that the number of events that the system has returned is between 5 (for computer) and nearly 20 (for watching TV) times higher then the true number of events.
7. CONCLUSION
- The authors have shown that on results generated using published, non-trivial datasets, the proposed metrics reveal novel information about classifier performance, for both frame and event analysis.
- Because it is based on total durations, or number of frames, this method of reporting can be misleading when activity event durations are variable.
- AR researchers have largely avoided this method of evaluation, in part, because of the difficulty of scoring correct and incorrect activities.
- The introduction of a full characterisation of fragmented and merged events, and a revised definition of insertions and deletions, provides one possible solution to these difficulties.
Did you find this useful? Give us your feedback
Citations
1,214 citations
1,078 citations
Cites background from "Performance metrics for activity re..."
...5 [Everingham and Winn 2007] (see [Ward et al. 2011] for more sophisticated event-based evaluation techniques)....
[...]
565 citations
Cites background or methods from "Performance metrics for activity re..."
...Some measures may reflect specific qualities of the system while hiding or misrepresenting others (Ward et al., 2011)....
[...]
...The measures proposed by Ward et al. (2011) are shown in Figs....
[...]
...In order to address this issue, Ward et al. (2011) explicitly defined different types of errors as follows, Overfill: when the start (or stop) time of a predicted label is earlier (or later) than the actual time....
[...]
456 citations
Additional excerpts
...[32, 33]....
[...]
262 citations
References
3,229 citations
3,223 citations
2,046 citations
"Performance metrics for activity re..." refers background in this paper
...One drawback of precision is that it is heavily affected by changes in the proportions of classes in the dataset (class skew) [Fawcett 2004]....
[...]
1,386 citations
"Performance metrics for activity re..." refers background in this paper
...DOI = 10.1145/1889681.1889687 http://doi.acm.org/10.1145/1889681.1889687 Authors addresses: J. A. Ward and H. W. Gellersen, Lancaster University, Bailrigg, Lancaster....
[...]
...…of ground truth to output has been identi.ed in a range of AR research, with a typical solution being to ignore any result within a set margin of the event boundaries [Bao and Intille 2004], or to employ some minimum coverage rule [Tapia et al. 2004; Westeyn et al. 2005; Fogarty et al. 2006]....
[...]
...Tapia et al. 2004 . . . . 4 Notes: 1) De.nes a correct event as, occurred at least once during the day ....
[...]
1,180 citations
"Performance metrics for activity re..." refers background or methods in this paper
...ROC Graphs: Notes and Practical Considerations for Researchers....
[...]
...INTILLE,S.S., LARSON,K., TAPIA,E.M., BEAUDIN,J., KAUSHIK,P., NAWYN,J., AND ROCKINSON, R. 2006....
[...]
...A recommended practice is to consider performance over a range of operating points, such as using ROC [Provost et al. 1998]....
[...]
...This pairing can be plotted as an ROC curve for parameter-independent evaluation [Provost et al. 1998]....
[...]
...An improved approach might be to aggregate a selection of some of the metrics and plot these in an ROC-style curve....
[...]
Related Papers (5)
Frequently Asked Questions (11)
Q2. What future works have the authors mentioned in the paper "Performance metrics for activity recognition" ?
The authors believe that it is better to show all results in the brightest ( coldest ) light, and then give explanations afterwards if need be. Again a challenge for future work is how this information might be displayed in an informative way. In these cases, the events marked F may be aggregated with C and presented in an additional, application-specific metric. Although SET completely captures both segment and frame errors, it can be difficult to interpret.
Q3. What are the common metrics used to score frames?
Because of the one-to-one mapping between ground and output, scoring frames is trivial, with frames assigned to one of: true positive (TP), true negative (TN), false positive (FP) or false negative (FN).
Q4. What are the examples of substitution errors in multi-class AR?
In addition to insertions and deletions such multi class problems can produce substitution errors which are instances of one class being mistaken for another.
Q5. What are the new metrics for class skew invariance?
To maintain class skew invariance, the new 2SET metrics introduced here are based around tpr and fpr: that is, FN errors are expressed as a ratio of the total positive frames, P ; and the FP errors are expressed as a ratio of the total negative frames, N .
Q6. What are some of the common methods of comparing events?
Traditional event-based comparisons might be able to accommodate offsets using techniques such as dynamic time warping (DTW), or fuzzy event boundaries.
Q7. What is the main drawback of the segment-based method?
The segment-based method presented by Ward et al. [2006a] is intrinsically multi-class: Each pairing of ground truth and output segment is assigned to exactly one of six categories (insertion-deletion, insertionunderfill, insertion-fragmenting, overfill-deletion, overfill-underfill, and merge-deletion).
Q8. What is the way to handle inexact time matching of ground truth to output?
The problem of how to handle inexact time matching of ground truth to output has been identified in a range of AR research, with a typical solution being to ignore any result within a set margin of the event boundaries [Bao and Intille 2004], or to employ some minimum coverage rule [Tapia et al.
Q9. What are the final 3 columns of the graph?
The final 3 columns indicate, either through explicit mention in the paper, or through evidence in an included graph, whether artefacts such as timing errors, fragmenting or merge were encountered.
Q10. What are the two fundamental assumptions that the authors make?
Here the authors make two fundamental assumptions:(1) Ground truth and classifier prediction are available for each individual frame of the signal.
Q11. What are the key observations of Figure 1?
Three key observations can be made of Figure 1: 1) some events in A are fragmented into several smaller chunks (f); 2) multiple events in B are recognised as a single merged output (m); and 3) outputs are often offset in time.