scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Performance metrics for activity recognition

TL;DR: A comprehensive set of performance metrics and visualisations for continuous activity recognition (AR) and shows that where event- and frame-based precision and recall lead to an ambiguous interpretation of results in some cases, the proposed metrics provide a consistently unambiguous explanation.
Abstract: In this article, we introduce and evaluate a comprehensive set of performance metrics and visualisations for continuous activity recognition (AR). We demonstrate how standard evaluation methods, often borrowed from related pattern recognition problems, fail to capture common artefacts found in continuous AR—specifically event fragmentation, event merging and timing offsets. We support our assertion with an analysis on a set of recently published AR papers. Building on an earlier initial work on the topic, we develop a frame-based visualisation and corresponding set of class-skew invariant metrics for the one class versus all evaluation. These are complemented by a new complete set of event-based metrics that allow a quick graphical representation of system performance—showing events that are correct, inserted, deleted, fragmented, merged and those which are both fragmented and merged. We evaluate the utility of our approach through comparison with standard metrics on data from three different published experiments. This shows that where event- and frame-based precision and recall lead to an ambiguous interpretation of results in some cases, the proposed metrics provide a consistently unambiguous explanation.

Summary (5 min read)

1. INTRODUCTION

  • Human activity recognition (AR) is a fast growing research topic with many promising real-world applications.
  • As it matures so does the need for a comprehensive system of metrics that can be used to summarise and compare different AR systems.
  • A valid methodology for performance evaluation should fulfil two basic criteria: (1) It must be objective and unambiguous.
  • To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.

2. PERFORMANCE EVALUATION

  • In its general form AR is a multi-class problem with c “interesting” classes plus a “NULL” class.
  • In addition to insertions and deletions such multi class problems can produce substitution errors which are instances of one class being mistaken for another.
  • The quality of the similarity measure depends on the application domain and the underlying assumptions.
  • (2) The time shift in which events are detected in the classifier output is at most within the range of the event.
  • Another permissible variant is that several events in the output overlap with one event in the ground truth.

2.1 Existing Methods for Error Scoring

  • Performance metrics are usually calculated in three steps.
  • From the comparison a scoring is made on the matches and errors.
  • Two basic units of comparison are typically used – frames or events: Scoring Frames.
  • It is often the smallest unit of measure defined by the system (the sample rate) and in such cases approximates continuous time.
  • There is not necessarily a one-to-one relation between E and R. A comparison can instead be made using alternative means: for example DTW [Berndt and Clifford 1994], measuring the longest common subsequence [Agrawal et al. 1995], or a combination of different transformations [Perng et al. 2000].

2.2 Shortcomings of Conventional Performance Characterisation

  • Existing metrics often fall short of providing sufficient insight into the performance of an AR recognition system.
  • These plot a short section (300 s) of results described by Bulling et al. [2008] on the recognition of reading activities using body-worn sensors.
  • The authors also decide that several events detected by a one output count only as a single true positive.
  • Together with a poorer event precision, this indicates a larger number of false insertions in A.
  • However they fail to explicitly account for fragmented or merged events.

2.3 Significance of the Problem

  • To assess the prevalence of fragmenting, merge and timing errors, the authors surveyed a selection of papers on continuous AR published between 2004 and 2010 at selected computing conferences and journals (e.g., Pervasive, Ubicomp, Wearable Computing, etc.).
  • Table I highlights the main metrics used by each work, and whether these were based on frame, event, or some combination of both evaluation methods.
  • The final 3 columns indicate, either through explicit mention in the paper, or through evidence in an included graph, whether artefacts such as timing errors, fragmenting or merge were encountered.
  • The simple frame-based accuracy metric was heavily used in earlier work (often accompanied by a full confusion matrix), but has since given way to the pairing of precision and recall.
  • In most, however, there is strong evidence of timing offsets being an issue.

3. EXTENDED METHODS USING ADDITIONAL ERROR CATEGORIES

  • Ward et al. 2006a introduced an extension to the standard frame scoring scheme that the authors adopt here for the single class problem.
  • First the authors introduce additional categories of events to capture information on fragmenting and merge behaviour.
  • The authors then show how these are scored in an objective and unambiguous way.

3.1 Addition Event Information

  • This is when an event in the ground truth is recognised by several returns in the output.
  • The authors say that these ground events are merged (M), and refer to the single return event as a merging return (M ′).
  • A ground event can be both fragmented and merged.
  • Performance metrics for activity recognition · 117 truth event is clearly fragmented (into two returns).
  • But the second return in A also covers another event, thus merging the two.

3.2 Scoring Segments

  • An alternative scoring strategy was introduced by Ward et al. 2006a that provides a mid-way solution between the one-to-one mapping of frame scoring, while retaining useful information from event scoring.
  • This hybrid scheme is based on the notion of segments.
  • A segment is the largest part of an event on which the comparison between the ground truth and the output of the recognition system can be made in an unambiguous way.
  • For a binary problem, positive (p) versus negative (n), there are four possible outcomes to be scored: TPs, TNs, FPs and FNs.
  • A FPs that corresponds exactly to an inserted return, I. Merge, Ms. A FPs that occurs between two TPs segments within a merge return (i.e. the bit that joins two events).

3.3 Scoring Frames

  • Once the authors have assigned error categories to segments, it is a simple matter to transfer those assignments to the frames that constitute each segment.
  • The authors use these numbers in their subsequent frame analysis.

3.4 Deriving Event Scores Using Segments

  • Figure 2(b) shows an example of how event scores can be unambiguously assigned using information provided by the corresponding segment scores.
  • Note that a key difference between the frame (and segment) error scores and the event scores is that the former analysis focuses on characterising and reporting frame errors (FP and FN), whereas here the authors report on counts of matched events.
  • This is a troublesome definition because it completely ignores the possibility of fragmentation.
  • The authors assume that it is better to classify correct only those events that cannot be applied to any of the other event categories.
  • A correct event as used here is one that is matched with exactly one return event3.

3.5 Limits of Time Shift Tolerance

  • A key concerns behind their work is to distinguish between errors that are caused by small shifts in the recognition timing (which may be irrelevant for many applications) and the more “serious” errors of misclassified instances.
  • This may seem surprising given the fact that their evaluation works on sequential segment comparison.
  • So long as the recognized event has an overlap with the ground truth there will be a segment that is identified as correct, and adjoining segments will be labelled as timing errors (or fragmentation/merge when relevant).
  • Clearly, in cases that involve very short (in terms of the time scale of the sensor and recognition system), widely spaced events this would be a problem.
  • Moreover many applications look at complex longer term activities that can tale many seconds or even minutes.

4.1 Frame Metrics

  • Accuracy (TP+TNP+N ) is the most commonly used metric that can be calculated from a confusion matrix.
  • One drawback of precision is that it is heavily affected by changes in the proportions of classes in the dataset (class skew) [Fawcett 2004].
  • For this reason the authors prefer the skew-invariant fpr metric paired alongside tpr.
  • This is sometimes summarised in a single area-under-curve (AUC) metric [Ling et al. 2003].
  • This 2-class segment error table (2SET) is shown in Figure 4(a).

4.2 Event Metrics

  • From the categories laid out in 3.4 there are 8 different types of event error scores.
  • Four of these can be applied to ground truth events: deletions (D), fragmented (F), fragmented and merged (FM) and merged (M).
  • Together with correct events (C), these scores can be visualised in a single figure (see Figure 5), which the authors term the event analysis diagram (EAD).
  • Likewise, C+M ′+FM ′+F ′+I completely contains all of the returned events in a system output.
  • The EAD trivially shows exact counts of the event categories.

4.3 Application to Reading Example

  • The frame results for the two examples, A and B , are shown in pie chart format in Figure 6(b).
  • At first glance, these figures reveal the most striking differences between the two examples: the existence of insertion (ir) and fragmenting (fr) errors in A, where none are seen in B.
  • This influence of inexact timing is not apparent when the standard metrics in Figure 1 are used.
  • The charts are useful indicators at explaining how much of the false negative and false positive frames are given over to specific types of error.
  • This is where an event analysis is useful.

5. DATASETS

  • To assess the utility of the proposed method the authors use results calculated from three publicly available datasets: D1, from Bulling et al.
  • Following from the original papers, each set is evaluated using a different classifier: D1 using string matching; D2 using HMMs; and D3, decision tree.
  • The aim of this diverse selection is to show that the method can be applied to a range of different datasets and using different classifiers.
  • The authors do not intend to compare these results with one another (nor with the original results as published).
  • Rather the authors wish to show how results compare when presented using traditional metrics against those presented using their proposed metrics.

5.1 EOG Reading Dataset (D1)

  • The example in Figure 1 was taken from a study by Bulling et al. on recognising reading activity from patterns of horizontal electrooculogram-based (EOG) eye movements.
  • Six hours of data was collected using eight participants4.
  • The activities in this dataset are very fine-grained.
  • Following the method described in the original paper, the authors use string matching on discretised sequences of horizontal eye movements.
  • A threshold is applied to the output distance vector to determine ’reading’ or not.

5.2 Darmstadt Daily Routines Dataset (D2)

  • Huynh et al. introduced a novel approach for modelling daily routines using data from pocket and wrist-mounted accelerometers.
  • They collected a 7 day, singlesubject dataset.
  • A remaining 25% of the dataset is not modelled here (the unclassified case, or ‘NULL’).
  • Each observation feature vector is modelled using a mixture of two Gaussian.
  • The competing models are successively applied to a 30s sliding window.

5.3 MIT PLCouple1 Dataset (D3)

  • Logan et al. presented a study aimed at recognising common activities in an indoor setting using a large variety and number of ambient sensors.
  • A single subject was tracked and annotated for 100 hours using the MIT PlaceLab [Intille et al. 2006].
  • A wide range of activities are targeted, five of which the authors choose as a representative sample of the dataset: watching t.v., dishwashing, eating, using a computer and 4Download D1 at: http://www.andreas-bulling.de/publications/conferences/.

5.4 Application of Metrics to Datasets

  • Table II shows how the results from the three datasets might be analysed using standard metrics.
  • The opposite is also shown here: the ‘N’ charts for the D3 classes show that by far the most common frame errors within fpr are insertions (ir).
  • High ir correlates with what might be expected given the low event precision for these classes.
  • Performance metrics for activity recognition · 125 Reading, D1.
  • Almost 52% (14) of these events are merged together into 6 large merge outputs.

6.1 Highlighting the Benefits

  • To illustrate the benefits of the proposed metrics the authors take a second, more detailed look at two examples from the data presented in 5.4.
  • In both classes around 50% of the positive frames are correctly recognized.
  • Thus, in both cases, an application designer may by inclined not to use the system, or find a work-around that does not require the recognition of the particular classes.
  • For both classes around half of non-recognized true positive frames are due to timing errors, not real deletions.
  • This implies that the number of events that the system has returned is between 5 (for computer) and nearly 20 (for watching TV) times higher then the true number of events.

7. CONCLUSION

  • The authors have shown that on results generated using published, non-trivial datasets, the proposed metrics reveal novel information about classifier performance, for both frame and event analysis.
  • Because it is based on total durations, or number of frames, this method of reporting can be misleading when activity event durations are variable.
  • AR researchers have largely avoided this method of evaluation, in part, because of the difficulty of scoring correct and incorrect activities.
  • The introduction of a full characterisation of fragmented and merged events, and a revised definition of insertions and deletions, provides one possible solution to these difficulties.

Did you find this useful? Give us your feedback

Figures (10)

Content maybe subject to copyright    Report

Performance metrics for activity recognition
JAMIE A. WARD
Lancaster University
PAUL LUKOWICZ
University of Passau
and
HANS W. GELLERSEN
Lancaster University
In this article we introduce and evaluate a comprehensive set of performance metrics and vi-
sualisations for continuous activity recognition (AR). We demonstrate how standard evaluation
methods, often borrowed from related pattern recognition problems, fail to capture common arte-
facts found in continuous AR specifically event fragmentation, event merging and timing offsets.
We support our assertion with an analysis on a set of recently published AR papers. Building on
an earlier initial work on the topic, we develop a frame-based visualisation and corresponding set
of class-skew invariant metrics for the one class versus all evaluation. These are complemented by
a new complete set of event-based metrics that allow a quick graphical representation of system
performance showing events that are correct, inserted, deleted, fragmented, merged and those
which are both fragmented and merged. We evaluate the utility of our approach through compar-
ison with standard metrics on data from three different published experiments. This shows that
where event- and frame-based precision and recall lead to an ambiguous interpretation of results
in some cases, the proposed metrics provide a consistently unambiguous explanation.
Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Design Methodology
General Terms: Performance, Standardization
Additional Key Words and Phrases: Activity recognition, metrics, performance evaluation
1. INTRODUCTION
Human activity recognition (AR) is a fast growing research topic with many promis-
ing real-world applications. As it matures so does the need for a comprehensive
system of metrics that can be used to summarise and compare different AR systems.
A valid methodology for performance evaluation should fulfil two basic criteria:
(1) It must be objective and unambiguous. The outcome of an evaluation should
not depend on any arbitrary assumptions or parameters.
(2) It should not only grade, but also characterise performance. When comparing
Contact: j.a.ward@lancaster.ac.uk, paul.lukowicz@uni-passau.de, hwg@comp.lancs.ac.uk
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for profit or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
c
2010 ACM 1529-3785/2010/0700-1110000111 $5.00
ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010, Pages 111–132.

112 · Jamie A. Ward et al.
systems the method should give more then a binary decision, such as “A is
better then B”. Instead it should quantify the strengths and weaknesses of
each and give the system designer hints as to how improvements can be made.
Ward et al. [2006a] demonstrated that the standard evaluation metrics currently
used in AR do not adequately characterise performance. Information about typical
characteristics of activity events are routinely ignored in favour of making recog-
nition results fit standard metrics such as event or frame accuracy. For example,
existing metrics do not reveal whether an activity has been fragmented into several
smaller activities, whether several activities have been merged into a single large
activity; or whether there are timing offsets in the recognition of an activity. This
can lead to a presentation of results that can be confusing, and even misleading.
As we will show in this article, this is not just a theoretical problem but an issue
routinely encountered in real applications.
The problem of how to handle inexact time matching of ground truth to output
has been identified in a range of AR research, with a typical solution being to ignore
any result within a set margin of the event boundaries [Bao and Intille 2004], or
to employ some minimum coverage rule [Tapia et al. 2004; Westeyn et al. 2005;
Fogarty et al. 2006]. The problem of fragmented output has been noted in a
handful of publications, with solutions ranging from treating fragments as correct
events [Fogarty et al. 2006], to incorporating them in an equal way to insertion and
deletion errors (e.g., ‘reverse splicing’ [Patterson et al. 2005]). Evidence of merging
was hinted at by Lester et al. [2005], and is discussed as an ‘episode spanning two
activities,’ by Buettner et al. [2009].
In a first attempt at characterising AR performance, Ward et al. [2006a] intro-
duced an unambiguous method for calculating insertions and deletions alongside
four new types of error: fragmentation, merge and the timing offset errors of over-
fill and underfill. Corresponding frame-by-frame metrics derived from all of these
categories were also proposed alongside a convenient visualisation of the informa-
tion. Although used in a handful of subsequent publications [Bulling et al. 2008;
Minnen et al. 2007; Stiefmeier et al. 2006; Ward et al. 2006b], the original metrics
suffer from a number of shortcomings:
(1) visualisation of frame errors using the error division diagram (EDD), which
plots insertion, deletion, fragmenting, merge, correct and timing errors as a
percentage of the total experiment time, is influenced by changes in the pro-
portion of different classes, or class skew. This makes comparability between
datasets difficult.
(2) event errors were not represented in a metric format suitable for comparison.
Instead absolute counts of insertions, deletions, etc., were shown.
This article extends the previous work in four ways, specifically we: 1) introduce a
system of frame-by-frame metrics which are invariant to class skew and 2) introduce
a new system of metrics for recording and visualising event performance. We then
3) apply the metrics to three previously published data sets and 4) show how these
offer an improvement over traditional metrics. The contributed methods are based
on sequential, segment-wise comparison, but it is worth noting that they also have a
significant amount of tolerance against small time shifts in the recognition. Unlike
ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010.

Performance metrics for activity recognition · 113
in other approaches (e.g., dynamic time warping, DTW [Berndt and Clifford 1994]),
the time shift is not masked (or hidden in an abstract number such as matching
costs), but explicitly described in the form of underfill and overfill errors.
The article is organised as follows. We first lay the groundwork for our contribu-
tion with an analysis of the AR performance evaluation problem, including a survey
of selected publications from the past six years of AR research. This is followed by
the introduction of AR event categories that extend Ward et al. [2006a]’s scoring
system (Section 3). We then introduce a new system of frame and event metrics
and show how they are applied (Section 4). The metrics are then evaluated by ap-
plication to results from three previously published datasets (Section 5), followed
by a concluding analysis of their benefits and limitations (Section 6).
2. PERFORMANCE EVALUATION
In its general form AR is a multi-class problem with c “interesting” classes plus a
“NULL” class. The latter includes all parts of the signal where no relevant activity
has taken place. In addition to insertions and deletions such multi class problems
can produce substitution errors which are instances of one class being mistaken for
another. Note that insertions and deletions are a special case of a substitution with
one of the classes involved being the NULL class.
In this paper, we approach performance evaluation of multi-class AR by consider-
ing a class at a time. In doing so, the root problem we address is the characterisation
and summary of performance in a single, time-wise continuous, binary classification.
That is, the output of the classifier at any one time is either positive, p or nega-
tive, n. Evaluation can then be viewed as a comparison of two discrete time-series
(recognition output versus ground truth). We know that there is no objectively
‘best’ similarity measure for time series comparison. The quality of the similarity
measure depends on the application domain and the underlying assumptions. Here
we make two fundamental assumptions:
(1) Ground truth and classifier prediction are available for each individual frame
of the signal.
(2) The time shift in which events are detected in the classifier output is at most
within the range of the event. This means that events in the recognition output
can be assigned to events in the ground truth based on their time overlap. For
example, assume that we have two events, e
1
and e
2
, in the ground truth. If
output r
x
has temporal overlap with e
1
then we assume that it is a prediction
for e
1
(similarly for e
2
). If it has no temporal overlap with either of the two then
we assume it to be an insertion
1
. This allows us to do error scoring without
having to worry about permutations of assignments of events from the ground
truth to the classifier prediction. From our study of published work we have
found this assumption to be plausible for most applications.
1
Note that it is permissible for r
x
to overlap with both part of e
1
and part of e
2
(and possibly
more events). Another permissible variant is that several events in the output overlap with one
event in the ground truth.
ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010.

114 · Jamie A. Ward et al.
events frames
Rec.P re. tpr fpr pr
63 44 77 18 80
55 100 86 9 90
time ——>
Fig. 1. Recognition results from a 300 s extract of the reading experiment reported by Bulling
et al. 2008. A sequence of 11 ground truth events (gt) are shown alongside outputs for unsmoothed
(A) and smoothed (B) recognition. Five event errors are highlighted: i) insertion, d) deletion, f)
fragmentation, m) merge, and fm) fragmented and merged. For each sequence the table shows
the % event recall Rec. and % event precision P re.; as well as the % frame-based true positive
and false positive rates, tpr and fpr, and precision pr.
2.1 Existing Methods for Error Scoring
Performance metrics are usually calculated in three steps. First a comparison is
made between the returned system output and what is known to have occurred
(or an approximation of what occurred). From the comparison a scoring is made
on the matches and errors. Finally these scores are summarised by one or more
metrics, usually expressed as a normalised rate or percentage.
Two basic units of comparison are typically used frames or events:
Scoring Frames. A frame is a fixed-length, fixed-rate unit of time. It is often the
smallest unit of measure defined by the system (the sample rate) and in such cases
approximates continuous time. Because of the one-to-one mapping between ground
and output, scoring frames is trivial, with frames assigned to one of: true positive
(TP), true negative (TN), false positive (FP) or false negative (FN).
Scoring Events. We define an event as a variable duration sequence of positive
frames within a continuous time-series. It has a start time and a stop time. Given a
test sequence of g known events, E = {e
1
, e
2
, ...e
g
}, a recognition outputs h return
events, R = {r
1
, r
2
, ...r
h
}. There is not necessarily a one-to-one relation between
E and R. A comparison can instead be made using alternative means: for exam-
ple DTW [Berndt and Clifford 1994], measuring the longest common subsequence
[Agrawal et al. 1995], or a combination of different transformations [Perng et al.
2000]. An event can then be scored as either correctly detected (C); falsely inserted
(I
0
), where there is no corresponding event in the ground truth; or deleted (D),
where there is a failure to detect an event.
Commonly recommended frame based metrics include: true positive rate (tpr =
T P
T P +F N
), false positive rate (f pr =
F P
T N +F P
), precision (pr =
T P
T P +F P
); or some
combination of these (see 4.1.1). Similarly, event scores can be summarized by
precision (
correct
output returns
), recall (
correct
total
), or simply a count of I
0
and D.
2.2 Shortcomings of Conventional Performance Characterisation
Existing metrics often fall short of providing sufficient insight into the performance
of an AR recognition system. We illustrate this using the examples in Figure 1.
These plot a short section (300 s) of results described by Bulling et al. [2008] on the
recognition of reading activities using body-worn sensors. Plot A shows a classifier
output with classes ‘reading’ versus ‘not reading’; plot B shows the same output
but smoothed by a 30s sliding window; and gt shows the annotated ground truth.
ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010.

Performance metrics for activity recognition · 115
For both A and B, traditional frame metrics (tpr, fpr, pr) are calculated, as are
event-based precision and recall (P re., Rec.). For the event analysis, a decision
needs to be made as to what constitutes a ‘correct’ event. Here we define a true
positive event as one that is detected by at least one output. We also decide that
several events detected by a one output count only as a single true positive.
The frame results show that the f pr of A is almost 10% higher than that of
B. Together with a poorer event precision, this indicates a larger number of false
insertions in A. A’s frame tpr is almost 10% lower than B. This might suggest more
deletions, and thus a lower recall but in fact its recall is 8% higher. Why? The
answer is not clear from the metrics alone so we have to look at the plots. This
instantly shows that A is more fragmented than B many short false negatives
break up some of the larger events. This has the effect of reducing the true positive
frame count, while leaving the event count (based on the above assumption of
‘detected at least once’) unaffected.
Three key observations can be made of Figure 1: 1) some events in A are frag-
mented into several smaller chunks (f); 2) multiple events in B are recognised as a
single merged output (m); and 3) outputs are often offset in time. These anomalies
represent typical fragmenting, merge and time errors, none of which are captured by
conventional metrics. Frame error scores of false positive or false negative simply do
not distinguish between frames that belong to a ‘serious’ error, such as insertion or
deletion, and those that are timing offsets of otherwise correct events. Traditional
event-based comparisons might be able to accommodate offsets using techniques
such as dynamic time warping (DTW), or fuzzy event boundaries. However they
fail to explicitly account for fragmented or merged events.
2.3 Significance of the Problem
To assess the prevalence of fragmenting, merge and timing errors, we surveyed a
selection of papers on continuous AR published between 2004 and 2010 at selected
computing conferences and journals (e.g., Pervasive, Ubicomp, Wearable Comput-
ing, etc.) Table I highlights the main metrics used by each work, and whether these
were based on frame, event, or some combination of both evaluation methods. The
final 3 columns indicate, either through explicit mention in the paper, or through
evidence in an included graph, whether artefacts such as timing errors, fragmenting
or merge were encountered.
The simple frame-based accuracy metric was heavily used in earlier work (often
accompanied by a full confusion matrix), but has since given way to the pairing
of precision and recall. Event analysis has been applied by several researchers,
however there is no clear consensus on the definition of a ‘correct’ event, nor on the
metrics that should be used. In most, however, there is strong evidence of timing
offsets being an issue. Several highlight fragmenting and merge (though only those
using EDD acknowledge these as specific error categories).
3. EXTENDED METHODS USING ADDITIONAL ERROR CATEGORIES
Ward et al. 2006a introduced an extension to the standard frame scoring scheme
that we adopt here for the single class problem. First we introduce additional
categories of events to capture information on fragmenting and merge behaviour.
We then show how these are scored in an objective and unambiguous way.
ACM Transactions on Computational Logic, Vol. 1, No. 1, 09 2010.

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors provide a comprehensive hands-on introduction for newcomers to the field of human activity recognition using on-body inertial sensors and describe the concept of an Activity Recognition Chain (ARC) as a general-purpose framework for designing and evaluating activity recognition systems.
Abstract: The last 20 years have seen ever-increasing research activity in the field of human activity recognition. With activity recognition having considerably matured, so has the number of challenges in designing, implementing, and evaluating activity recognition systems. This tutorial aims to provide a comprehensive hands-on introduction for newcomers to the field of human activity recognition. It specifically focuses on activity recognition using on-body inertial sensors. We first discuss the key research challenges that human activity recognition shares with general pattern recognition and identify those challenges that are specific to human activity recognition. We then describe the concept of an Activity Recognition Chain (ARC) as a general-purpose framework for designing and evaluating activity recognition systems. We detail each component of the framework, provide references to related research, and introduce the best practice methods developed by the activity recognition research community. We conclude with the educational example problem of recognizing different hand gestures from inertial sensors attached to the upper and lower arm. We illustrate how each component of this framework can be implemented for this specific activity recognition problem and demonstrate how different implementations compare and how they impact overall recognition performance.

1,214 citations

01 Jan 2014
TL;DR: This tutorial aims to provide a comprehensive hands-on introduction for newcomers to the field of human activity recognition using on-body inertial sensors and describes the concept of an Activity Recognition Chain (ARC) as a general-purpose framework for designing and evaluating activity recognition systems.
Abstract: The last 20 years have seen ever-increasing research activity in the field of human activity recognition. With activity recognition having considerably matured, so has the number of challenges in designing, implementing, and evaluating activity recognition systems. This tutorial aims to provide a comprehensive hands-on introduction for newcomers to the field of human activity recognition. It specifically focuses on activity recognition using on-body inertial sensors. We first discuss the key research challenges that human activity recognition shares with general pattern recognition and identify those challenges that are specific to human activity recognition. We then describe the concept of an Activity Recognition Chain (ARC) as a general-purpose framework for designing and evaluating activity recognition systems. We detail each component of the framework, provide references to related research, and introduce the best practice methods developed by the activity recognition research community. We conclude with the educational example problem of recognizing different hand gestures from inertial sensors attached to the upper and lower arm. We illustrate how each component of this framework can be implemented for this specific activity recognition problem and demonstrate how different implementations compare and how they impact overall recognition performance.

1,078 citations


Cites background from "Performance metrics for activity re..."

  • ...5 [Everingham and Winn 2007] (see [Ward et al. 2011] for more sophisticated event-based evaluation techniques)....

    [...]

Journal ArticleDOI
TL;DR: This work introduces a versatile human activity dataset recorded in a sensor-rich environment and expects this benchmarking database will motivate other researchers to replicate and outperform the presented results, thus contributing to further advances in the state-of-the-art of activity recognition methods.

565 citations


Cites background or methods from "Performance metrics for activity re..."

  • ...Some measures may reflect specific qualities of the system while hiding or misrepresenting others (Ward et al., 2011)....

    [...]

  • ...The measures proposed by Ward et al. (2011) are shown in Figs....

    [...]

  • ...In order to address this issue, Ward et al. (2011) explicitly defined different types of errors as follows, Overfill: when the start (or stop) time of a predicted label is earlier (or later) than the actual time....

    [...]

Proceedings ArticleDOI
11 Nov 2013
TL;DR: The primary contributions of this work are an improved algorithm for estimating the gravity component of accelerometer measurements, a novel set of accelerometers that are able to capture key characteristics of vehicular movement patterns, and a hierarchical decomposition of the detection task.
Abstract: We present novel accelerometer-based techniques for accurate and fine-grained detection of transportation modes on smartphones. The primary contributions of our work are an improved algorithm for estimating the gravity component of accelerometer measurements, a novel set of accelerometer features that are able to capture key characteristics of vehicular movement patterns, and a hierarchical decomposition of the detection task. We evaluate our approach using over 150 hours of transportation data, which has been collected from 4 different countries and 16 individuals. Results of the evaluation demonstrate that our approach is able to improve transportation mode detection by over 20% compared to current accelerometer-based systems, while at the same time improving generalization and robustness of the detection. The main performance improvements are obtained for motorised transportation modalities, which currently represent the main challenge for smartphone-based transportation mode detection.

456 citations


Additional excerpts

  • ...[32, 33]....

    [...]

Proceedings ArticleDOI
06 Jun 2012
TL;DR: A new dataset for physical activity monitoring is created and made publicly available, and 4 classification problems are benchmarked on the dataset, using a standard data processing chain and 5 different classifiers.
Abstract: Physical activity monitoring has recently become an important field in wearable computing research. However, there is a lack of a commonly used, standard dataset and established benchmarking problems. In this work, a new dataset for physical activity monitoring --- recorded from 9 subjects, wearing 3 inertial measurement units and a heart rate monitor, and performing 18 different activities --- is created and made publicly available. Moreover, 4 classification problems are benchmarked on the dataset, using a standard data processing chain and 5 different classifiers. The benchmark shows the difficulty of the classification tasks and exposes some challenges, defined by e.g. a high number of activities and personalization.

262 citations

References
More filters
Proceedings Article
31 Jul 1994
TL;DR: Preliminary experiments with a dynamic programming approach to pattern detection in databases, based on the dynamic time warping technique used in the speech recognition field, are described.
Abstract: Knowledge discovery in databases presents many interesting challenges within the context of providing computer tools for exploring large data archives. Electronic data repositories are growing quickly and contain data from commercial, scientific, and other domains. Much of this data is inherently temporal, such as stock prices or NASA telemetry data. Detecting patterns in such data streams or time series is an important knowledge discovery task. This paper describes some preliminary experiments with a dynamic programming approach to the problem. The pattern detection algorithm is based on the dynamic time warping technique used in the speech recognition field.

3,229 citations

Book ChapterDOI
21 Apr 2004
TL;DR: This is the first work to investigate performance of recognition algorithms with multiple, wire-free accelerometers on 20 activities using datasets annotated by the subjects themselves, and suggests that multiple accelerometers aid in recognition.
Abstract: In this work, algorithms are developed and evaluated to de- tect physical activities from data acquired using five small biaxial ac- celerometers worn simultaneously on different parts of the body. Ac- celeration data was collected from 20 subjects without researcher su- pervision or observation. Subjects were asked to perform a sequence of everyday tasks but not told specifically where or how to do them. Mean, energy, frequency-domain entropy, and correlation of acceleration data was calculated and several classifiers using these features were tested. De- cision tree classifiers showed the best performance recognizing everyday activities with an overall accuracy rate of 84%. The results show that although some activities are recognized well with subject-independent training data, others appear to require subject-specific training data. The results suggest that multiple accelerometers aid in recognition because conjunctions in acceleration feature values can effectively discriminate many activities. With just two biaxial accelerometers - thigh and wrist - the recognition performance dropped only slightly. This is the first work to investigate performance of recognition algorithms with multiple, wire-free accelerometers on 20 activities using datasets annotated by the subjects themselves.

3,223 citations

Journal ArticleDOI
TL;DR: This article serves both as a tutorial introduction to ROC graphs and as a practical guide for using them in research.
Abstract: Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communities. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. This article serves both as a tutorial introduction to ROC graphs and as a practical guide for using them in research.

2,046 citations


"Performance metrics for activity re..." refers background in this paper

  • ...One drawback of precision is that it is heavily affected by changes in the proportions of classes in the dataset (class skew) [Fawcett 2004]....

    [...]

Book ChapterDOI
21 Apr 2004
TL;DR: Preliminary results on a small dataset show that it is possible to recognize activities of interest to medical professionals such as toileting, bathing, and grooming with detection accuracies ranging from 25% to 89% depending on the evaluation criteria used.
Abstract: In this work, a system for recognizing activities in the home setting using a set of small and simple state-change sensors is introduced. The sensors are designed to be “tape on and forget” devices that can be quickly and ubiquitously installed in home environments. The proposed sensing system presents an alternative to sensors that are sometimes perceived as invasive, such as cameras and microphones. Unlike prior work, the system has been deployed in multiple residential environments with non-researcher occupants. Preliminary results on a small dataset show that it is possible to recognize activities of interest to medical professionals such as toileting, bathing, and grooming with detection accuracies ranging from 25% to 89% depending on the evaluation criteria used.

1,386 citations


"Performance metrics for activity re..." refers background in this paper

  • ...DOI = 10.1145/1889681.1889687 http://doi.acm.org/10.1145/1889681.1889687 Authors addresses: J. A. Ward and H. W. Gellersen, Lancaster University, Bailrigg, Lancaster....

    [...]

  • ...…of ground truth to output has been identi.ed in a range of AR research, with a typical solution being to ignore any result within a set margin of the event boundaries [Bao and Intille 2004], or to employ some minimum coverage rule [Tapia et al. 2004; Westeyn et al. 2005; Fogarty et al. 2006]....

    [...]

  • ...Tapia et al. 2004 . . . . 4 Notes: 1) De.nes a correct event as, occurred at least once during the day ....

    [...]

Proceedings Article
24 Jul 1998
TL;DR: This work describes and demonstrates what it believes to be the proper use of ROC analysis for comparative studies in machine learning research, and argues that this methodology is preferable both for making practical choices and for drawing conclusions.
Abstract: We analyze critically the use of classi cation accuracy to compare classi ers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and standard benchmark data sets. The results raise serious concerns about the use of accuracy for comparing classi ers and draw into question the conclusions that can be drawn from such studies. In the course of the presentation, we describe and demonstrate what we believe to be the proper use of ROC analysis for comparative studies in machine learning research. We argue that this methodology is preferable both for making practical choices and for drawing scienti c conclusions.

1,180 citations


"Performance metrics for activity re..." refers background or methods in this paper

  • ...ROC Graphs: Notes and Practical Considerations for Researchers....

    [...]

  • ...INTILLE,S.S., LARSON,K., TAPIA,E.M., BEAUDIN,J., KAUSHIK,P., NAWYN,J., AND ROCKINSON, R. 2006....

    [...]

  • ...A recommended practice is to consider per­formance over a range of operating points, such as using ROC [Provost et al. 1998]....

    [...]

  • ...This pairing can be plotted as an ROC curve for parameter-independent evaluation [Provost et al. 1998]....

    [...]

  • ...An improved approach might be to aggregate a selection of some of the metrics and plot these in an ROC-style curve....

    [...]

Frequently Asked Questions (11)
Q1. What are the contributions mentioned in the paper "Performance metrics for activity recognition" ?

In this article the authors introduce and evaluate a comprehensive set of performance metrics and visualisations for continuous activity recognition ( AR ). The authors demonstrate how standard evaluation methods, often borrowed from related pattern recognition problems, fail to capture common artefacts found in continuous AR – specifically event fragmentation, event merging and timing offsets. The authors evaluate the utility of their approach through comparison with standard metrics on data from three different published experiments. This shows that where eventand frame-based precision and recall lead to an ambiguous interpretation of results in some cases, the proposed metrics provide a consistently unambiguous explanation. 

The authors believe that it is better to show all results in the brightest ( coldest ) light, and then give explanations afterwards if need be. Again a challenge for future work is how this information might be displayed in an informative way. In these cases, the events marked F may be aggregated with C and presented in an additional, application-specific metric. Although SET completely captures both segment and frame errors, it can be difficult to interpret. 

Because of the one-to-one mapping between ground and output, scoring frames is trivial, with frames assigned to one of: true positive (TP), true negative (TN), false positive (FP) or false negative (FN). 

In addition to insertions and deletions such multi class problems can produce substitution errors which are instances of one class being mistaken for another. 

To maintain class skew invariance, the new 2SET metrics introduced here are based around tpr and fpr: that is, FN errors are expressed as a ratio of the total positive frames, P ; and the FP errors are expressed as a ratio of the total negative frames, N . 

Traditional event-based comparisons might be able to accommodate offsets using techniques such as dynamic time warping (DTW), or fuzzy event boundaries. 

The segment-based method presented by Ward et al. [2006a] is intrinsically multi-class: Each pairing of ground truth and output segment is assigned to exactly one of six categories (insertion-deletion, insertionunderfill, insertion-fragmenting, overfill-deletion, overfill-underfill, and merge-deletion). 

The problem of how to handle inexact time matching of ground truth to output has been identified in a range of AR research, with a typical solution being to ignore any result within a set margin of the event boundaries [Bao and Intille 2004], or to employ some minimum coverage rule [Tapia et al. 

The final 3 columns indicate, either through explicit mention in the paper, or through evidence in an included graph, whether artefacts such as timing errors, fragmenting or merge were encountered. 

Here the authors make two fundamental assumptions:(1) Ground truth and classifier prediction are available for each individual frame of the signal. 

Three key observations can be made of Figure 1: 1) some events in A are fragmented into several smaller chunks (f); 2) multiple events in B are recognised as a single merged output (m); and 3) outputs are often offset in time. 

Trending Questions (1)
What is the impact of recognition in quantitative metrics?

The impact of recognition in quantitative metrics is the ability to summarize and compare different activity recognition systems.