What contributions have the authors mentioned in the paper "Recognizing workshop activity using body worn microphones and accelerometers" ?

The authors attempt instead to recognize gestures made in the course of performing everyday work activities. Specifically, the authors examine activities in a wood shop, both in isolation as well as in the context of a simulated assembly task.

What are the future works mentioned in the paper "Recognizing workshop activity using body worn microphones and accelerometers" ?

In the future, the authors hope to apply these promising techniques for recognizing everyday gestures in more general scenarios.

What is the definition of a hidden markov model?

Hidden Markov models (HMMs) are probabilistic models used to represent non-deterministic processes in partially observable domains and are defined over a set of states, transitions, and observations.

What is the current version of the padnet sensor network?

The current module version reads out up to three analog sensor signals including amplification and filtering and handles the communication between modules through dedicated I/O pins.

How many examples of gestures were used to train the HMMs?

To test the performance of the HMMs in isolation, the shop accelerometer data was partitioned by hand into 255 individual examples of gestures then used as a training setfor the HMMs.

(Open Access) Recognizing Workshop Activity Using Body Worn Microphones and Accelerometers (2004) | Paul Lukowicz

Q: How many examples were used to analyze the performance of the LDA on isolated classes?

In order to analysis the performance of the LDA on isolated classes, individual examples of each class were partitioned from each of the 10 experiments, providing 10 examples of each class.

Q: What is the way to perform an assembly task?

Performing initial experiments on live assembly or maintenance tasks is unadvisable due to the cost and safety concerns and the ability to obtain repeatable measurements under experimental conditions.

Recognizing Workshop Activity Using Body Worn Microphones and

Accelerometers

P. Lukowicz, J. Ward, H. Junker, M. St¨ager, G. Tr¨oster

ETH - Swiss Federal Institute of Technology

Wearable Computing Laboratory

8092 Z¨urich, Switzerland

www.wearable.ethz.ch

A. Atrash and T. Starner

College of Computing

Georgia Institute of Technology

Atlanta, Georgia 30332-0280



amin,thad



@cc.gatech.edu

Abstract

Most gesture recognition systems analyze gestures intended

for communication (e.g. sign language) or for command

(e.g. navigation in a virtual world). We attempt instead to

recognize gestures made in the course of performing every-

day work activities. Speciﬁcally, we examine activities in

a wood shop, both in isolation as well as in the context of

a simulated assembly task. We apply linear discriminant

analysis (LDA) and hidden Markov model (HMM) tech-

niques to features derived from body-worn accelerometers

and microphones. The resulting system can successfully

segment and identify most shop activities with zero false

positives and 83.5% accuracy.

1 Introduction

Advances in technology are allowing computer support for

mobile applications. Delivery, maintenance, and manufac-

turing personnel are adopting mobile computing devices to

support their work. Similarly, consumers now have access

to mobile electronic tourist guides, communication devices,

and health and wellness monitoring devices.

A key issue in most such mobile applications is the ef-

fort required to devote to operating the devices. Whereas

in a desktop setting the computer is the focus of the user’s

attention, the user is forced to focus his attention on the

environment for many mobile applications. Accessing the

computer should require minimal cognitive and physical ef-

fort to prevent distracting the user from his primary task.

1.1 Context Sensitivity in Wearable Systems

In addressing the above issues wearable computers have re-

cently emerged as a promising new paradigm. To reduce

the physical effort required to operate the device they are

designed to be a permanently accessible part of the user’s

outﬁt, have mostly hands free input devices, and head-up

displays.

With respect to the cognitive load, many wearable sys-

tems focus on context sensitivity and proactiveness (e.g [1]).

The system should be aware of the user’s action and the ac-

tivities occuring in his environment. Based on this aware-

ness, the system can adapt its conﬁguration, deliver infor-

mation to the user, or record interesting events without any

explicit user input [17]. For example, a maintenance sup-

port system could recognize what particular task is being

performed by the user and automatically display the rele-

vant manual pages on the system’s head-up display. The

wearable could also record the sequence of operations that

are being performed for later analysis or could warn the user

if an important step has been forgotten.

1.2 Recognition Approach

Past approaches by the authors have used head-mounted

cameras and computer vision techniques to identify user

context [17]. Although the visual signal contains much rel-

evant information about any given situation, vision–based

recognition has several disadvantages. For one, reliable lo-

calization and recognition of the relevant objects (hands,

machine parts, tools) in complex scenes is an open research

problem. In addition, computervisiontechniqueshave difﬁ-

culty with the unstructured, moving backgrounds and vary-

ing lighting condition as is common to many wearable sce-

narios, and relevant parts of the scene might be out of view

or obstructed. Finally, video recognition is computationally

intensive, often requiring resources not available on a wear-

able system.

A recognition approach gaining popularity in the wear-

able community is simple sensors integrated in the user’s

outﬁt and in the user’s artifacts (e.g. tools, appliances, or

parts of the machinery) [10]. One of the key aspects of

this approach is the recognition and tracking of postures and

gestures using motion sensors attached to appropriate loca-

tions on the user’s limbs. Initial experiments have shown

that many activities can be well identiﬁed through such

analysis[8]. Another important source of information about

environmental activity is sound. It has been shown that in

many situations ambient sound analysis can be used to dis-

tinguish between different settings, activities, and situations

[4].

1.3 Paper Aims and Contributions

This paper is part of our work aiming to develop a reli-

able context recognition methodology based on the above

approach. It presents a novel way of combining motion

sensor-based gesture recognition with sound data from dis-

tributed microphones. In particular we exploit intensity dif-

ferences between microphones on the wrist of the dominant

hand and on the chest to identify relevant actions performed

by the user’s hand.

In the paper we focus on tracking user activity during as-

sembly or maintenance tasks. Such tasks are among the

most important applications of wearable computing (e.g.

[2, 7]) and could signiﬁcantly beneﬁt from context sensi-

tivity. At the same time these tasks are well structured and

limited to a reasonable number of often repetitive actions.

In addition, machines and tools typical to a workshop envi-

ronment generate distinct sounds. Therefore, these activi-

ties are well suited for a combination of gesture and sound–

based recognition.

This paper describes our approach and the results pro-

duced in an experiment performed on an assembly task in

a wood workshop. We demonstrate that simple sensors

placed on the user’s body can reliably select and recognize

user actions during a workshop procedure.

1.4 Related Work

Acceleration–basedactivityrecognition has been studied by

different research groups [11, 14, 19]. However all of the

above work focused on recognizing comparatively simple

activities (walking, running, and sitting). Sound based sit-

uation analysis has been investigated by Pelton et al. and

in the wearables domain by Clarkson and Pentland [12, 5].

Intelligent hearing aids have also exploited sound analysis

to improve their performance [3].

2 Experimental Setup

Performing initial experiments on live assembly or mainte-

nance tasks is unadvisable due to the cost and safety con-

cerns and the ability to obtain repeatable measurements un-

der experimental conditions. As a consequence we have de-

cided to focus on an “artiﬁcial” task performed at the work-

bench of wood workshop of our lab (see Figure 1). The task

consisted of assembling a simple object made of two pieces

of wood and a piece of metal. The task required 8 process-

ing steps using different tools and including walking and

Figure 1: Left: the wood workshop with 1) grinder, 2) drill,

3)ﬁle and saw, 4) vice, and 5) cabinet with drawers. Right:

The sensor type and placement is identical with the one used

in our experiment: 1,4: microphone, 2,3 and 5: 3-axis ac-

celeration sensors.

other gestures similar to an assembly task in a real world

setting.

2.1 Procedure

No action

1 take the wood out of the drawer

2 put the wood into the vice

3 take out the saw

4 saw

5 put the saw into the drawer

6 take the wood out of the vice

7 drill

8 get the nail and the hammer

9 hammer

10 put away the hammer, get the driver and the screw

11 drive the screw in

12 put away the driver

13 pick up the metal

14 grind

15 put away the metal, pick up the wood

16 put the wood into the vice

17 take the ﬁle out of the drawer

18 ﬁle

19 put away the ﬁle, take the sandpaper

20 sand

21 take the wood out of the vice

Table 1: Steps of workshop assembly task.

The assembly sequence consists of sawing a piece of

wood, drilling a hole in it, grinding a piece of metal, at-

taching it to the piece of wood with a screw, hammering in

a nail to connect the two pieces of wood, and then ﬁnish-

ing the product by smoothing away rough edges with a ﬁle

and a piece of sandpaper. The wood was ﬁxed in the vice

for sawing, ﬁling, and smoothing (and removed whenever

necessary). The test subject moved between areas in the

workshop between steps. Also, whenever a tool or an ob-

ject (nail screw, wood) was required, it was retrieved from

its drawer in the cabinet and returned after use.

The exact sequence of actions is listed in Table 1. The

task was to recognize all tool-based activities. Tool-based

activities excludes drawer manipulation, user locomotion,

and clapping (a calibration gesture). The experiment was

repeated 10 times in the same sequence to collect data for

training and testing. For practical reasons, the individual

processing steps were only executed long enough to obtain

an adequate sample of the activity. This policy did not re-

quire the complete execution of any one task (e.g. the wood

was not completely sawn), allowing us to complete the ex-

periment in a reasonable amount of time. However this pro-

tocol inﬂuenced only the duration of each activity and not

the manner in which it was performed.

2.2 Data Collection System

The data was collected using the ETH PadNET sensor net-

work [8] equipped with 3 axis accelerometer nodes and two

Sony mono microphones connected to a body worn com-

puter. The position of the sensors on the body is shown in

Figure 1: an accelerometer node on both wrist and on the

upper arm of the right hand and a microphone on the chest

and on the right wrist (the test subject was right handed).

As can be seen in Figure 1 each PadNET sensor node

consist of two modules. The main module incorporates

a MSP430149 low power 16-Bit mixed signal micropro-

cessor (MPU) from Texas Instruments running at 6 MHz

maximum clock speed. The current module version reads

out up to three analog sensor signals including ampliﬁca-

tion and ﬁltering and handles the communication between

modules through dedicated I/O pins. The sensors them-

selves are hosted on an even smaller ’sensor-module’ that

can be either placed directly on the main module or con-

nected through wires. In the experiment described in this

paper sensor modules were based on a 3-axis accelerom-

eter package consisting of two ADXL202E devices from

Analog Devices. The analog signals from the sensor were

lowpass ﬁltered (



) and digitized with 12Bit

resolution using a sampling rate of 100Hz.

3 Recognition

3.1 Acceleration Data Analysis

Figure 2 shows a segment of the acceleration data collected

during the experiment. The segment includes sawing, re-

moving the wood from the vice, and drilling. The user ac-

cesses the drawer two times and walks between the vice and

the drill. Clear differences can be seen in the acceleration

signals. For example, sawing clearly reﬂects a periodic mo-

tion. By contrast, the drawer access (marked as 1a and 1b in

the ﬁgure) shows a low frequency “bump” in acceleration.

This bump corresponds to the 90 degree turns of the wrist

as the user releases the drawer handle, retrieves the object,

and grasps the handle again to close the drawer.

Given the data, time series recognition techniques such

as hidden Markov models (HMMs) [13] should allow the

recognition of the relevant gestures. However, a closer anal-

ysis reveals two potential problems. First, not all relevant

activities are strictly constrained to a particular sequence of

motions. While the characteristic motions associated with

sawing or hammering are distinct, there is high variation in

drawer manipulation and grinding. Secondly, the activities

are separated by sequences of user motions unrelated to the

task (e.g the user scratching his head). Such motions may

be confused with the relevant activities. We deﬁne a “noise”

class to handle these unrelated gestures.

3.2 Sound Data Analysis

Considering that most gestures relevant for the assem-

bly/maintanance scenario are associated with a distinct

sounds, sound analysis should help to address the problems

described above. We distinguish between three different

types of sounds:

1. Sounds made by a handtool: - Such sounds are directly

correlated with user hand motion. Examples are saw-

ing, hammering, ﬁling, and sanding. These actions are

generally repetitive, quasi–stationary sounds (i.e. rel-

atively constant over time - such that each time slice

on a sample would produce an identical spectrum over

a reasonable length of time). In addition these sounds

are much louder than the background noise (dominant)

and are likely to be much louder at the microphone

on the user’s hand than on his chest. For example,

the intensity curve for sanding (see Figure 2 top right)

reﬂects the periodic sanding motion with the minima

corresponding to the changes in direction and the max-

ima coinciding with the maximum sanding speed in the

middle of the motion. Since the user’s hand is directly

on the source of the sound the intensity difference is

large. For other activities it is smaller, however in most

cases still detectable.

2. Semi-autonomous sounds: These sounds are initiated

by user’s hand, possibly (but not necessarily) remain-

ing close to the source for most of the sound duration.

This class includessound produced by a machine, such

as the drill or grinder. Although ideal quasi-stationary

sounds, sounds in this class may not necessarily be

1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6

x 10

−2

−1

Data Frame

Accelerometer Reading (Offset per Axis for Readability)

sound

correleation

r−wrist

axis 1

r−wrist

axis 2

r−wrist axis 3

l−wrist

axis 1

r−arm

axis 1

sawing

open vice

close vice

turn drill

wheel to move

it down

taking out

the saw

putting back

the saw

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

time [s]

Intensity

Sanding

Wrist microphone

Chest microphone

−40 −20 0 20 40 60 80

−60

−50

−40

−30

−20

−10

saw

drill

drawer

Figure 2: Left: example accelerometer data from sawing and drilling. Right top: audio proﬁle of sanding from wrist and

chest microphones. Right bottom: clustering of activities in LDA space

dominant and tend to have a less distinct intensity dif-

ference between the hand and the chest (for example,

when a user moves their hand away from the machine

during operation).

3. Autonomous sounds: These are sounds generated by

activities not driven by the user’s hands (e.g loud back-

ground noises or the user speaking).

Obviously the vast majority of relevant actions in assembly

and maintanance are associated with hand tool sounds and

semi–autonomoussounds. In principle, these sounds should

be easy to identify using intensity differences between the

wrist and the chest microphone. In addition, if extracted ap-

propriately, these sounds may be treated as quasi-stationary

and can be reliably classiﬁed using simple spectrum pattern

matching techniques.

The main problem with this approach is that many ir-

relevant actions are also likely to fall within the deﬁnition

of handtool and semi–autonomous sound. Such actions

include scratching or putting down an object. Thus, like

acceleration analysis, sound–based classiﬁcation also has

problem distinguishing relevant from irrelevant actions and

will produce a number of false positives.

3.3 Recognition Methodology

Neither acceleration nor sound provide enough information

for perfect extraction and classiﬁcation of all relevant activ-

ities; however, we hypothesize that their sources of error are

likely to be statistically distinct. Thus, we develop a tech-

nique based on the fusion of both methods. Our procedure

consists of three steps:

1. Extraction of the relevant data segments using the in-

tensity difference between the wrist and the chest mi-

crophone. We expect that this technique will segment

the data stream into individualactions (including many

actions we will model as noise).

2. Independent classiﬁcation of the actions based on

sound or acceleration. This step will yield imperfect

recognition results by both the sound and acceleration

subsystems.

3. Removal of false positives. While the sound and ac-

celeration subsystems are each imperfect, when their

classiﬁcations of a segment agree, the result may be

more reliable (if the sources of error are statistically

distinct).

4 Isolated Activity Recognition

As an initial experiment, we segment the activities in the

data ﬁles by hand and test the accuracy of the sound and

acceleration methods separately.

4.1 Sound Recognition

4.1.1 Method

The basic classiﬁcation scheme operates on individual

sound segments of length



. The approach follows a three

step process: feature extraction, dimensionality reduction,

and the actual classiﬁcation.

The features used are the spectral components of each



obtained by Fast Fourier Transformation (FFT). This pro-

duces













dimensional feature vectors.

Rather than attempting to classify such large



dimensional vectors directly, Linear Discriminant Analysis

(LDA)[6] is employed to derive an optimal projection of the

data into a smaller,



dimensional feature space (where M

is the number of classes). In the “recognition phase”, the

LDA transformation is applied to the data segment under

test to produce the corresponding



dimensional fea-

ture vector.

Using a labeled training-set, class means are calculated

in the



dimensional space. Classiﬁcation is performed

simply by choosing the class mean which has the minimum

Euclidean distance from the test feature vector (see Figure

3 bottom right).

4.1.2 Intensity Analysis

Making use of the fact that signal intensity is inversely pro-

portional to the square of the distance from its source, the

ratio of the two intensities

















is used as a mea-

sure of absolute distance of source from the user. Assuming

the sound source is distance



from the wrist microphone

and

 "!

from the chest, the ratio of the intensities will be

proportional to

#$













 !(













*)+!,"!









-.

)!











When both microphones are separated by at least

, any

sound produced a distance



( where

0/1/2!

) from the

user will bring this ratio close to one. Sounds produced near

the chest microphone (e.g. the user speaking) will cause the

ratio to approach zero whereas any sounds close to the wrist

mic will make this ratio large.

Sound extraction is performed by sliding a window

4

over the





Hz resampled audio data. On each iteration,

the signal energy over

354

for each channel is calculated.

For these windows, the difference in ratio

6



7









and

its reciprocal are obtained, which are then compared to an

empirically obtained threshold

894

The difference

:













"









7#$



provides a

convenient metric for thresholding - zero indicates a far off

(or exactly equidistant) sound while above or below zero

values indicate a sound closer to the wrist or to the chest

mic respectively.

Sound LDA IA+LDA maj(IA+LDA)

Hammer 96.79 98.85 100

Saw 92.71 92.98 100

Filing 69.68 81.43 100

Drilling 99.59 99.35 100

Sanding 93.66 92.87 100

Grinding 97.77 97.75 100

Screwing 91.17 93.29 100

Vice 80.10 81.14 100

Overall 90.18 92.21 100

Table 2: Isolated Recognition Accuracy Per Sound (in %)

for LDA alone, LDA with IA preselection and majority de-

cision.

4.2 Results

In order to analysis the performanceof the LDA on isolated

classes, individual examples of each class were partitioned

from each of the 10 experiments, providing 10 examples

of each class. Eight examples of each class were used for

training while testing on the remaining two examples.

Earlier work[15] cited



;

=5kHz and



=0.05 seconds

(256 points) as optimal parameters for general purpose

sound recognition tasks. In this task, it was found that

recognition rates were improved using a larger





=0.1; at

the same time





could be reduced to 2kHz without any no-

table adverse effects.

With these parameters, a sliding window





LDA classi-

ﬁcation was run directly over all the class partitioned sam-

ples. This process returned an overall recognition rate of

90.19%. The individual class results are given in the ﬁrst

column of Table 2. We next used intensity analysis to select

only the samples over a given threshold to pass to the LDA

procedure. This technique resulted in a slightly higher accu-

racy of 92.21% as shown in the second column of Table 2.

The third column of Table 2 shows a variation of this tech-

nique where we slide a window over the data and classify

the data at each window segment. A majority decision over

the window segments was used to determine the overall la-

bel for a given isolated activity. This technique resulted in

100% recognition over the test data.

Figure 3: HMMs topologies.

Recognizing Workshop Activity Using Body Worn Microphones and Accelerometers

Figures

Citations

A public domain dataset for human activity recognition using smartphones

Sensor-Based Activity Recognition

Activity identification using body-mounted sensors — a review of classification techniques

A Survey on Ambient Intelligence in Healthcare

Physical Human Activity Recognition Using Wearable Sensors.

References

Pattern Classification

An introduction to hidden Markov models

Real-time American sign language recognition using desk and wearable computer based video

Classification, 2nd Edition

Recognizing human motion with multiple acceleration sensors

Related Papers (5)

Activity recognition from user-annotated acceleration data

Activity recognition from accelerometer data

Activity Recognition in the Home Using Simple and Ubiquitous Sensors

Inferring activities from interactions with objects

Implementation of a real-time human movement classifier using a triaxial accelerometer for ambulatory monitoring

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Recognizing workshop activity using body worn microphones and accelerometers" ?

Q2. What are the future works mentioned in the paper "Recognizing workshop activity using body worn microphones and accelerometers" ?

Q3. What is the definition of a hidden markov model?

Q4. What is the current version of the padnet sensor network?

Q5. How many examples of gestures were used to train the HMMs?

Q6. How many examples were used to analyze the performance of the LDA on isolated classes?

Q7. What are the main advantages of wearable computers?

Q8. What is the difference between the two intensities?

Q9. What is the way to improve the recognition rate of a sound?

Q10. What is the way to perform an assembly task?

Q11. What was the procedure for removing the wood from the drawer?