How many filters are needed to perform the performance?

For simple datasets, a filter size of two seems to provide the best performance, while for more complex datasets this size needs to be increased up to three.

What datasets are the candidates for the combined approach?

Combining both deep learnt and shallow features allows the best of the approaches to be exploited, producing a more generalizable solution that, as shown in their experiments, overcomes the other approaches for all datasets except for the Skoda dataset.

What is the reason for the lack of a deep learning approach?

Another possibility is that, since the extraction of features through deep learning is driven by data, if the dataset is not well represented in all the possible modalities(i.e., location of the sensor, different sensor’s properties such as amplitude or sampling rate) the deep learning approach is not capable to generalize these data modalities automatically for the classification task.

How is the proposed method suitable for real-time on-node HAR?

the authors show that the computation time obtained from low-power devices, such as smartphones, wearable devices, and IoT, is suitable for real-time on-node HAR.

What is the plausible reason for the lower precision and recall results observed?

For the Daphnet FoG dataset, the under representation of the class“freeze” in the training data is the most plausible reason for the lower precision and recall results observed.

How many filters can be used to implement the deep learning model?

For both architectures, the authors use theFFTW3 library [33] to extract the spectrogram and the Torch framework [34] to implement the deep learning model.

What is the largest dataset in terms of number of samples?

It is one of the largest datasets in terms of number of samples with around 30 h of labeled raw data, and it is the first database that groups together data captured using different sensor configurations.

How long did the computation take to perform the classification task?

To evaluate if the proposed method could achieve real-time performance on a smartphone or a miniature wearable device, the computation time required to perform the classification task for a 10 s segment of data was measured.

(Open Access) A Deep Learning Approach to on-Node Sensor Data Analytics for Mobile or Wearable Devices (2017) | Daniele Ravi

Q: What is the main challenge when designing a classification method for time-series analysis?

One of the main challenges, when designing a classification method for time-series analysis, is selecting a suitable set of features for subsequent classification.

Q: What is the main advantage of the filters?

These filters are applied repeatedly to the entire spectrogram and the main advantage is that the network contains just a number of neurons equal to a single instance of the filters, which drastically reduces the connections from the typical neural network architecture.

56 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 21, NO. 1, JANUARY 2017

A Deep Learning Approach to on-Node Sensor

Data Analytics for Mobile or Wearable Devices

Daniele Rav

ı, Charence Wong, Benny Lo, and Guang-Zhong Yang, Fellow, IEEE

Abstract—The increasing popularity of wearable devices

in recent years means that a diverse range of physiologi-

cal and functional data can now be captured continuously

for applications in sports, wellbeing, and healthcare. This

wealth of information requires efﬁcient methods of clas-

siﬁcation and analysis where deep learning is a promis-

ing technique for large-scale data analytics. While deep

learning has been successful in implementations that uti-

lize high-performance computing platforms, its use on low-

power wearable devices is limited by resource constraints.

In this paper, we propose a deep learning methodology,

which combines features learned from inertial sensor data

together with complementary information from a set of shal-

low features to enable accurate and real-time activity clas-

siﬁcation. The design of this combined method aims to

overcome some of the limitations present in a typical deep

learning framework where on-node computation is required.

To optimize the proposed method for real-time on-node

computation, spectral domain preprocessing is used before

the data are passed onto the deep learning framework. The

classiﬁcation accuracy of our proposed deep learning ap-

proach is evaluated against state-of-the-art methods using

both laboratory and real world activity datasets. Our results

show the validity of the approach on different human ac-

tivity datasets, outperforming other methods, including the

two methods used within our combined pipeline. We also

demonstrate that the computation times for the proposed

method are consistent with the constraints of real-time on-

node processing on smartphones and a wearable sensor

platform.

Index Terms—ActiveMiles, deep learning, Human Activ-

ity Recognition (HAR), Internet-of-Things (IoT), low-power

devices, wearable.

I. INTRODUCTION

EEP learning is a paradigm of machine learning that uses

multiple processing layers to infer and extract information

from big data. Research has shown that the use of deep learning

can achieve improved performance in a range of applications

when compared to traditional approaches [1]–[6]. Conventional

Manuscript received July 20, 2016; revised October 19, 2016; ac-

cepted November 18, 2016. Date of publication December 23, 2016;

date of current version January 31, 2017. This research work was sup-

ported by EPSRC reference: EP/L014149/1 Smart Sensing for Surgery

project and EPSRC-NIHR HTC Partnership Award (EP/M000257/1 and

EP/N027132/1).

The authors are with the Hamlyn Centre, Imperial College London,

London SW7 2AZ, U.K. (e-mail: d.ravi@imperial.ac.uk; charence@

imperial.ac.uk; benny.lo@imperial.ac.uk; g.z.yang@imperial.ac.uk).

Digital Object Identiﬁer 10.1109/JBHI.2016.2633287

learning approaches use a set of predesigned features—also

known as “shallow” features—to represent the data for a spe-

ciﬁc classiﬁcation task. In image processing and machine vision,

shallow features such as SIFT or FAST are often used for land-

mark detection [7], whereas for time-series analysis, statistical

parameters are used [8]–[11].

Human Activity Recognition (HAR), e.g., generally exploits

time-series data from inertial sensors to identify the actions be-

ing performed. In healthcare, inertial sensor data can be used for

monitoring the onset of diseases as well as the efﬁcacy of treat-

ment options [11], [12]. For patients with neurodegenerative dis-

eases, such as Parkinson’s, HAR can be used to compile diaries

of their daily activities and detect episodes such as freezing-of-

gait events, for assessing the patient’s condition [13]. Quantify-

ing physical activity through HAR can also provide invaluable

information for other applications, such as evaluating the con-

dition of patients with chronic obstructive pulmonary disease

(COPD) [14], [15] or evaluating the recovery progress of pa-

tients during rehabilitation [16], [17].

Currently, smartphones, wearable devices, and internet-of-

things (IoT) are becoming more affordable and ubiquitous.

Many commercial products, such as the Apple Watch, Fitbit,

and Microsoft Band, and smartphone apps including Runkeeper

and Strava, are already available for continuous collection of

physiological data. These products typically contain sensors

that enable them to sense the environment, have modest com-

puting resources for data processing and transfer, and can be

placed in a pocket or purse, worn on the body, or installed at

home. Accurate and meaningful interpretation of the recorded

physiological data from these devices can be applied potentially

to HAR. However, most current commercial products only pro-

vide relatively simple metrics, such as step count or cadence.

The emergence of deep learning methodologies, which are able

to extract discriminating features from the data, and increased

processing capabilities in wearable technologies give rise to the

possibility of performing detailed data analysis in situ and in

real time. The ability to perform more complex analysis, such

as activity classiﬁcation on the wearable device would be ad-

vantageous for the aforementioned applications.

The rest of the paper is organized as follows: In Section II,

we introduce the current state-of-the-art in machine learning

for HAR. Our proposed methodology is then described in

Section III. Datasets used for performance evaluation are pre-

sented in Section IV along with detailed comparison of the dif-

ferent approaches. Our ﬁndings and contributions are concluded

in Section V.

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/

RAV

ı et al.: DEEP LEARNING APPROACH TO ON-NODE SENSOR DATA ANALYTICS FOR MOBILE OR WEARABLE DEVICES 57

II. RELATED WORK

One of the main challenges, when designing a classiﬁcation

method for time-series analysis, is selecting a suitable set of fea-

tures for subsequent classiﬁcation. Recent surveys of research

in activity recognition show the diverse range of features and

classiﬁcation methods used [18], [19].

In [20], a simple energy thresholding method applied to fre-

quency analysis of the input data is used for the detection of

freezing of gait in Parkinson patients. In other applications,

statistical parameters [8], basis transform coding [9], and sym-

bolic representation [10] are often used as “shallow” features to

describe time-series data. Methods such as decision trees and

support vector machines (SVM) are then trained to classify the

data using the given features [21]–[23]. Catal et al. [24] pro-

posed a method for HAR that combines multiple classiﬁcation

methods, known as an ensemble of classiﬁers, to maximize the

accuracy that can be attained from each classiﬁcation method.

Using deep learning methods, such as deep belief networks

(DBN), restricted Boltzmann machines (RBM), and convolu-

tional neural networks (CNN), a discriminative set of features

can be learnt directly from the input data [3]–[5]. However, for

HAR, changes in sensor orientation, placement, and other fac-

tors require that deep learning approaches for HAR must use

complex designs with many layers in order to discover a com-

plete hierarchy of features to properly classify the raw data. Al-

sheikh et al. [6] demonstrate activity recognition using a method

based on DBNs and RBMs formed using multiple hidden lay-

ers. A hybrid deep learning and hidden Markov model (HMM)

approach is ﬁnally used with three 1000 neuron layers. While

utilizing additional hidden layers and neurons to improve recog-

nition accuracy is not a signiﬁcant burden for high-performance

computer systems, it makes these methods unsuitable for de-

vices with fewer resources.

A deep learning approach optimized for low-power devices

presented in [1] uses a spectrogram representation of the iner-

tial input data to provide invariance against changes in sensor

placement, amplitude, or sampling rate, thus allowing a more

compact method design. However, the results reported in [1] do

not always overcome the accuracy obtained from shallow fea-

tures, which may be due to resource limitations and the simple

design of the method. For this reason, we propose to combine a

set of shallow features with those obtained from deep learning

in this paper. As far as we know, we are the ﬁrst that propose

to combine efﬁciently both shallow and deep features with a

method that can be executed in real time on a wearable device.

III. M

ETHODS

As mentioned previously, in Rav

ı et al. [1], it is shown that

features derived from a deep learning method performed on de-

vices with limited resources are sometimes less discriminative

than a complete set of predeﬁned shallow features. A possible

reason for this behavior may lie in the fact that deep learning

methods with less computational layers cannot ﬁnd the entire

hierarchy of features. Another possibility is that, since the ex-

traction of features through deep learning is driven by data, if

the dataset is not well represented in all the possible modalities

Fig. 1. Schematic workﬂow of the proposed method: the raw datasets

measured by the inertial sensors are collected and divided into seg-

ments. The automatically learnt features and the shallow features are

extracted in processes A and B, respectively. In the last block, the fea-

tures are combined together and classiﬁed using a fully connected layer

and a soft-max layer of the deep learning model.

(i.e., location of the sensor, different sensor’s properties s uch

as amplitude or sampling rate) the deep learning approach is

not capable to generalize these data modalities automatically

for the classiﬁcation task. In these scenarios, shallow features

may achieve better performance than deep learning approaches.

Consequently, we believe that shallow and deep learnt features

provide complementary information that can be jointly used for

classiﬁcation.

The pipeline of the proposed approach that combines both

shallow and deep learnt features is described in

Fig. 1. The ﬁrst

block within the pipeline collects the raw data obtained from

the inertial sensors. The second block extracts the input data

into segments to be used along both process A and process B

of the pipeline where features from a deep learning method and

shallow features are computed in parallel. In the ﬁnal block of

the pipeline, these two sets of features are merged together and

classiﬁed using a fully connected and a soft-max layers. The

details of the approach are further explained in Algorithm 1 and

each of these blocks are described as follows:

A. Input

For the application of HAR, we will be using inertial sensors,

such as accelerometers and gyroscopes for the input block of

58 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 21, NO. 1, JANUARY 2017

Fig. 1. It is important to note that the described approach also

caters for additional time-series data from other sensor types,

such as Electrocardiography (ECG) for measuring heart rhythm

or Electromyography (EMG) for muscle activity.

B. Extract Segments

After the raw signals are collected, segments of n samples

are extracted and forwarded along processes A and B of the

pipeline. The number of samples to consider depends on the

type of application involved. Of course, increasing the length

of the segments can introduce an improvement in recognition

accuracy, but at the same time it would cause a delay in response

for real-time applications as longer segments of data need to be

obtained and the boundary between different activities become

less well deﬁned. Typically, segments of 4 to 10 s are used for

HAR [6]. The reason that segments rather than single data points

are used is motivated by the fact that the highly ﬂuctuating raw

inertial measurements make the classiﬁcation of a single data

point impractical [25]. Therefore, segments are obtained using

a sliding window applied individually to each axis of the sensor.

C. Spectrogram and Deep Learning Module

In process A of Fig. 1, a set of deep features is automatically

extracted using the proposed deep learning module. This module

takes advantage of a spectrogram representation and an efﬁcient

design to achieve its task. In previous work, Rav

ı et al. [1] show

the importance of using a suitable domain when a deep learn-

ing methodology is applied to time-series data. Speciﬁcally, they

show that the s pectrogram representation is essential f or extract-

ing interpretable features that capture the intensity differences

among nearest inertial data points. The spectrogram representa-

tion also provides a form of time and sampling rate invariance.

This enables the classiﬁcation to be more robust against data

shifting in time and also against changes in amplitude of the

signal and sampling rate. Moreover, frequency selection in the

spectrogram domain also provides an implicit way to allow noise

ﬁltering of the data over time.

A spectrogram of an inertial signal x is a new representation

of the signal as a function of frequency and time. Speciﬁcally, the

spectrogram is the magnitude squared of the short-time Fourier

transform (STFT). The procedure for computing the spectro-

gram is to divide a longer time signal into short segments of

equal length and then compute the Fourier transform separately

on each shorter segment. This can be expressed as

STFT{x[n]}(m, ω)=X(m, ω)=

∞



n=−∞

x[n]ω[n − m]e

−jωn

(1)

likewise, with signal x[n] and window w[n]. The magnitude

squared of the STFT yields the spectrogram of the function:

spectrogram{x(n)}(m, ω)=|X(m, ω)|

. (2)

The resulting spectrogram is a matrix st × sf, where st is the

number of different short term, time-localized points and sf is

the number of frequencies considered. Therefore, the spectro-

gram describes the changing spectra as a function of time. In

Fig. 2, we show examples of the averaged spectrograms across

different activities. As we can see, their representations exhibit

different patterns. Speciﬁcally, it appears that highly variable

activities exhibit higher spectrogram values along all frequen-

cies, instead repetitive activities, such as walking or running,

only show high values on speciﬁc frequencies. These discrim-

inative patterns can be detected by the deep learning module,

which aims to extract features and characterize activities.

Once the spectrograms have been computed, they are pro-

cessed using the deep learning module. The design of our deep

learning module is aimed at overcoming some of the issues

typically present in a deep learning framework where on-node

computation is required. Speciﬁcally, these disadvantages in-

clude the following:

1) deep learning modules can contain redundant links be-

tween pairs of nodes that connect two consecutive layers

of the neural network;

2) correlations in different signal points are usually over-

looked; and

3) a large set of layers can be built on top of each other

to extract a hierarchy of features from low level to high

level.

Deep learning approaches with these designs tend to have high

computation demands and are unsuitable for low-power devices

being considered in this paper. In our proposed approach, we

reduce the computation cost by limiting the connections be-

tween the nodes of the network and by computing the features

RAV

ı et al.: DEEP LEARNING APPROACH TO ON-NODE SENSOR DATA ANALYTICS FOR MOBILE OR WEARABLE DEVICES 59

Fig. 2. Examples of averaged spectrograms extracted from different activities of the ActiveMiles dataset. Their representations exhibit different

patterns for feature extraction and class recognition.

efﬁciently through the use of few hidden layers. Speciﬁcally,

the spectral representations of different axes and sensors are

prearranged so that the data represent local correlations and

they can be processed using 1-D convolutional kernels with the

same principle that CNN [26] follows. These ﬁlters are applied

repeatedly to the entire spectrogram and the main advantage

is that the network contains just a number of neurons equal to

a single instance of the ﬁlters, which drastically reduces the

connections from the typical neural network architecture.

The proposed prearrangement of the spectrograms is shown

in the deep learning module of

Fig. 1. Here the spectrograms

computed on the x, y, and z axes are grouped together column

wise while the spectrograms obtained from different sensors are

grouped row wise. The processing of our deep learning mod-

ule is based on the use of sums of local 1-D convolutions over

this prearranged input. Since each activity has a discriminative

distribution of frequencies, as shown in

Fig. 2, the sum is per-

formed in correspondence to each frequency. Speciﬁcally, each

ﬁlter w

—with size kw × st—is applied to the spectrogram ver-

tically, and the weighted sum of the convolved signal at time t

is computed as follows:

o[t][i]=



j=1



k=1

w[i][j][k] ∗ input[j][dw ∗ (t − 1) + k] (3)

where dw is the stride of the convolution. These convolutions

produce an output layer o with size wp × OutputFrame with

OutputFrame =(InputFrame − kw)/dw +1and wp, the num-

ber of ﬁlters. The results of the convolution obtained from the x,

y, and z axes of an inertial sensor are summed together without

TABLE I

SHALLOW FEATURES EXTRACTED FROM THE PROPOSED APPROACH AND

COMBINED WITH THE LEARNT FEATURES

Input Data Features

Interquartile Range Amplitude Kurtosis

Root Mean Square Variance Mean

Raw signal Standard Deviation Skewness Min

Mean-cross Median Max

Zero-cross

First derivative

Root Mean Square Variance Mean

Standard Deviation

any discrimination so that the orientation invariance property is

maintained. This helps the proposed deep learning framework

to be more generalizable even when variation in the data re-

sulting from different sensor orientation is not well represented

in the dataset. The ﬁlters applied to the three axes share the

same weights, which is important for reducing the number of

parameters for each convolution layer.

D. Shallow Features

In process B of Fig. 1, 17 predeﬁned shallow features are

considered. These features, listed i n

Table I, are extracted sepa-

rately from each segment of each axis, creating a vector repre-

sentation for the considered segment. This step is expressed on

line 5 in Algorithm 1. In our case, it takes six input segments,

a[1],a[2],a[3],g[1],g[2],g[3], representing, respectively, the

60 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 21, NO. 1, JANUARY 2017

TABLE II

SUMMARY OF HUMAN ACTIVITY DATASETS

Dataset Description # of Classes Subjects Samples Sampling Rate Reference

ActiveMiles Daily activities collected by smartphone in uncontrolled environments 7 10 4,390,726 50 – 200 Hz [1]

WISDM v1.1 Daily activities collected by smartphone in a laboratory 6 29 1,098,207 20 Hz [27]

WISDM v2.0 Daily activities collected by smartphone in uncontrolled environments 6 563 2,980,765 20 Hz [28][29]

Daphnet FoG Freezing of gait episodes in Parkinson’s patients 2 10 1,917,887 64 Hz [20]

Skoda Manipulative gestures performed in a car maintenance scenario 10 1 ∼ 701, 440 98 Hz [30]

accelerometers and the gyroscope data vector along the three

axes and produces a ﬁnal vector of 102 features as output.

E. Classiﬁcation

Once both deep and shallow features have been computed

they are merged together into a unique vector and classiﬁed

through a fully connected layer and a soft-max layer, as shown

by lines 20 and 21 of Algorithm 1.

F. Training Process

Shallow and deep features are trained together in a uniﬁed

deep neural network. During each stage of the training, errors

between the target and obtained values are used in the back-

ward propagation routine to update the weights of the different

hidden layers. Stochastic gradient descent (SGD) is used to min-

imize the loss function deﬁned by the L2-norm. To further im-

prove the training procedure of the weights, we have used three

regularizations:

1) Weight decay: it is a term in the weight update rule that

causes the weights t o exponentially decay to zero i f no

other update is scheduled. It is used to avoid over ﬁtting.

2) Momentum: it is a technique for accelerating gradient

descent and attempting to move the global minimum of

the function. It accumulates a velocity vector in directions

of persistent reduction in the objective across iterations.

3) Dropout: it is a technique that prevents overﬁtting and

provides a way of combining many different neural net-

work architectures efﬁciently for consensus purposes. At

each iteration of the training, dropout temporarily re-

moves nodes from a neural network, along with all its

incoming and outgoing connections. The choice of which

units to drop is random and is determined according to

a probability p of retaining a node. Training a network

with dropout leads to signiﬁcantly lower generalization

error.

IV. E

XPERIMENTAL RESULTS

A. Datasets

To evaluate the proposed system, we analyze the perfor-

mance obtained on complex real world activity data, collected

from multiple users. Five public datasets are analyzed using

tenfold cross validation.

Table II summarizes these datasets.

Noteworthy is the release of our dataset, ActiveMiles (available

at http://hamlyn.doc.ic.ac.uk/activemiles/), which contains un-

constrained real world human activity data from ten subjects

Fig. 3. Behavior of the proposed approach by increasing the probabil-

ity of retaining a node in the dropout regularization, where size of the

convolutional kernel is 2, number of levels is 2, and number of ﬁlters

is 40.

collected using ﬁve different smartphones. Each subject was

asked to annotate the activities they carried out during the day

using an Android app developed for this purpose. There are no

limitations on where t he smartphone is located (i.e., pocket, bag,

or held i n the hand). Annotations record the start time, end time,

and label of a continuous activity. Since each smartphone uses

a different brand of sensor, the ﬁnal dataset will contain data

that have many modalities, including different sampling rates

and amplitude ranges. It is one of the largest datasets in terms

of number of samples with around 30 h of labeled raw data, and

it is the ﬁrst database that groups together data captured using

different sensor conﬁgurations.

B. Parameters Optimizations

The proposed deep learning framework contains a few hyper-

parameters that must be deﬁned before training the ﬁnal model.

An optimization process based on a grid search is proposed to

ﬁnd the best values for the following:

1) the probability of retaining a node during the dropout

regularization;

2) the size of the convolutional kernels for all relative con-

volutional layers;

3) the total number of convolutional layers; and

4) the total number of ﬁlters in each convolutional layer.

The behavior of the system when these parameters are sys-

tematically tested is shown in

Figs. 3, 4, and 5.InFig. 3, we can

infer for datasets that have many classes and large variability,

increasing t he probability of retaining a node during dropout

A Deep Learning Approach to on-Node Sensor Data Analytics for Mobile or Wearable Devices

Figures

Citations

Deep learning for healthcare: review, opportunities and challenges.

Deep learning for sensor-based activity recognition: A survey

Deep Learning for IoT Big Data and Streaming Analytics: A Survey

Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges

Network Intrusion Detection for IoT Security Based on Learning Techniques

References

ImageNet Classification with Deep Convolutional Neural Networks

Gradient-based learning applied to document recognition

ImageNet classification with deep convolutional neural networks

Object recognition from local scale-invariant features

The Design and Implementation of FFTW3

Related Papers (5)

Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition

A Survey on Human Activity Recognition using Wearable Sensors

Deep learning

Activity recognition using cell phone accelerometers

A tutorial on human activity recognition using body-worn inertial sensors

Frequently Asked Questions (15)

Q1. What is the main challenge of a deep learning approach for low-power devices?

Q2. What are the methods used to classify the data?

Q3. What are the main challenges of using statistical parameters to describe time-series data?

Q4. How many filters are needed to perform the performance?

Q5. What datasets are the candidates for the combined approach?

Q6. What is the reason for the lack of a deep learning approach?

Q7. How is the proposed method suitable for real-time on-node HAR?

Q8. What is the main challenge when designing a classification method for time-series analysis?

Q9. What is the main advantage of the filters?

Q10. What is the plausible reason for the lower precision and recall results observed?

Q11. How many filters can be used to implement the deep learning model?

Q12. What is the largest dataset in terms of number of samples?

Q13. What are the disadvantages of deep learning?

Q14. What is the sum of the frequency distribution of the activity?

Q15. How long did the computation take to perform the classification task?