Open AccessJournal ArticleDOI

Experiencing SAX: a novel symbolic representation of time series

- 01 Oct 2007 -

- Vol. 15, Iss: 2, pp 107-144

TLDR

The utility of the new symbolic representation of time series formed is demonstrated, which allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.

Abstract:

Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.

Content maybe subject to copyright Report

Data Min Knowl Disc (2007) 15:107–144

DOI 10.1007/s10618-007-0064-z

Experiencing SAX: a novel symbolic representation

of time series

Jessica Lin · Eamonn Keogh · Li Wei ·

Stefano Lonardi

Received: 15 June 2006 / Accepted: 10 January 2007 / Published online: 3 April 2007

Springer Science+Business Media, LLC 2007

Abstract Many high level representations of time series have been proposed

for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise

polynomial models, etc. Many researchers have also considered symbolic rep-

resentations of time series, noting that such representations would potentiality

allow researchers to avail of the wealth of data structures and algorithms from

the t ext processing and bioinformatics communities. While many symbolic rep-

resentations of time series have been introduced over the past decades, they all

suffer from two fatal ﬂaws. First, the dimensionality of the symbolic represen-

tation is the same as the original data, and virtually all data mining algorithms

scale poorly with dimensionality. Second, although distance measures can be

deﬁned on the symbolic approaches, these distance measures have little corre-

lation with distance measures deﬁned on the original time series.

In this work we formulate a new symbolic representation of time series. Our

representation is unique in that it allows dimensionality/numerosity reduction,

Responsible editor: Johannes Gehrke.

J. Lin (

)

Information and Software Engineering Department, George Mason University, Fairfax,

VA 22030, USA

e-mail: jessica@ise.gmu.edu

E. Keogh · L. Wei · S. Lonardi

Computer Science & Engineering Department, University of California-Riverside, Riverside,

CA 92521, USA

e-mail: eamonn@cs.ucr.edu

L. Wei

e-mail: wli@cs.ucr.edu

S. Lonardi

e-mail: stelo@cs.ucr.edu

108 J. Lin et al.

and it also allows distance measures to be deﬁned on the symbolic approach that

lower bound corresponding distance measures deﬁned on the original series.

As we shall demonstrate, this latter feature is particularly exciting because it

allows one to run certain data mining algorithms on the efﬁciently manipulated

symbolic representation, while producing identical results to the algorithms

that operate on the original data. In particular, we will demonstrate the utility

of our representation on various data mining tasks of clustering, classiﬁcation,

query by content, anomaly detection, motif discovery, and visualization.

Keywords Time series · Data mining · Symbolic representation · Discretize

1 Introduction

Many high level representations of time series have been proposed for data min-

ing. Figure 1 illustrates a hierarchy of all the various time series representations

in the literature (Andre-Jonsson and Badal 1997; Chan and Fu 1999; Faloutsos

et al. 1994; Geurts 2001; Huang and Yu 1999; Keogh et al. 2001a; Keogh and

Pazzani 1998; Roddick et al. 2001; Shahabi et al. 2000; Yi and Faloutsos 2000).

One representation that the data mining community has not considered in detail

is the discretization of the original data into symbolic strings. At ﬁrst glance this

seems a surprising oversight. There is an enormous wealth of existing algorithms

and data structures that allow the efﬁcient manipulations of strings. Such algo-

rithms have received decades of attention in the text retrieval community, and

more recent attention from the bioinformatics community (Apostolico et al.

2002; Durbin et al. 1998; Gionis and Mannila 2003; Reinert et al. 2000; Staden

et al. 1989; Tompa and Buhler 2001). Some simple examples of “tools” that are

not deﬁned for real-valued sequences but are deﬁned for symbolic approaches

include hashing, Markov models, sufﬁx trees, decision trees, etc.

There is, however, a simple explanation for the data mining community’s

lack of interest in string manipulation as a supporting technique for mining

time series. If the data are transformed into virtually any of the other repre-

sentations depicted in Fig. 1, then it is possible to measure the similarity of two

time series in that representation space, such that the distance is guaranteed to

lower bound the true distance between the time series in the original space.

This simple fact is at the core of almost all algorithms in time series data mining

and indexing (Faloutsos et al. 1994). However, in spite of the fact that there are

dozens of techniques for producing different variants of the symbolic represen-

tation (Andre-Jonsson and Badal 1997; Daw et al. 2001; Huang and Yu 1999),

there is no known method for calculating the distance in the symbolic space,

while providing the lower bounding guarantee.

In addition to allowing the creation of lower bounding distance measures,

there is one other highly desirable property of any time series representation,

The exceptions are random mappings, which are only guaranteed to be within an epsilon of the

true distance with a certain probability, trees, interpolation and natural language.

Experiencing SAX: a novel symbolic representationof time series 109

Time Series Representations

Data Adaptive Non Data Adaptive

SpectralWavelets

Piecewise

Aggregate

Approxima tion

Piecewise

Polynomial

Symbolic

Singular

Value

Decompos ition

Random

Mappings

Piecew ise Linear

Approximat ion

Adaptive Piecewise

Constant

Approximation

Discrete

Fourier

Transform

Discrete

Cosine

Transform

Haa

Daubechi es

dbn n > 1

Coiflets Symlets

Sorted Coe fficients

Orthonormal Bi

Orthonormal

Interpolatio n Regression

Trees

Natural

Language

Strings

Lower

Bounding

Non

- Lower

Bounding

Fig. 1 A hierarchy of all the various time series representations in the literature. The leaf nodes

refer to the actual representation, and the internal nodes refer to the classiﬁcation of the approach.

The contribution of this paper is to introduce a new representation, the lower bounding symbolic

approach

including a symbolic one. Almost all time series datasets are very high dimen-

sional. This is a challenging fact because all non-trivial data mining and indexing

algorithms degrade exponentially with dimensionality. For example, above 16–

20 dimensions, index structures degrade to sequential scanning (Hellerstein

et al. 1997). None of the symbolic representations that we are aware of allow

dimensionality reduction (Andre-Jonsson and Badal 1997; Daw et al. 2001;

Huang and Yu 1999). There is some reduction in the storage space required,

since fewer bits are required for each value; however, the intrinsic dimension-

ality of the symbolic representation is the same as the original data.

There is no doubt that a new symbolic representation that remedies all

the problems mentioned above would be highly desirable. More speciﬁcally,

the symbolic representation should meet the following criteria: space efﬁ-

ciency, time efﬁciency (fast indexing), and correctness of answer sets (no false

dismissals).

In this work we formally formulate a novel symbolic representation and

show its utility on other time series tasks.

Our representation is unique in

that it allows dimensionality/numerosity reduction, and it also allows distance

measures to be deﬁned on the symbolic representation that lower bound corre-

sponding popular distance measures deﬁned on the original data. As we shall

demonstrate, the latter feature is particularly exciting because it allows one

to run certain data mining algorithms on the efﬁciently manipulated symbolic

representation, while producing identical results to the algorithms that operate

on the original data. In particular, we will demonstrate the utility of our repre-

sentation on the classic data mining tasks of clustering (Kalpakis et al. 2001),

classiﬁcation (Geurts 2001), indexing (Agrawal et al. 1995; Faloutsos et al. 1994;

Keogh et al. 2001a; Yi and Faloutsos 2000), and anomaly detection (Dasgupta

and Forrest 1999; Keogh et al. 2002; Shahabi et al. 2000).

The rest of this paper is organized as follows. Section 2 brieﬂy discusses

background material on time series data mining and related work. Section 3

introduces our novel symbolic approach, and discusses its dimensionality reduc-

tion, numerosity reduction and lower bounding abilities. Section 4 contains an

experimental evaluation of the symbolic approach on a variety of data mining

A preliminary version of this paper appears in Lin et al. (2003).

110 J. Lin et al.

tasks. Impact of the symbolic approach is also discussed. Finally, Section 5 offers

some conclusions and suggestions for future work.

2 Background and Related Work

Time series data mining has attracted enormous attention in the last decade.

The review below is necessarily brief; we refer interested readers t o (Keogh

and Kasetty 2002; Roddick et al. 2001) for a more in depth review.

2.1 Time series data mining tasks

While making no pretence to be exhaustive, the following list summarizes the

areas that have seen the majority of research interest in time series data mining.

• Indexing: Given a query time series Q, and some similarity/dissimilarity mea-

sure D(Q,C), ﬁnd the most similar time series in database DB (Agrawal

et al. 1995; Chan and Fu 1999; Faloutsos et al. 1994; Keogh et al. 2001a; Yi

and Faloutsos 2000).

• Clustering: Find natural groupings of the time series in database DB under

some similarity/dissimilarity measure D(Q,C) (Kalpakis et al. 2001; Keogh

and Pazzani 1998).

• Classiﬁcation: Given an unlabeled time series Q, assign it to one of two or

more predeﬁned classes (Geurts 2001).

• Summarization: Given a time series Q containing n datapoints where n is

an extremely large number, create a (possibly graphic) approximation of Q

which retains its essential features but ﬁts on a single page, computer screen,

executive summary etc (Lin et al. 2002).

• Anomaly detection: Given a time series Q, and some model of “normal”

behavior, ﬁnd all sections of Q which contain anomalies or

“surprising/interesting/unexpected/novel” behavior (Dasgupta and Forrest

1999; Keogh et al. 2002; Shahabi et al. 2000).

Since the datasets encountered by data miners typically don’t ﬁt in main mem-

ory, and disk I/O tends to be the bottleneck for any data mining task, a simple

generic framework for time series data mining has emerged (Faloutsos et al.

1994). The basic approach is outlined in Table 1.

It should be clear that the utility of this framework depends heavily on the

quality of the approximation created in Step 1. If the approximation is very

faithful to the original data, then the solution obtained in main memory is likely

to be the same as, or very close to, the solution we would have obtained on the

original data. The handful of disk accesses made in Step 3 to conﬁrm or slightly

modify the solution will be inconsequential compared to the number of disk

accesses required had we worked on the original data. With this in mind, there

has been great interest in approximate representations of time series, which we

consider below.

Experiencing SAX: a novel symbolic representationof time series 111

Table 1 A generic time series

data mining approach

1. Create an approximation of the data, which

will ﬁt in main memory, yet retains the essen-

tial features of interest.

2. Approximately solve the task at hand in

main memory.

3. Make (hopefully very few) accesses to the

original data on disk to conﬁrm the solution

obtained in Step 2, or to modify the solution

so it agrees with the solution we would have

obtained on the original data.

0 50 100

Discrete Fourier

Transform

Piecewise Linear

Approximation

Haar Wavelet

Adaptive Piecewise

Constant Approximation

Fig. 2 The most common representations for time series data mining. Each can be visualized as an

attempt to approximate the signal with a linear combination of basis functions

2.2 Time series representations

As with most problems in computer science, the suitable choice of representa-

tion greatly affects the ease and efﬁciency of time series data mining. With this

in mind, a great number of time series representations have been introduced,

including the Discrete Fourier Transform (DFT) (Faloutsos et al. 1994), the

Discrete Wavelet Transform (DWT) (Chan and Fu 1999), Piecewise Linear,

and Piecewise Constant models (PAA) (Keogh et al. 2001a), (APCA) (Geurts

2001; Keogh et al. 2001a), and Singular Value Decomposition (SVD) (Keogh

et al. 2001a). Figure 2 illustrates the most commonly used representations.

Recent work suggests that there is little to choose between the above in

terms of indexing power (Keogh and Kasetty 2002); however, the representa-

tions have other features that may act as strengths or weaknesses. As a simple

example, wavelets have the useful multiresolution property, but are only deﬁned

for time series that are an integer power of two in length (Chan and Fu 1999).

One important feature of all the above representations is that they are real

valued. This limits the algorithms, data structures and deﬁnitions available for

them. For example, in anomaly detection we cannot meaningfully deﬁne the

probability of observing any particular set of wavelet coefﬁcients, since the

probability of observing any real number is zero (Larsen and Marx 1986). Such

limitations have lead researchers to consider using a symbolic representation

of time series.

HTML Viewer

Figures

Table 2 A summarization of the notation used in this paper

Table 4 A lookup table used by the MINDIST function

Fig. 6 A visual intuition of the three representations discussed in this work, and the distance measures defined on them. (A) The Euclidean distance between two time series can be visualized as the square root of the sum of the squared differences of each pair of corresponding points. (B) The distance measure defined for the PAA approximation can be seen as the square root of the sum of the squared differences between each pair of corresponding PAA coefficients, multiplied by the square root of the compression rate. (C) The distance between two SAX representations of a time series requires looking up the distances between each pair of symbols, squaring them, summing them, taking the square root and finally multiplying by the square root of the compression rate

Fig. 16 Error rates for SAX and Euclidean distance on 22 datasets. Lower triangle is the region where SAX is more accurate than Euclidean distance, and upper triangle is where Euclidean distance is more accurate than SAX

Table 8 A comparison of SAX with the specialized Regression Tree approach for decision tree classification

Fig. 13 (A) Time series from the “decreasing trend” class and the resulting series after differencing. (B) Time series from the “upward shift” class and the resulting series after differencing

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Querying and mining of time series data: experimental comparison of representations and distance measures

Hui Ding, +4 more

TL;DR: An extensive set of time series experiments are conducted re-implementing 8 different representation methods and 9 similarity measures and their variants and testing their effectiveness on 38 time series data sets from a wide variety of application domains to provide a unified validation of some of the existing achievements.

...read moreread less

Journal ArticleDOI

A review on time series data mining

Tak-chung Fu

- 01 Feb 2011 -

Engineering Applications of Artificial I...

TL;DR: The primary objective of this paper is to serve as a glossary for interested researchers to have an overall picture on the current time series data mining development and identify their potential research direction to further investigation.

...read moreread less

Journal ArticleDOI

Time-series clustering - A decade review

Saeed Aghabozorgi, +2 more

- 01 Oct 2015 -

Information Systems

TL;DR: This review will expose four main components of time-series clustering and is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time- series approaches during the last decade and enlighten new paths for future works.

...read moreread less

Proceedings ArticleDOI

Time series classification from scratch with deep neural networks: A strong baseline

Zhiguang Wang, +2 more

TL;DR: In this article, the authors proposed a simple but strong baseline for time series classification from scratch with deep neural networks, which is pure end-to-end without any heavy preprocessing on the raw data or feature crafting.

...read moreread less

Journal ArticleDOI

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

Anthony J. Bagnall, +4 more

- 01 May 2017 -

Data Mining and Knowledge Discovery

TL;DR: This work implemented 18 recently proposed algorithms in a common Java framework and compared them against two standard benchmark classifiers (and each other) by performing 100 resampling experiments on each of the 85 datasets, indicating that only nine of these algorithms are significantly more accurate than both benchmarks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Richard Durbin, +3 more

TL;DR: This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.

...read moreread less

Journal Article

Washington DC - USA

Beatrice Gralton

- 01 Jul 2008 -

Art monthly Australia

Proceedings Article

Using dynamic time warping to find patterns in time series

Donald J. Berndt, +1 more

TL;DR: Preliminary experiments with a dynamic programming approach to pattern detection in databases, based on the dynamic time warping technique used in the speech recognition field, are described.

...read moreread less

Proceedings ArticleDOI

A symbolic representation of time series, with implications for streaming algorithms

Jessica Lin, +3 more

TL;DR: A new symbolic representation of time series is introduced that is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.

...read moreread less

Proceedings ArticleDOI

Fast subsequence matching in time-series databases

Christos Faloutsos, +2 more

TL;DR: An efficient indexing method to locate 1-dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance.

...read moreread less

Collapse

Dimensionality reduction for fast similarity search in large time series databases

Eamonn Keogh, +3 more

- 01 Aug 2001 -

Knowledge and Information Systems

A symbolic representation of time series, with implications for streaming algorithms

Jessica Lin, +3 more

Querying and mining of time series data: experimental comparison of representations and distance measures

Hui Ding, +4 more

Fast subsequence matching in time-series databases

Christos Faloutsos, +2 more

Efficient time series matching by wavelets

Kin-Pong Chan, +1 more

Frequently Asked Questions (15)

Q1. What are the contributions in "Experiencing sax: a novel symbolic representation of time series" ?

In this work the authors formulate a new symbolic representation of time series.

Q2. What future works have the authors mentioned in the paper "Experiencing sax: a novel symbolic representation of time series" ?

A host of future directions suggest themselves. There is an enormous wealth of useful definitions, algorithms and data structures in the bioinformatics literature that can be exploited by their representation ( Apostolico et al. It may be possible to create a lower bounding approximation of Dynamic Time Warping ( Berndt and Clifford 1994 ), by slightly modifying the classic string edit distance. Finally, there may be utility in extending their work to multidimensional and streaming time series ( Vlachos et al. 2002 ).

Q3. Why did the authors exclude them from the rest of the classification experiments?

Since both IMPACTS and SDA perform poorly compared to Euclidean distance and SAX, the authors will exclude them from the rest of the classification experiments.

Q4. How do the authors compare the objective function of k-means?

Since k-means algorithm seeks to optimize the objective function, by minimizing the sum of squared intra-cluster error,we compare and plot the objective functions, after projecting the data back to its original dimension (for fair comparison of objective functions), for each iteration.

Q5. What are some examples of tools that are not defined for real-valued sequences?

Some simple examples of “tools” that are not defined for real-valued sequences but are defined for symbolic approaches include hashing, Markov models, suffix trees, decision trees, etc.

Q6. What is the way to compare and contrast similarity measures?

Comparing hierarchical clusterings is a very good way to compare and contrast similarity measures, since a dendrogram of size N summarizes O(N2) distance calculations (Keogh and Kasetty 2002).

Q7. How can one guarantee retrieving the full answer set?

It is only by using a lower bounding technique that one can guarantee retrieving the full answer set, with no false dismissals (Faloutsos et al. 1994).

Q8. What is the way to test the sanity of hierarchical clustering?

Although hierarchical clustering is a good sanity check for any proposed distance measure, it has limited utility for data mining because of its poor scalability.

Q9. What is the key observation that allowed us to prove lower bounds?

The key observation that allowed us to prove lower bounds is to concentrate on proving that the symbolic distance measure lower bounds the PAA distance measure.

Q10. How many bits per word is needed for the original time series?

The compression ratio (last column of next table) is calculated as: w × ⌈log2 a ⌉/n × 32, because for SAX representation the authors only need ⌈log2 a ⌉bits per word, while for the original time series the authors need 4 bytes (32 bits) for each value.

Q11. What are the reasons why SDA and IMPACTS perform poorly?

The reasons that SDA and IMPACTS perform poorly, the authors observe, are that neither symbolic representation is very descriptive of the general shape of the time series, and that the lack of dimensionality reduction can further distort the results if the data is noisy.

Q12. What is the frequency of occurrence for each pattern?

Each string is regarded as a pattern, and the frequency of occurrence for each pattern is encoded by the thickness of the branch: the thicker the branch, the more frequent the corresponding pattern.

Q13. What is the commonly used data mining algorithm?

The most commonly used data mining clustering algorithm is k-means (Fayyad et al. 1998), so for completeness the authors will consider it here.

Q14. What is the difference between SAX and other real-valued approaches?

SAXrepresentation can afford to have higher dimensionality than the other real-valued approaches, while using less or the same amount of space.

Q15. What is the difference between SAX and the other representations?

Note that since SAX is a symbolic representation, the alphabets can be stored as bits rather than doubles, which results in a considerable amount of space-saving.

Experiencing SAX: a novel symbolic representation of time series

Figures

Citations

Querying and mining of time series data: experimental comparison of representations and distance measures

A review on time series data mining

Time-series clustering - A decade review

Time series classification from scratch with deep neural networks: A strong baseline

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

References

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Washington DC - USA

Using dynamic time warping to find patterns in time series

A symbolic representation of time series, with implications for streaming algorithms

Fast subsequence matching in time-series databases

Related Papers (5)

Dimensionality reduction for fast similarity search in large time series databases

A symbolic representation of time series, with implications for streaming algorithms

Querying and mining of time series data: experimental comparison of representations and distance measures

Fast subsequence matching in time-series databases

Efficient time series matching by wavelets

Frequently Asked Questions (15)

Q1. What are the contributions in "Experiencing sax: a novel symbolic representation of time series" ?

Q2. What future works have the authors mentioned in the paper "Experiencing sax: a novel symbolic representation of time series" ?

Q3. Why did the authors exclude them from the rest of the classification experiments?

Q4. How do the authors compare the objective function of k-means?

Q5. What are some examples of tools that are not defined for real-valued sequences?

Q6. What is the way to compare and contrast similarity measures?

Q7. How can one guarantee retrieving the full answer set?

Q8. What is the way to test the sanity of hierarchical clustering?

Q9. What is the key observation that allowed us to prove lower bounds?

Q10. How many bits per word is needed for the original time series?

Q11. What are the reasons why SDA and IMPACTS perform poorly?

Q12. What is the frequency of occurrence for each pattern?

Q13. What is the commonly used data mining algorithm?

Q14. What is the difference between SAX and other real-valued approaches?

Q15. What is the difference between SAX and the other representations?