scispace - formally typeset
Open AccessJournal ArticleDOI

Experiencing SAX: a novel symbolic representation of time series

TLDR
The utility of the new symbolic representation of time series formed is demonstrated, which allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.
Abstract
Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.

read more

Content maybe subject to copyright    Report

Data Min Knowl Disc (2007) 15:107–144
DOI 10.1007/s10618-007-0064-z
Experiencing SAX: a novel symbolic representation
of time series
Jessica Lin · Eamonn Keogh · Li Wei ·
Stefano Lonardi
Received: 15 June 2006 / Accepted: 10 January 2007 / Published online: 3 April 2007
Springer Science+Business Media, LLC 2007
Abstract Many high level representations of time series have been proposed
for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise
polynomial models, etc. Many researchers have also considered symbolic rep-
resentations of time series, noting that such representations would potentiality
allow researchers to avail of the wealth of data structures and algorithms from
the t ext processing and bioinformatics communities. While many symbolic rep-
resentations of time series have been introduced over the past decades, they all
suffer from two fatal flaws. First, the dimensionality of the symbolic represen-
tation is the same as the original data, and virtually all data mining algorithms
scale poorly with dimensionality. Second, although distance measures can be
defined on the symbolic approaches, these distance measures have little corre-
lation with distance measures defined on the original time series.
In this work we formulate a new symbolic representation of time series. Our
representation is unique in that it allows dimensionality/numerosity reduction,
Responsible editor: Johannes Gehrke.
J. Lin (
B
)
Information and Software Engineering Department, George Mason University, Fairfax,
VA 22030, USA
e-mail: jessica@ise.gmu.edu
E. Keogh · L. Wei · S. Lonardi
Computer Science & Engineering Department, University of California-Riverside, Riverside,
CA 92521, USA
e-mail: eamonn@cs.ucr.edu
L. Wei
e-mail: wli@cs.ucr.edu
S. Lonardi
e-mail: stelo@cs.ucr.edu

108 J. Lin et al.
and it also allows distance measures to be defined on the symbolic approach that
lower bound corresponding distance measures defined on the original series.
As we shall demonstrate, this latter feature is particularly exciting because it
allows one to run certain data mining algorithms on the efficiently manipulated
symbolic representation, while producing identical results to the algorithms
that operate on the original data. In particular, we will demonstrate the utility
of our representation on various data mining tasks of clustering, classification,
query by content, anomaly detection, motif discovery, and visualization.
Keywords Time series · Data mining · Symbolic representation · Discretize
1 Introduction
Many high level representations of time series have been proposed for data min-
ing. Figure 1 illustrates a hierarchy of all the various time series representations
in the literature (Andre-Jonsson and Badal 1997; Chan and Fu 1999; Faloutsos
et al. 1994; Geurts 2001; Huang and Yu 1999; Keogh et al. 2001a; Keogh and
Pazzani 1998; Roddick et al. 2001; Shahabi et al. 2000; Yi and Faloutsos 2000).
One representation that the data mining community has not considered in detail
is the discretization of the original data into symbolic strings. At first glance this
seems a surprising oversight. There is an enormous wealth of existing algorithms
and data structures that allow the efficient manipulations of strings. Such algo-
rithms have received decades of attention in the text retrieval community, and
more recent attention from the bioinformatics community (Apostolico et al.
2002; Durbin et al. 1998; Gionis and Mannila 2003; Reinert et al. 2000; Staden
et al. 1989; Tompa and Buhler 2001). Some simple examples of “tools” that are
not defined for real-valued sequences but are defined for symbolic approaches
include hashing, Markov models, suffix trees, decision trees, etc.
There is, however, a simple explanation for the data mining community’s
lack of interest in string manipulation as a supporting technique for mining
time series. If the data are transformed into virtually any of the other repre-
sentations depicted in Fig. 1, then it is possible to measure the similarity of two
time series in that representation space, such that the distance is guaranteed to
lower bound the true distance between the time series in the original space.
1
This simple fact is at the core of almost all algorithms in time series data mining
and indexing (Faloutsos et al. 1994). However, in spite of the fact that there are
dozens of techniques for producing different variants of the symbolic represen-
tation (Andre-Jonsson and Badal 1997; Daw et al. 2001; Huang and Yu 1999),
there is no known method for calculating the distance in the symbolic space,
while providing the lower bounding guarantee.
In addition to allowing the creation of lower bounding distance measures,
there is one other highly desirable property of any time series representation,
1
The exceptions are random mappings, which are only guaranteed to be within an epsilon of the
true distance with a certain probability, trees, interpolation and natural language.

Experiencing SAX: a novel symbolic representationof time series 109
Time Series Representations
Data Adaptive Non Data Adaptive
SpectralWavelets
Piecewise
Aggregate
Approxima tion
Piecewise
Polynomial
Symbolic
Singular
Value
Decompos ition
Random
Mappings
Piecew ise Linear
Approximat ion
Adaptive Piecewise
Constant
Approximation
Discrete
Fourier
Transform
Discrete
Cosine
Transform
Haa
r
Daubechi es
dbn n > 1
Coiflets Symlets
Sorted Coe fficients
Orthonormal Bi
-
Orthonormal
Interpolatio n Regression
Trees
Natural
Language
Strings
Lower
Bounding
Non
- Lower
Bounding
Fig. 1 A hierarchy of all the various time series representations in the literature. The leaf nodes
refer to the actual representation, and the internal nodes refer to the classification of the approach.
The contribution of this paper is to introduce a new representation, the lower bounding symbolic
approach
including a symbolic one. Almost all time series datasets are very high dimen-
sional. This is a challenging fact because all non-trivial data mining and indexing
algorithms degrade exponentially with dimensionality. For example, above 16–
20 dimensions, index structures degrade to sequential scanning (Hellerstein
et al. 1997). None of the symbolic representations that we are aware of allow
dimensionality reduction (Andre-Jonsson and Badal 1997; Daw et al. 2001;
Huang and Yu 1999). There is some reduction in the storage space required,
since fewer bits are required for each value; however, the intrinsic dimension-
ality of the symbolic representation is the same as the original data.
There is no doubt that a new symbolic representation that remedies all
the problems mentioned above would be highly desirable. More specifically,
the symbolic representation should meet the following criteria: space effi-
ciency, time efficiency (fast indexing), and correctness of answer sets (no false
dismissals).
In this work we formally formulate a novel symbolic representation and
show its utility on other time series tasks.
2
Our representation is unique in
that it allows dimensionality/numerosity reduction, and it also allows distance
measures to be defined on the symbolic representation that lower bound corre-
sponding popular distance measures defined on the original data. As we shall
demonstrate, the latter feature is particularly exciting because it allows one
to run certain data mining algorithms on the efficiently manipulated symbolic
representation, while producing identical results to the algorithms that operate
on the original data. In particular, we will demonstrate the utility of our repre-
sentation on the classic data mining tasks of clustering (Kalpakis et al. 2001),
classification (Geurts 2001), indexing (Agrawal et al. 1995; Faloutsos et al. 1994;
Keogh et al. 2001a; Yi and Faloutsos 2000), and anomaly detection (Dasgupta
and Forrest 1999; Keogh et al. 2002; Shahabi et al. 2000).
The rest of this paper is organized as follows. Section 2 briefly discusses
background material on time series data mining and related work. Section 3
introduces our novel symbolic approach, and discusses its dimensionality reduc-
tion, numerosity reduction and lower bounding abilities. Section 4 contains an
experimental evaluation of the symbolic approach on a variety of data mining
2
A preliminary version of this paper appears in Lin et al. (2003).

110 J. Lin et al.
tasks. Impact of the symbolic approach is also discussed. Finally, Section 5 offers
some conclusions and suggestions for future work.
2 Background and Related Work
Time series data mining has attracted enormous attention in the last decade.
The review below is necessarily brief; we refer interested readers t o (Keogh
and Kasetty 2002; Roddick et al. 2001) for a more in depth review.
2.1 Time series data mining tasks
While making no pretence to be exhaustive, the following list summarizes the
areas that have seen the majority of research interest in time series data mining.
Indexing: Given a query time series Q, and some similarity/dissimilarity mea-
sure D(Q,C), find the most similar time series in database DB (Agrawal
et al. 1995; Chan and Fu 1999; Faloutsos et al. 1994; Keogh et al. 2001a; Yi
and Faloutsos 2000).
Clustering: Find natural groupings of the time series in database DB under
some similarity/dissimilarity measure D(Q,C) (Kalpakis et al. 2001; Keogh
and Pazzani 1998).
Classification: Given an unlabeled time series Q, assign it to one of two or
more predefined classes (Geurts 2001).
Summarization: Given a time series Q containing n datapoints where n is
an extremely large number, create a (possibly graphic) approximation of Q
which retains its essential features but fits on a single page, computer screen,
executive summary etc (Lin et al. 2002).
Anomaly detection: Given a time series Q, and some model of “normal”
behavior, find all sections of Q which contain anomalies or
“surprising/interesting/unexpected/novel” behavior (Dasgupta and Forrest
1999; Keogh et al. 2002; Shahabi et al. 2000).
Since the datasets encountered by data miners typically don’t fit in main mem-
ory, and disk I/O tends to be the bottleneck for any data mining task, a simple
generic framework for time series data mining has emerged (Faloutsos et al.
1994). The basic approach is outlined in Table 1.
It should be clear that the utility of this framework depends heavily on the
quality of the approximation created in Step 1. If the approximation is very
faithful to the original data, then the solution obtained in main memory is likely
to be the same as, or very close to, the solution we would have obtained on the
original data. The handful of disk accesses made in Step 3 to confirm or slightly
modify the solution will be inconsequential compared to the number of disk
accesses required had we worked on the original data. With this in mind, there
has been great interest in approximate representations of time series, which we
consider below.

Experiencing SAX: a novel symbolic representationof time series 111
Table 1 A generic time series
data mining approach
1. Create an approximation of the data, which
will fit in main memory, yet retains the essen-
tial features of interest.
2. Approximately solve the task at hand in
main memory.
3. Make (hopefully very few) accesses to the
original data on disk to confirm the solution
obtained in Step 2, or to modify the solution
so it agrees with the solution we would have
obtained on the original data.
0 50 100
0 50 100
0 50 100
0 50 100
Discrete Fourier
Transform
Piecewise Linear
Approximation
Haar Wavelet
Adaptive Piecewise
Constant Approximation
Fig. 2 The most common representations for time series data mining. Each can be visualized as an
attempt to approximate the signal with a linear combination of basis functions
2.2 Time series representations
As with most problems in computer science, the suitable choice of representa-
tion greatly affects the ease and efficiency of time series data mining. With this
in mind, a great number of time series representations have been introduced,
including the Discrete Fourier Transform (DFT) (Faloutsos et al. 1994), the
Discrete Wavelet Transform (DWT) (Chan and Fu 1999), Piecewise Linear,
and Piecewise Constant models (PAA) (Keogh et al. 2001a), (APCA) (Geurts
2001; Keogh et al. 2001a), and Singular Value Decomposition (SVD) (Keogh
et al. 2001a). Figure 2 illustrates the most commonly used representations.
Recent work suggests that there is little to choose between the above in
terms of indexing power (Keogh and Kasetty 2002); however, the representa-
tions have other features that may act as strengths or weaknesses. As a simple
example, wavelets have the useful multiresolution property, but are only defined
for time series that are an integer power of two in length (Chan and Fu 1999).
One important feature of all the above representations is that they are real
valued. This limits the algorithms, data structures and definitions available for
them. For example, in anomaly detection we cannot meaningfully define the
probability of observing any particular set of wavelet coefficients, since the
probability of observing any real number is zero (Larsen and Marx 1986). Such
limitations have lead researchers to consider using a symbolic representation
of time series.

Citations
More filters
Journal ArticleDOI

Querying and mining of time series data: experimental comparison of representations and distance measures

TL;DR: An extensive set of time series experiments are conducted re-implementing 8 different representation methods and 9 similarity measures and their variants and testing their effectiveness on 38 time series data sets from a wide variety of application domains to provide a unified validation of some of the existing achievements.
Journal ArticleDOI

A review on time series data mining

TL;DR: The primary objective of this paper is to serve as a glossary for interested researchers to have an overall picture on the current time series data mining development and identify their potential research direction to further investigation.
Journal ArticleDOI

Time-series clustering - A decade review

TL;DR: This review will expose four main components of time-series clustering and is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time- series approaches during the last decade and enlighten new paths for future works.
Proceedings ArticleDOI

Time series classification from scratch with deep neural networks: A strong baseline

TL;DR: In this article, the authors proposed a simple but strong baseline for time series classification from scratch with deep neural networks, which is pure end-to-end without any heavy preprocessing on the raw data or feature crafting.
Journal ArticleDOI

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

TL;DR: This work implemented 18 recently proposed algorithms in a common Java framework and compared them against two standard benchmark classifiers (and each other) by performing 100 resampling experiments on each of the 85 datasets, indicating that only nine of these algorithms are significantly more accurate than both benchmarks.
References
More filters
Book

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

TL;DR: This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.
Journal Article

Washington DC - USA

Proceedings Article

Using dynamic time warping to find patterns in time series

TL;DR: Preliminary experiments with a dynamic programming approach to pattern detection in databases, based on the dynamic time warping technique used in the speech recognition field, are described.
Proceedings ArticleDOI

A symbolic representation of time series, with implications for streaming algorithms

TL;DR: A new symbolic representation of time series is introduced that is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.
Proceedings ArticleDOI

Fast subsequence matching in time-series databases

TL;DR: An efficient indexing method to locate 1-dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions in "Experiencing sax: a novel symbolic representation of time series" ?

In this work the authors formulate a new symbolic representation of time series. 

A host of future directions suggest themselves. There is an enormous wealth of useful definitions, algorithms and data structures in the bioinformatics literature that can be exploited by their representation ( Apostolico et al. It may be possible to create a lower bounding approximation of Dynamic Time Warping ( Berndt and Clifford 1994 ), by slightly modifying the classic string edit distance. Finally, there may be utility in extending their work to multidimensional and streaming time series ( Vlachos et al. 2002 ). 

Since both IMPACTS and SDA perform poorly compared to Euclidean distance and SAX, the authors will exclude them from the rest of the classification experiments. 

Since k-means algorithm seeks to optimize the objective function, by minimizing the sum of squared intra-cluster error,we compare and plot the objective functions, after projecting the data back to its original dimension (for fair comparison of objective functions), for each iteration. 

Some simple examples of “tools” that are not defined for real-valued sequences but are defined for symbolic approaches include hashing, Markov models, suffix trees, decision trees, etc. 

Comparing hierarchical clusterings is a very good way to compare and contrast similarity measures, since a dendrogram of size N summarizes O(N2) distance calculations (Keogh and Kasetty 2002). 

It is only by using a lower bounding technique that one can guarantee retrieving the full answer set, with no false dismissals (Faloutsos et al. 1994). 

Although hierarchical clustering is a good sanity check for any proposed distance measure, it has limited utility for data mining because of its poor scalability. 

The key observation that allowed us to prove lower bounds is to concentrate on proving that the symbolic distance measure lower bounds the PAA distance measure. 

The compression ratio (last column of next table) is calculated as: w × ⌈log2 a ⌉/n × 32, because for SAX representation the authors only need ⌈log2 a ⌉bits per word, while for the original time series the authors need 4 bytes (32 bits) for each value. 

The reasons that SDA and IMPACTS perform poorly, the authors observe, are that neither symbolic representation is very descriptive of the general shape of the time series, and that the lack of dimensionality reduction can further distort the results if the data is noisy. 

Each string is regarded as a pattern, and the frequency of occurrence for each pattern is encoded by the thickness of the branch: the thicker the branch, the more frequent the corresponding pattern. 

The most commonly used data mining clustering algorithm is k-means (Fayyad et al. 1998), so for completeness the authors will consider it here. 

SAXrepresentation can afford to have higher dimensionality than the other real-valued approaches, while using less or the same amount of space. 

Note that since SAX is a symbolic representation, the alphabets can be stored as bits rather than doubles, which results in a considerable amount of space-saving.