scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Efficient time series matching by wavelets

23 Mar 1999-pp 126-133
TL;DR: This paper proposes to use Haar Wavelet Transform for time series indexing and shows that Haar transform can outperform DFT through experiments, and proposes a two-phase method for efficient n-nearest neighbor query in time series databases.
Abstract: Time series stored as feature vectors can be indexed by multidimensional index trees like R-Trees for fast retrieval. Due to the dimensionality curse problem, transformations are applied to time series to reduce the number of dimensions of the feature vectors. Different transformations like Discrete Fourier Transform (DFT) Discrete Wavelet Transform (DWT), Karhunen-Loeve (KL) transform or Singular Value Decomposition (SVD) can be applied. While the use of DFT and K-L transform or SVD have been studied on the literature, to our knowledge, there is no in-depth study on the application of DWT. In this paper we propose to use Haar Wavelet Transform for time series indexing. The major contributions are: (1) we show that Euclidean distance is preserved in the Haar transformed domain and no false dismissal will occur, (2) we show that Haar transform can outperform DFT through experiments, (3) a new similarity model is suggested to accommodate vertical shift of time series, and (4) a two-phase method is proposed for efficient n-nearest neighbor query in time series databases.

Summary (3 min read)

1. Introduction

  • Time series data are of growing importance in many new database applications, such as data warehousing and data mining [3, 8, 2, 12].
  • Indexing is used to support efficient retrieval and matching of time series.
  • The authors propose to apply this technique in time series for dimension reduction and content-based search.
  • Moreover, the authors show by experiments that Haar Wavelet Transform [9], which is a commonly used wavelet transform, can outperform DFT significantly.

2.1. Wavelet Transform

  • Wavelets are basis functions used in representing data or other functions.
  • Wavelet algorithms process data at different scales or resolutions in contrast with DFT where only frequency components are considered.
  • The origin of wavelets can be traced to the work of Karl Weierstrass [27] in 1873.
  • Haar basis is still a foundation of modern wavelet theory.
  • Another significant advance is the introduction of a nonorthogonal basis by Dennis Gabor in 1946 [16].

3. The Proposed Approach

  • Following a trend in the disciplines of signal and image processing, the authors propose to study the use of wavelet transformation for the time series indexing problem.
  • This definition can give a better estimation of the similarity between two time sequences with similar trends running at two completely different levels.

3.2. DFT versus Haar Transform

  • The authors motivation of using Haar transform to replace DFT is based on several evidences and observations, some of which are also the reasons why the use of wavelet transforms instead of DFT is considered in areas of image and signal processing.
  • Apart from Euclidean distance, their model can easily accommodate v-shift similarity of two time sequences (Definition 2) at a little more cost.
  • That is, the situation where vertically shifted signals can match is accommodated.
  • On the other hand, previous study on F-index did not make use of this similarity model.

3.3. Guarantee of no False Dismissal

  • For FT and DFT, it is shown by Parseval’s Theorem [23] that the energy of a signal conserves in both time and frequency domains.
  • Parseval’s Theorem also shows that this situation is true for wavelet transforms.
  • The expression of the Euclidean distance between time sequences in terms of their Haar coefficients is not sufficient for proper use in multi-dimensional index trees until Euclidean distance preserves in both Haar and time domains, as for DFT in (1), also known as.

4. The Overall Strategy

  • The authors present the overall strategy of their time series matching method and propose their own method for nearest neighbor query.
  • Before querying is performed, the authors shall do some pre-processing to extract the feature vectors with reduced dimensionality, and to build the index.
  • After the index is built, contentbased search can be performed for two types of querying: range querying and -nearest-neighbors querying.

4.1. Pre-processing

  • Step 1 - Similarity Model Selection: According to their applications users may choose to use either the simple Euclidean distance (Definition 1) or the v-shift similarity (Definition 2) as their similarity measurements.
  • For nearest neighbor query, the authors propose a two-phase evaluation as follows.
  • The effectiveness of this -nearest neighbor search algorithm arises from the value of $A% ( found in Phase 1 which provides a sufficient small query range to prune out a large amount of candidates for Phase 2.

5. Performance Evaluation

  • Experiments using real stock data and synthetic random walk data have been carried out.
  • The enhancement in precision of Haar transform over DFT increases with the number of dimensions.
  • In Figure 4, the precision of Haar and Haar(V-shift) is shown.
  • As the time series of financial data consist of a sequence of time values fluctuating around a relative constant level, which is the average value of that time sequence.
  • This agrees with the result depicted in Figure 5, where the page accesses of the best dimensions of DFT (dim. 5), Haar (dim. 7), and Haar(V-shift) (dim. 10) are shown.

5.1. Scalability Test

  • The authors study the scalability of their method by varying the size or length of synthetic time series database.
  • Different sizes of databases (5k to 30k) and different lengths of sequences (256 to 2048) are generated as described in the previous section separately.
  • Haar and Haar(V-shift) have a better scaling with database size and sequence length increase than DFT.
  • Similar results have also been recorded for range queries.
  • The poorer precision of DFT creates more work in the post-processing step and this affects the overall performance, especially in terms of the amount of disk accesses for large databases with long sequences.

6. Conclusion

  • An efficient time series matching technique through dimension reduction by Haar Wavelet Transform is proposed.
  • The first few coefficients of the transformed sequences are indexed in an R-Tree for similarity search.
  • Experiments show that their method outperforms the F-index (Discrete Fourier Transform) method in terms of pruning power, number of page accesses, scalability, and complexity.
  • The authors can also try to apply wavelets that did not work well with stock data in other signals, e.g. sinusoidal signals, electrocardiographs (ECGs).

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Efficient Time Series Matching by Wavelets
Kin-pong Chan and Ada Wai-chee Fu
Department of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, Hong Kong
kpchan, adafu
@cse.cuhk.edu.hk
Abstract
Time series stored as feature vectors can be indexed by multi-
dimensional index trees like R-Trees for fast retrieval. Due to
the dimensionality curse problem, transformations are applied to
time series to reduce the number of dimensions of the feature vec-
tors. Different transformations like Discrete Fourier Transform
(DFT), Discrete Wavelet Transform (DWT), Karhunen-Loeve (K-
L) transform or Singular Value Decomposition (SVD) can be ap-
plied. While the use of DFT and K-L transform or SVD have been
studied in the literature, to our knowledge, there is no in-depth
study on the application of DWT. In this paper, we propose to use
Haar Wavelet Transform for time series indexing. The major con-
tributions are: (1) we show that Euclidean distance is preserved
in the Haar transformed domain and no false dismissal will occur,
(2) we show that Haar transform can outperform DFT through
experiments, (3) a new similarity model is suggested to accom-
modate vertical shift of time series, and (4) a two-phase method
is proposed for efficient
-nearest neighbor query in time series
databases.
1. Introduction
Time series data are of growing importance in many new
database applications, such as data warehousing and data mining
[3, 8, 2, 12]. A time series (or time sequence) is a sequence of
real numbers, each number representing a value at a time point.
Typical examples include stock prices or currency exchange rates,
biomedical measurements, weather data, etc . . . collected over
time. Therefore, time series databases supporting fast retrieval of
time series data and similarity queries are desired.
In order to depict the similarity between two time series,
we define a similarity measurement during the matching pro-
cess. Given two time series

and

 ! "#
, a standard approach is to compute the Eu-
clidean distance
$&%'

)(
between time series
and
*,+!-
.0/
-
12354
 "
6 78
:9
.
7;
1
7
9 <=?>
@
By using this similarity model, we can retrieve similar time series
by considering distance
$A%!
B
)(
.
Indexing is used to support efficient retrieval and matching of
time series. Some important factors have to be considered: The
first factor is dimensionality reduction. Many multi-dimensional
indexing methods [13, 7, 5, 20] such as the R-Tree and R*-Tree
[20, 5, 11] scale exponentially for high dimensionalities, eventu-
ally reducing the performance to that of sequential scanning or
worse. Hence, transformation isappliedto mapthe time sequences
to a new feature space of a lower dimensionality. Next we must
ensure completeness and effectiveness when the number of dimen-
sions is reduced. To avoid missing any qualifying object, the Eu-
clidean distance in the reduced
C
-dimensional space should be less
than or equal to the Euclidean distance between the two original
time sequences. Finally, we must also consider the nature of data
series since the effectiveness of power concentration of a partic-
ular transformation depends on the nature of the time series. It
is believed that only brown noise or random walks exists in real
signals. In particular, stock movements and exchange rates can be
modeledsuccessfullyas random walks in [10], for which a skewed
energy spectrum can be obtained.
Discrete Fourier Transform (DFT) has been one of the most
commonly used techniques. One problem with DFT is that it
misses the important feature of time localization. Piecewise
Fourier Transform has been proposed to mitigate this problem, but
the size of the pieces leads to other problems. While large pieces
reduce the power of multi-resolution, small pieces has weakness
in modeling low frequencies.
Wavelet Transform (WT), or Discrete Wavelet Transform
(DWT) [9, 18] has been found to be effective in replacing DFT
in many applicationsin computer graphics, image [26], speech [1]
, and signal processing [6, 4]. We propose to apply this technique
in time series for dimension reduction and content-based search.
DWT is a discrete version of WT for numerical signal. Although
the potential application of DWT in this problem was pointed out
in [22], no further investigation has been reported to our knowl-
edge. Hence, it is of value to conduct studies and evaluations on
time series retrieval and matching by means of wavelets.
The advantageof using DWT is multi-resolution representation
of signals. It has the time-frequency localization property. Thus,
DWT is able to give locations in both time and frequency. There-
fore, wavelet representationsof signals bear more information than
that of DFT, in which only frequenciesareconsidered. While DFT
extracts the lower harmonics which represent the general shape of

a time sequence, DWT encodes a coarser resolution of the origi-
nal time sequence with its preceding coefficients. We show that
Euclidean distance is preserved in the Haar transformed domain.
Moreover, we show by experiments that Haar Wavelet Transform
[9], which is a commonly used wavelet transform, can outper-
form DFT significantly.
We alsosuggesta similarity definition to handlethe problem of
vertical shifts of time series. Finally we propose an algorithm on
-nearest neighbor query for the proposed wavelet method. The
algorithm makes use of the range query and dynamically adjusts
the range by the property of Euclidean distance preservationof the
wavelet transformation.
2. Related Work
Discrete Fourier Transform (DFT) is often used for dimension
reduction [2, 15] to achieve efficient indexing. An index built by
means of DFT is also called an F-index [2]. Suppose the DFT of
a time sequence
is denotedby
. For many applications such as
stock data, the low frequency components are located at the pre-
ceding coefficients of
which represent the general trend of the
time sequence
. These coefficients can be indexed in an R-Tree
or R*-Tree for fast retrieval. In most previous works, range query-
ing is considered. A range query (or epsilon query) evaluation
returns sequenceswith Euclidean distance within
from the query
point.
Parseval’s Theorem [23] shows that the Euclidean distance be-
tween two signals
and
in time domain is the same as their
Euclidean distance in frequency domain

<

<
(1)
Therefore, F-index may raise false alarms, but guarantees no false
dismissal. After a range query in the F-index, false alarms are fil-
tered by checking against the query sequence in the original time
domain in a post-processing step. F-index is further generalized
and subsequence matching is proposed in [15]. This is called the
ST-index which permits sequence query of varying length. Each
time sequence is broken up into pieces of subsequencesby a slid-
ing windowwith a fixedlength
for DFT. Feature points in nearby
offsets will form a trail due to the effect of stepwise sliding win-
dow, the minimum bounding rectangle(MBR) of a trail is then be-
ing indexed in an R-Tree instead of the feature points themselves.
When a query arrives, all MBRs that intersect the query region are
retrieved and their trails are matched.
New similarity models are applied to F-index based time se-
ries matching in [24]. It achieves time warping, moving average,
and reversing by applying transformations to feature points in the
frequencydomain. Given a query
, a new index is built by apply-
ing a transformation to all points in the original index and feature
points with a distance less than
from
are returned. However, a
lot of computations are involved in building the new index. which
has a great impact on the actual query performance.
In the above works, no efficient method for nearest neighbor
query, which can be more useful than range query, has been pro-
posed.
We shall use Haar wavelet transform and DWT interchangeably
throughout this paper, unless specified particularly.
Another method that has been employed for dimension reduc-
tion is Karhunen-Loeve (K-L) transform [28]. (This method is
also known as Singular Value Decomposition (SVD) [22], and
is called Principle Component analysis in statistical literature.)
Given a collection of
-dimensional points, we project them on a
C
-dimensional sub-space where
C

, maximizing the variances
in the chosen dimensions. The key weakness of K-L transform is
the deterioration of performance upon incremental update of the
index. Therefore, new projection matrix should be re-calculated
and the index tree has to be re-organized periodically to keep up
the search performance.
2.1. Wavelet Transform
Wavelets are basis functions used in representing data or other
functions. Wavelet algorithms process data at different scales or
resolutions in contrast with DFT where only frequency compo-
nents are considered. The origin of wavelets can be traced to the
work of Karl Weierstrass [27] in 1873. The construction of the
first orthonormal system by Haar [21] is an important milestone.
Haar basis is still a foundation of modern wavelet theory. Another
significant advanceis the introduction of a nonorthogonal basis by
Dennis Gabor in 1946 [16]. In this work we shall advocatethe use
of the Haar wavelets in the problem of time series retrieval.
3. The Proposed Approach
Following a trend in the disciplines of signal and image pro-
cessing, we propose to study the use of wavelet transformation for
the time series indexing problem. Before we go into the details of
our proposed techniques, we would first like to define the similar-
ity model used in sequence matching. The first definition is based
on the Euclidean distance
$A%'

(
between time sequences
and
.
Definition 1 Given a threshold
, two time sequences
and
of
equal length
are said to be similar if
*,+'-
. /
-
123 4
 "
6 78
+
1
7 ;
.
7
2
<
=?>
@

A shortcoming of Definition 1 is demonstrated in Figure 1.
From human interpretation,
and
may be quite similar because
can be shifted up vertically to obtain
or vice versa. However,
they will be considered not similar by Definition 1 because errors
are accumulated at each pair of
7
and
7
. Therefore, we suggest
another similarity model.
Definition 2 Given a threshold
, two time sequences
and
of
equal length
are said to be v-shift similar if
*,+!-
. /
-
1 23 4
 "
6 78
++
1
70;
.
7
2
;
+
1

;
.

22
<
=?>
@

where
.

3

 "
6 78
.
7
and
1

3

 "
6 78
1
7

From Definition 2, any two time sequencesare said to be v-shift
similar if the Euclidean distance is less than or equal to a thresh-
old
neglecting their vertical offsets from x-axis. This definition
can give a better estimation of the similarity between two time se-
quences with similar trends running at two completely different
levels.
y
t
y
x
Figure 1. Example of vertical shifts of time sequences
3.1. Haar Wavelets
We want to have a decomposition that is fast to compute and
requires little storage for each sequence. The Haar wavelet is cho-
sen for the following reasons: (1) it allows good approximation
with a subset of coefficients, (2) it can be computed quickly and
easily, requiring linear time in the length of the sequenceand sim-
ple coding, and (3) it preserves Euclidean distance (see Section
3.3). The formal definition of Haar wavelets is given in Appendix
A. Concrete mathematical foundationscan be found in [9, 19] and
related implementations in [14].
Haar transform can be seen as a series of averaging and differ-
encing operations on a discrete time function. We compute the av-
erage and difference between every two adjacent values of
%
(
.
The procedure to find the Haar transform of a discrete function
%
(
= (9 7 3 5) is shown below.
Resolution Averages Coefficients
4 (9 7 3 5)
2 (8 4) (1 -1)
1 (6) (2)
Resolution 4 is the full resolution of the discrete function
%
(
.
In resolution 2, (8 4) are obtained by taking the average of (9 7)
and (3 5) at resolution 4 respectively. (1 -1) are the differences
of (9 7) and (3 5) divided by two respectively. This process is
continued until a resolution of 1 is reached. The Haar transform
%
%
((
= (

) = (6 2 1 -1) is obtained which is composed
of the last average value 6 and the coefficients found on the right
most column, 2, 1 and -1. It should be pointed out that
is the
overall average value of the whole time sequence, which is equal
to
%

(


. Different resolutions can be obtained
by adding difference values back to or subtract differences from
averages. For instance, (8 4) = (6+2 6-2) where 6 and 2 are the
first and secondcoefficient respectively. This process can be done
recursively until the full resolution is reached.
Haar transform can be realizedby a series of matrix multiplica-
tions as illustrated in Equation (2). Envisioning the example input
signal
as a column vector with length
<
= 4, an intermediate
transform vector
as another column vector and Haar transform
matrix
.

.
3
;
;
"!#
.
.
.
<
.
%$
(2)
The factor 1/2 associated with the Haar transform matrix can be
varied according to different normalization
$
conditions. After the
first multiplication of
and
, half of the Haar transform coef-
ficients can be found which are
and
in
interleaving with
some intermediate coefficients
and
. Actually,
and
are
the last two coefficients of the Haar transform.
and
are then
extracted from
and put into a new column vector
= [
0
0]
&
.
is treated as the new input vector for transformation. This
process is done recursively until one element is left in
. In this
particular case,
and
can be found in the second iteration.
The complexity of Haar transform can be evaluated by consid-
ering the number of operations involved in the recursion process.
Lemma 1 Given a time sequence of length
where
is an inte-
gral power of 2, the complexity of Haar transform is
'
%
(
.
Proof: There are totally
matrix additions or subtractions in the
first iteration of matrix operation. The size of the input vector is
halved in each iteration onwards. The total number of operations
is formulated as
( )*
@
+ ,.- /
0
21
"34353
1
61
1
( )*
@
7
1
7
61
%
87
'(
which is boundedby
9
%
(
.
:
3.2. DFT versus Haar Transform
Our motivation of using Haar transform to replace DFT is
based on several evidences and observations, some of which are
also the reasonswhy the use of wavelet transforms instead of DFT
is considered in areas of image and signal processing.
The first reason is on the pruning power. The nature of the
Euclidean distance preserved by Haar transform and DFT are dif-
ferent. In DFT, comparison of two time sequences is based on
their low frequency components, where most energy is presumed
to be concentrated on. On the other hand, the comparison of Haar
coefficients is matching a gradually refined resolution of the two
time sequences. From intuition, Euclidean distance can be highly
related to low resolution of signal rather than low frequency com-
ponents. This property can give rise to more effective pruning, i.e.
less false alarms will appear, which is confirmed by experiments
in Section 5.
Another reason is the complexity consideration. The complex-
ity of Haar transform is O
%
(
whilst O
%
<;>=.?
(
computation is
<
As for Fast Fourier Transform, the length of the signal is restricted to
numbers which are power of 2.
$
The normalization is described in Section 3.3.

required for Fast Fourier Transform (FFT) [17]. Both impose re-
striction on the length of time sequences which must be an inte-
gral power of 2. Although these computations are all involved in
pre-processing stage, the complexity of the transformation can be
a concern especially when the database is large. From our experi-
ments, the pre-processingtime for DFT is about 3 to 4 times longer
than Haar transform.
Finally, the proposed method provides better similarity model.
Apart from Euclidean distance, our model can easily accommo-
date v-shift similarity of two time sequences(Definition 2) at a lit-
tle more cost. That is, the situation where vertically shifted signals
can match is accommodated. On the other hand, previous study on
F-index did not make use of this similarity model.
Note that similar to DFT, DWT will not require massive index
re-organization because of database updating, which is a major
drawback in using the K-L transform or SVD approach.
3.3. Guarantee of no False Dismissal
For FT and DFT, it is shown by Parseval’s Theorem [23] that
the energy of a signal conserves in both time and frequency do-
mains. Parseval’s Theorem also showsthat this situation is true for
wavelet transforms. On the other hand, the Euclidean distances of
both time and frequency domains are the same for DFT by Equa-
tion (1). This is a very important property in order that dimen-
sion reduction of sequence data is possible. It guarantees that no
qualified time sequence will be rejected, thus no false dismissal.
However, this property has not been shown for DWT in general,
and not for the Haar wavelets. Here we show such a relationship.
Lemma 2 Given a sequence
= (
 
) and a sequence
= (


). The Haar transforms of
and
are
%'
(
=
= (
!


) and
%
(
=
= (
) respectively. Lengths of
,
,
and
are
all equal to 2. Then Euclidean distance
$A%!
B
(
is
1
times of
Euclidean distance
$A%!
(
, i.e.
$A%!
B
0(
1
$A%'

(
Proof: Express
in terms of
and
in terms of
by applying
Equation (2) accordingly.
-
3

.

.
.
;
.
-
3

1

1
1
;
1
Square of Euclidean distance of
and
@




>
@
 

>
@
@
!
>
@
"
>
@
@
@

@
Thus,
$
<
%'

(
% $
<
%'

)((
.1
, and
$A%!
B
)(
1
$A%!

!(
:
Lemma 3 Given two sequences
and
, and the Haar transforms
of
,
are
and
respectively. Lengths of
,
,
and
are all
(
$#
1
and
is a powerof 2). (
) = (
%
$
$
<
. . .
$

). The
Euclidean distance
$A%'

(
=
&
( )*
@
can be expressed in terms
of (
%
?$
$
<
. . .
$
 "
) recursively by
'
7
)(
3
*
!
+
'
<
7
*
<
<
+
*
<
<
+
(
-,.,/,/
*
<
<
+
>
"
2
for
"0
214365
<
;
'
3
7
(3)
Level S
i+1
Level S
1
Level S
0
x
i+1,2j
x
i+1,2j+1
d
2 +j
i
x
i,j
log n,1
x
x
log n,0
x
log n,n-1
x
log n,n-2
x
i+1,2j+3
d i
2 +j+1
x
i,j+1
(2 terms)
d
1
x
1,0
x
1,1
0,0
x
Level S
i
x
i+1,2j+2
Level S
log n
i
Figure 2. Hierarchy of Haar wavelet transform of se-
quence
of length
Proof: In Figure 2, the original sequence
is represented at level
;>=2?
<
. The values of
7
8 9
and
<
+
(:9
are defined by
.
7
;8 9
3
.
7
)(
8
<
9
.
7
<(
8
<
9/(
<
+
(:9
3
.
7
)(
8
<
9
;
.
7
<(
8
<
9/(
The Haar transform of
,
%'
(
is represented by (

8
<
...
<
+
(=9
<
+
(:9>(
...

). A similar hierarchy exists for another
sequence
. Denote
%

8


8
and
$
7
7
of sequence
7
of sequence
, where
7
@?BAC?
7
.
We can treat the elements at each horizontal level of the hierar-
chy to be a data sequence. Hencethe sequence at level
&
7
contains
data

7
;8

7
8

7
;8
<
+

. Let us define
&
7
to be
'
7
3
DE
E
F
<
+
"
6
9
#8
+
.
7
;8 9
;
1
7
;8 9
2
<
&
7
can be seen as the Euclidean distance between the data se-
quences at level
A
(
G
?HA@?
;>=.?
<
) in the hierarchies for
and
. Also,
&
( )*
@
is the Euclidean distance between the given time
series.
Next we prove the following statement:
'
7
<(
3
*
!
+
'
<
7
*
<
<
+
*
<
<
+
(
I,J,.,!
*
<
<
+
>

2
for
"0
"14365
<
;
'
3
7
(4)
The base case is shown true by Lemma 2 when
A
= 0.
'
3
K
!
+
'
<
*
<
2
We next prove the case for
A

C
LMG
. We first note that in the
given hierarchy, for a pair of adjacent elements at a level
L
0 of
the form

7
<(
8
<
9

7
<(
8
<
9>(
, we have the following relation

+
.
7
)(
8
<
9
;
1
7
<(
8
<
9
2
<
+
.
7
)(
8
<
9/(
;
1
7
<(
8
<
9/(
2
<
3
+
.
7
8 9
;
1
7
8 9
2
<
<
+
(:9
;
<
+
(=9
<

(5)
where
<
+
(=9
is the element in the hierarchy for
corre-
sponding to
<
+
(:9
. This can be shown by repeating the
proof in Lemma 2, replacing
by

7
<(
8
<
9

7
<(
8
<
9>(
'
,
by

7
<(
8
<
9

7
<(
8
<
9>(

,
by
!
7
8 9
<
+
(=9
, and
by

7
8 9
<
+
(=9
.
Note that
<
+
(=9
<
+
(=9

<
$
<
<
+
(=9
.
For
A

C
,
'

(
3
D
E
E
F
<

>
"
6
9
#8
+
.
(
8 9
;
1
(
8 9
2
<
3
+
.
(
8
;
1
(
8
2
<
+
.
(
8
;
1
(
8
2
<
I,J,/,
+
.
(
8
<
>

;
1
(
8
<
>

2
<

>
@
By Equation (5), we have
&
(
3

+
.
8
;
1
8
2
<
*
<
<


+
.
8
;
1
8
2
<
*
<
<

(
,/,/,/
+
.
8
<


;
1
8
<

"
2
<
*
<
<

(
<

"
>
@
3

+
.
8
;
1
8
2
<
+
.
8
;
1
8
2
<
-,J,.,
+
.
8
<


;
1
8
<

"
2
<
*
<
<

*
<
<

(
-,.,J,!
*
<
<

(
<

"
>
@
Finally by definition of
&
,
&
(
*
1

%
&
<
$
<
<

$
<
<

(
8353532
$
<
<

>

(
which completes the proof.
:
The expression of the Euclidean distance between time se-
quences in terms of their Haar coefficients is not sufficient for
proper use in multi-dimensional index trees until Euclidean dis-
tance preserves in both Haar and time domains, as for DFT in
(1). This can be achieved by a normalization step which replaces
the scaling factor in Equation (2) from
721
to
7
1
in the Haar
transformation. After the normalization step, Euclidean distance
between sequences in Haar domain will be equivalent to
&
( )*
@
in Equation (3). The preservation of Euclidean distance of Haar
transform ensures the completeness of feature extraction as in
DFT.
If only the first

dimensions (
7
@?

?
) of Haar transform
are used in calculation of Euclidean distance in Equation (3), then
we should replace 0’s in the Haar transformed sequences. This
replacement starts from

+1 th to
th coefficients in the trans-
formed sequences.
Lemma 4 If the first

(
7
?

?
) dimensions of Haar trans-
form are used, no false dismissal will occur for range queries.
Proof: Considering the inequality in Definition 1 and Lemma 3
$A%!

(
&
( )*
@
?
(6)
Using the first

dimensionsas index, the value of
$
7
in Equation
(3) will become zero for
A
#

. Thus the Euclidean distance
between two sequences is
?
&
( )*
@
?
. This completes the
proof.
:
4. The Overall Strategy
In this section, we present the overall strategy of our time se-
ries matching method and propose our own method for nearest
neighbor query. Before querying is performed, we shall do some
pre-processing to extract the feature vectors with reduced dimen-
sionality, and to build the index. After the index is built, content-
based search can be performed for two types of querying: range
querying and
-nearest-neighbors querying.
4.1. Pre-processing
Step 1 - Similarity Model Selection: According to their applica-
tions users may choose to use either the simple Euclidean distance
(Definition 1) or the v-shift similarity (Definition 2) as their sim-
ilarity measurements. For Definition 1, Haar transform is applied
to time series. For Definition 2, Haar transform is applied to time
series, but the first Haar coefficient will not be used in indexing, as
there is no need to match their average values.
Step 2 - Index Construction: Given a database of time series of
varying length. We pre-process the time series as follows. We ob-
tain the
-point Haar transform by applying Equation (2) with the
normalization factor, for each subsequenceswith a sliding window
of size
to each sequence in the database. An index structure
such as an R-Tree is built, using the first

Haar coefficients
where

is an optimal value found by experiments based on the
number of page accesses. This is because of a trade off between
post-processing cost and index dimension.
4.2. Range Query
After we have built the index, we can carry out range query or
nearest neighbor query evaluation. For range queries, two steps
are involved:
1. Similar sequences with distances
?
from the query are
looked up in the index and returned.
2. A post processingstep is applied on these sequencesto find
the true distancesin time domain to remove all falsealarms.
4.3. Nearest Neighbor Query
For nearest neighbor query, we proposea two-phase evaluation
as follows.
Phase 1
In the first phase,
nearest neighbors of query
are found
in the R-Tree index using the algorithm in [25]. The Eu-
clidean distances
$
in time domain (full dimension) are
computed between the query sequence and all
nearest
neighbors obtained which are
$&%


7
(
, where

7
de-
notes the nearest neighbor
A
(
7
? A ?
), with

farthest
from the query
.
Phase 2
A range query evaluation is then performed on the same in-
dex by setting
=
$A%
0

)(
initially. During the search,
Using Definition 2, one dimension can be savedin the index tree.

Citations
More filters
Proceedings ArticleDOI
13 Jun 2003
TL;DR: A new symbolic representation of time series is introduced that is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.
Abstract: The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued.Many researchers have also considered transforming real valued time series into symbolic representations, nothing that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly "batch-only" problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms.In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.We will demonstrate the utility of our representation on the classic data mining tasks of clustering, classification, query by content and anomaly detection.

1,922 citations


Cites background or methods from "Efficient time series matching by w..."

  • ...Indeed, it is in this context that most of the representations enumerated in Figure 1 were introdu ce [7, 14, 22, 35]....

    [...]

  • ...With this in mind, a great number of time series representations have been introduced, including the Discrete Fourier Transform (DFT) [14], the Discrete Wavelet Transform (DWT) [7], Piecewise Linear, and Piecewise Constant models (PAA) [22], (APCA) [16, 22], and Singular Value Decomposition (SVD) [22]....

    [...]

  • ...As a simple example, wavelets have the useful multiresolution property, but are only defined for time series that are an integer power of two in length [7]....

    [...]

  • ...Most applications assume that we have one very long time series T, and that manageable subsequences of length n are extracted by use of a sliding window, then stored in a matrix fo r urther manipulation [7, 14, 22, 35]....

    [...]

  • ...To perform query by content, we built an index usin g SAX, and compared it to an index built using the Haar wavele t approach [7]....

    [...]

Journal ArticleDOI
TL;DR: The analysis of time series: An Introduction, 4th edn. as discussed by the authors by C. Chatfield, C. Chapman and Hall, London, 1989. ISBN 0 412 31820 2.
Abstract: The Analysis of Time Series: An Introduction, 4th edn. By C. Chatfield. ISBN 0 412 31820 2. Chapman and Hall, London, 1989. 242 pp. £13.50.

1,583 citations

Journal ArticleDOI
TL;DR: This work introduces a new dimensionality reduction technique which it is called Piecewise Aggregate Approximation (PAA), and theoretically and empirically compare it to the other techniques and demonstrate its superiority.
Abstract: The problem of similarity search in large time series databases has attracted much attention recently. It is a non-trivial problem because of the inherent high dimensionality of the data. The most promising solutions involve first performing dimensionality reduction on the data, and then indexing the reduced data with a spatial access method. Three major dimensionality reduction techniques have been proposed: Singular Value Decomposition (SVD), the Discrete Fourier transform (DFT), and more recently the Discrete Wavelet Transform (DWT). In this work we introduce a new dimensionality reduction technique which we call Piecewise Aggregate Approximation (PAA). We theoretically and empirically compare it to the other techniques and demonstrate its superiority. In addition to being competitive with or faster than the other methods, our approach has numerous other advantages. It is simple to understand and to implement, it allows more flexible distance measures, including weighted Euclidean queries, and the index can be built in linear time.

1,550 citations


Cites background or methods from "Efficient time series matching by w..."

  • ...However it might be argued the inherent smoothing effect of dimensionality reduction using DFT, DWT or SVD would help smooth out the noise and produce better results....

    [...]

  • ...utilizes the Discrete Fourier Transform (DFT) to perform the dimensionality reduction, but other techniques have been suggested, including Singular Value Decomposition (SVD) [34] and the Discrete Wavelet Transform (DWT) [6]....

    [...]

  • ...Because we wished to include the DWT in our experiments, we are limited to query lengths that are an integer power of two....

    [...]

  • ...The Discrete Haar Wavelet Transform (DWT) can be calculated efficiently and an entire dataset can be indexed in O(mn)....

    [...]

  • ...7 and DRW is defined as: ( ) ),,min( 11 iii N n N n www +−= , ( )∑ = −= N i iiiN n yxwYWXDRW 1 2)],,([ (9) Note that it is not possible to modify DFT, DWT or SVD in a similar manner, because each coefficient represents a signal that is added along the entire length of the query....

    [...]

Journal ArticleDOI
TL;DR: The utility of the new symbolic representation of time series formed is demonstrated, which allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.
Abstract: Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.

1,452 citations


Cites background or methods from "Efficient time series matching by w..."

  • ...Figure 1 illustrates a hierarchy of all the various time series representations in the literature (Andre-Jonsson and Badal 1997; Chan and Fu 1999; Faloutsos et al. 1994; Geurts 2001; Huang and Yu 1999; Keogh et al. 2001a; Keogh and Pazzani 1998; Roddick et al. 2001; Shahabi et al. 2000; Yi and…...

    [...]

  • ...• Indexing: Given a query time series Q, and some similarity/dissimilarity measure D(Q,C), find the most similar time series in database DB (Agrawal et al. 1995; Chan and Fu 1999; Faloutsos et al. 1994; Keogh et al. 2001a; Yi and Faloutsos 2000)....

    [...]

  • ...With this in mind, a great number of time series representations have been introduced, including the Discrete Fourier Transform (DFT) (Faloutsos et al. 1994), the Discrete Wavelet Transform (DWT) (Chan and Fu 1999), Piecewise Linear, and Piecewise Constant models (PAA) (Keogh et al. 2001a), (APCA) (Geurts 2001; Keogh et al. 2001a), and Singular Value Decomposition (SVD) (Keogh et al. 2001a)....

    [...]

  • ...To perform query by content, we build an index using SAX, and compare it to an index built using the Haar wavelet approach (Chan and Fu 1999)....

    [...]

  • ...1 were introduced (Chan and Fu 1999; Faloutsos et al. 1994; Keogh et al. 2001a; Yi and Faloutsos 2000)....

    [...]

Journal ArticleDOI
01 Aug 2008
TL;DR: An extensive set of time series experiments are conducted re-implementing 8 different representation methods and 9 similarity measures and their variants and testing their effectiveness on 38 time series data sets from a wide variety of application domains to provide a unified validation of some of the existing achievements.
Abstract: The last decade has witnessed a tremendous growths of interests in applications that deal with querying and mining of time series data. Numerous representation methods for dimensionality reduction and similarity measures geared towards time series have been introduced. Each individual work introducing a particular method has made specific claims and, aside from the occasional theoretical justifications, provided quantitative experimental observations. However, for the most part, the comparative aspects of these experiments were too narrowly focused on demonstrating the benefits of the proposed methods over some of the previously introduced ones. In order to provide a comprehensive validation, we conducted an extensive set of time series experiments re-implementing 8 different representation methods and 9 similarity measures and their variants, and testing their effectiveness on 38 time series data sets from a wide variety of application domains. In this paper, we give an overview of these different techniques and present our comparative experimental findings regarding their effectiveness. Our experiments have provided both a unified validation of some of the existing achievements, and in some cases, suggested that certain claims in the literature may be unduly optimistic.

1,387 citations


Additional excerpts

  • ...Many techniques have been proposed in the literature for representing time series with reduced dimensionality, such as Discrete Fourier Transformation (DFT) [13], Single Value Decomposition (SVD) [13], Discrete Cosine Transformation (DCT) [29], Discrete Wavelet Transformation (DWT) [33], Piecewise Aggregate Approximation (PAA) [24], Adaptive Piecewise Constant Approximation (APCA) [23], Chebyshev polynomials (CHEB) [6], Symbolic Aggregate approXimation (SAX) [30], Indexable Piecewise Linear Approximation (IPLA) [11] and etc....

    [...]

  • ...…(DFT) [13], Single Value Decomposition (SVD) [13], Discrete Cosine Transformation (DCT) [29], Discrete Wavelet Transformation (DWT) [33], Piecewise Aggregate Approximation (PAA) [24], Adaptive Piecewise Constant Approximation (APCA) [23], Chebyshev polynomials (CHEB) [6], Symbolic Aggregate…...

    [...]

References
More filters
Book
01 Jan 1993
TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Abstract: This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.

37,183 citations

Proceedings ArticleDOI
01 Jun 1984
TL;DR: A dynamic index structure called an R-tree is described which meets this need, and algorithms for searching and updating it are given and it is concluded that it is useful for current database systems in spatial applications.
Abstract: In order to handle spatial data efficiently, as required in computer aided design and geo-data applications, a database system needs an index mechanism that will help it retrieve data items quickly according to their spatial locations However, traditional indexing methods are not well suited to data objects of non-zero size located m multi-dimensional spaces In this paper we describe a dynamic index structure called an R-tree which meets this need, and give algorithms for searching and updating it. We present the results of a series of tests which indicate that the structure performs well, and conclude that it is useful for current database systems in spatial applications

7,336 citations


"Efficient time series matching by w..." refers methods in this paper

  • ...Many multi-dimensional indexing methods [13, 7, 5, 20] such as the R-Tree and R*-Tree [20, 5, 11] scale exponentially for high dimensionalities, ventually reducing the performance to that of sequential scannin g or worse....

    [...]

  • ...Many multi-dimensional indexing methods [13, 7, 5, 20] such as the R-Tree and R*-Tree [20, 5, 11] scale exponentially for high dimensionalities, eventually reducing the performance to that of sequential scanning or worse....

    [...]

  • ...These coefficients can be indexed in an R-Tree or R*-Tree for fast retrieval....

    [...]

Journal ArticleDOI
TL;DR: Statistical theory attacks the problem from both ends as discussed by the authors, and provides optimal methods for finding a real signal in a noisy background, and also provides strict checks against the overinterpretation of random patterns.
Abstract: Statistics is the science of learning from experience, especially experience that arrives a little bit at a time. The earliest information science was statistics, originating in about 1650. This century has seen statistical techniques become the analytic methods of choice in biomedical science, psychology, education, economics, communications theory, sociology, genetic studies, epidemiology, and other areas. Recently, traditional sciences like geology, physics, and astronomy have begun to make increasing use of statistical methods as they focus on areas that demand informational efficiency, such as the study of rare and exotic particles or extremely distant galaxies. Most people are not natural-born statisticians. Left to our own devices we are not very good at picking out patterns from a sea of noisy data. To put it another way, we are all too good at picking out non-existent patterns that happen to suit our purposes. Statistical theory attacks the problem from both ends. It provides optimal methods for finding a real signal in a noisy background, and also provides strict checks against the overinterpretation of random patterns.

6,361 citations

01 Jan 1946

5,910 citations

Proceedings ArticleDOI
06 Mar 1995
TL;DR: Three algorithms are presented to solve the problem of mining sequential patterns over databases of customer transactions, and empirically evaluating their performance using synthetic data shows that two of them have comparable performance.
Abstract: We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction. >

5,663 citations


Additional excerpts

  • ...Time series data are of growing importance in many new database applications, such as data warehousing and data mining [ 3 , 8, 2, 12]....

    [...]

Frequently Asked Questions (14)
Q1. What are the future works mentioned in the paper "Efficient time series matching by wavelets" ?

The authors have some suggestions for future work. The authors can study the possibility of using other wavelets like Symmlet [ 18 ] to boost up the performance further. The authors can also try to apply wavelets that did not work well with stock data in other signals, e. g. sinusoidal signals, electrocardiographs ( ECGs ). 

While the use of DFT and K-L transform or SVD have been studied in the literature, to their knowledge, there is no in-depth study on the application of DWT. In this paper, the authors propose to use Haar Wavelet Transform for time series indexing. The major contributions are: ( 1 ) the authors show that Euclidean distance is preserved in the Haar transformed domain and no false dismissal will occur, ( 2 ) they show that Haar transform can outperform DFT through experiments, ( 3 ) a new similarity model is suggested to accommodate vertical shift of time series, and ( 4 ) a two-phase method is proposed for efficient -nearest neighbor query in time series databases. 

The authors obtain the -point Haar transform by applying Equation (2) with the normalization factor, for each subsequences with a sliding window of size to each sequence in the database. 

To avoid missing any qualifying object, the Euclidean distance in the reduced C -dimensional space should be less than or equal to the Euclidean distance between the two original time sequences. 

Experiments show that their method outperforms the F-index (Discrete Fourier Transform) method in terms of pruning power, number of page accesses, scalability, and complexity. 

After the first multiplication of and , half of the Haar transform coefficients can be found which are and in interleaving with some intermediate coefficients and . 

As most of the page accesses of a query are devoted to removing false alarm, the precision is crucial to the overall performance of query evaluation. 

Feature points in nearby offsets will form a trail due to the effect of stepwise sliding window, the minimum bounding rectangle (MBR) of a trail is then being indexed in an R-Tree instead of the feature points themselves. 

The poorer precision of DFT creates more work in the post-processing step and this affects the overall performance, especially in terms of the amount of disk accesses for large databases with long sequences. 

An index structure such as an R-Tree is built, using the first Haar coefficients where is an optimal value found by experiments based on the number of page accesses. 

The extra step introduced in Phase 2 to update can enhance the performance by pruning more non-qualifying MBRs during the traversal of R-Tree. 

Apart from Euclidean distance, their model can easily accommodate v-shift similarity of two time sequences (Definition 2) at a little more cost. 

Time series data are of growing importance in many new database applications, such as data warehousing and data mining [3, 8, 2, 12]. 

From experiments,we find that the other wavelets seem to also preserve Euclidean distances, however, so far the authors have a proof of this property only for the Haar wavelets.