What are the future works mentioned in the paper "Efficient time series matching by wavelets" ?

The authors have some suggestions for future work. The authors can study the possibility of using other wavelets like Symmlet [ 18 ] to boost up the performance further. The authors can also try to apply wavelets that did not work well with stock data in other signals, e. g. sinusoidal signals, electrocardiographs ( ECGs ).

How can the authors avoid missing any qualifying object?

To avoid missing any qualifying object, the Euclidean distance in the reduced C -dimensional space should be less than or equal to the Euclidean distance between the two original time sequences.

What are the results of the experiments?

Experiments show that their method outperforms the F-index (Discrete Fourier Transform) method in terms of pruning power, number of page accesses, scalability, and complexity.

How many coefficients can be found in the Haar transform?

After the first multiplication of and , half of the Haar transform coefficients can be found which are and in interleaving with some intermediate coefficients and .

What is the importance of the precision in a query?

As most of the page accesses of a query are devoted to removing false alarm, the precision is crucial to the overall performance of query evaluation.

What is the effect of the poorer precision of DFT?

The poorer precision of DFT creates more work in the post-processing step and this affects the overall performance, especially in terms of the amount of disk accesses for large databases with long sequences.

What is the way to build an index?

An index structure such as an R-Tree is built, using the first Haar coefficients where is an optimal value found by experiments based on the number of page accesses.

How can the authors improve the performance of the R-Tree?

The extra step introduced in Phase 2 to update can enhance the performance by pruning more non-qualifying MBRs during the traversal of R-Tree.

What are the other wavelets that the authors have found?

From experiments,we find that the other wavelets seem to also preserve Euclidean distances, however, so far the authors have a proof of this property only for the Haar wavelets.

(Open Access) Efficient time series matching by wavelets (1999) | Kin-Pong Chan

Q: How do the authors obtain the -point Haar transform?

The authors obtain the -point Haar transform by applying Equation (2) with the normalization factor, for each subsequences with a sliding window of size to each sequence in the database.

Efﬁcient Time Series Matching by Wavelets

Kin-pong Chan and Ada Wai-chee Fu

Department of Computer Science and Engineering

The Chinese University of Hong Kong

Shatin, Hong Kong



kpchan, adafu



@cse.cuhk.edu.hk

Abstract

Time series stored as feature vectors can be indexed by multi-

dimensional index trees like R-Trees for fast retrieval. Due to

the dimensionality curse problem, transformations are applied to

time series to reduce the number of dimensions of the feature vec-

tors. Different transformations like Discrete Fourier Transform

(DFT), Discrete Wavelet Transform (DWT), Karhunen-Loeve (K-

L) transform or Singular Value Decomposition (SVD) can be ap-

plied. While the use of DFT and K-L transform or SVD have been

studied in the literature, to our knowledge, there is no in-depth

study on the application of DWT. In this paper, we propose to use

Haar Wavelet Transform for time series indexing. The major con-

tributions are: (1) we show that Euclidean distance is preserved

in the Haar transformed domain and no false dismissal will occur,

(2) we show that Haar transform can outperform DFT through

experiments, (3) a new similarity model is suggested to accom-

modate vertical shift of time series, and (4) a two-phase method

is proposed for efﬁcient



-nearest neighbor query in time series

databases.

1. Introduction

Time series data are of growing importance in many new

database applications, such as data warehousing and data mining

[3, 8, 2, 12]. A time series (or time sequence) is a sequence of

real numbers, each number representing a value at a time point.

Typical examples include stock prices or currency exchange rates,

biomedical measurements, weather data, etc . . . collected over

time. Therefore, time series databases supporting fast retrieval of

time series data and similarity queries are desired.

In order to depict the similarity between two time series,

we deﬁne a similarity measurement during the matching pro-

cess. Given two time series





and





 ! "#

, a standard approach is to compute the Eu-

clidean distance

$&%'





)(

between time series





and





*,+!-

.0/

12354

 "

6 78

:9

7;

9 <=?>

By using this similarity model, we can retrieve similar time series

by considering distance

$A%!

B



)(

Indexing is used to support efﬁcient retrieval and matching of

time series. Some important factors have to be considered: The

ﬁrst factor is dimensionality reduction. Many multi-dimensional

indexing methods [13, 7, 5, 20] such as the R-Tree and R*-Tree

[20, 5, 11] scale exponentially for high dimensionalities, eventu-

ally reducing the performance to that of sequential scanning or

worse. Hence, transformation isappliedto mapthe time sequences

to a new feature space of a lower dimensionality. Next we must

ensure completeness and effectiveness when the number of dimen-

sions is reduced. To avoid missing any qualifying object, the Eu-

clidean distance in the reduced

-dimensional space should be less

than or equal to the Euclidean distance between the two original

time sequences. Finally, we must also consider the nature of data

series since the effectiveness of power concentration of a partic-

ular transformation depends on the nature of the time series. It

is believed that only brown noise or random walks exists in real

signals. In particular, stock movements and exchange rates can be

modeledsuccessfullyas random walks in [10], for which a skewed

energy spectrum can be obtained.

Discrete Fourier Transform (DFT) has been one of the most

commonly used techniques. One problem with DFT is that it

misses the important feature of time localization. Piecewise

Fourier Transform has been proposed to mitigate this problem, but

the size of the pieces leads to other problems. While large pieces

reduce the power of multi-resolution, small pieces has weakness

in modeling low frequencies.

Wavelet Transform (WT), or Discrete Wavelet Transform

(DWT) [9, 18] has been found to be effective in replacing DFT

in many applicationsin computer graphics, image [26], speech [1]

, and signal processing [6, 4]. We propose to apply this technique

in time series for dimension reduction and content-based search.

DWT is a discrete version of WT for numerical signal. Although

the potential application of DWT in this problem was pointed out

in [22], no further investigation has been reported to our knowl-

edge. Hence, it is of value to conduct studies and evaluations on

time series retrieval and matching by means of wavelets.

The advantageof using DWT is multi-resolution representation

of signals. It has the time-frequency localization property. Thus,

DWT is able to give locations in both time and frequency. There-

fore, wavelet representationsof signals bear more information than

that of DFT, in which only frequenciesareconsidered. While DFT

extracts the lower harmonics which represent the general shape of

a time sequence, DWT encodes a coarser resolution of the origi-

nal time sequence with its preceding coefﬁcients. We show that

Euclidean distance is preserved in the Haar transformed domain.

Moreover, we show by experiments that Haar Wavelet Transform



[9], which is a commonly used wavelet transform, can outper-

form DFT signiﬁcantly.

We alsosuggesta similarity deﬁnition to handlethe problem of

vertical shifts of time series. Finally we propose an algorithm on



-nearest neighbor query for the proposed wavelet method. The

algorithm makes use of the range query and dynamically adjusts

the range by the property of Euclidean distance preservationof the

wavelet transformation.

2. Related Work

Discrete Fourier Transform (DFT) is often used for dimension

reduction [2, 15] to achieve efﬁcient indexing. An index built by

means of DFT is also called an F-index [2]. Suppose the DFT of

a time sequence





is denotedby





. For many applications such as

stock data, the low frequency components are located at the pre-

ceding coefﬁcients of





which represent the general trend of the

time sequence





. These coefﬁcients can be indexed in an R-Tree

or R*-Tree for fast retrieval. In most previous works, range query-

ing is considered. A range query (or epsilon query) evaluation

returns sequenceswith Euclidean distance within



from the query

point.

Parseval’s Theorem [23] shows that the Euclidean distance be-

tween two signals





and





in time domain is the same as their

Euclidean distance in frequency domain





























(1)

Therefore, F-index may raise false alarms, but guarantees no false

dismissal. After a range query in the F-index, false alarms are ﬁl-

tered by checking against the query sequence in the original time

domain in a post-processing step. F-index is further generalized

and subsequence matching is proposed in [15]. This is called the

ST-index which permits sequence query of varying length. Each

time sequence is broken up into pieces of subsequencesby a slid-

ing windowwith a ﬁxedlength



for DFT. Feature points in nearby

offsets will form a trail due to the effect of stepwise sliding win-

dow, the minimum bounding rectangle(MBR) of a trail is then be-

ing indexed in an R-Tree instead of the feature points themselves.

When a query arrives, all MBRs that intersect the query region are

retrieved and their trails are matched.

New similarity models are applied to F-index based time se-

ries matching in [24]. It achieves time warping, moving average,

and reversing by applying transformations to feature points in the

frequencydomain. Given a query





, a new index is built by apply-

ing a transformation to all points in the original index and feature

points with a distance less than



from





are returned. However, a

lot of computations are involved in building the new index. which

has a great impact on the actual query performance.

In the above works, no efﬁcient method for nearest neighbor

query, which can be more useful than range query, has been pro-

posed.



We shall use Haar wavelet transform and DWT interchangeably

throughout this paper, unless speciﬁed particularly.

Another method that has been employed for dimension reduc-

tion is Karhunen-Loeve (K-L) transform [28]. (This method is

also known as Singular Value Decomposition (SVD) [22], and

is called Principle Component analysis in statistical literature.)

Given a collection of



-dimensional points, we project them on a

-dimensional sub-space where





, maximizing the variances

in the chosen dimensions. The key weakness of K-L transform is

the deterioration of performance upon incremental update of the

index. Therefore, new projection matrix should be re-calculated

and the index tree has to be re-organized periodically to keep up

the search performance.

2.1. Wavelet Transform

Wavelets are basis functions used in representing data or other

functions. Wavelet algorithms process data at different scales or

resolutions in contrast with DFT where only frequency compo-

nents are considered. The origin of wavelets can be traced to the

work of Karl Weierstrass [27] in 1873. The construction of the

ﬁrst orthonormal system by Haar [21] is an important milestone.

Haar basis is still a foundation of modern wavelet theory. Another

signiﬁcant advanceis the introduction of a nonorthogonal basis by

Dennis Gabor in 1946 [16]. In this work we shall advocatethe use

of the Haar wavelets in the problem of time series retrieval.

3. The Proposed Approach

Following a trend in the disciplines of signal and image pro-

cessing, we propose to study the use of wavelet transformation for

the time series indexing problem. Before we go into the details of

our proposed techniques, we would ﬁrst like to deﬁne the similar-

ity model used in sequence matching. The ﬁrst deﬁnition is based

on the Euclidean distance

$A%'





(

between time sequences





and





Deﬁnition 1 Given a threshold



, two time sequences





and





equal length



are said to be similar if

*,+'-

. /

123 4

 "

6 78



7 ;

=?>





A shortcoming of Deﬁnition 1 is demonstrated in Figure 1.

From human interpretation,





and





may be quite similar because





can be shifted up vertically to obtain





or vice versa. However,

they will be considered not similar by Deﬁnition 1 because errors

are accumulated at each pair of



and



. Therefore, we suggest

another similarity model.

Deﬁnition 2 Given a threshold



, two time sequences





and





equal length



are said to be v-shift similar if

*,+!-

. /

1 23 4

 "

6 78



++

70;

;



;



22

=?>





where







 "

6 78



and







 "

6 78



From Deﬁnition 2, any two time sequencesare said to be v-shift

similar if the Euclidean distance is less than or equal to a thresh-

old



neglecting their vertical offsets from x-axis. This deﬁnition

can give a better estimation of the similarity between two time se-

quences with similar trends running at two completely different

levels.

Figure 1. Example of vertical shifts of time sequences

3.1. Haar Wavelets

We want to have a decomposition that is fast to compute and

requires little storage for each sequence. The Haar wavelet is cho-

sen for the following reasons: (1) it allows good approximation

with a subset of coefﬁcients, (2) it can be computed quickly and

easily, requiring linear time in the length of the sequenceand sim-

ple coding, and (3) it preserves Euclidean distance (see Section

3.3). The formal deﬁnition of Haar wavelets is given in Appendix

A. Concrete mathematical foundationscan be found in [9, 19] and

related implementations in [14].

Haar transform can be seen as a series of averaging and differ-

encing operations on a discrete time function. We compute the av-

erage and difference between every two adjacent values of



(

The procedure to ﬁnd the Haar transform of a discrete function



(

= (9 7 3 5) is shown below.

Resolution Averages Coefﬁcients

4 (9 7 3 5)

2 (8 4) (1 -1)

1 (6) (2)

Resolution 4 is the full resolution of the discrete function



(

In resolution 2, (8 4) are obtained by taking the average of (9 7)

and (3 5) at resolution 4 respectively. (1 -1) are the differences

of (9 7) and (3 5) divided by two respectively. This process is

continued until a resolution of 1 is reached. The Haar transform





((

= (















) = (6 2 1 -1) is obtained which is composed

of the last average value 6 and the coefﬁcients found on the right

most column, 2, 1 and -1. It should be pointed out that



is the

overall average value of the whole time sequence, which is equal



(







. Different resolutions can be obtained

by adding difference values back to or subtract differences from

averages. For instance, (8 4) = (6+2 6-2) where 6 and 2 are the

ﬁrst and secondcoefﬁcient respectively. This process can be done

recursively until the full resolution is reached.

Haar transform can be realizedby a series of matrix multiplica-

tions as illustrated in Equation (2). Envisioning the example input

signal





as a column vector with length



= 4, an intermediate

transform vector





as another column vector and Haar transform

matrix































 

 



;



 

 

 



;



"!#







(2)

The factor 1/2 associated with the Haar transform matrix can be

varied according to different normalization

conditions. After the

ﬁrst multiplication of





and



, half of the Haar transform coef-

ﬁcients can be found which are







and









interleaving with

some intermediate coefﬁcients







and







. Actually,







and





are

the last two coefﬁcients of the Haar transform.







and







are then

extracted from





and put into a new column vector







= [



















is treated as the new input vector for transformation. This

process is done recursively until one element is left in







. In this

particular case,



and





can be found in the second iteration.

The complexity of Haar transform can be evaluated by consid-

ering the number of operations involved in the recursion process.

Lemma 1 Given a time sequence of length



where



is an inte-

gral power of 2, the complexity of Haar transform is

% 

(

Proof: There are totally



matrix additions or subtractions in the

ﬁrst iteration of matrix operation. The size of the input vector is

halved in each iteration onwards. The total number of operations

is formulated as

( )*



+ ,.- /



0



21

"34353



( )*





7



7



% 



which is boundedby

% 

(

3.2. DFT versus Haar Transform

Our motivation of using Haar transform to replace DFT is

based on several evidences and observations, some of which are

also the reasonswhy the use of wavelet transforms instead of DFT

is considered in areas of image and signal processing.

The ﬁrst reason is on the pruning power. The nature of the

Euclidean distance preserved by Haar transform and DFT are dif-

ferent. In DFT, comparison of two time sequences is based on

their low frequency components, where most energy is presumed

to be concentrated on. On the other hand, the comparison of Haar

coefﬁcients is matching a gradually reﬁned resolution of the two

time sequences. From intuition, Euclidean distance can be highly

related to low resolution of signal rather than low frequency com-

ponents. This property can give rise to more effective pruning, i.e.

less false alarms will appear, which is conﬁrmed by experiments

in Section 5.

Another reason is the complexity consideration. The complex-

ity of Haar transform is O

% 

(

whilst O

% 

<;>=.?



(

computation is

As for Fast Fourier Transform, the length of the signal is restricted to

numbers which are power of 2.

The normalization is described in Section 3.3.

required for Fast Fourier Transform (FFT) [17]. Both impose re-

striction on the length of time sequences which must be an inte-

gral power of 2. Although these computations are all involved in

pre-processing stage, the complexity of the transformation can be

a concern especially when the database is large. From our experi-

ments, the pre-processingtime for DFT is about 3 to 4 times longer

than Haar transform.

Finally, the proposed method provides better similarity model.

Apart from Euclidean distance, our model can easily accommo-

date v-shift similarity of two time sequences(Deﬁnition 2) at a lit-

tle more cost. That is, the situation where vertically shifted signals

can match is accommodated. On the other hand, previous study on

F-index did not make use of this similarity model.

Note that similar to DFT, DWT will not require massive index

re-organization because of database updating, which is a major

drawback in using the K-L transform or SVD approach.

3.3. Guarantee of no False Dismissal

For FT and DFT, it is shown by Parseval’s Theorem [23] that

the energy of a signal conserves in both time and frequency do-

mains. Parseval’s Theorem also showsthat this situation is true for

wavelet transforms. On the other hand, the Euclidean distances of

both time and frequency domains are the same for DFT by Equa-

tion (1). This is a very important property in order that dimen-

sion reduction of sequence data is possible. It guarantees that no

qualiﬁed time sequence will be rejected, thus no false dismissal.

However, this property has not been shown for DWT in general,

and not for the Haar wavelets. Here we show such a relationship.

Lemma 2 Given a sequence





= (

 

) and a sequence





= (





). The Haar transforms of





and





are



%'

 (





= (



!





) and



%

(





= (









) respectively. Lengths of













and





are

all equal to 2. Then Euclidean distance

$A%!

B



(



times of

Euclidean distance

$A%!









(

, i.e.

$A%!

B



0( 



$A%'









(

Proof: Express





in terms of





and





in terms of





by applying

Equation (2) accordingly.















;





















;







Square of Euclidean distance of





and

















 









!





"

























Thus,

%'









( 

% $

%'





)((

.1

, and

$A%!

B



)( 



$A%!









Lemma 3 Given two sequences





and





, and the Haar transforms









are





and





respectively. Lengths of













and





are all



(



and



is a powerof 2). (





–





) = (



. . .



). The

Euclidean distance

$A%'





(

( )*



can be expressed in terms

of (



. . .

 "

) recursively by

)(





'



+



+

(



-,.,/,/

+



"

for





214365



;





(3)

Level S

i+1

Level S

i+1,2j

i+1,2j+1

2 +j

i,j

log n,1

log n,0

log n,n-1

log n,n-2

i+1,2j+3

d i

2 +j+1

i,j+1

(2 terms)

1,0

1,1

0,0

Level S

i+1,2j+2

Level S

log n

Figure 2. Hierarchy of Haar wavelet transform of se-

quence





of length



Proof: In Figure 2, the original sequence





is represented at level

;>=2?



. The values of



8 9

and



+

(:9

are deﬁned by

;8 9

)(







9/(







+

(:9

)(



;



9/(





The Haar transform of







%'

 (

is represented by (











...



+

(=9



+

(:9>(



...





). A similar hierarchy exists for another

sequence





. Denote

 









and





of sequence







of sequence





, where

@?BAC?





We can treat the elements at each horizontal level of the hierar-

chy to be a data sequence. Hencethe sequence at level

contains

data







8



 

+





. Let us deﬁne

to be

+

"



;8 9

;

;8 9

can be seen as the Euclidean distance between the data se-

quences at level

(

?HA@?

;>=.?



) in the hierarchies for





and





. Also,

( )*



is the Euclidean distance between the given time

series.

Next we prove the following statement:





'



+



+

(



I,J,.,!

+





for





"14365



;





(4)

The base case is shown true by Lemma 2 when

= 0.





'







We next prove the case for



LMG

. We ﬁrst note that in the

given hierarchy, for a pair of adjacent elements at a level

0 of

the form









9>(





, we have the following relation

)(



;





)(



9/(



;



9/(







8 9

;

8 9







+

(:9

;





(=9





(5)

where





(=9

is the element in the hierarchy for





corre-

sponding to



+

(:9

. This can be shown by repeating the

proof in Lemma 2, replacing













9>(

'













9>(







!

8 9





+

(=9



, and







8 9







+

(=9



Note that



+

(=9





+

(=9





+

(=9

For





(

 3





"





(



8 9

;



(



8 9





(





;



(



 2





(



;



(



2

I,J,/,





(









;



(











By Equation (5), we have



(













;



2















;



2





(







,/,/,/











;





"





(



"















;











;





-,J,.,









;





"















(



-,.,J,!



(



"





Finally by deﬁnition of



(

 













(



8353532







(

which completes the proof.

The expression of the Euclidean distance between time se-

quences in terms of their Haar coefﬁcients is not sufﬁcient for

proper use in multi-dimensional index trees until Euclidean dis-

tance preserves in both Haar and time domains, as for DFT in

(1). This can be achieved by a normalization step which replaces

the scaling factor in Equation (2) from

721

7



in the Haar

transformation. After the normalization step, Euclidean distance

between sequences in Haar domain will be equivalent to

( )*



in Equation (3). The preservation of Euclidean distance of Haar

transform ensures the completeness of feature extraction as in

DFT.

If only the ﬁrst



dimensions (





) of Haar transform

are used in calculation of Euclidean distance in Equation (3), then

we should replace 0’s in the Haar transformed sequences. This

replacement starts from



+1 th to



th coefﬁcients in the trans-

formed sequences.

Lemma 4 If the ﬁrst



(





) dimensions of Haar trans-

form are used, no false dismissal will occur for range queries.

Proof: Considering the inequality in Deﬁnition 1 and Lemma 3

$A%!





( 

( )*





(6)

Using the ﬁrst



dimensionsas index, the value of

in Equation

(3) will become zero for



. Thus the Euclidean distance

between two sequences is

( )*





. This completes the

proof.

4. The Overall Strategy

In this section, we present the overall strategy of our time se-

ries matching method and propose our own method for nearest

neighbor query. Before querying is performed, we shall do some

pre-processing to extract the feature vectors with reduced dimen-

sionality, and to build the index. After the index is built, content-

based search can be performed for two types of querying: range

querying and



-nearest-neighbors querying.

4.1. Pre-processing

Step 1 - Similarity Model Selection: According to their applica-

tions users may choose to use either the simple Euclidean distance

(Deﬁnition 1) or the v-shift similarity (Deﬁnition 2) as their sim-

ilarity measurements. For Deﬁnition 1, Haar transform is applied

to time series. For Deﬁnition 2, Haar transform is applied to time

series, but the ﬁrst Haar coefﬁcient will not be used in indexing, as

there is no need to match their average values.

Step 2 - Index Construction: Given a database of time series of

varying length. We pre-process the time series as follows. We ob-

tain the



-point Haar transform by applying Equation (2) with the

normalization factor, for each subsequenceswith a sliding window

of size



to each sequence in the database. An index structure

such as an R-Tree is built, using the ﬁrst



Haar coefﬁcients

where



is an optimal value found by experiments based on the

number of page accesses. This is because of a trade off between

post-processing cost and index dimension.

4.2. Range Query

After we have built the index, we can carry out range query or

nearest neighbor query evaluation. For range queries, two steps

are involved:

1. Similar sequences with distances



from the query are

looked up in the index and returned.

2. A post processingstep is applied on these sequencesto ﬁnd

the true distancesin time domain to remove all falsealarms.

4.3. Nearest Neighbor Query

For nearest neighbor query, we proposea two-phase evaluation

as follows.



Phase 1

In the ﬁrst phase,



nearest neighbors of query





are found

in the R-Tree index using the algorithm in [25]. The Eu-

clidean distances

in time domain (full dimension) are

computed between the query sequence and all



nearest

neighbors obtained which are

$&%











(

, where







de-

notes the nearest neighbor

(

? A ?



), with









farthest

from the query







Phase 2

A range query evaluation is then performed on the same in-

dex by setting



$A%



0







)(

initially. During the search,



Using Deﬁnition 2, one dimension can be savedin the index tree.

Efficient time series matching by wavelets

Figures

Citations

Stock time series pattern matching: Template-based vs. rule-based approaches

Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation

Mining asynchronous periodic patterns in time series data

An improvement of symbolic aggregate approximation distance measure for time series

Fuzzy clustering of time series in the frequency domain

References

An introduction to the bootstrap

R-trees: a dynamic index structure for spatial searching

An Introduction to the Bootstrap

Theory of communication

Mining sequential patterns

Related Papers (5)

Efficient Similarity Search In Sequence Databases

Fast subsequence matching in time-series databases

Dimensionality reduction for fast similarity search in large time series databases

A symbolic representation of time series, with implications for streaming algorithms

Experiencing SAX: a novel symbolic representation of time series

Frequently Asked Questions (14)

Q1. What are the future works mentioned in the paper "Efficient time series matching by wavelets" ?

Q2. What are the contributions in "Efficient time series matching by wavelets" ?

Q3. How do the authors obtain the -point Haar transform?

Q4. How can the authors avoid missing any qualifying object?

Q5. What are the results of the experiments?

Q6. How many coefficients can be found in the Haar transform?

Q7. What is the importance of the precision in a query?

Q8. What is the effect of sliding window on a trail?

Q9. What is the effect of the poorer precision of DFT?

Q10. What is the way to build an index?

Q11. How can the authors improve the performance of the R-Tree?

Q12. What is the way to accommodate v-shift similarity?

Q13. What are the main reasons why time series data are of growing importance in many new database applications?

Q14. What are the other wavelets that the authors have found?