Efficient time series matching by wavelets
Summary (3 min read)
1. Introduction
- Time series data are of growing importance in many new database applications, such as data warehousing and data mining [3, 8, 2, 12].
- Indexing is used to support efficient retrieval and matching of time series.
- The authors propose to apply this technique in time series for dimension reduction and content-based search.
- Moreover, the authors show by experiments that Haar Wavelet Transform [9], which is a commonly used wavelet transform, can outperform DFT significantly.
2.1. Wavelet Transform
- Wavelets are basis functions used in representing data or other functions.
- Wavelet algorithms process data at different scales or resolutions in contrast with DFT where only frequency components are considered.
- The origin of wavelets can be traced to the work of Karl Weierstrass [27] in 1873.
- Haar basis is still a foundation of modern wavelet theory.
- Another significant advance is the introduction of a nonorthogonal basis by Dennis Gabor in 1946 [16].
3. The Proposed Approach
- Following a trend in the disciplines of signal and image processing, the authors propose to study the use of wavelet transformation for the time series indexing problem.
- This definition can give a better estimation of the similarity between two time sequences with similar trends running at two completely different levels.
3.2. DFT versus Haar Transform
- The authors motivation of using Haar transform to replace DFT is based on several evidences and observations, some of which are also the reasons why the use of wavelet transforms instead of DFT is considered in areas of image and signal processing.
- Apart from Euclidean distance, their model can easily accommodate v-shift similarity of two time sequences (Definition 2) at a little more cost.
- That is, the situation where vertically shifted signals can match is accommodated.
- On the other hand, previous study on F-index did not make use of this similarity model.
3.3. Guarantee of no False Dismissal
- For FT and DFT, it is shown by Parseval’s Theorem [23] that the energy of a signal conserves in both time and frequency domains.
- Parseval’s Theorem also shows that this situation is true for wavelet transforms.
- The expression of the Euclidean distance between time sequences in terms of their Haar coefficients is not sufficient for proper use in multi-dimensional index trees until Euclidean distance preserves in both Haar and time domains, as for DFT in (1), also known as.
4. The Overall Strategy
- The authors present the overall strategy of their time series matching method and propose their own method for nearest neighbor query.
- Before querying is performed, the authors shall do some pre-processing to extract the feature vectors with reduced dimensionality, and to build the index.
- After the index is built, contentbased search can be performed for two types of querying: range querying and -nearest-neighbors querying.
4.1. Pre-processing
- Step 1 - Similarity Model Selection: According to their applications users may choose to use either the simple Euclidean distance (Definition 1) or the v-shift similarity (Definition 2) as their similarity measurements.
- For nearest neighbor query, the authors propose a two-phase evaluation as follows.
- The effectiveness of this -nearest neighbor search algorithm arises from the value of $A% ( found in Phase 1 which provides a sufficient small query range to prune out a large amount of candidates for Phase 2.
5. Performance Evaluation
- Experiments using real stock data and synthetic random walk data have been carried out.
- The enhancement in precision of Haar transform over DFT increases with the number of dimensions.
- In Figure 4, the precision of Haar and Haar(V-shift) is shown.
- As the time series of financial data consist of a sequence of time values fluctuating around a relative constant level, which is the average value of that time sequence.
- This agrees with the result depicted in Figure 5, where the page accesses of the best dimensions of DFT (dim. 5), Haar (dim. 7), and Haar(V-shift) (dim. 10) are shown.
5.1. Scalability Test
- The authors study the scalability of their method by varying the size or length of synthetic time series database.
- Different sizes of databases (5k to 30k) and different lengths of sequences (256 to 2048) are generated as described in the previous section separately.
- Haar and Haar(V-shift) have a better scaling with database size and sequence length increase than DFT.
- Similar results have also been recorded for range queries.
- The poorer precision of DFT creates more work in the post-processing step and this affects the overall performance, especially in terms of the amount of disk accesses for large databases with long sequences.
6. Conclusion
- An efficient time series matching technique through dimension reduction by Haar Wavelet Transform is proposed.
- The first few coefficients of the transformed sequences are indexed in an R-Tree for similarity search.
- Experiments show that their method outperforms the F-index (Discrete Fourier Transform) method in terms of pruning power, number of page accesses, scalability, and complexity.
- The authors can also try to apply wavelets that did not work well with stock data in other signals, e.g. sinusoidal signals, electrocardiographs (ECGs).
Did you find this useful? Give us your feedback
Citations
1,922 citations
Cites background or methods from "Efficient time series matching by w..."
...Indeed, it is in this context that most of the representations enumerated in Figure 1 were introdu ce [7, 14, 22, 35]....
[...]
...With this in mind, a great number of time series representations have been introduced, including the Discrete Fourier Transform (DFT) [14], the Discrete Wavelet Transform (DWT) [7], Piecewise Linear, and Piecewise Constant models (PAA) [22], (APCA) [16, 22], and Singular Value Decomposition (SVD) [22]....
[...]
...As a simple example, wavelets have the useful multiresolution property, but are only defined for time series that are an integer power of two in length [7]....
[...]
...Most applications assume that we have one very long time series T, and that manageable subsequences of length n are extracted by use of a sliding window, then stored in a matrix fo r urther manipulation [7, 14, 22, 35]....
[...]
...To perform query by content, we built an index usin g SAX, and compared it to an index built using the Haar wavele t approach [7]....
[...]
1,583 citations
1,550 citations
Cites background or methods from "Efficient time series matching by w..."
...However it might be argued the inherent smoothing effect of dimensionality reduction using DFT, DWT or SVD would help smooth out the noise and produce better results....
[...]
...utilizes the Discrete Fourier Transform (DFT) to perform the dimensionality reduction, but other techniques have been suggested, including Singular Value Decomposition (SVD) [34] and the Discrete Wavelet Transform (DWT) [6]....
[...]
...Because we wished to include the DWT in our experiments, we are limited to query lengths that are an integer power of two....
[...]
...The Discrete Haar Wavelet Transform (DWT) can be calculated efficiently and an entire dataset can be indexed in O(mn)....
[...]
...7 and DRW is defined as: ( ) ),,min( 11 iii N n N n www +−= , ( )∑ = −= N i iiiN n yxwYWXDRW 1 2)],,([ (9) Note that it is not possible to modify DFT, DWT or SVD in a similar manner, because each coefficient represents a signal that is added along the entire length of the query....
[...]
1,452 citations
Cites background or methods from "Efficient time series matching by w..."
...Figure 1 illustrates a hierarchy of all the various time series representations in the literature (Andre-Jonsson and Badal 1997; Chan and Fu 1999; Faloutsos et al. 1994; Geurts 2001; Huang and Yu 1999; Keogh et al. 2001a; Keogh and Pazzani 1998; Roddick et al. 2001; Shahabi et al. 2000; Yi and…...
[...]
...• Indexing: Given a query time series Q, and some similarity/dissimilarity measure D(Q,C), find the most similar time series in database DB (Agrawal et al. 1995; Chan and Fu 1999; Faloutsos et al. 1994; Keogh et al. 2001a; Yi and Faloutsos 2000)....
[...]
...With this in mind, a great number of time series representations have been introduced, including the Discrete Fourier Transform (DFT) (Faloutsos et al. 1994), the Discrete Wavelet Transform (DWT) (Chan and Fu 1999), Piecewise Linear, and Piecewise Constant models (PAA) (Keogh et al. 2001a), (APCA) (Geurts 2001; Keogh et al. 2001a), and Singular Value Decomposition (SVD) (Keogh et al. 2001a)....
[...]
...To perform query by content, we build an index using SAX, and compare it to an index built using the Haar wavelet approach (Chan and Fu 1999)....
[...]
...1 were introduced (Chan and Fu 1999; Faloutsos et al. 1994; Keogh et al. 2001a; Yi and Faloutsos 2000)....
[...]
1,387 citations
Additional excerpts
...Many techniques have been proposed in the literature for representing time series with reduced dimensionality, such as Discrete Fourier Transformation (DFT) [13], Single Value Decomposition (SVD) [13], Discrete Cosine Transformation (DCT) [29], Discrete Wavelet Transformation (DWT) [33], Piecewise Aggregate Approximation (PAA) [24], Adaptive Piecewise Constant Approximation (APCA) [23], Chebyshev polynomials (CHEB) [6], Symbolic Aggregate approXimation (SAX) [30], Indexable Piecewise Linear Approximation (IPLA) [11] and etc....
[...]
...…(DFT) [13], Single Value Decomposition (SVD) [13], Discrete Cosine Transformation (DCT) [29], Discrete Wavelet Transformation (DWT) [33], Piecewise Aggregate Approximation (PAA) [24], Adaptive Piecewise Constant Approximation (APCA) [23], Chebyshev polynomials (CHEB) [6], Symbolic Aggregate…...
[...]
References
37,183 citations
7,336 citations
"Efficient time series matching by w..." refers methods in this paper
...Many multi-dimensional indexing methods [13, 7, 5, 20] such as the R-Tree and R*-Tree [20, 5, 11] scale exponentially for high dimensionalities, ventually reducing the performance to that of sequential scannin g or worse....
[...]
...Many multi-dimensional indexing methods [13, 7, 5, 20] such as the R-Tree and R*-Tree [20, 5, 11] scale exponentially for high dimensionalities, eventually reducing the performance to that of sequential scanning or worse....
[...]
...These coefficients can be indexed in an R-Tree or R*-Tree for fast retrieval....
[...]
6,361 citations
5,663 citations
Additional excerpts
...Time series data are of growing importance in many new database applications, such as data warehousing and data mining [ 3 , 8, 2, 12]....
[...]
Related Papers (5)
Frequently Asked Questions (14)
Q2. What are the contributions in "Efficient time series matching by wavelets" ?
While the use of DFT and K-L transform or SVD have been studied in the literature, to their knowledge, there is no in-depth study on the application of DWT. In this paper, the authors propose to use Haar Wavelet Transform for time series indexing. The major contributions are: ( 1 ) the authors show that Euclidean distance is preserved in the Haar transformed domain and no false dismissal will occur, ( 2 ) they show that Haar transform can outperform DFT through experiments, ( 3 ) a new similarity model is suggested to accommodate vertical shift of time series, and ( 4 ) a two-phase method is proposed for efficient -nearest neighbor query in time series databases.
Q3. How do the authors obtain the -point Haar transform?
The authors obtain the -point Haar transform by applying Equation (2) with the normalization factor, for each subsequences with a sliding window of size to each sequence in the database.
Q4. How can the authors avoid missing any qualifying object?
To avoid missing any qualifying object, the Euclidean distance in the reduced C -dimensional space should be less than or equal to the Euclidean distance between the two original time sequences.
Q5. What are the results of the experiments?
Experiments show that their method outperforms the F-index (Discrete Fourier Transform) method in terms of pruning power, number of page accesses, scalability, and complexity.
Q6. How many coefficients can be found in the Haar transform?
After the first multiplication of and , half of the Haar transform coefficients can be found which are and in interleaving with some intermediate coefficients and .
Q7. What is the importance of the precision in a query?
As most of the page accesses of a query are devoted to removing false alarm, the precision is crucial to the overall performance of query evaluation.
Q8. What is the effect of sliding window on a trail?
Feature points in nearby offsets will form a trail due to the effect of stepwise sliding window, the minimum bounding rectangle (MBR) of a trail is then being indexed in an R-Tree instead of the feature points themselves.
Q9. What is the effect of the poorer precision of DFT?
The poorer precision of DFT creates more work in the post-processing step and this affects the overall performance, especially in terms of the amount of disk accesses for large databases with long sequences.
Q10. What is the way to build an index?
An index structure such as an R-Tree is built, using the first Haar coefficients where is an optimal value found by experiments based on the number of page accesses.
Q11. How can the authors improve the performance of the R-Tree?
The extra step introduced in Phase 2 to update can enhance the performance by pruning more non-qualifying MBRs during the traversal of R-Tree.
Q12. What is the way to accommodate v-shift similarity?
Apart from Euclidean distance, their model can easily accommodate v-shift similarity of two time sequences (Definition 2) at a little more cost.
Q13. What are the main reasons why time series data are of growing importance in many new database applications?
Time series data are of growing importance in many new database applications, such as data warehousing and data mining [3, 8, 2, 12].
Q14. What are the other wavelets that the authors have found?
From experiments,we find that the other wavelets seem to also preserve Euclidean distances, however, so far the authors have a proof of this property only for the Haar wavelets.