Proceedings ArticleDOI
ChainLink: Indexing Big Time Series Data For Long Subsequence Matching
Noura Alghamdi,Liang Zhang,Huayi Zhang,Elke A. Rundensteiner,Mohamed Y. Eltabakh +4 more
- pp 529-540
Reads0
Chats0
TLDR
This work proposes a lightweight distributed indexing framework, called ChainLink, that supports approximate kNN queries over TB-scale time series data, and designs a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem.Abstract:
Scalable subsequence matching is critical for supporting analytics on big time series from mining, prediction to hypothesis testing. However, state-of-the-art subsequence matching techniques do not scale well to TB-scale datasets. Not only does index construction become prohibitively expensive, but also the query response time deteriorates quickly as the length of the query subsequence exceeds several 100s of data points. Although Locality Sensitive Hashing (LSH) has emerged as a promising solution for indexing long time series, it relies on expensive hash functions that perform multiple passes over the data and thus is impractical for big time series. In this work, we propose a lightweight distributed indexing framework, called ChainLink, that supports approximate kNN queries over TB-scale time series data. As a foundation of ChainLink, we design a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem. In particular, we prove theoretically and demonstrate experimentally that the similarity proximity of the indexed subsequences is preserved by our proposed single-pass SPS scheme. Leveraging this SPS innovation, Chainlink then adopts a three-step approach for scalable index building: (1) in-place data re-organization within each partition to enable efficient record-level random access to all subsequences, (2) parallel building of hash-based local indices on top of the re-organized data using our SPS scheme for efficient search within each partition, and (3) efficient aggregation of the local indices to construct a centralized yet highly compact global index for effective pruning of irrelevant partitions during query processing. ChainLink achieves the above three steps in one single map-reduce process. Our experimental evaluation shows that ChainLink indices are compact at less than 2% of dataset size while state-of-the-art index sizes tend to be almost the same size as the dataset. Better still, ChainLink is up to 2 orders of magnitude faster in its index construction time compared to state-of-the-art techniques, while improving both the final query response time by up to 10 fold and the result accuracy by 15%.read more
Citations
More filters
Duality-based subsequence matching in time-series databases
TL;DR: In this article, the authors proposed a new subsequence matching method, Dual Match, which exploits duality in constructing windows and significantly improves the performance of the FRM algorithm by storing minimum bounding rectangles rather than individual points representing windows.
Proceedings ArticleDOI
Analysis of current trends in relational database indexing
Michal Kvet,Karol Matiasko +1 more
TL;DR: This paper aims to discuss the auto-indexing methods provided by DBS Oracle, highlighting their limitations and proposes own techniques to remove the impact of peaks caused by adding new queries to the system, to which no suitable index is present.
Book ChapterDOI
Flower Master Index for Relational Database Selection and Joining
Michal Kvet,Karol Matiasko +1 more
TL;DR: In this article, the authors propose block identification objects stored in private or shared memory areas to improve the performance of the index in a relational database, which is one of the key features ensuring data retrieval performance.
Proceedings ArticleDOI
Scalable Time Series Compound Infrastructure
TL;DR: This work introduces new similarity-match semantics as well as a compact misalignment-resilient representation for TSCs, and designs a TSC-aware distributed indexing infrastructure Sloth that supports scalable storage, indexing and querying of TB-scale TSC datasets.
Journal ArticleDOI
The Inherent Time Complexity and An Efficient Algorithm for Subsequence Matching Problem
TL;DR: The inherent time complexity of the subsequence matching problem is studied and an efficient algorithm for solving the problem is proposed and a new summarization method as well as a novel index for series data is designed.
References
More filters
Proceedings Article
Similarity Search in High Dimensions via Hashing
TL;DR: Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition.
Journal ArticleDOI
Universal classes of hash functions
TL;DR: An input independent average linear time algorithm for storage and retrieval on keys that makes a random choice of hash function from a suitable class of hash functions.
Proceedings ArticleDOI
Fast subsequence matching in time-series databases
TL;DR: An efficient indexing method to locate 1-dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance.
Book
Mining of Massive Datasets
TL;DR: Determining relevant data is key to delivering value from massive amounts of data and big data is defined less by volume which is a constantly moving target than by its ever-increasing variety, velocity, variability and complexity.
Journal ArticleDOI
Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality
TL;DR: Two algorithms for the approximate nearest neighbor problem in high dimensional spaces for data sets of size n living in IR are presented, achieving query times that are sub-linear in n and polynomial in d.
Related Papers (5)
Performance bottleneck of subsequence matching in time-series databases: Observation, solution, and performance evaluation
Sang-Wook Kim,Byeong-Soo Jeong +1 more