scispace - formally typeset
Proceedings ArticleDOI

Similarity search in sets and categorical data using the signature tree

Reads0
Chats0
TLDR
A method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types is proposed, which is robust to different data characteristics, scalable to the database size and efficient for various queries.
Abstract
Data mining applications analyze large collections of set data and high dimensional categorical data. Search on these data types is not restricted to the classic problems of mining association rules and classification, but similarity search is also a frequently applied operation. Access methods/or multidimensional numerical data are inappropriate for this problem and specialized indexes are needed. We propose a method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types. In contrast to a previous technique, the signature tree is dynamic and does not rely on hardwired constants. Experiments with synthetic and real datasets show that it is robust to different data characteristics, scalable to the database size and efficient for various queries.

read more

Citations
More filters
Journal Article

When is nearest neighbor meaningful

TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.
Journal ArticleDOI

The Concentration of Fractional Distances

TL;DR: This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample, and an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data.
Proceedings ArticleDOI

A Hybrid Prediction Model for Moving Objects

TL;DR: An object's trajectory patterns which have ad-hoc forms for prediction are discovered and then indexed by a novel access method for efficient query processing, which estimates an object's future locations based on its pattern information as well as existing motion functions using the object's recent movements.
Proceedings ArticleDOI

Similarity evaluation on tree-structured data

TL;DR: This paper proposes to transform tree-structured data into an approximate numerical multidimensional vector which encodes the original structure information and proves that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees.
Proceedings ArticleDOI

Indexing Uncertain Categorical Data

TL;DR: This paper proposes two index structures for efficiently searching uncertain categorical data, one based on the R-tree and another based on an inverted index structure, and provides a detailed description of the probabilistic equality queries they support.
References
More filters
Proceedings ArticleDOI

R-trees: a dynamic index structure for spatial searching

TL;DR: A dynamic index structure called an R-tree is described which meets this need, and algorithms for searching and updating it are given and it is concluded that it is useful for current database systems in spatial applications.
Book ChapterDOI

When Is ''Nearest Neighbor'' Meaningful?

TL;DR: The effect of dimensionality on the "nearest neighbor" problem is explored, and it is shown that under a broad set of conditions, as dimensionality increases, the Distance to the nearest data point approaches the distance to the farthest data point.