Sampling Large Databases for Association Rules

Open AccessProceedings Article

Sampling Large Databases for Association Rules

Hannu Toivonen

- pp 134-145

Chats0

TLDR

New algorithms that reduce the database activity considerably by picking a Random sample, to find using this sample all association rules that probably hold in the whole database, and then to verify the results with the rest of the database.

Abstract:

Discovery of association rules .is an important database mining problem. Current algorithms for finding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very significant for very large databases. We present new algorithms that reduce the database activity considerably. The idea is to pick a Random sample, to find using this sample all association rules that probably hold in the whole database, and then to verify the results with the rest of the database. The algorithms thus produce exact association rules, not approximations based on a sample. The approach is, however, probabilistic, and in those rare cases where our sampling method does not produce all association rules, the missing rules can be found in a second pass. Our experiments show that the proposed algorithms can find association rules very efficiently in only one database

Citations

PDF

Open Access

More filters

Book

Data Mining: Concepts and Techniques

Jiawei Han, +2 more

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

...read moreread less

Proceedings ArticleDOI

Automatic subspace clustering of high dimensional data for data mining applications

Rakesh Agrawal, +3 more

TL;DR: CLIQUE is presented, a clustering algorithm that satisfies each of these requirements of data mining applications including the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records.

...read moreread less

Proceedings ArticleDOI

CURE: an efficient clustering algorithm for large databases

Sudipto Guha, +2 more

TL;DR: This work proposes a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size, and demonstrates that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

...read moreread less

Data Mining: Concepts and Techniques (2nd edition)

Jiawei Han, +1 more

TL;DR: There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99].

...read moreread less

Proceedings ArticleDOI

Mining high-speed data streams

Pedro Domingos, +1 more

TL;DR: This paper describes and evaluates VFDT, an anytime system that builds decision trees using constant memory and constant time per example, and applies it to mining the continuous stream of Web access data from the whole University of Washington main campus.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Mining association rules between sets of items in large databases

Rakesh Agrawal, +2 more

TL;DR: An efficient algorithm is presented that generates all significant association rules between items in the database of customer transactions and incorporates buffer management and novel estimation and pruning techniques.

...read moreread less

Proceedings Article

Fast Algorithms for Mining Association Rules in Large Databases

Rakesh Agrawal, +1 more

Book

The Probabilistic Method

Joel Spencer

TL;DR: A particular set of problems - all dealing with “good” colorings of an underlying set of points relative to a given family of sets - is explored.

...read moreread less

Proceedings Article

Fast discovery of association rules

Rakesh Agrawal, +4 more

Book

Knowledge Discovery in Databases

Gregory Piateski, +1 more

TL;DR: Knowledge Discovery in Databases brings together current research on the exciting problem of discovering useful and interesting knowledge in databases, which spans many different approaches to discovery, including inductive learning, bayesian statistics, semantic query optimization, knowledge acquisition for expert systems, information theory, and fuzzy 1 sets.

...read moreread less

Sampling Large Databases for Association Rules

Citations

Data Mining: Concepts and Techniques

Automatic subspace clustering of high dimensional data for data mining applications

CURE: an efficient clustering algorithm for large databases

Data Mining: Concepts and Techniques (2nd edition)

Mining high-speed data streams

References

Mining association rules between sets of items in large databases

Fast Algorithms for Mining Association Rules in Large Databases

The Probabilistic Method

Fast discovery of association rules

Knowledge Discovery in Databases

Related Papers (5)

Mining association rules between sets of items in large databases

An Efficient Algorithm for Mining Association Rules in Large Databases

Fast algorithms for mining association rules

Fast Algorithms for Mining Association Rules in Large Databases

Mining frequent patterns without candidate generation