Scaling clustering algorithms to large databases

Open AccessProceedings Article

Scaling clustering algorithms to large databases

Paul S. Bradley, +2 more

- pp 9-15

Chats0

TLDR

A scalable clustering framework applicable to a wide class of iterative clustering that requires at most one scan of the database and is instantiated and numerically justified with the popular K-Means clustering algorithm.

Abstract:

Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.

Citations

PDF

Open Access

More filters

Book

Data Mining: Concepts and Techniques

Jiawei Han, +2 more

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

...read moreread less

Journal ArticleDOI

Data clustering: 50 years beyond K-means

Anil K. Jain

TL;DR: A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.

...read moreread less

Journal ArticleDOI

Survey of clustering algorithms

Rui Xu, +1 more

- 01 May 2005 -

IEEE Transactions on Neural Networks

TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.

...read moreread less

Journal ArticleDOI

An efficient k-means clustering algorithm: analysis and implementation

Tapas Kanungo, +5 more

- 01 Jul 2002 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This work presents a simple and efficient implementation of Lloyd's k-means clustering algorithm, which it calls the filtering algorithm, and establishes the practical efficiency of the algorithm's running time.

...read moreread less

Journal Article

Industry Report: Amazon.com Recommendations: Item-to-Item Collaborative Filtering.

Greg Linden, +2 more

- 01 Jan 2003 -

IEEE Distributed Systems Online

TL;DR: This work compares three common approaches to solving the recommendation problem: traditional collaborative filtering, cluster models, and search-based methods, and their algorithm, which is called item-to-item collaborative filtering.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Maximum likelihood from incomplete data via the EM algorithm

Arthur P. Dempster, +2 more

- 01 Sep 1977 -

Journal of the royal statistical society...

Some methods for classification and analysis of multivariate observations

James B. MacQueen

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.

...read moreread less

Book

Neural networks for pattern recognition

Christopher M. Bishop

TL;DR: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition, and is designed as a text, with over 100 exercises, to benefit anyone involved in the fields of neural computation and pattern recognition.

...read moreread less

BookDOI

Density estimation for statistics and data analysis

Bernard W. Silverman

TL;DR: The Kernel Method for Multivariate Data: Three Important Methods and Density Estimation in Action.

...read moreread less

Journal ArticleDOI

Pattern Classification and Scene Analysis.

Ulf Grenander, +2 more

- 01 Sep 1974 -

Journal of the American Statistical Asso...

Collapse

Scaling clustering algorithms to large databases

Citations

Data Mining: Concepts and Techniques

Data clustering: 50 years beyond K-means

Survey of clustering algorithms

An efficient k-means clustering algorithm: analysis and implementation

Industry Report: Amazon.com Recommendations: Item-to-Item Collaborative Filtering.

References

Maximum likelihood from incomplete data via the EM algorithm

Some methods for classification and analysis of multivariate observations

Neural networks for pattern recognition

Density estimation for statistics and data analysis

Pattern Classification and Scene Analysis.

Related Papers (5)

BIRCH: an efficient data clustering method for very large databases

Finding Groups in Data: An Introduction to Cluster Analysis

Some methods for classification and analysis of multivariate observations

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Data clustering: a review