scispace - formally typeset
Open Access

Scaling EM (Expectation Maximization) Clustering to Large Databases

TLDR
A scalable implementation of the Expectation-Maximization (EM) algorithm, which constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data.
Abstract
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table For large databases, the scans become prohibitively expensive We present a scalable implementation of the Expectation-Maximization (EM) algorithm The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that are compressible and regions that must be maintained in memory The approach operates within the confines of a limited main memory buffer and requires at most a single database scan Data resolution is preserved to the extent possible based upon the size of the main memory buffer and the fit of the current clustering model to the data We extend the method to efficiently update multiple models simultaneously Computational tests indicate that this scalable scheme outperforms sampling-based approaches – the straightforward alternatives to “scaling” traditional in-memory implementations to large databases 1 Preliminaries and Motivation Data clustering is important in many fields, including data mining [FPSU96], statistical data analysis [KR89,BR93], compression [ZRL97], and vector quantization [DH73] Applications include data analysis and modeling [FDW97,FHS96], image segmentation, marketing, fraud detection, predictive modeling, data summarization, general data reporting tasks, data cleaning and exploratory data analysis [B*96] Clustering is a crucial data mining step and performing this task over large databases is essential Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 2 A general view of clustering places it in the framework of density estimation [S86, S92, A73] Clustering can be viewed as identifying the dense regions of the data source An efficient representation of the probability density function is the mixture model, which asserts that the data is a combination of k individual component densities, corresponding to the k clusters Basically, the problem is this: given data records (observations), identify a set of k populations in the data, and provide a model (density distribution) of each of the populations Since the model assumes a mixture of populations, it is often referred to as a mixture model The Expectation-Maximization (EM) algorithm [DLR77, CS96] is an effective and popular technique for estimating the mixture model parameters or fitting the model to the database The EM algorithm iteratively refines an initial cluster model to better fit the data and terminates at a solution which is locally optimal or a saddle point of the underlying clustering criterion [DLR77, B95] The objective function is log-likelihood of the data given the model measuring how well the probabilistic model fits the data Other similar iterative refinement clustering methods include the popular K-Means-type algorithms [M67,DH73,F90,BMS97,SI84] While these approaches have received attention in the database and data mining literature [NH94,ZRL97,BFR98], they are limited in their ability to compute correct statistical models of the data The K-Mean algorithm minimizes the sum of squared Euclidean distances of between data records in a cluster and the cluster’s mean vector This assignment criterion implicitly assumes that clusters are represented by spherical Gaussian distributions located at the k cluster means [BB95, B95] Since the K-Mean algorithm utilizes the Euclidean metric, it does not generalize to the problem of clustering discrete or categorical data The K-Mean algorithm also uses a membership function which assigns each data record to exactly one cluster This harsh criteria does not allow for uncertainty in the membership of a data record in a cluster The mixture model framework relaxes these assumptions Due to the probabilistic nature of the mixture model, arbitrary shaped clusters (ie non-spherical, etc) can be effectively represented by the choice of suitable component density functions (eg Poission, non-spherical Gaussians, etc) Categorical or discrete data is similarly handled by associating discrete data distribution over these attributes (eg Mutinomial, Binomial, etc) Consider a simple example with data consisting of 2 attributes: age and income One may choose to model the data as a single cluster and report that average age over the data records is 41 years and an average income is $26K/year (with associated variances) However, this may be rather deceptive and uninformative The data may be a mixture of working people, retired people, and Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 3 children A more informative summary might identify these subsets or clusters, and report the cluster parameters Such results are shown in Table 11: Table 11: Sample data summary by segment “name” (not given) Size Average Age Average Income “working” 45% 38 $45K “retired” 30% 72 $20K “children” 20% 12 $0

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Model-Based Clustering, Discriminant Analysis, and Density Estimation

TL;DR: This work reviews a general methodology for model-based clustering that provides a principled statistical approach to important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled.
Book ChapterDOI

Multivariate Density Estimation

TL;DR: Exploring and identifying structure is even more important for multivariate data than univariate data, given the difficulties in graphically presenting multivariateData and the comparative lack of parametric models to represent it.
Journal ArticleDOI

Validity index for crisp and fuzzy clusters

TL;DR: A cluster validity index and its fuzzification is described, which can provide a measure of goodness of clustering on different partitions of a data set, and results demonstrating the superiority of the PBM-index in appropriately determining the number of clusters are provided.
Book ChapterDOI

Knowledge Discovery in Databases: An Overview

TL;DR: In this paper, the authors define the basic notions in data mining and KDD, define the goals, present motivation, and give a high-level definition of the KDD Process and how it relates to Data Mining.
Journal ArticleDOI

Clustering: A neural network approach

TL;DR: A comprehensive overview of competitive learning based clustering methods is given and two examples are given to demonstrate the use of the clustering Methods.
References
More filters

Some methods for classification and analysis of multivariate observations

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Book

Neural networks for pattern recognition

TL;DR: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition, and is designed as a text, with over 100 exercises, to benefit anyone involved in the fields of neural computation and pattern recognition.
BookDOI

Density estimation for statistics and data analysis

TL;DR: The Kernel Method for Multivariate Data: Three Important Methods and Density Estimation in Action.