Scaling EM (Expectation Maximization) Clustering to Large Databases

Open Access

Scaling EM (Expectation Maximization) Clustering to Large Databases

- pp 25

TLDR

A scalable implementation of the Expectation-Maximization (EM) algorithm, which constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data.

Abstract:

Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table For large databases, the scans become prohibitively expensive We present a scalable implementation of the Expectation-Maximization (EM) algorithm The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that are compressible and regions that must be maintained in memory The approach operates within the confines of a limited main memory buffer and requires at most a single database scan Data resolution is preserved to the extent possible based upon the size of the main memory buffer and the fit of the current clustering model to the data We extend the method to efficiently update multiple models simultaneously Computational tests indicate that this scalable scheme outperforms sampling-based approaches – the straightforward alternatives to “scaling” traditional in-memory implementations to large databases 1 Preliminaries and Motivation Data clustering is important in many fields, including data mining [FPSU96], statistical data analysis [KR89,BR93], compression [ZRL97], and vector quantization [DH73] Applications include data analysis and modeling [FDW97,FHS96], image segmentation, marketing, fraud detection, predictive modeling, data summarization, general data reporting tasks, data cleaning and exploratory data analysis [B*96] Clustering is a crucial data mining step and performing this task over large databases is essential Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 2 A general view of clustering places it in the framework of density estimation [S86, S92, A73] Clustering can be viewed as identifying the dense regions of the data source An efficient representation of the probability density function is the mixture model, which asserts that the data is a combination of k individual component densities, corresponding to the k clusters Basically, the problem is this: given data records (observations), identify a set of k populations in the data, and provide a model (density distribution) of each of the populations Since the model assumes a mixture of populations, it is often referred to as a mixture model The Expectation-Maximization (EM) algorithm [DLR77, CS96] is an effective and popular technique for estimating the mixture model parameters or fitting the model to the database The EM algorithm iteratively refines an initial cluster model to better fit the data and terminates at a solution which is locally optimal or a saddle point of the underlying clustering criterion [DLR77, B95] The objective function is log-likelihood of the data given the model measuring how well the probabilistic model fits the data Other similar iterative refinement clustering methods include the popular K-Means-type algorithms [M67,DH73,F90,BMS97,SI84] While these approaches have received attention in the database and data mining literature [NH94,ZRL97,BFR98], they are limited in their ability to compute correct statistical models of the data The K-Mean algorithm minimizes the sum of squared Euclidean distances of between data records in a cluster and the cluster’s mean vector This assignment criterion implicitly assumes that clusters are represented by spherical Gaussian distributions located at the k cluster means [BB95, B95] Since the K-Mean algorithm utilizes the Euclidean metric, it does not generalize to the problem of clustering discrete or categorical data The K-Mean algorithm also uses a membership function which assigns each data record to exactly one cluster This harsh criteria does not allow for uncertainty in the membership of a data record in a cluster The mixture model framework relaxes these assumptions Due to the probabilistic nature of the mixture model, arbitrary shaped clusters (ie non-spherical, etc) can be effectively represented by the choice of suitable component density functions (eg Poission, non-spherical Gaussians, etc) Categorical or discrete data is similarly handled by associating discrete data distribution over these attributes (eg Mutinomial, Binomial, etc) Consider a simple example with data consisting of 2 attributes: age and income One may choose to model the data as a single cluster and report that average age over the data records is 41 years and an average income is $26K/year (with associated variances) However, this may be rather deceptive and uninformative The data may be a mixture of working people, retired people, and Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 3 children A more informative summary might identify these subsets or clusters, and report the cluster parameters Such results are shown in Table 11: Table 11: Sample data summary by segment “name” (not given) Size Average Age Average Income “working” 45% 38 $45K “retired” 30% 72 $20K “children” 20% 12 $0

Scaling EM (Expectation Maximization) Clustering to Large Databases

Citations

Model-Based Clustering, Discriminant Analysis, and Density Estimation

Multivariate Density Estimation

Validity index for crisp and fuzzy clusters

Knowledge Discovery in Databases: An Overview

Clustering: A neural network approach

References

Maximum likelihood from incomplete data via the EM algorithm

Some methods for classification and analysis of multivariate observations

Neural networks for pattern recognition

Density estimation for statistics and data analysis

Pattern Classification and Scene Analysis.

Related Papers (5)

Maximum likelihood from incomplete data via the EM algorithm

Some methods for classification and analysis of multivariate observations

Pattern classification and scene analysis

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Finding Groups in Data: An Introduction to Cluster Analysis