The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

/pdf/data-mining-concepts-and-techniques-4dtvdfkvmi.pdf

Data Mining: Concepts and Techniques

Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

Data clustering: 50 years beyond K-means

Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

/pdf/survey-of-clustering-algorithms-6zezqhbjyo.pdf

Survey of clustering algorithms

In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

/pdf/an-efficient-k-means-clustering-algorithm-analysis-and-1vgn3zr7cg.pdf

An efficient k-means clustering algorithm: analysis and implementation

Recommendation algorithms are best known for their use on e-commerce Web sites, where they use input about a customer's interests to generate a list of recommended items. Many applications use only the items that customers purchase and explicitly rate to represent their interests, but they can also use other attributes, including items viewed, demographic data, subject interests, and favorite artists. At Amazon.com, we use recommendation algorithms to personalize the online store for each customer. The store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. There are three common approaches to solving the recommendation problem: traditional collaborative filtering, cluster models, and search-based methods. Here, we compare these methods with our algorithm, which we call item-to-item collaborative filtering. Unlike traditional collaborative filtering, our algorithm's online computation scales independently of the number of customers and number of items in the product catalog. Our algorithm produces recommendations in real-time, scales to massive data sets, and generates high quality recommendations.

Industry Report: Amazon.com Recommendations: Item-to-Item Collaborative Filtering.

Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.

/pdf/scaling-clustering-algorithms-to-large-databases-djuqveosy0.pdf

Scaling clustering algorithms to large databases

Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table For large databases, the scans become prohibitively expensive We present a scalable implementation of the Expectation-Maximization (EM) algorithm The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that are compressible and regions that must be maintained in memory The approach operates within the confines of a limited main memory buffer and requires at most a single database scan Data resolution is preserved to the extent possible based upon the size of the main memory buffer and the fit of the current clustering model to the data We extend the method to efficiently update multiple models simultaneously Computational tests indicate that this scalable scheme outperforms sampling-based approaches – the straightforward alternatives to “scaling” traditional in-memory implementations to large databases 1 Preliminaries and Motivation Data clustering is important in many fields, including data mining [FPSU96], statistical data analysis [KR89,BR93], compression [ZRL97], and vector quantization [DH73] Applications include data analysis and modeling [FDW97,FHS96], image segmentation, marketing, fraud detection, predictive modeling, data summarization, general data reporting tasks, data cleaning and exploratory data analysis [B*96] Clustering is a crucial data mining step and performing this task over large databases is essential Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 2 A general view of clustering places it in the framework of density estimation [S86, S92, A73] Clustering can be viewed as identifying the dense regions of the data source An efficient representation of the probability density function is the mixture model, which asserts that the data is a combination of k individual component densities, corresponding to the k clusters Basically, the problem is this: given data records (observations), identify a set of k populations in the data, and provide a model (density distribution) of each of the populations Since the model assumes a mixture of populations, it is often referred to as a mixture model The Expectation-Maximization (EM) algorithm [DLR77, CS96] is an effective and popular technique for estimating the mixture model parameters or fitting the model to the database The EM algorithm iteratively refines an initial cluster model to better fit the data and terminates at a solution which is locally optimal or a saddle point of the underlying clustering criterion [DLR77, B95] The objective function is log-likelihood of the data given the model measuring how well the probabilistic model fits the data Other similar iterative refinement clustering methods include the popular K-Means-type algorithms [M67,DH73,F90,BMS97,SI84] While these approaches have received attention in the database and data mining literature [NH94,ZRL97,BFR98], they are limited in their ability to compute correct statistical models of the data The K-Mean algorithm minimizes the sum of squared Euclidean distances of between data records in a cluster and the cluster’s mean vector This assignment criterion implicitly assumes that clusters are represented by spherical Gaussian distributions located at the k cluster means [BB95, B95] Since the K-Mean algorithm utilizes the Euclidean metric, it does not generalize to the problem of clustering discrete or categorical data The K-Mean algorithm also uses a membership function which assigns each data record to exactly one cluster This harsh criteria does not allow for uncertainty in the membership of a data record in a cluster The mixture model framework relaxes these assumptions Due to the probabilistic nature of the mixture model, arbitrary shaped clusters (ie non-spherical, etc) can be effectively represented by the choice of suitable component density functions (eg Poission, non-spherical Gaussians, etc) Categorical or discrete data is similarly handled by associating discrete data distribution over these attributes (eg Mutinomial, Binomial, etc) Consider a simple example with data consisting of 2 attributes: age and income One may choose to model the data as a single cluster and report that average age over the data records is 41 years and an average income is $26K/year (with associated variances) However, this may be rather deceptive and uninformative The data may be a mixture of working people, retired people, and Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 3 children A more informative summary might identify these subsets or clusters, and report the cluster parameters Such results are shown in Table 11: Table 11: Sample data summary by segment “name” (not given) Size Average Age Average Income “working” 45% 38 $45K “retired” 30% 72 $20K “children” 20% 12 $0

/pdf/scaling-em-expectation-maximization-clustering-to-large-1tdkbb532l.pdf

Scaling EM (Expectation Maximization) Clustering to Large Databases

Iterative refinement clustering algorithms (e.g. K-Means, EM) converge to one of numerous local minima. It is known that they are especially sensitive to initial conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The refined initial starting condition leads to convergence to "better" local minima. The procedure is applicable to a wide class of clustering algorithms for both discrete and continuous data. We demonstrate the application of this method to the Expectation Maximization (EM) clustering algorithm and show that refined initial points indeed lead to improved solutions. Refinement run time is considerably lower than the time required to cluster the full database. The method is scalable and can be coupled with a scalable clustering algorithm to address the large-scale clustering in data mining.

/pdf/initialization-of-iterative-refinement-clustering-algorithms-1jd12ozecv.pdf

Initialization of iterative refinement clustering algorithms

A data mining system for use in finding clusters of data items in a database or any other data storage medium. The clusters are used in categorizing the data in the database into K different clusters within each of M models. An initial set of estimates (or guesses) of the parameters of each model to be explored (e.g. centriods in K-means), of each cluster are provided from some source. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original guesses at the parameters of the model in each of the K clusters over all M models. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.

Scalable system for clustering of large databases

In one exemplary embodiment the invention provides a data mining system for use in evaluating data in a database. Before the data evaulation begins a choice is made of a cluster number K for use in categorizing the data in the database into K different clusters and initial guesses at the means, or centriods, of each cluster are provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory. Data contained in the data portion is used to update the original guesses at the centroids of each of the K clusters. Some of the data belonging to a cluster is summarized or compressed and stored as a summarization of the data. More data is accessed from the database and assigned to a cluster. An updated mean for the clusters is determined from the summarized data and the newly acquired data. A stopping criteria is evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.

Cory Reina

Papers

Scaling clustering algorithms to large databases

Scaling EM (Expectation Maximization) Clustering to Large Databases

Initialization of iterative refinement clustering algorithms

Scalable system for clustering of large databases

Scalable system for K-means clustering of large databases