Showing papers on "Fuzzy clustering published in 2009"

PDF

Open Access

Journal Article•DOI•

Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

[...]

Hans-Peter Kriegel¹, Peer Kröger¹, Arthur Zimek¹•Institutions (1)

23 Mar 2009-ACM Transactions on Knowledge Discovery From Data

TL;DR: This survey tries to clarify the different problem definitions related to subspace clustering in general; the specific difficulties encountered in this field of research; the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and how several prominent solutions tackle different problems.

...read moreread less

Abstract: As a prolific research area in data mining, subspace clustering and related problems induced a vast quantity of proposed solutions. However, many publications compare a new proposition—if at all—with one or two competitors, or even with a so-called “naive” ad hoc solution, but fail to clarify the exact problem definition. As a consequence, even if two solutions are thoroughly compared experimentally, it will often remain unclear whether both solutions tackle the same problem or, if they do, whether they agree in certain tacit assumptions and how such assumptions may influence the outcome of an algorithm. In this survey, we try to clarify: (i) the different problem definitions related to subspace clustering in general; (ii) the specific difficulties encountered in this field of research; (iii) the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and (iv) how several prominent solutions tackle different problems.

...read moreread less

1,206 citations

Book•

Introduction to Multivariate Statistical Analysis in Chemometrics

[...]

Kurt Varmuza, Peter Filzmoser¹•Institutions (1)

Vienna University of Technology¹

17 Feb 2009

TL;DR: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables

...read moreread less

Abstract: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables Summary Principal Component Analysis (PCA) Concepts Number of PCA Components Centering and Scaling Outliers and Data Distribution Robust PCA Algorithms for PCA Evaluation and Diagnostics Complementary Methods for Exploratory Data Analysis Examples Summary Calibration Concepts Performance of Regression Models Ordinary Least Squares Regression Robust Regression Variable Selection Principal Component Regression Partial Least Squares Regression Related Methods Examples Summary Classification Concepts Linear Classification Methods Kernel and Prototype Methods Classification Trees Artificial Neural Networks Support Vector Machine Evaluation Examples Summary Cluster Analysis Concepts Distance and Similarity Measures Partitioning Methods Hierarchical Clustering Methods Fuzzy Clustering Model-Based Clustering Cluster Validity and Clustering Tendency Measures Examples Summary Preprocessing Concepts Smoothing and Differentiation Multiplicative Signal Correction Mass Spectral Features Appendix 1: Symbols and Abbreviations Appendix 2: Matrix Algebra Appendix 3: Introduction to R Index References appear at the end of each chapter

...read moreread less

1,003 citations

Proceedings Article•DOI•

Multi-view clustering via canonical correlation analysis

[...]

Kamalika Chaudhuri¹, Sham M. Kakade², Karen Livescu², Karthik Sridharan²•Institutions (2)

University of California, San Diego¹, Toyota Technological Institute at Chicago²

14 Jun 2009

TL;DR: Under the assumption that the views are un-correlated given the cluster label, it is shown that the separation conditions required for the algorithm to be successful are significantly weaker than prior results in the literature.

...read moreread less

Abstract: Clustering data in high dimensions is believed to be a hard problem in general. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lower-dimensional subspace, e.g. via Principal Components Analysis (PCA) or random projections, before clustering. Here, we consider constructing such projections using multiple views of the data, via Canonical Correlation Analysis (CCA).Under the assumption that the views are un-correlated given the cluster label, we show that the separation conditions required for the algorithm to be successful are significantly weaker than prior results in the literature. We provide results for mixtures of Gaussians and mixtures of log concave distributions. We also provide empirical support from audio-visual speaker clustering (where we desire the clusters to correspond to speaker ID) and from hierarchical Wikipedia document clustering (where one view is the words in the document and the other is the link structure).

...read moreread less

765 citations

Book Chapter•DOI•

Parallel K-Means Clustering Based on MapReduce

[...]

Weizhong Zhao¹, Huifang Ma¹, Qing He¹•Institutions (1)

Chinese Academy of Sciences¹

22 Nov 2009

TL;DR: This paper proposes a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique and demonstrates that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

...read moreread less

Abstract: Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

...read moreread less

626 citations

Proceedings Article•DOI•

Ranking-based clustering of heterogeneous information networks with star network schema

[...]

Yizhou Sun¹, Yintao Yu¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

28 Jun 2009

TL;DR: This paper studies clustering of multi-typed heterogeneous networks with a star network schema and proposes a novel algorithm, NetClus, that utilizes links across multityped objects to generate high-quality net-clusters and generates informative clusters.

...read moreread less

Abstract: A heterogeneous information network is an information networkcomposed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in each cluster. However, although clustering on homogeneous networks has been studied over decades, clustering on heterogeneous networks has not been addressed until recently.A recent study proposed a new algorithm, RankClus, for clustering on bi-typed heterogeneous networks. However, a real-world network may consist of more than two types, and the interactions among multi-typed objects play a key role at disclosing the rich semantics that a network carries. In this paper, we study clustering of multi-typed heterogeneous networks with a star network schema and propose a novel algorithm, NetClus, that utilizes links across multityped objects to generate high-quality net-clusters. An iterative enhancement method is developed that leads to effective ranking-based clustering in such heterogeneous networks. Our experiments on DBLP data show that NetClus generates more accurate clustering results than the baseline topic model algorithm PLSA and the recently proposed algorithm, RankClus. Further, NetClus generates informative clusters, presenting good ranking and cluster membership information for each attribute object in each net-cluster.

...read moreread less

546 citations

Proceedings Article•DOI•

Fast approximate spectral clustering

[...]

Donghui Yan¹, Ling Huang², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Intel²

28 Jun 2009

TL;DR: This work develops a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data, and develops two concrete instances of this framework, one based on local k-means clustering (KASP) and onebased on random projection trees (RASP).

...read moreread less

Abstract: Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n3) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.

...read moreread less

507 citations

Proceedings Article•DOI•

RankClus: integrating clustering with ranking for heterogeneous information network analysis

[...]

Yizhou Sun¹, Jiawei Han¹, Peixiang Zhao¹, Zhijun Yin¹, Hong Cheng², Tianyi Wu¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, The Chinese University of Hong Kong²

24 Mar 2009

TL;DR: This paper addresses the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multi-typed information network, and proposes a novel clustering framework called RankClus that directly generates clusters integrated with ranking.

...read moreread less

Abstract: As information networks become ubiquitous, extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) in one huge cluster without distinction is dull as well.In this paper, we address the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multi-typed (i.e., heterogeneous) information network. A novel clustering framework called RankClus is proposed that directly generates clusters integrated with ranking. Based on initial K clusters, ranking is applied separately, which serves as a good measure for each cluster. Then, we use a mixture model to decompose each object into a K-dimensional vector, where each dimension is a component coefficient with respect to a cluster, which is measured by rank distribution. Objects then are reassigned to the nearest cluster under the new measure space to improve clustering. As a result, quality of clustering and ranking are mutually enhanced, which means that the clusters are getting more accurate and the ranking is getting more meaningful. Such a progressive refinement process iterates until little change can be made. Our experiment results show that RankClus can generate more accurate clusters and in a more efficient way than the state-of-the-art link-based clustering methods. Moreover, the clustering results with ranks can provide more informative views of data compared with traditional clustering.

...read moreread less

399 citations

Journal Article•DOI•

Semi-supervised graph clustering: a kernel approach

[...]

Brian Kulis¹, Sugato Basu², Inderjit S. Dhillon¹, Raymond J. Mooney¹•Institutions (2)

University of Texas at Austin¹, Google²

01 Jan 2009-Machine Learning

TL;DR: The proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective.

...read moreread less

Abstract: Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.

...read moreread less

384 citations

Improving the Accuracy and Efficiency of the k-means Clustering Algorithm

[...]

K. A. Abdul Nazeer, M. P. Sebastian

01 Jan 2009

TL;DR: A method for making the k-means clustering algorithm more effective and efficient, so as to get better clustering with reduced complexity is proposed.

...read moreread less

Abstract: Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data per- taining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means clustering algorithm is widely used for many practical applications. But the original k-means algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed in the literature for improving the performance of the k-means clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient, so as to get better clustering with reduced complexity.

...read moreread less

324 citations

Journal Article•DOI•

Evaluating clustering in subspace projections of high dimensional data

[...]

Emmanuel Müller¹, Stephan Günnemann¹, Ira Assent², Thomas Seidl¹•Institutions (2)

RWTH Aachen University¹, Aalborg University²

01 Aug 2009

TL;DR: In this paper, the authors take a systematic approach to evaluate the major clustering paradigms in a common framework and provide a benchmark set of results on a large variety of real world and synthetic data sets.

...read moreread less

Abstract: Clustering high dimensional data is an emerging research field. Subspace clustering or projected clustering group similar objects in subspaces, i.e. projections, of the full space. In the past decade, several clustering paradigms have been developed in parallel, without thorough evaluation and comparison between these paradigms on a common basis.Conclusive evaluation and comparison is challenged by three major issues. First, there is no ground truth that describes the "true" clusters in real world data. Second, a large variety of evaluation measures have been used that reflect different aspects of the clustering result. Finally, in typical publications authors have limited their analysis to their favored paradigm only, while paying other paradigms little or no attention.In this paper, we take a systematic approach to evaluate the major paradigms in a common framework. We study representative clustering algorithms to characterize the different aspects of each paradigm and give a detailed comparison of their properties. We provide a benchmark set of results on a large variety of real world and synthetic data sets. Using different evaluation measures, we broaden the scope of the experimental analysis and create a common baseline for future developments and comparable evaluations in the field. For repeatability, all implementations, data sets and evaluation measures are available on our website.

...read moreread less

294 citations

Proceedings Article•DOI•

Clustering the tagged web

[...]

Daniel Ramage¹, Paul Heymann¹, Christopher D. Manning¹, Hector Garcia-Molina¹•Institutions (1)

Stanford University¹

09 Feb 2009

TL;DR: It is demonstrated how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages.

...read moreread less

Abstract: Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

...read moreread less

Proceedings Article•DOI•

Adapting the right measures for K-means clustering

[...]

Junjie Wu¹, Hui Xiong², Jian Chen³•Institutions (3)

Beihang University¹, Rutgers University², Tsinghua University³

28 Jun 2009

TL;DR: An organized study of 16 external validation measures for K-means clustering by introducing the importance of measure normalization in the evaluation of the clustering performance on data with imbalanced class distributions and revealing the interrelationships among these external measures.

...read moreread less

Abstract: Clustering validation is a long standing challenge in the clustering literature. While many validation measures have been developed for evaluating the performance of clustering algorithms, these measures often provide inconsistent information about the clustering performance and the best suitable measures to use in practice remain unknown. This paper thus fills this crucial void by giving an organized study of 16 external validation measures for K-means clustering. Specifically, we first introduce the importance of measure normalization in the evaluation of the clustering performance on data with imbalanced class distributions. We also provide normalization solutions for several measures. In addition, we summarize the major properties of these external measures. These properties can serve as the guidance for the selection of validation measures in different application scenarios. Finally, we reveal the interrelationships among these external measures. By mathematical transformation, we show that some validation measures are equivalent. Also, some measures have consistent validation performances. Most importantly, we provide a guide line to select the most suitable validation measures for K-means clustering.

...read moreread less

Proceedings Article•DOI•

Interactive visual clustering of large collections of trajectories

[...]

Gennady Andrienko¹, Natalia Andrienko¹, Salvatore Rinzivillo², Mirco Nanni², Dino Pedreschi³, Fosca Giannotti² - Show less +2 more•Institutions (3)

Fraunhofer Society¹, Istituto di Scienza e Tecnologie dell'Informazione², University of Pisa³

13 Nov 2009

TL;DR: This work proposes an approach to extracting meaningful clusters from large databases by combining clustering and classification, which are driven by a human analyst through an interactive visual interface.

...read moreread less

Abstract: One of the most common operations in exploration and analysis of various kinds of data is clustering, i.e. discovery and interpretation of groups of objects having similar properties and/or behaviors. In clustering, objects are often treated as points in multi-dimensional space of properties. However, structurally complex objects, such as trajectories of moving entities and other kinds of spatio-temporal data, cannot be adequately represented in this manner. Such data require sophisticated and computationally intensive clustering algorithms, which are very hard to scale effectively to large datasets not fitting in the computer main memory. We propose an approach to extracting meaningful clusters from large databases by combining clustering and classification, which are driven by a human analyst through an interactive visual interface.

...read moreread less

Proceedings Article•DOI•

Learning trajectory patterns by clustering: Experimental studies and comparative evaluation

[...]

Brendan Morris¹, Mohan M. Trivedi¹•Institutions (1)

University of California, San Diego¹

20 Jun 2009

TL;DR: This paper evaluates different similarity measures and clustering methodologies to catalog their strengths and weaknesses when utilized for the trajectory learning problem.

...read moreread less

Abstract: Recently a large amount of research has been devoted to automatic activity analysis. Typically, activities have been defined by their motion characteristics and represented by trajectories. These trajectories are collected and clustered to determine typical behaviors. This paper evaluates different similarity measures and clustering methodologies to catalog their strengths and weaknesses when utilized for the trajectory learning problem. The clustering performance is measured by evaluating the correct clustering rate on different datasets with varying characteristics.

...read moreread less

Journal Article•DOI•

K-Means Clustering Versus Validation Measures: A Data-Distribution Perspective

[...]

Hui Xiong¹, Junjie Wu², Jian Chen³•Institutions (3)

Rutgers University¹, Beihang University², Tsinghua University³

01 Apr 2009

TL;DR: This paper provides a formal and organized study of the effect of skewed data distributions on K-means clustering and provides the coefficient of variation (CV) as a necessary criterion to validate the clustering results.

...read moreread less

Abstract: K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied ldquotruerdquo cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in ldquotruerdquo cluster sizes (e.g., CV > 1.0 ), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in ldquotruerdquo cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the ldquotruerdquo cluster distributions.

...read moreread less

Journal Article•DOI•

A Comparison of Hierarchical Methods for Clustering Functional Data

[...]

Laura Ferreira¹, David B. Hitchcock¹•Institutions (1)

University of South Carolina¹

09 Dec 2009-Communications in Statistics - Simulation and Computation

TL;DR: A simulation study compares the performance of four major hierarchical methods for clustering functional data and yields concrete suggestions to future researchers to determine the best method for clustered their functional data.

...read moreread less

Abstract: Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Compari...

...read moreread less

Journal Article•DOI•

Generalized Fuzzy C-Means Clustering Algorithm With Improved Fuzzy Partitions

[...]

Lin Zhu¹, Fu-Lai Chung, Shitong Wang¹•Institutions (1)

Jiangnan University¹

01 Jun 2009

TL;DR: A recent advance of fuzzy clustering called fuzzy c-means clustering with improved fuzzy partitions (IFP-FCM) is extended in this paper, and a generalized algorithm for more effective clustering is proposed by introducing a novel membership constraint function.

...read moreread less

Abstract: The fuzziness index m has important influence on the clustering result of fuzzy clustering algorithms, and it should not be forced to fix at the usual value m = 2. In view of its distinctive features in applications and its limitation in having m = 2 only, a recent advance of fuzzy clustering called fuzzy c-means clustering with improved fuzzy partitions (IFP-FCM) is extended in this paper, and a generalized algorithm called GIFP-FCM for more effective clustering is proposed. By introducing a novel membership constraint function, a new objective function is constructed, and furthermore, GIFP-FCM clustering is derived. Meanwhile, from the viewpoints of L p norm distance measure and competitive learning, the robustness and convergence of the proposed algorithm are analyzed. Furthermore, the classical fuzzy c-means algorithm (FCM) and IFP-FCM can be taken as two special cases of the proposed algorithm. Several experimental results including its application to noisy image texture segmentation are presented to demonstrate its average advantage over FCM and IFP-FCM in both clustering and robustness capabilities.

...read moreread less

Journal Article•DOI•

Characterization and evaluation of similarity measures for pairs of clusterings

[...]

Darius Pfitzner¹, Richard Leibbrandt¹, David M. W. Powers¹•Institutions (1)

Flinders University¹

26 May 2009-Knowledge and Information Systems

TL;DR: A paradigm apparatus for the evaluation of clustering comparison techniques is introduced and the proposal of a novel clustering similarity measure, the Measure of Concordance, is proposed, showing that only MoC, Powers’s measure, Lopez and Rajski's measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.

...read moreread less

Abstract: In evaluating the results of cluster analysis, it is common practice to make use of a number of fixed heuristics rather than to compare a data clustering directly against an empirically derived standard, such as a clustering empirically obtained from human informants. Given the dearth of research into techniques to express the similarity between clusterings, there is broad scope for fundamental research in this area. In defining the comparative problem, we identify two types of worst-case matches between pairs of clusterings, characterised as independently codistributed clustering pairs and conjugate partition pairs. Desirable behaviour for a similarity measure in either of the two worst cases is discussed, giving rise to five test scenarios in which characteristics of one of a pair of clusterings was manipulated in order to compare and contrast the behaviour of different clustering similarity measures. This comparison is carried out for previously-proposed clustering similarity measures, as well as a number of established similarity measures that have not previously been applied to clustering comparison. We introduce a paradigm apparatus for the evaluation of clustering comparison techniques and distinguish between the goodness of clusterings and the similarity of clusterings by clarifying the degree to which different measures confuse the two. Accompanying this is the proposal of a novel clustering similarity measure, the Measure of Concordance (MoC). We show that only MoC, Powers’s measure, Lopez and Rajski’s measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.

...read moreread less

A roadmap of clustering algorithms: finding a match for a biomedical

[...]

Xiaogang Wang, Michael Schroeder

01 Jan 2009

TL;DR: In this paper, the authors present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms and compare algorithms on the basis of these features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.

...read moreread less

Abstract: Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms’ benefits and drawbacks as a basis for matching them to biomedical applications.

...read moreread less

Book Chapter•DOI•

A survey of Clustering Algorithms

[...]

Lior Rokach¹•Institutions (1)

Ben-Gurion University of the Negev¹

01 Jan 2009

TL;DR: This chapter presents a tutorial overview of the main clustering methods used in Data Mining, divided into: hierarchical, partitioning, density- based, model-based, grid-based; and soft-computing methods.

...read moreread less

Abstract: This chapter presents a tutorial overview of the main clustering methods used in Data Mining. The goal is to provide a self-contained review of the concepts and the mathematics underlying clustering techniques. The chapter begins by providing measures and criteria that are used for determining whether two objects are similar or dissimilar. Then the clustering methods are presented, divided into: hierarchical, partitioning, density-based, model-based, grid-based, and soft-computing methods. Following the methods, the challenges of performing clustering in large data sets are discussed. Finally, the chapter presents how to determine the number of clusters.

...read moreread less

Journal Article•DOI•

A novel approach for classification of ECG arrhythmias: Type-2 fuzzy clustering neural network

[...]

Rahime Ceylan¹, Yüksel Özbay¹, Bekir Karlik²•Institutions (2)

Selçuk University¹, Fatih University²

01 Apr 2009-Expert Systems With Applications

TL;DR: An improved classifier for automated diagnostic systems of electrocardiogram (ECG) arrhythmias using type-2 fuzzy c-means clustering (T2FCM) algorithm and neural network to constitute the best classification system with high accuracy rate for ECG beats is presented.

...read moreread less

Abstract: This paper presents an improved classifier for automated diagnostic systems of electrocardiogram (ECG) arrhythmias This diagnostic system consists of a combined Fuzzy Clustering Neural Network Algorithm for Classification of ECG Arrhythmias using type-2 fuzzy c-means clustering (T2FCM) algorithm and neural network Type-2 fuzzy c-means clustering is used to improve performance of neural network The aim of improving classifier's performance is to constitute the best classification system with high accuracy rate for ECG beats Ten types of ECG arrhythmias (normal beat, sinus bradycardia, ventricular tachycardia, sinus arrhythmia, atrial premature contraction, paced beat, right bundle branch block, left bundle branch block, atrial fibrillation and atrial flutter) obtained from MIT-BIH database were analyzed However, the presented structure was tested by experimental ECG records of 92 patients (40 male and 52 female, average age is 3975+/-1906) The classification accuracy of an improved classifier in training and testing, namely Type-2 Fuzzy Clustering Neural Network (T2FCNN), was compared with neural network (NN) and fuzzy clustering neural network (FCNN) In T2FCNN architecture, decision making has two stages: forming of the new training set obtained by selection of the best arrhythmia for each arrhythmia class using T2FCM and classification using neural network trained on the new training set The results are demonstrated that the proposed diagnostic systems achieved high (99%) accuracy rate

...read moreread less

Proceedings Article•

A Game-Theoretic Approach to Hypergraph Clustering

[...]

Samuel Rota Bulò¹, Marcello Pelillo¹•Institutions (1)

Ca' Foscari University of Venice¹

07 Dec 2009

TL;DR: It is proved that the problem of finding the equilibria of the clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and a discrete-time high-order replicator dynamics to perform this optimization, based on the Baum-Eagon inequality is provided.

...read moreread less

Abstract: Hypergraph clustering refers to the process of extracting maximally coherent groups from a set of objects using high-order (rather than pairwise) similarities. Traditional approaches to this problem are based on the idea of partitioning the input data into a user-defined number of classes, thereby obtaining the clusters as a by-product of the partitioning process. In this paper, we provide a radically different perspective to the problem. In contrast to the classical approach, we attempt to provide a meaningful formalization of the very notion of a cluster and we show that game theory offers an attractive and unexplored perspective that serves well our purpose. Specifically, we show that the hypergraph clustering problem can be naturally cast into a non-cooperative multi-player "clustering game", whereby the notion of a cluster is equivalent to a classical game-theoretic equilibrium concept. From the computational viewpoint, we show that the problem of finding the equilibria of our clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and we provide a discrete-time dynamics to perform this optimization. Experiments are presented which show the superiority of our approach over state-of-the-art hypergraph clustering techniques.

...read moreread less

Journal Article•DOI•

Automatic image pixel clustering with an improved differential evolution

[...]

Swagatam Das¹, Amit Konar¹•Institutions (1)

Jadavpur University¹

01 Jan 2009

TL;DR: Extensive performance comparison among the new method, a recently developed genetic-fuzzy clustering technique and the classical fuzzy c-means algorithm over a test suite comprising ordinary grayscale images and remote sensing satellite images reveals the superiority of the proposed technique in terms of speed, accuracy and robustness.

...read moreread less

Abstract: This article proposes an evolutionary-fuzzy clustering algorithm for automatically grouping the pixels of an image into different homogeneous regions. The algorithm does not require a prior knowledge of the number of clusters. The fuzzy clustering task in the intensity space of an image is formulated as an optimization problem. An improved variant of the differential evolution (DE) algorithm has been used to determine the number of naturally occurring clusters in the image as well as to refine the cluster centers. We report extensive performance comparison among the new method, a recently developed genetic-fuzzy clustering technique and the classical fuzzy c-means algorithm over a test suite comprising ordinary grayscale images and remote sensing satellite images. Such comparisons reveal, in a statistically meaningful way, the superiority of the proposed technique in terms of speed, accuracy and robustness.

...read moreread less

Journal Article•DOI•

Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition.

[...]

D. N. Georgiou¹, Theodoros E. Karakasidis², Juan J. Nieto³, Ángela Torres³•Institutions (3)

University of Patras¹, University of Thessaly², University of Santiago de Compostela³

07 Mar 2009-Journal of Theoretical Biology

TL;DR: It turns out that one should use the properties that determine in the more important way the behavior of the amino acids and that the use of the appropriate metric can help in defining the groups into groups.

...read moreread less

Journal Article•DOI•

A Novel Density-Based Clustering Framework by Using Level Set Method

[...]

Xiao-Feng Wang¹, De-Shuang Huang¹•Institutions (1)

Chinese Academy of Sciences¹

01 Nov 2009-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.

...read moreread less

Abstract: In this paper, a new density-based clustering framework is proposed by adopting the assumption that the cluster centers in data space can be regarded as target objects in image space. First, the level set evolution is adopted to find an approximation of cluster centers by using a new initial boundary formation scheme. Accordingly, three types of initial boundaries are defined so that each of them can evolve to approach the cluster centers in different ways. To avoid the long iteration time of level set evolution in data space, an efficient termination criterion is presented to stop the evolution process in the circumstance that no more cluster centers can be found. Then, a new effective density representation called level set density (LSD) is constructed from the evolution results. Finally, the valley seeking clustering is used to group data points into corresponding clusters based on the LSD. The experiments on some synthetic and real data sets have demonstrated the efficiency and effectiveness of the proposed clustering framework. The comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.

...read moreread less

Journal Article•DOI•

Reinforcement Ant Optimized Fuzzy Controller for Mobile-Robot Wall-Following Control

[...]

Chia-Feng Juang¹, Chia-Hung Hsu¹•Institutions (1)

National Chung Hsing University¹

24 Mar 2009-IEEE Transactions on Industrial Electronics

TL;DR: This paper proposes a reinforcement ant optimized fuzzy controller (FC) design method, called RAOFC, and applies it to wheeled-mobile-robot wall-following control under reinforcement learning environments, and proposes an online aligned interval type-2 fuzzy clustering method to generate rules automatically.

...read moreread less

Abstract: This paper proposes a reinforcement ant optimized fuzzy controller (FC) design method, called RAOFC, and applies it to wheeled-mobile-robot wall-following control under reinforcement learning environments. The inputs to the designed FC are range-finding sonar sensors, and the controller output is a robot steering angle. The antecedent part in each fuzzy rule uses interval type-2 fuzzy sets in order to increase FC robustness. No a priori assignment of fuzzy rules is necessary in RAOFC. An online aligned interval type-2 fuzzy clustering (AIT2FC) method is proposed to generate rules automatically. The AIT2FC not only flexibly partitions the input space but also reduces the number of fuzzy sets in each input dimension, which improves controller interpretability. The consequent part of each fuzzy rule is designed using Q-value aided ant colony optimization (QACO). The QACO approach selects the consequent part from a set of candidate actions according to ant pheromone trails and Q-values, both of whose values are updated using reinforcement signals. Simulations and experiments on mobile-robot wall-following control show the effectiveness and efficiency of the proposed RAOFC.

...read moreread less

Journal Article•

A Survey: Clustering Ensembles Techniques

[...]

Reza Ghaemi, Md. Nasir Sulaiman, Hamidah Ibrahim, Norwati Mustapha

25 Feb 2009-World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering

TL;DR: The clustering ensembles combine multiple partitions generated by different clustering algorithms into a single clustering solution, representation of multiple partitions, its challenges and present taxonomy of combination algorithms.

...read moreread less

Abstract: The clustering ensembles combine multiple partitions generated by different clustering algorithms into a single clustering solution. Clustering ensembles have emerged as a prominent method for improving robustness, stability and accuracy of unsupervised classification solutions. So far, many contributions have been done to find consensus clustering. One of the major problems in clustering ensembles is the consensus function. In this paper, firstly, we introduce clustering ensembles, representation of multiple partitions, its challenges and present taxonomy of combination algorithms. Secondly, we describe consensus functions in clustering ensembles including Hypergraph partitioning, Voting approach, Mutual information, Co-association based functions and Finite mixture model, and next explain their advantages, disadvantages and computational complexity. Finally, we compare the characteristics of clustering ensembles algorithms such as computational complexity, robustness, simplicity and accuracy on different datasets in previous techniques.

...read moreread less

Proceedings Article•DOI•

Learning the Shared Subspace for Multi-task Clustering and Transductive Transfer Classification

[...]

Quanquan Gu¹, Jie Zhou¹•Institutions (1)

Tsinghua University¹

06 Dec 2009

TL;DR: A novel clustering paradigm, namely multi-task clustering, which performs multiple related clustering tasks together and utilizes the relation of these tasks to enhance the clustering performance, and which is comparable to or even better than several existing transductive transfer classification approaches.

...read moreread less

Abstract: There are many clustering tasks which are closely related in the real world, e.g. clustering the web pages of different universities. However, existing clustering approaches neglect the underlying relation and treat these clustering tasks either individually or simply together. In this paper, we will study a novel clustering paradigm, namely multi-task clustering, which performs multiple related clustering tasks together and utilizes the relation of these tasks to enhance the clustering performance. We aim to learn a subspace shared by all the tasks, through which the knowledge of the tasks can be transferred to each other. The objective of our approach consists of two parts: (1) Within-task clustering: clustering the data of each task in its input space individually; and (2) Cross-task clustering: simultaneous learning the shared subspace and clustering the data of all the tasks together. We will show that it can be solved by alternating minimization, and its convergence is theoretically guaranteed. Furthermore, we will show that given the labels of one task, our multi-task clustering method can be extended to transductive transfer classification (a.k.a. cross-domain classification, domain adaption). Experiments on several cross-domain text data sets demonstrate that the proposed multi-task clustering outperforms traditional single-task clustering methods greatly. And the transductive transfer classification method is comparable to or even better than several existing transductive transfer classification approaches.

...read moreread less

Journal Article•DOI•

Overlapping Multihop Clustering for Wireless Sensor Networks

[...]

Moustafa Youssef¹, Adel Amin Youssef², Mohamed Younis³•Institutions (3)

Nile University¹, Google², University of Maryland, Baltimore³

01 Dec 2009-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The results show that KOCA produces approximately equal-sized clusters, which allow distributing the load evenly over different clusters, and is scalable; the clustering formation terminates in a constant time regardless of the network size.

...read moreread less

Abstract: Clustering is a standard approach for achieving efficient and scalable performance in wireless sensor networks Traditionally, clustering algorithms aim at generating a number of disjoint clusters that satisfy some criteria In this paper, we formulate a novel clustering problem that aims at generating overlapping multihop clusters Overlapping clusters are useful in many sensor network applications, including intercluster routing, node localization, and time synchronization protocols We also propose a randomized, distributed multihop clustering algorithm (KOCA) for solving the overlapping clustering problem KOCA aims at generating connected overlapping clusters that cover the entire sensor network with a specific average overlapping degree Through analysis and simulation experiments, we show how to select the different values of the parameters to achieve the clustering process objectives Moreover, the results show that KOCA produces approximately equal-sized clusters, which allow distributing the load evenly over different clusters In addition, KOCA is scalable; the clustering formation terminates in a constant time regardless of the network size

...read moreread less

Journal Article•DOI•

A new approach for fuzzy risk analysis based on similarity measures of generalized fuzzy numbers

[...]

Shih-Hua Wei¹, Shyi-Ming Chen¹•Institutions (1)

National Taiwan University of Science and Technology¹

01 Jan 2009-Expert Systems With Applications

TL;DR: A new similarity measure between generalized fuzzy numbers is presented that combines the concepts of geometric distance, the perimeter and the height of generalized fuzzyNumbers for calculating the degree of similarity between summarized fuzzy numbers.

...read moreread less

Abstract: In this paper, we present a new method for fuzzy risk analysis based on similarity measures between generalized fuzzy numbers. First, we present a new similarity measure between generalized fuzzy numbers. It combines the concepts of geometric distance, the perimeter and the height of generalized fuzzy numbers for calculating the degree of similarity between generalized fuzzy numbers. We also prove some properties of the proposed similarity measure. We make an experiment to use 15 sets of generalized fuzzy numbers to compare the experimental results of the proposed method with the existing similarity measures. The proposed method can overcome the drawbacks of the existing similarity measures. Based on the proposed similarity measure between generalized fuzzy numbers, we present a new fuzzy risk analysis algorithm for dealing with fuzzy risk analysis problems, where the values of the evaluating items are represented by generalized fuzzy numbers. The proposed method provides a useful way to deal with fuzzy risk analysis problems.

...read moreread less

Collapse