scispace - formally typeset
Search or ask a question

Showing papers on "Fuzzy clustering published in 2009"


Journal ArticleDOI
TL;DR: This survey tries to clarify the different problem definitions related to subspace clustering in general; the specific difficulties encountered in this field of research; the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and how several prominent solutions tackle different problems.
Abstract: As a prolific research area in data mining, subspace clustering and related problems induced a vast quantity of proposed solutions. However, many publications compare a new proposition—if at all—with one or two competitors, or even with a so-called “naive” ad hoc solution, but fail to clarify the exact problem definition. As a consequence, even if two solutions are thoroughly compared experimentally, it will often remain unclear whether both solutions tackle the same problem or, if they do, whether they agree in certain tacit assumptions and how such assumptions may influence the outcome of an algorithm. In this survey, we try to clarify: (i) the different problem definitions related to subspace clustering in general; (ii) the specific difficulties encountered in this field of research; (iii) the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and (iv) how several prominent solutions tackle different problems.

1,206 citations


Book
17 Feb 2009
TL;DR: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables
Abstract: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables Summary Principal Component Analysis (PCA) Concepts Number of PCA Components Centering and Scaling Outliers and Data Distribution Robust PCA Algorithms for PCA Evaluation and Diagnostics Complementary Methods for Exploratory Data Analysis Examples Summary Calibration Concepts Performance of Regression Models Ordinary Least Squares Regression Robust Regression Variable Selection Principal Component Regression Partial Least Squares Regression Related Methods Examples Summary Classification Concepts Linear Classification Methods Kernel and Prototype Methods Classification Trees Artificial Neural Networks Support Vector Machine Evaluation Examples Summary Cluster Analysis Concepts Distance and Similarity Measures Partitioning Methods Hierarchical Clustering Methods Fuzzy Clustering Model-Based Clustering Cluster Validity and Clustering Tendency Measures Examples Summary Preprocessing Concepts Smoothing and Differentiation Multiplicative Signal Correction Mass Spectral Features Appendix 1: Symbols and Abbreviations Appendix 2: Matrix Algebra Appendix 3: Introduction to R Index References appear at the end of each chapter

1,003 citations


Proceedings ArticleDOI
14 Jun 2009
TL;DR: Under the assumption that the views are un-correlated given the cluster label, it is shown that the separation conditions required for the algorithm to be successful are significantly weaker than prior results in the literature.
Abstract: Clustering data in high dimensions is believed to be a hard problem in general. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lower-dimensional subspace, e.g. via Principal Components Analysis (PCA) or random projections, before clustering. Here, we consider constructing such projections using multiple views of the data, via Canonical Correlation Analysis (CCA).Under the assumption that the views are un-correlated given the cluster label, we show that the separation conditions required for the algorithm to be successful are significantly weaker than prior results in the literature. We provide results for mixtures of Gaussians and mixtures of log concave distributions. We also provide empirical support from audio-visual speaker clustering (where we desire the clusters to correspond to speaker ID) and from hierarchical Wikipedia document clustering (where one view is the words in the document and the other is the link structure).

765 citations


Book ChapterDOI
22 Nov 2009
TL;DR: This paper proposes a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique and demonstrates that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.
Abstract: Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

626 citations


Proceedings ArticleDOI
28 Jun 2009
TL;DR: This paper studies clustering of multi-typed heterogeneous networks with a star network schema and proposes a novel algorithm, NetClus, that utilizes links across multityped objects to generate high-quality net-clusters and generates informative clusters.
Abstract: A heterogeneous information network is an information networkcomposed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in each cluster. However, although clustering on homogeneous networks has been studied over decades, clustering on heterogeneous networks has not been addressed until recently.A recent study proposed a new algorithm, RankClus, for clustering on bi-typed heterogeneous networks. However, a real-world network may consist of more than two types, and the interactions among multi-typed objects play a key role at disclosing the rich semantics that a network carries. In this paper, we study clustering of multi-typed heterogeneous networks with a star network schema and propose a novel algorithm, NetClus, that utilizes links across multityped objects to generate high-quality net-clusters. An iterative enhancement method is developed that leads to effective ranking-based clustering in such heterogeneous networks. Our experiments on DBLP data show that NetClus generates more accurate clustering results than the baseline topic model algorithm PLSA and the recently proposed algorithm, RankClus. Further, NetClus generates informative clusters, presenting good ranking and cluster membership information for each attribute object in each net-cluster.

546 citations


Proceedings ArticleDOI
28 Jun 2009
TL;DR: This work develops a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data, and develops two concrete instances of this framework, one based on local k-means clustering (KASP) and onebased on random projection trees (RASP).
Abstract: Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n3) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.

507 citations


Proceedings ArticleDOI
24 Mar 2009
TL;DR: This paper addresses the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multi-typed information network, and proposes a novel clustering framework called RankClus that directly generates clusters integrated with ranking.
Abstract: As information networks become ubiquitous, extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) in one huge cluster without distinction is dull as well.In this paper, we address the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multi-typed (i.e., heterogeneous) information network. A novel clustering framework called RankClus is proposed that directly generates clusters integrated with ranking. Based on initial K clusters, ranking is applied separately, which serves as a good measure for each cluster. Then, we use a mixture model to decompose each object into a K-dimensional vector, where each dimension is a component coefficient with respect to a cluster, which is measured by rank distribution. Objects then are reassigned to the nearest cluster under the new measure space to improve clustering. As a result, quality of clustering and ranking are mutually enhanced, which means that the clusters are getting more accurate and the ranking is getting more meaningful. Such a progressive refinement process iterates until little change can be made. Our experiment results show that RankClus can generate more accurate clusters and in a more efficient way than the state-of-the-art link-based clustering methods. Moreover, the clustering results with ranks can provide more informative views of data compared with traditional clustering.

399 citations


Journal ArticleDOI
TL;DR: The proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective.
Abstract: Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.

384 citations


01 Jan 2009
TL;DR: A method for making the k-means clustering algorithm more effective and efficient, so as to get better clustering with reduced complexity is proposed.
Abstract: Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data per- taining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means clustering algorithm is widely used for many practical applications. But the original k-means algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed in the literature for improving the performance of the k-means clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient, so as to get better clustering with reduced complexity.

324 citations


Journal ArticleDOI
01 Aug 2009
TL;DR: In this paper, the authors take a systematic approach to evaluate the major clustering paradigms in a common framework and provide a benchmark set of results on a large variety of real world and synthetic data sets.
Abstract: Clustering high dimensional data is an emerging research field. Subspace clustering or projected clustering group similar objects in subspaces, i.e. projections, of the full space. In the past decade, several clustering paradigms have been developed in parallel, without thorough evaluation and comparison between these paradigms on a common basis.Conclusive evaluation and comparison is challenged by three major issues. First, there is no ground truth that describes the "true" clusters in real world data. Second, a large variety of evaluation measures have been used that reflect different aspects of the clustering result. Finally, in typical publications authors have limited their analysis to their favored paradigm only, while paying other paradigms little or no attention.In this paper, we take a systematic approach to evaluate the major paradigms in a common framework. We study representative clustering algorithms to characterize the different aspects of each paradigm and give a detailed comparison of their properties. We provide a benchmark set of results on a large variety of real world and synthetic data sets. Using different evaluation measures, we broaden the scope of the experimental analysis and create a common baseline for future developments and comparable evaluations in the field. For repeatability, all implementations, data sets and evaluation measures are available on our website.

294 citations


Proceedings ArticleDOI
09 Feb 2009
TL;DR: It is demonstrated how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages.
Abstract: Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

Proceedings ArticleDOI
28 Jun 2009
TL;DR: An organized study of 16 external validation measures for K-means clustering by introducing the importance of measure normalization in the evaluation of the clustering performance on data with imbalanced class distributions and revealing the interrelationships among these external measures.
Abstract: Clustering validation is a long standing challenge in the clustering literature. While many validation measures have been developed for evaluating the performance of clustering algorithms, these measures often provide inconsistent information about the clustering performance and the best suitable measures to use in practice remain unknown. This paper thus fills this crucial void by giving an organized study of 16 external validation measures for K-means clustering. Specifically, we first introduce the importance of measure normalization in the evaluation of the clustering performance on data with imbalanced class distributions. We also provide normalization solutions for several measures. In addition, we summarize the major properties of these external measures. These properties can serve as the guidance for the selection of validation measures in different application scenarios. Finally, we reveal the interrelationships among these external measures. By mathematical transformation, we show that some validation measures are equivalent. Also, some measures have consistent validation performances. Most importantly, we provide a guide line to select the most suitable validation measures for K-means clustering.

Proceedings ArticleDOI
13 Nov 2009
TL;DR: This work proposes an approach to extracting meaningful clusters from large databases by combining clustering and classification, which are driven by a human analyst through an interactive visual interface.
Abstract: One of the most common operations in exploration and analysis of various kinds of data is clustering, i.e. discovery and interpretation of groups of objects having similar properties and/or behaviors. In clustering, objects are often treated as points in multi-dimensional space of properties. However, structurally complex objects, such as trajectories of moving entities and other kinds of spatio-temporal data, cannot be adequately represented in this manner. Such data require sophisticated and computationally intensive clustering algorithms, which are very hard to scale effectively to large datasets not fitting in the computer main memory. We propose an approach to extracting meaningful clusters from large databases by combining clustering and classification, which are driven by a human analyst through an interactive visual interface.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper evaluates different similarity measures and clustering methodologies to catalog their strengths and weaknesses when utilized for the trajectory learning problem.
Abstract: Recently a large amount of research has been devoted to automatic activity analysis. Typically, activities have been defined by their motion characteristics and represented by trajectories. These trajectories are collected and clustered to determine typical behaviors. This paper evaluates different similarity measures and clustering methodologies to catalog their strengths and weaknesses when utilized for the trajectory learning problem. The clustering performance is measured by evaluating the correct clustering rate on different datasets with varying characteristics.

Journal ArticleDOI
01 Apr 2009
TL;DR: This paper provides a formal and organized study of the effect of skewed data distributions on K-means clustering and provides the coefficient of variation (CV) as a necessary criterion to validate the clustering results.
Abstract: K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied ldquotruerdquo cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in ldquotruerdquo cluster sizes (e.g., CV > 1.0 ), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in ldquotruerdquo cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the ldquotruerdquo cluster distributions.

Journal ArticleDOI
TL;DR: A simulation study compares the performance of four major hierarchical methods for clustering functional data and yields concrete suggestions to future researchers to determine the best method for clustered their functional data.
Abstract: Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Compari...

Journal ArticleDOI
01 Jun 2009
TL;DR: A recent advance of fuzzy clustering called fuzzy c-means clustering with improved fuzzy partitions (IFP-FCM) is extended in this paper, and a generalized algorithm for more effective clustering is proposed by introducing a novel membership constraint function.
Abstract: The fuzziness index m has important influence on the clustering result of fuzzy clustering algorithms, and it should not be forced to fix at the usual value m = 2. In view of its distinctive features in applications and its limitation in having m = 2 only, a recent advance of fuzzy clustering called fuzzy c-means clustering with improved fuzzy partitions (IFP-FCM) is extended in this paper, and a generalized algorithm called GIFP-FCM for more effective clustering is proposed. By introducing a novel membership constraint function, a new objective function is constructed, and furthermore, GIFP-FCM clustering is derived. Meanwhile, from the viewpoints of L p norm distance measure and competitive learning, the robustness and convergence of the proposed algorithm are analyzed. Furthermore, the classical fuzzy c-means algorithm (FCM) and IFP-FCM can be taken as two special cases of the proposed algorithm. Several experimental results including its application to noisy image texture segmentation are presented to demonstrate its average advantage over FCM and IFP-FCM in both clustering and robustness capabilities.

Journal ArticleDOI
TL;DR: A paradigm apparatus for the evaluation of clustering comparison techniques is introduced and the proposal of a novel clustering similarity measure, the Measure of Concordance, is proposed, showing that only MoC, Powers’s measure, Lopez and Rajski's measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.
Abstract: In evaluating the results of cluster analysis, it is common practice to make use of a number of fixed heuristics rather than to compare a data clustering directly against an empirically derived standard, such as a clustering empirically obtained from human informants. Given the dearth of research into techniques to express the similarity between clusterings, there is broad scope for fundamental research in this area. In defining the comparative problem, we identify two types of worst-case matches between pairs of clusterings, characterised as independently codistributed clustering pairs and conjugate partition pairs. Desirable behaviour for a similarity measure in either of the two worst cases is discussed, giving rise to five test scenarios in which characteristics of one of a pair of clusterings was manipulated in order to compare and contrast the behaviour of different clustering similarity measures. This comparison is carried out for previously-proposed clustering similarity measures, as well as a number of established similarity measures that have not previously been applied to clustering comparison. We introduce a paradigm apparatus for the evaluation of clustering comparison techniques and distinguish between the goodness of clusterings and the similarity of clusterings by clarifying the degree to which different measures confuse the two. Accompanying this is the proposal of a novel clustering similarity measure, the Measure of Concordance (MoC). We show that only MoC, Powers’s measure, Lopez and Rajski’s measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.

01 Jan 2009
TL;DR: In this paper, the authors present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms and compare algorithms on the basis of these features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.
Abstract: Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms’ benefits and drawbacks as a basis for matching them to biomedical applications.

Book ChapterDOI
01 Jan 2009
TL;DR: This chapter presents a tutorial overview of the main clustering methods used in Data Mining, divided into: hierarchical, partitioning, density- based, model-based, grid-based; and soft-computing methods.
Abstract: This chapter presents a tutorial overview of the main clustering methods used in Data Mining. The goal is to provide a self-contained review of the concepts and the mathematics underlying clustering techniques. The chapter begins by providing measures and criteria that are used for determining whether two objects are similar or dissimilar. Then the clustering methods are presented, divided into: hierarchical, partitioning, density-based, model-based, grid-based, and soft-computing methods. Following the methods, the challenges of performing clustering in large data sets are discussed. Finally, the chapter presents how to determine the number of clusters.

Journal ArticleDOI
TL;DR: An improved classifier for automated diagnostic systems of electrocardiogram (ECG) arrhythmias using type-2 fuzzy c-means clustering (T2FCM) algorithm and neural network to constitute the best classification system with high accuracy rate for ECG beats is presented.
Abstract: This paper presents an improved classifier for automated diagnostic systems of electrocardiogram (ECG) arrhythmias This diagnostic system consists of a combined Fuzzy Clustering Neural Network Algorithm for Classification of ECG Arrhythmias using type-2 fuzzy c-means clustering (T2FCM) algorithm and neural network Type-2 fuzzy c-means clustering is used to improve performance of neural network The aim of improving classifier's performance is to constitute the best classification system with high accuracy rate for ECG beats Ten types of ECG arrhythmias (normal beat, sinus bradycardia, ventricular tachycardia, sinus arrhythmia, atrial premature contraction, paced beat, right bundle branch block, left bundle branch block, atrial fibrillation and atrial flutter) obtained from MIT-BIH database were analyzed However, the presented structure was tested by experimental ECG records of 92 patients (40 male and 52 female, average age is 3975+/-1906) The classification accuracy of an improved classifier in training and testing, namely Type-2 Fuzzy Clustering Neural Network (T2FCNN), was compared with neural network (NN) and fuzzy clustering neural network (FCNN) In T2FCNN architecture, decision making has two stages: forming of the new training set obtained by selection of the best arrhythmia for each arrhythmia class using T2FCM and classification using neural network trained on the new training set The results are demonstrated that the proposed diagnostic systems achieved high (99%) accuracy rate

Proceedings Article
07 Dec 2009
TL;DR: It is proved that the problem of finding the equilibria of the clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and a discrete-time high-order replicator dynamics to perform this optimization, based on the Baum-Eagon inequality is provided.
Abstract: Hypergraph clustering refers to the process of extracting maximally coherent groups from a set of objects using high-order (rather than pairwise) similarities. Traditional approaches to this problem are based on the idea of partitioning the input data into a user-defined number of classes, thereby obtaining the clusters as a by-product of the partitioning process. In this paper, we provide a radically different perspective to the problem. In contrast to the classical approach, we attempt to provide a meaningful formalization of the very notion of a cluster and we show that game theory offers an attractive and unexplored perspective that serves well our purpose. Specifically, we show that the hypergraph clustering problem can be naturally cast into a non-cooperative multi-player "clustering game", whereby the notion of a cluster is equivalent to a classical game-theoretic equilibrium concept. From the computational viewpoint, we show that the problem of finding the equilibria of our clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and we provide a discrete-time dynamics to perform this optimization. Experiments are presented which show the superiority of our approach over state-of-the-art hypergraph clustering techniques.

Journal ArticleDOI
01 Jan 2009
TL;DR: Extensive performance comparison among the new method, a recently developed genetic-fuzzy clustering technique and the classical fuzzy c-means algorithm over a test suite comprising ordinary grayscale images and remote sensing satellite images reveals the superiority of the proposed technique in terms of speed, accuracy and robustness.
Abstract: This article proposes an evolutionary-fuzzy clustering algorithm for automatically grouping the pixels of an image into different homogeneous regions. The algorithm does not require a prior knowledge of the number of clusters. The fuzzy clustering task in the intensity space of an image is formulated as an optimization problem. An improved variant of the differential evolution (DE) algorithm has been used to determine the number of naturally occurring clusters in the image as well as to refine the cluster centers. We report extensive performance comparison among the new method, a recently developed genetic-fuzzy clustering technique and the classical fuzzy c-means algorithm over a test suite comprising ordinary grayscale images and remote sensing satellite images. Such comparisons reveal, in a statistically meaningful way, the superiority of the proposed technique in terms of speed, accuracy and robustness.

Journal ArticleDOI
TL;DR: It turns out that one should use the properties that determine in the more important way the behavior of the amino acids and that the use of the appropriate metric can help in defining the groups into groups.

Journal ArticleDOI
TL;DR: Comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.
Abstract: In this paper, a new density-based clustering framework is proposed by adopting the assumption that the cluster centers in data space can be regarded as target objects in image space. First, the level set evolution is adopted to find an approximation of cluster centers by using a new initial boundary formation scheme. Accordingly, three types of initial boundaries are defined so that each of them can evolve to approach the cluster centers in different ways. To avoid the long iteration time of level set evolution in data space, an efficient termination criterion is presented to stop the evolution process in the circumstance that no more cluster centers can be found. Then, a new effective density representation called level set density (LSD) is constructed from the evolution results. Finally, the valley seeking clustering is used to group data points into corresponding clusters based on the LSD. The experiments on some synthetic and real data sets have demonstrated the efficiency and effectiveness of the proposed clustering framework. The comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.

Journal ArticleDOI
TL;DR: This paper proposes a reinforcement ant optimized fuzzy controller (FC) design method, called RAOFC, and applies it to wheeled-mobile-robot wall-following control under reinforcement learning environments, and proposes an online aligned interval type-2 fuzzy clustering method to generate rules automatically.
Abstract: This paper proposes a reinforcement ant optimized fuzzy controller (FC) design method, called RAOFC, and applies it to wheeled-mobile-robot wall-following control under reinforcement learning environments. The inputs to the designed FC are range-finding sonar sensors, and the controller output is a robot steering angle. The antecedent part in each fuzzy rule uses interval type-2 fuzzy sets in order to increase FC robustness. No a priori assignment of fuzzy rules is necessary in RAOFC. An online aligned interval type-2 fuzzy clustering (AIT2FC) method is proposed to generate rules automatically. The AIT2FC not only flexibly partitions the input space but also reduces the number of fuzzy sets in each input dimension, which improves controller interpretability. The consequent part of each fuzzy rule is designed using Q-value aided ant colony optimization (QACO). The QACO approach selects the consequent part from a set of candidate actions according to ant pheromone trails and Q-values, both of whose values are updated using reinforcement signals. Simulations and experiments on mobile-robot wall-following control show the effectiveness and efficiency of the proposed RAOFC.

Journal Article
TL;DR: The clustering ensembles combine multiple partitions generated by different clustering algorithms into a single clustering solution, representation of multiple partitions, its challenges and present taxonomy of combination algorithms.
Abstract: The clustering ensembles combine multiple partitions generated by different clustering algorithms into a single clustering solution. Clustering ensembles have emerged as a prominent method for improving robustness, stability and accuracy of unsupervised classification solutions. So far, many contributions have been done to find consensus clustering. One of the major problems in clustering ensembles is the consensus function. In this paper, firstly, we introduce clustering ensembles, representation of multiple partitions, its challenges and present taxonomy of combination algorithms. Secondly, we describe consensus functions in clustering ensembles including Hypergraph partitioning, Voting approach, Mutual information, Co-association based functions and Finite mixture model, and next explain their advantages, disadvantages and computational complexity. Finally, we compare the characteristics of clustering ensembles algorithms such as computational complexity, robustness, simplicity and accuracy on different datasets in previous techniques.

Proceedings ArticleDOI
06 Dec 2009
TL;DR: A novel clustering paradigm, namely multi-task clustering, which performs multiple related clustering tasks together and utilizes the relation of these tasks to enhance the clustering performance, and which is comparable to or even better than several existing transductive transfer classification approaches.
Abstract: There are many clustering tasks which are closely related in the real world, e.g. clustering the web pages of different universities. However, existing clustering approaches neglect the underlying relation and treat these clustering tasks either individually or simply together. In this paper, we will study a novel clustering paradigm, namely multi-task clustering, which performs multiple related clustering tasks together and utilizes the relation of these tasks to enhance the clustering performance. We aim to learn a subspace shared by all the tasks, through which the knowledge of the tasks can be transferred to each other. The objective of our approach consists of two parts: (1) Within-task clustering: clustering the data of each task in its input space individually; and (2) Cross-task clustering: simultaneous learning the shared subspace and clustering the data of all the tasks together. We will show that it can be solved by alternating minimization, and its convergence is theoretically guaranteed. Furthermore, we will show that given the labels of one task, our multi-task clustering method can be extended to transductive transfer classification (a.k.a. cross-domain classification, domain adaption). Experiments on several cross-domain text data sets demonstrate that the proposed multi-task clustering outperforms traditional single-task clustering methods greatly. And the transductive transfer classification method is comparable to or even better than several existing transductive transfer classification approaches.

Journal ArticleDOI
TL;DR: The results show that KOCA produces approximately equal-sized clusters, which allow distributing the load evenly over different clusters, and is scalable; the clustering formation terminates in a constant time regardless of the network size.
Abstract: Clustering is a standard approach for achieving efficient and scalable performance in wireless sensor networks Traditionally, clustering algorithms aim at generating a number of disjoint clusters that satisfy some criteria In this paper, we formulate a novel clustering problem that aims at generating overlapping multihop clusters Overlapping clusters are useful in many sensor network applications, including intercluster routing, node localization, and time synchronization protocols We also propose a randomized, distributed multihop clustering algorithm (KOCA) for solving the overlapping clustering problem KOCA aims at generating connected overlapping clusters that cover the entire sensor network with a specific average overlapping degree Through analysis and simulation experiments, we show how to select the different values of the parameters to achieve the clustering process objectives Moreover, the results show that KOCA produces approximately equal-sized clusters, which allow distributing the load evenly over different clusters In addition, KOCA is scalable; the clustering formation terminates in a constant time regardless of the network size

Journal ArticleDOI
TL;DR: A new similarity measure between generalized fuzzy numbers is presented that combines the concepts of geometric distance, the perimeter and the height of generalized fuzzyNumbers for calculating the degree of similarity between summarized fuzzy numbers.
Abstract: In this paper, we present a new method for fuzzy risk analysis based on similarity measures between generalized fuzzy numbers. First, we present a new similarity measure between generalized fuzzy numbers. It combines the concepts of geometric distance, the perimeter and the height of generalized fuzzy numbers for calculating the degree of similarity between generalized fuzzy numbers. We also prove some properties of the proposed similarity measure. We make an experiment to use 15 sets of generalized fuzzy numbers to compare the experimental results of the proposed method with the existing similarity measures. The proposed method can overcome the drawbacks of the existing similarity measures. Based on the proposed similarity measure between generalized fuzzy numbers, we present a new fuzzy risk analysis algorithm for dealing with fuzzy risk analysis problems, where the values of the evaluating items are represented by generalized fuzzy numbers. The proposed method provides a useful way to deal with fuzzy risk analysis problems.