scispace - formally typeset
Search or ask a question

Showing papers on "Fuzzy clustering published in 2013"


Journal ArticleDOI
TL;DR: In this article, a sparse subspace clustering algorithm is proposed to cluster high-dimensional data points that lie in a union of low-dimensional subspaces, where a sparse representation corresponds to selecting a few points from the same subspace.
Abstract: Many real-world problems deal with collections of high-dimensional data, such as images, videos, text, and web documents, DNA microarray data, and more. Often, such high-dimensional data lie close to low-dimensional structures corresponding to several classes or categories to which the data belong. In this paper, we propose and study an algorithm, called sparse subspace clustering, to cluster data points that lie in a union of low-dimensional subspaces. The key idea is that, among the infinitely many possible representations of a data point in terms of other points, a sparse representation corresponds to selecting a few points from the same subspace. This motivates solving a sparse optimization program whose solution is used in a spectral clustering framework to infer the clustering of the data into subspaces. Since solving the sparse optimization program is in general NP-hard, we consider a convex relaxation and show that, under appropriate conditions on the arrangement of the subspaces and the distribution of the data, the proposed minimization program succeeds in recovering the desired sparse representations. The proposed algorithm is efficient and can handle data points near the intersections of subspaces. Another key advantage of the proposed algorithm with respect to the state of the art is that it can deal directly with data nuisances, such as noise, sparse outlying entries, and missing entries, by incorporating the model of the data into the sparse optimization program. We demonstrate the effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering.

2,298 citations


Book ChapterDOI
14 Apr 2013
TL;DR: This work proposes a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed, and proposes a novel cluster stability measure.
Abstract: We propose a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed For obtaining a “flat” partition consisting of only the most significant clusters (possibly corresponding to different density thresholds), we propose a novel cluster stability measure, formalize the problem of maximizing the overall stability of selected clusters, and formulate an algorithm that computes an optimal solution to this problem We demonstrate that our approach outperforms the current, state-of-the-art, density-based clustering methods on a wide variety of real world data

1,132 citations


01 Jan 2013
TL;DR: Six different approaches to determine the right number of clusters in a dataset are explored, including k-means method, a simple and fast clustering technique that addresses the problem of cluster number selection by using a k-Means approach.
Abstract: Clustering is widely used in different field such as biology, psychology, and economics. The result of clustering varies as number of cluster parameter changes hence main challenge of cluster analysis is that the number of clusters or the number of model parameters is seldom known, and it must be determined before clustering. The several clustering algorithm has been proposed. Among them k-means method is a simple and fast clustering technique. We address the problem of cluster number selection by using a k-means approach We can ask end users to provide a number of clusters in advance, but it is not feasible end user requires domain knowledge of each data set. There are many methods available to estimate the number of clusters such as statistical indices, variance based method, Information Theoretic, goodness of fit method etc...The paper explores six different approaches to determine the right number of clusters in a dataset

927 citations


BookDOI
21 Aug 2013
TL;DR: Top researchers from around the world explore the characteristics of clustering problems in a variety of application areas and explain how to glean detailed insight from the clustering process including how to verify the quality of the underlying cluster through supervision, human intervention, or the automated generation of alternative clusters.
Abstract: Research on the problem of clustering tends to be fragmented across the pattern recognition, database, data mining, and machine learning communities. Addressing this problem in a unified way, Data Clustering: Algorithms and Applications provides complete coverage of the entire area of clustering, from basic methods to more refined and complex data clustering approaches. It pays special attention to recent issues in graphs, social networks, and other domains. The book focuses on three primary aspects of data clustering: Methods, describing key techniques commonly used for clustering, such as feature selection, agglomerative clustering, partitional clustering, density-based clustering, probabilistic clustering, grid-based clustering, spectral clustering, and nonnegative matrix factorization Domains, covering methods used for different domains of data, such as categorical data, text data, multimedia data, graph data, biological data, stream data, uncertain data, time series clustering, high-dimensional clustering, and big data Variations and Insights, discussing important variations of the clustering process, such as semisupervised clustering, interactive clustering, multiview clustering, cluster ensembles, and cluster validation In this book, top researchers from around the world explore the characteristics of clustering problems in a variety of application areas. They also explain how to glean detailed insight from the clustering processincluding how to verify the quality of the underlying clustersthrough supervision, human intervention, or the automated generation of alternative clusters.

759 citations


Journal ArticleDOI
Maoguo Gong1, Yan Liang1, Jiao Shi1, Wenping Ma1, Jingjing Ma1 
TL;DR: An improved fuzzy C-means (FCM) algorithm for image segmentation is presented by introducing a tradeoff weighted fuzzy factor and a kernel metric and results show that the new algorithm is effective and efficient, and is relatively independent of this type of noise.
Abstract: In this paper, we present an improved fuzzy C-means (FCM) algorithm for image segmentation by introducing a tradeoff weighted fuzzy factor and a kernel metric. The tradeoff weighted fuzzy factor depends on the space distance of all neighboring pixels and their gray-level difference simultaneously. By using this factor, the new algorithm can accurately estimate the damping extent of neighboring pixels. In order to further enhance its robustness to noise and outliers, we introduce a kernel distance measure to its objective function. The new algorithm adaptively determines the kernel parameter by using a fast bandwidth selection rule based on the distance variance of all data points in the collection. Furthermore, the tradeoff weighted fuzzy factor and the kernel distance measure are both parameter free. Experimental results on synthetic and real images show that the new algorithm is effective and efficient, and is relatively independent of this type of noise.

546 citations


Proceedings Article
03 Aug 2013
TL;DR: This paper proposes a new robust large-scale multi-view clustering method to integrate heterogeneous representations of largescale data and evaluates the proposed new methods by six benchmark data sets and compared the performance with several commonly used clustering approaches as well as the baseline multi- view clustering methods.
Abstract: In past decade, more and more data are collected from multiple sources or represented by multiple views, where different views describe distinct perspectives of the data. Although each view could be individually used for finding patterns by clustering, the clustering performance could be more accurate by exploring the rich information among multiple views. Several multi-view clustering methods have been proposed to unsupervised integrate different views of data. However, they are graph based approaches, e.g. based on spectral clustering, such that they cannot handle the large-scale data. How to combine these heterogeneous features for unsupervised large-scale data clustering has become a challenging problem. In this paper, we propose a new robust large-scale multi-view clustering method to integrate heterogeneous representations of largescale data. We evaluate the proposed new methods by six benchmark data sets and compared the performance with several commonly used clustering approaches as well as the baseline multi-view clustering methods. In all experimental results, our proposed methods consistently achieve superiors clustering performances.

471 citations


Journal ArticleDOI
TL;DR: The interval-valued HFSs and the corresponding correlation coefficient formulas are developed and demonstrated their application in clustering with intervals-valued hesitant fuzzy information through a specific numerical example.

449 citations


Journal ArticleDOI
TL;DR: Two important clustering algorithms namely centroid based K-means and representative object based FCM (Fuzzy C-Means) clustering algorithm are compared and performance is evaluated on the basis of the efficiency of clustering output.
Abstract: In the arena of software, data mining technology has been considered as useful means for identifying patterns and trends of large volume of data. This approach is basically used to extract the unknown pattern from the large set of data for business as well as real time applications. It is a computational intelligence discipline which has emerged as a valuable tool for data analysis, new knowledge discovery and autonomous decision making. The raw, unlabeled data from the large volume of dataset can be classified initially in an unsupervised fashion by using cluster analysis i.e. clustering the assignment of a set of observations into clusters so that observations in the same cluster may be in some sense be treated as similar. The outcome of the clustering process and efficiency of its domain application are generally determined through algorithms. There are various algorithms which are used to solve this problem. In this research work two important clustering algorithms namely centroid based K-Means and representative object based FCM (Fuzzy C-Means) clustering algorithms are compared. These algorithms are applied and performance is evaluated on the basis of the efficiency of clustering output. The numbers of data points as well as the number of clusters are the factors upon which the behaviour patterns of both the algorithms are analyzed. FCM produces close results to K-Means clustering but it still requires more computation time than K-Means clustering. Keywords—clustering; k-means; fuzzy c-means; time complexity

408 citations


Journal ArticleDOI
TL;DR: It is shown that, under mild conditions, spectral clustering applied to the adjacency matrix of the network can consistently recover hidden communities even when the order of the maximum expected degree is as small as $\log n$ with $n$ the number of nodes.
Abstract: We analyze the performance of spectral clustering for community extraction in stochastic block models. We show that, under mild conditions, spectral clustering applied to the adjacency matrix of the network can consistently recover hidden communities even when the order of the maximum expected degree is as small as $\log n$, with $n$ the number of nodes. This result applies to some popular polynomial time spectral clustering algorithms and is further extended to degree corrected stochastic block models using a spherical $k$-median spectral clustering method. A key component of our analysis is a combinatorial bound on the spectrum of binary random matrices, which is sharper than the conventional matrix Bernstein inequality and may be of independent interest.

386 citations


Journal ArticleDOI
01 Apr 2013
TL;DR: A fuzzy energy-aware unequal clustering algorithm (EAUCF), that addresses the hot spots problem, and is compared with some popular clustering algorithms in the literature, namely Low Energy Adaptive Clustering Hierarchy, Cluster-Head Election Mechanism using Fuzzy Logic and Energy-Efficient Unequal Clustered.
Abstract: In order to gather information more efficiently in terms of energy consumption, wireless sensor networks (WSNs) are partitioned into clusters. In clustered WSNs, each sensor node sends its collected data to the head of the cluster that it belongs to. The cluster-heads are responsible for aggregating the collected data and forwarding it to the base station through other cluster-heads in the network. This leads to a situation known as the hot spots problem where cluster-heads that are closer to the base station tend to die earlier because of the heavy traffic they relay. In order to solve this problem, unequal clustering algorithms generate clusters of different sizes. In WSNs that are clustered with unequal clustering, the clusters close to the base station have smaller sizes than clusters far from the base station. In this paper, a fuzzy energy-aware unequal clustering algorithm (EAUCF), that addresses the hot spots problem, is introduced. EAUCF aims to decrease the intra-cluster work of the cluster-heads that are either close to the base station or have low remaining battery power. A fuzzy logic approach is adopted in order to handle uncertainties in cluster-head radius estimation. The proposed algorithm is compared with some popular clustering algorithms in the literature, namely Low Energy Adaptive Clustering Hierarchy, Cluster-Head Election Mechanism using Fuzzy Logic and Energy-Efficient Unequal Clustering. The experiment results show that EAUCF performs better than the other algorithms in terms of first node dies, half of the nodes alive and energy-efficiency metrics in all scenarios. Therefore, EAUCF is a stable and energy-efficient clustering algorithm to be utilized in any WSN application.

292 citations


Journal ArticleDOI
TL;DR: It is demonstrated in this paper that PCCA+ always delivers an optimal fuzzy clustering for nearly uncoupled, not necessarily reversible, Markov chains with transition states.
Abstract: Given a row-stochastic matrix describing pairwise similarities between data objects, spectral clustering makes use of the eigenvectors of this matrix to perform dimensionality reduction for clustering in fewer dimensions. One example from this class of algorithms is the Robust Perron Cluster Analysis (PCCA+), which delivers a fuzzy clustering. Originally developed for clustering the state space of Markov chains, the method became popular as a versatile tool for general data classification problems. The robustness of PCCA+, however, cannot be explained by previous perturbation results, because the matrices in typical applications do not comply with the two main requirements: reversibility and nearly decomposability. We therefore demonstrate in this paper that PCCA+ always delivers an optimal fuzzy clustering for nearly uncoupled, not necessarily reversible, Markov chains with transition states.

Journal ArticleDOI
TL;DR: A new internal clustering validate measure, named clustering validation index based on nearest neighbors (CVNN), which is based on the notion of nearest neighbors is proposed, which can dynamically select multiple objects as representatives for different clusters in different situations.
Abstract: Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications. In general, clustering validation can be categorized into two classes, external clustering validation and internal clustering validation. In this paper, we focus on internal clustering validation and present a study of 11 widely used internal clustering validation measures for crisp clustering. The results of this study indicate that these existing measures have certain limitations in different application scenarios. As an alternative choice, we propose a new internal clustering validation measure, named clustering validation index based on nearest neighbors (CVNN), which is based on the notion of nearest neighbors. This measure can dynamically select multiple objects as representatives for different clusters in different situations. Experimental results show that CVNN outperforms the existing measures on both synthetic data and real-world data in different application scenarios.

Journal ArticleDOI
TL;DR: A fuzzy c-means clustering hybrid approach that combines support vector regression and a genetic algorithm yields sufficient and sensible imputation performance results.

Journal ArticleDOI
TL;DR: The application of a philosophy of cluster analysis to economic data from the 2007 US Survey of Consumer Finances demonstrates techniques and decisions required to obtain an interpretable clustering, and the clustering is shown to be significantly more structured than a suitable null model.
Abstract: Data with mixed type (metric/ordinal/nominal) variables are typical for social stratification, i.e., partitioning a population into social classes. Approaches to cluster such data are compared, namely a latent class mixture model assuming local independence and dissimilarity based methods such as k-medoids. The design of an appropriate dissimilarity measure and the estimation of the number of clusters are discussed as well, comparing the BIC with dissimilarity based criteria. The comparison is based on a philosophy of cluster analysis that connects the problem of a choice of a suitable clustering method closely to the application by considering direct interpretations of the implications of the methodology. According to this philosophy, model assumptions serve to understand such implications but are not taken to be true. It is emphasised that researchers implicitly define the “true” clustering and number of clusters by the choice of a particular methodology. The researcher has to take the responsibility to specify the criteria on which such a comparison can be made. The application of this philosophy to socioeconomic data from the 2007 US Survey of Consumer Finances demonstrates some techniques to obtain an interpretable clustering in an ambiguous situation.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: A method that learns the projection of data and finds the sparse coefficients in the low-dimensional latent space and applies spectral clustering to a similarity matrix built from these sparse coefficients.
Abstract: We propose a novel algorithm called Latent Space Sparse Subspace Clustering for simultaneous dimensionality reduction and clustering of data lying in a union of subspaces. Specifically, we describe a method that learns the projection of data and finds the sparse coefficients in the low-dimensional latent space. Cluster labels are then assigned by applying spectral clustering to a similarity matrix built from these sparse coefficients. An efficient optimization method is proposed and its non-linear extensions based on the kernel methods are presented. One of the main advantages of our method is that it is computationally efficient as the sparse coefficients are found in the low-dimensional latent space. Various experiments show that the proposed method performs better than the competitive state-of-the-art subspace clustering methods.

Journal ArticleDOI
24 Jan 2013-Energies
TL;DR: The effect of similarity measures in the application of clustering for discovering representatives in cases where correlation is supposed to be an important factor to consider, e.g., time series is checked.
Abstract: Forecasting and modeling building energy profiles require tools able to discover patterns within large amounts of collected information. Clustering is the main technique used to partition data into groups based on internal and a priori unknown schemes inherent of the data. The adjustment and parameterization of the whole clustering task is complex and submitted to several uncertainties, being the similarity metric one of the first decisions to be made in order to establish how the distance between two independent vectors must be measured. The present paper checks the effect of similarity measures in the application of clustering for discovering representatives in cases where correlation is supposed to be an important factor to consider, e.g., time series. This is a necessary step for the optimized design and development of efficient clustering-based models, predictors and controllers of time-dependent processes, e.g., building energy consumption patterns. In addition, clustered-vector balance is proposed as a validation technique to compare clustering performances.

Journal ArticleDOI
01 Feb 2013
TL;DR: This tutorial presents a simple yet powerful data clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm, and an implementation in Mathematica.
Abstract: Data clustering techniques are valuable tools for researchers working with large databases of multivariate data. In this tutorial, we present a simple yet powerful one: the k-means clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm. We then present an implementation in Mathematica and various examples of the different options available to illustrate the application of the technique. Data clustering techniques are descriptive data analysis techniques that can be applied to multivariate data sets to uncover the structure present in the data. They are particularly useful when classical second order statistics (the sample mean and covariance) cannot be used. Namely, in exploratory data analysis, one of the assumptions that is made is that no prior knowledge about the dataset, and therefore the dataset’s distribution, is available. In such a situation, data clustering can be a valuable tool. Data clustering is a form of unsupervised classification, as the clusters are formed by evaluating similarities and dissimilarities of intrinsic characteristics between different cases, and the grouping of cases is based on those emergent similarities and not on an external criterion. Also, these techniques can be useful for datasets of any dimensionality over three, as it is very difficult for humans to compare items of such complexity reliably without a support to aid the comparison.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: Out-of-sample extension of SSC is proposed, named as Scalable Sparse Subspace Clustering (SSSC), which makes SSC feasible to cluster large scale data sets and demonstrates the effectiveness and efficiency of the method comparing with the state-of theart algorithms.
Abstract: In this paper, we address two problems in Sparse Subspace Clustering algorithm (SSC), i.e., scalability issue and out-of-sample problem. SSC constructs a sparse similarity graph for spectral clustering by using l1-minimization based coefficients, has achieved state-of-the-art results for image clustering and motion segmentation. However, the time complexity of SSC is proportion to the cubic of problem size such that it is inefficient to apply SSC into large scale setting. Moreover, SSC does not handle with out-of-sample data that are not used to construct the similarity graph. For each new datum, SSC needs recalculating the cluster membership of the whole data set, which makes SSC is not competitive in fast online clustering. To address the problems, this paper proposes out-of-sample extension of SSC, named as Scalable Sparse Subspace Clustering (SSSC), which makes SSC feasible to cluster large scale data sets. The solution of SSSC adopts a "sampling, clustering, coding, and classifying" strategy. Extensive experimental results on several popular data sets demonstrate the effectiveness and efficiency of our method comparing with the state-of-the-art algorithms.

Book ChapterDOI
20 Nov 2013
TL;DR: This paper proposes a new clustering method based on the auto-encoder network, which can learn a highly non-linear mapping function and can obtain stable and effective clustering.
Abstract: Linear or non-linear data transformations are widely used processing techniques in clustering. Usually, they are beneficial to enhancing data representation. However, if data have a complex structure, these techniques would be unsatisfying for clustering. In this paper, based on the auto-encoder network, which can learn a highly non-linear mapping function, we propose a new clustering method. Via simultaneously considering data reconstruction and compactness, our method can obtain stable and effective clustering. Experiments on three databases show that the proposed clustering model achieves excellent performance in terms of both accuracy and normalized mutual information.

Journal ArticleDOI
TL;DR: This work proposes to integrate meta-path selection with user-guided clustering to cluster objects in networks, where a user first provides a small set of object seeds for each cluster as guidance, and an effective and efficient iterative algorithm, PathSelClus, is proposed to learn the model, where the clustering quality and the meta- path weights mutually enhance each other.
Abstract: Real-world, multiple-typed objects are often interconnected, forming heterogeneous information networks. A major challenge for link-based clustering in such networks is their potential to generate many different results, carrying rather diverse semantic meanings. In order to generate desired clustering, we propose to use meta-path, a path that connects object types via a sequence of relations, to control clustering with distinct semantics. Nevertheless, it is easier for a user to provide a few examples (seeds) than a weighted combination of sophisticated meta-paths to specify her clustering preference. Thus, we propose to integrate meta-path selection with user-guided clustering to cluster objects in networks, where a user first provides a small set of object seeds for each cluster as guidance. Then the system learns the weight for each meta-path that is consistent with the clustering result implied by the guidance, and generates clusters under the learned weights of meta-paths. A probabilistic approach is proposed to solve the problem, and an effective and efficient iterative algorithm, PathSelClus, is proposed to learn the model, where the clustering quality and the meta-path weights mutually enhance each other. Our experiments with several clustering tasks in two real networks and one synthetic network demonstrate the power of the algorithm in comparison with the baselines.

Journal ArticleDOI
TL;DR: This survey presents enhanced approaches to subspace clustering by discussing the problems they are solving, their cluster definitions and algorithms, and the related works in high-dimensional clustering.
Abstract: Subspace clustering finds sets of objects that are homogeneous in subspaces of high-dimensional datasets, and has been successfully applied in many domains. In recent years, a new breed of subspace clustering algorithms, which we denote as enhanced subspace clustering algorithms, have been proposed to (1) handle the increasing abundance and complexity of data and to (2) improve the clustering results. In this survey, we present these enhanced approaches to subspace clustering by discussing the problems they are solving, their cluster definitions and algorithms. Besides enhanced subspace clustering, we also present the basic subspace clustering and the related works in high-dimensional clustering.

Journal ArticleDOI
TL;DR: This article compares k-mean to fuzzy c-means and rough k-Means as important representatives of soft clustering, and surveys important extensions and derivatives of these algorithms.

Journal ArticleDOI
TL;DR: An energy functional based on the fuzzy c -means objective function which incorporates the bias field that accounts for the intensity inhomogeneity of the real-world image and deduce a fuzzy external force for the LBM solver based onThe model by Zhao is designed.
Abstract: In the last decades, due to the development of the parallel programming, the lattice Boltzmann method (LBM) has attracted much attention as a fast alternative approach for solving partial differential equations In this paper, we first designed an energy functional based on the fuzzy c -means objective function which incorporates the bias field that accounts for the intensity inhomogeneity of the real-world image Using the gradient descent method, we obtained the corresponding level set equation from which we deduce a fuzzy external force for the LBM solver based on the model by Zhao The method is fast, robust against noise, independent to the position of the initial contour, effective in the presence of intensity inhomogeneity, highly parallelizable and can detect objects with or without edges Experiments on medical and real-world images demonstrate the performance of the proposed method in terms of speed and efficiency

Journal ArticleDOI
TL;DR: A hybrid fuzzy time series approach with fuzzy c-means clustering method and artificial neural networks employed for fuzzification and defining fuzzy relationships, respectively is proposed to reach more accurate forecasts.
Abstract: In recent years, time series forecasting studies in which fuzzy time series approach is utilized have got more attentions. Various soft computing techniques such as fuzzy clustering, artificial neural networks and genetic algorithms have been used in fuzzy time series method to improve the method. While fuzzy clustering and genetic algorithms are being used for fuzzification, artificial neural networks method is being preferred for using in defining fuzzy relationships. In this study, a hybrid fuzzy time series approach is proposed to reach more accurate forecasts. In the proposed hybrid approach, fuzzy c-means clustering method and artificial neural networks are employed for fuzzification and defining fuzzy relationships, respectively. The enrollment data of University of Alabama is forecasted by using both the proposed method and the other fuzzy time series approaches. As a result of comparison, it is seen that the most accurate forecasts are obtained when the proposed hybrid fuzzy time series approach is used.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: An efficient clustering framework specially for face clustering in videos, considering that faces in adjacent frames of the same face track are very similar, is introduced and is applicable to other clustering algorithms to significantly reduce the computational cost.
Abstract: In this paper, we focus on face clustering in videos. Given the detected faces from real-world videos, we partition all faces into K disjoint clusters. Different from clustering on a collection of facial images, the faces from videos are organized as face tracks and the frame index of each face is also provided. As a result, many pair wise constraints between faces can be easily obtained from the temporal and spatial knowledge of the face tracks. These constraints can be effectively incorporated into a generative clustering model based on the Hidden Markov Random Fields (HMRFs). Within the HMRF model, the pair wise constraints are augmented by label-level and constraint-level local smoothness to guide the clustering process. The parameters for both the unary and the pair wise potential functions are learned by the simulated field algorithm, and the weights of constraints can be easily adjusted. We further introduce an efficient clustering framework specially for face clustering in videos, considering that faces in adjacent frames of the same face track are very similar. The framework is applicable to other clustering algorithms to significantly reduce the computational cost. Experiments on two face data sets from real-world videos demonstrate the significantly improved performance of our algorithm over state-of-the art algorithms.

Journal ArticleDOI
01 Feb 2013-Cities
TL;DR: An artificial intelligence approach integrated with geographical information systems (GISs) for modeling urban evolution using fuzzy logic and neural networks to provide a synthetic spatiotemporal methodology for the analysis, prediction and interpretation of urban growth.

Journal ArticleDOI
TL;DR: A new method for selecting the most relevant attributes, namely Prominent attributes, is proposed, which is compared with another existing method to find Significant attributes for unsupervised learning, and performs multiple clustering of data to find initial cluster centers.
Abstract: Partitional clustering of categorical data is normally performed by using K-modes clustering algorithm, which works well for large datasets. Even though the design and implementation of K-modes algorithm is simple and efficient, it has the pitfall of randomly choosing the initial cluster centers for invoking every new execution that may lead to non-repeatable clustering results. This paper addresses the randomized center initialization problem of K-modes algorithm by proposing a cluster center initialization algorithm. The proposed algorithm performs multiple clustering of the data based on attribute values in different attributes and yields deterministic modes that are to be used as initial cluster centers. In the paper, we propose a new method for selecting the most relevant attributes, namely Prominent attributes, compare it with another existing method to find Significant attributes for unsupervised learning, and perform multiple clustering of data to find initial cluster centers. The proposed algorithm ensures fixed initial cluster centers and thus repeatable clustering results. The worst-case time complexity of the proposed algorithm is log-linear to the number of data objects. We evaluate the proposed algorithm on several categorical datasets and compared it against random initialization and two other initialization methods, and show that the proposed method performs better in terms of accuracy and time complexity. The initial cluster centers computed by the proposed approach are close to the actual cluster centers of the different data we tested, which leads to faster convergence of K-modes clustering algorithm in conjunction to better clustering results.

Proceedings Article
09 Jul 2013
TL;DR: This work proposes Multi-CLUS and GraphFuse, two multi-graph clustering techniques powered by Minimum Description Length and Tensor analysis, respectively, and demonstrates higher clustering accuracy than state-of-the-art baselines that do not exploit the multi-view nature of the network data.
Abstract: Given a co-authorship collaboration network, how well can we cluster the participating authors into communities? If we also consider their citation network, based on the same individuals, is it possible to do a better job? In general, given a network with multiple types (or views) of edges (e.g., collaboration, citation, friendship), can community detection and graph clustering benefit? In this work, we propose Multi-CLUS and GraphFuse, two multi-graph clustering techniques powered by Minimum Description Length and Tensor analysis, respectively. We conduct experiments both on real and synthetic networks, evaluating the performance of our approaches. Our results demonstrate higher clustering accuracy than state-of-the-art baselines that do not exploit the multi-view nature of the network data. Finally, we address the fundamental question posed in the title, and provide a comprehensive answer, based on our systematic analysis.

Journal ArticleDOI
TL;DR: An improved k-prototypes algorithm to cluster mixed data is proposed, and a new measure to calculate the dissimilarity between data objects and prototypes of clusters is proposed that takes into account the significance of different attributes towards the clustering process.

Journal ArticleDOI
TL;DR: A connection is established between the objective function and correlation clustering to propose practical approximation algorithms for the problem of clustering probabilistic graphs and show the practicality of the techniques using a large social network of Yahoo! users consisting of one billion edges.
Abstract: We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.