scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Improved rough k-means clustering algorithm based on weighted distance measure with Gaussian function

TL;DR: An improved rough k-means clustering based on weighted distance measure with Gaussian function is proposed in this paper and the validity of this algorithm is demonstrated by simulation and experimental analysis.
Abstract: Rough k-means clustering algorithm and its extensions are introduced and successfully applied to real-life data where clusters do not necessarily have crisp boundaries. Experiments with the rough k-means clustering algorithm have shown that it provides a reasonable set of lower and upper bounds for a given dataset. However, the same weight was used for all the data objects in a lower or upper approximate set when computing the new centre for each cluster while the different impacts of the objects in a same approximation were ignored. An improved rough k-means clustering based on weighted distance measure with Gaussian function is proposed in this paper. The validity of this algorithm is demonstrated by simulation and experimental analysis.
Citations
More filters
Journal ArticleDOI
TL;DR: Experimental evaluations on UCI machine learning repository datasets verify the effectiveness of the proposed IFCERS method, and it is shown that the quality of the final solution has a weak correlation with the ensemble size, the parameter setting on the rough approximations construction is appropriate, and the proposed method is robust towards the diversity from hard clustering members.
Abstract: To deal with the uncertainty, vagueness and overlapping distribution within the data sets, a novel incremental fuzzy cluster ensemble method based on rough set theory (IFCERS) is proposed by the idea of combining clustering analysis task with classification techniques. Firstly, on the basis of soft clustering results, the positive region, boundary region and negative region of clustering ensemble are obtained by applying the construction of rough approximation in rough set theory, and then a group structure within data points of positive region is obtained by adopting a fuzzy cluster ensemble method. Secondly, by combining with the supervised ensemble learning method, e.g., random forests, the obtained group structure is used to construct the random forests classifier to classify the data points in boundary region. Finally, all the acquired group structure is used to train the random forests classifier to classify the data points of negative region. Experimental evaluations on UCI machine learning repository datasets verify the effectiveness of the proposed method. It is also shown that the quality of the final solution has a weak correlation with the ensemble size, the parameter setting on the rough approximations construction is appropriate, and the proposed method is robust towards the diversity from hard clustering members.

49 citations

Journal ArticleDOI
TL;DR: This paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method.
Abstract: Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.

34 citations

Journal ArticleDOI
TL;DR: To mitigate adverse effects of imbalanced clusters and decrease the computational cost, an interval type-2 fuzzy local measure for the RKM clustering is proposed, on the basis of which, a novel RKm clustering algorithm has been developed that specifically gives due consideration to im balanced clusters.
Abstract: Rough K-Means (RKM) is an efficient clustering algorithm for overlapping datasets, and has captured increasing attention in recent years. RKM algorithms are the main focus on the further description of uncertain objects located in boundary regions in order to improve the performance. However, most available RKM algorithms fail to pay attention to the influence of imbalanced clusters, together with imbalanced spatial distributions (i.e., the cluster density) and differing cluster sizes (i.e., the number of object ratios). This paper seeks to address this deficiency and examines in detail some adverse effects caused by imbalanced clusters. To mitigate adverse effects of imbalanced clusters and decrease the computational cost, an interval type-2 fuzzy local measure for the RKM clustering is proposed, on the basis of which, a novel RKM clustering algorithm has been developed that specifically gives due consideration to imbalanced clusters. The effectiveness and superiority of this algorithm are demonstrated through simulation and experimental analysis.

34 citations


Cites background from "Improved rough k-means clustering a..."

  • ...Reflecting the differing impacts of different objects within the same approximation set, a RKM clustering based on a weighted distance measure with Gaussian function was proposed in [19]....

    [...]

  • ..., c-means clustering algorithm in some literature [3], [7], [9]–[16]) is still a topic of great interest to researchers, and has attained great popularity [1], [6], [17]–[19]....

    [...]

  • ...Considering that the lower approximation set or the boundary maybe empty in some cases, Peters [19] improved the above center iterative calculation formula....

    [...]

  • ...Many researchers have made successive improvements by incorporating inter alia fuzzy set, probabilistic model, kernel methods on [4]–[7], [17], [19], [21], [22]....

    [...]

Book ChapterDOI
29 Jun 2020
TL;DR: A narrative review of the state of the art of applications of TWD in regard to the different areas of concern identified by the framework is presented, and in so doing it will highlight both the points of strength of the three-way methodology, and the opportunities for further research.
Abstract: In this work we introduce a framework, based on three-way decision (TWD) and the trisecting-acting-outcome model, to handle uncertainty in Machine Learning (ML). We distinguish between handling uncertainty affecting the input of ML models, when TWD is used to identify and properly take into account the uncertain instances; and handling the uncertainty lying in the output, where TWD is used to allow the ML model to abstain. We then present a narrative review of the state of the art of applications of TWD in regard to the different areas of concern identified by the framework, and in so doing, we will highlight both the points of strength of the three-way methodology, and the opportunities for further research.

33 citations

Journal ArticleDOI
TL;DR: An overview and taxonomy of the K-means clustering algorithm and its variants can be found in this article , where the authors present an overview of the current trends, open issues and challenges, and recommended future research perspectives.

32 citations

References
More filters
Journal ArticleDOI
TL;DR: Some extensions of the rough set approach are presented and a challenge for the roughSet based research is outlined and it is outlined that the current rough set based research paradigms are unsustainable.

1,161 citations


"Improved rough k-means clustering a..." refers methods in this paper

  • ...The rough set theory proposed by Pawlak and Skowron [23] is an important tool to deal with imprecise, incomplete and inconsistent data....

    [...]

  • ...Rough set theory, as an important soft computing approach proposed by Pawlak and Skowron [23] for uncertain and vague data analysis, has been shown to be more promising and has been successfully incorporated in the k-means clustering framework by Lingras to develop the rough k-means (RKM) algorithm [9,14]....

    [...]

Journal ArticleDOI
01 Jul 2004
TL;DR: A variation of the K-means clustering algorithm based on properties of rough sets is proposed, which represents clusters as interval or rough sets.
Abstract: Data collection and analysis in web mining faces certain unique challenges. Due to a variety of reasons inherent in web browsing and web logging, the likelihood of bad or incomplete data is higher than conventional applications. The analytical techniques in web mining need to accommodate such data. Fuzzy and rough sets provide the ability to deal with incomplete and approximate information. Fuzzy set theory has been shown to be useful in three important aspects of web and data mining, namely clustering, association, and sequential analysis. There is increasing interest in research on clustering based on rough set theory. Clustering is an important part of web mining that involves finding natural groupings of web resources or web users. Researchers have pointed out some important differences between clustering in conventional applications and clustering in web mining. For example, the clusters and associations in web mining do not necessarily have crisp boundaries. As a result, researchers have studied the possibility of using fuzzy sets in web mining clustering applications. Recent attempts have used genetic algorithms based on rough set theory for clustering. However, the genetic algorithms based clustering may not be able to handle the large amount of data typical in a web mining application. This paper proposes a variation of the K-means clustering algorithm based on properties of rough sets. The proposed algorithm represents clusters as interval or rough sets. The paper also describes the design of an experiment including data collection and the clustering process. The experiment is used to create interval set representations of clusters of web visitors.

493 citations

Journal ArticleDOI
01 Oct 1997
TL;DR: This paper presents a problem of fuzzy clustering with partial supervision, i.e., unsupervised learning completed in the presence of some labeled patterns, and proposes two specific learning scenarios of complete and incomplete class assignment of the labeled patterns.
Abstract: Presented here is a problem of fuzzy clustering with partial supervision, i.e., unsupervised learning completed in the presence of some labeled patterns. The classification information is incorporated additively as a part of an objective function utilized in the standard FUZZY ISODATA. The algorithms proposed in the paper embrace two specific learning scenarios of complete and incomplete class assignment of the labeled patterns. Numerical examples including both synthetic and real-world data arising in the realm of software engineering are also provided.

327 citations

Journal ArticleDOI
01 Aug 2006
TL;DR: A novel clustering architecture is introduced, in which several subsets of patterns can be processed together with an objective of finding a common structure, and the required communication links are established at the level of cluster prototypes and partition matrices.
Abstract: In this study, we introduce a novel clustering architecture, in which several subsets of patterns can be processed together with an objective of finding a common structure. The structure revealed at the global level is determined by exchanging prototypes of the subsets of data and by moving prototypes of the corresponding clusters toward each other. Thereby, the required communication links are established at the level of cluster prototypes and partition matrices, without hampering the security concerns. A detailed clustering algorithm is developed by integrating the advantages of both fuzzy sets and rough sets, and a measure of quantitative analysis of the experimental results is provided for synthetic and real-world data

241 citations

Journal ArticleDOI
01 Dec 2007
TL;DR: The RFPCM comprises a judicious integration of the principles of rough and fuzzy sets that incorporates both probabilistic and possibilistic memberships simultaneously to avoid the problems of noise sensitivity of fuzzy C-means and the coincident clusters of PCM.
Abstract: A generalized hybrid unsupervised learning algorithm, which is termed as rough-fuzzy possibilistic C-means (RFPCM), is proposed in this paper. It comprises a judicious integration of the principles of rough and fuzzy sets. While the concept of lower and upper approximations of rough sets deals with uncertainty, vagueness, and incompleteness in class definition, the membership function of fuzzy sets enables efficient handling of overlapping partitions. It incorporates both probabilistic and possibilistic memberships simultaneously to avoid the problems of noise sensitivity of fuzzy C-means and the coincident clusters of PCM. The concept of crisp lower bound and fuzzy boundary of a class, which is introduced in the RFPCM, enables efficient selection of cluster prototypes. The algorithm is generalized in the sense that all existing variants of C-means algorithms can be derived from the proposed algorithm as a special case. Several quantitative indices are introduced based on rough sets for the evaluation of performance of the proposed C-means algorithm. The effectiveness of the algorithm, along with a comparison with other algorithms, has been demonstrated both qualitatively and quantitatively on a set of real-life data sets.

220 citations