scispace - formally typeset
Search or ask a question

Showing papers on "Dunn index published in 2020"


Journal ArticleDOI
TL;DR: An Ensemble Artificial Bee Colony based Anomaly Detection Scheme (En-ABC) for multi-class datasets in cloud environment is proposed and the performance of the proposed scheme has been compared with the existing schemes using various parameters such as-detection, false alarm, and accuracy rates.

59 citations


Journal ArticleDOI
TL;DR: The comparative analysis, based on the modified Dunn Index, and silhouette validity ratio have proved that the proposed initialization algorithm has performed better than the other initialization algorithms.

37 citations


Journal ArticleDOI
TL;DR: A new data-driven dissimilarity measure, called MADD, is used, which uses the distance concentration phenomenon to its advantage, and as a result, clustering algorithms based on MADD usually perform well for high dimensional data.
Abstract: Popular clustering algorithms based on usual distance functions (e.g., the Euclidean distance) often suffer in high dimension, low sample size (HDLSS) situations, where concentration of pairwise distances and violation of neighborhood structure have adverse effects on their performance. In this article, we use a new data-driven dissimilarity measure, called MADD, which takes care of these problems. MADD uses the distance concentration phenomenon to its advantage, and as a result, clustering algorithms based on MADD usually perform well for high dimensional data. We establish it using theoretical as well as numerical studies. We also address the problem of estimating the number of clusters. This is a challenging problem in cluster analysis, and several algorithms are available for it. We show that many of these existing algorithms have superior performance in high dimensions when they are constructed using MADD. We also construct a new estimator based on a penalized version of the Dunn index and prove its consistency in the HDLSS asymptotic regime. Several simulated and real data sets are analyzed to demonstrate the usefulness of MADD for cluster analysis of high dimensional data.

36 citations


Journal ArticleDOI
Hu Shi1, Chunping Jiang1, Zongzhuo Yan1, Tao Tao1, Xuesong Mei1 
TL;DR: The results show that compared with the BP neural network and multiple linear regression model, the Bayesian neural network not only has higher prediction accuracy but also can guarantee excellent prediction performance under different working conditions.
Abstract: It is well known that thermal error has a significant impact on the accuracy of CNC machine tools. In order to decrease the thermally induced positioning error of machine tools, a novel thermal error modeling approach based on Bayesian neural network is proposed in this paper. The relationship between the temperature rise and positioning error of the feed drive system is investigated by simultaneously measuring the thermal characteristics that include the temperature field and positioning error of the CNC machine tool. Fuzzy c-means (FCM) clustering and correlation analysis are used to select temperature-sensitive points, and the Dunn index is introduced to determine the optimal number of clustering groups, which can inhibit the multicollinearity problem among temperature measuring points effectively. The least-square linear fitting is applied to explore the feature of the positioning error data. The results show that compared with the BP neural network and multiple linear regression model, the Bayesian neural network not only has higher prediction accuracy but also can guarantee excellent prediction performance under different working conditions. The prediction results obtained under different operating conditions indicate that the maximum thermal error can be reduced from around 18.2 to 5.14 μm by using the Bayesian neural network, which represents a 71% reduction in the thermally induced error of the feed drive system of machine tool.

27 citations


Journal ArticleDOI
TL;DR: This paper proposes combining WOA with TS (WOATS) for data clustering, a meta-heuristic method which uses memory components to explore and exploit search space and uses an objective function inspired by partitional clustering to maintain the quality of clustering solutions.

20 citations


Journal ArticleDOI
TL;DR: The results using this methodology showed a high classification accuracy and proved that both learning frameworks can be combined to optimize the selection of classification features.

19 citations


Journal ArticleDOI
TL;DR: This paper proposes to cluster and identify similar trajectories based on paths traversed by moving object based on graph model, which has two phases graph generation and clustering.

16 citations


Journal ArticleDOI
TL;DR: The proposed distributed clustering algorithm using multi-objective whale optimization (DMOWOA) for peer to peer network outperforms the existing techniques in terms of statistical measures Minkowski Score, Dunn index and Silhouette index.

16 citations


Journal ArticleDOI
TL;DR: This paper designs an unsupervised clustering strategy with Silhouette Coefficient and Dunn Index applied to parameter selecting of DBSCAN to cluster unknown protocol messages into classes with different formats.

14 citations


Journal ArticleDOI
Jihwan Park1, Keon Vin Park1, Soohyun Yoo, Sang Ok Choi1, Sung Won Han1 
TL;DR: The new grouping suggested by the hierarchical method with four clusters can be utilized as a political decision-making tool and showed that clustering accuracy was the best for the classification that used the hierarchicalmethod.
Abstract: South Korea has been operating an extended producer responsibility system (EPR) since 2003 to collect, transport, and dispose of e-waste Until 2019, the EPR system was operated with a total number of 27 electronic products classified into five categories based on weight and volume, but 23 items will be added in 2020 along with a change to five categories based on the function of the products In this study that used actual operational data related to the collection, transport, and recycling steps from recycling plants in South Korea, we have analyzed how well the new five-category grouping appropriately reflected actual recycling industrial conditions and have provided optimal classification alternatives The results showed that clustering accuracy was the best for the classification that used the hierarchical method In particular, the evaluation index, silhouettes, showed the best accuracy with three clusters (04155), and the Dunn index indicated the best performance with four clusters (02333) Based these results, ANOVA tests were implemented, and showed that the three clusters in the relevant models were significantly different with regard to takt-time, weight, volume, and no of recycling processes (p ≤ 001) and to both recycling cost and value of material (p ≤ 005) In contrast, with regard to the grouping suggested by the South Korean government, the overall results of the clustering accuracy using silhouettes and Dunn indices were –02028 and 0058, respectively In conclusion, the new grouping suggested by the hierarchical method with four clusters can be utilized as a political decision-making tool

9 citations


Journal ArticleDOI
01 Jul 2020
TL;DR: The formalization of the medical data preprocessing stage was proposed in order to find personalized solutions based on current standards and pharmaceutical protocols and to determine deviations of parameters from the normative parameters of the group, as well as the average parameters.
Abstract: The study was conducted by applying machine learning and data mining methods to treatment personalization. This allows individual patient characteristics to be investigated. The personalization method was built on the clustering method and associative rules. It was suggested to determine the average distance between instances in order to find the optimal performance metrics. The formalization of the medical data preprocessing stage was proposed in order to find personalized solutions based on current standards and pharmaceutical protocols. The patient data model was built using time-dependent and time-independent parameters. Personalized treatment is usually based on the decision tree method. This approach requires significant computation time and cannot be parallelized. Therefore, it was proposed to group people by conditions and to determine deviations of parameters from the normative parameters of the group, as well as the average parameters. The novelty of the paper is the new clustering method, which was built from an ensemble of cluster algorithms, and the usage of the new distance measure with Hopkins metrics, which were 0.13 less than for the k-means method. The Dunn index was 0.03 higher than for the BIRCH (balanced iterative reducing and clustering using hierarchies) algorithm. The next stage was the mining of associative rules provided separately for each cluster. This allows a personalized approach to treatment to be created for each patient based on long-term monitoring. The correctness level of the proposed medical decisions is 86%, which was approved by experts.

Book ChapterDOI
01 Jan 2020
TL;DR: In this article, the impact of applying dimensionality reduction during the data transformation phase of the clustering process has been investigated for three most common clustering algorithms k-means clustering, clustering large applications (CLARA), and agglomerative hierarchical clustering (AGNES).
Abstract: With the huge volume of data available as input, modern-day statistical analysis leverages clustering techniques to limit the volume of data to be processed. These input data mainly sourced from social media channels and typically have high dimensions due to the diverse features it represents. This is normally referred to as the curse of dimensionality as it makes the clustering process highly computational intensive and less efficient. Dimensionality reduction techniques are proposed as a solution to address this issue. This paper covers an empirical analysis done on the impact of applying dimensionality reduction during the data transformation phase of the clustering process. We measured the impacts in terms of clustering quality and clustering performance for three most common clustering algorithms k-means clustering, clustering large applications (CLARA), and agglomerative hierarchical clustering (AGNES). The clustering quality is compared by using four internal evaluation criteria, namely Silhouette index, Dunn index, Calinski-Harabasz index, and Davies-Bouldin index, and average execution time is verified as a measure of clustering performance.

Journal ArticleDOI
03 Mar 2020
TL;DR: This study provides practical evaluation frameworks for accessing clustering results on gene expression cancer datasets and determines that PAM isbest for Affymetrix data set and DIANA is best for cDNA dataset among these four clustering algorithms.
Abstract: Clustering plays a particularly fundamental role in exploring data, creating predictions and to overcome the anomalies in the data. Clusters that contain parallel, identical characteristics in a dataset are grouped using reiterative algorithms. As the data in real world is rising day by day so the challenges of perceiving and interpreting the consequential mass of data, which often consists of millions of measurements are increased by the intricacy of a huge number of genes of biological networks. To addressing this challenge, we use clustering algorithms. In this study, we provided a comparative study of the four most popular clustering algorithms: K-Means, PAM, Agglomerative Hierarchical and DIANA and these are evaluated on eight real cancer (four Affymetrix and four cDNA) gene data and simulated data set. The comparative results based upon seven popular cluster validity indices: Average Silhouette Index, Corrected rand Index, Variation of Information, Dunn Index, Calinski-Harabasz Index, Separation Index, and Pearson Gamma. We determine that PAM is best for Affymetrix data set and DIANA is best for cDNA dataset among these four clustering algorithms. This study provides practical evaluation frameworks for accessing clustering results on gene expression cancer datasets.

Proceedings ArticleDOI
15 Dec 2020
TL;DR: In this article, the authors proposed an approach for automatic clustering for text document using a Self-Organizing Map (SOM) which is one of unsupervised artificial neural network that widely used for data analysis, data compression, clustering, and data mining.
Abstract: With the huge amount of published research papers, retrieving relevant information is a difficult task for any researcher Effective clustering algorithms can help improve and simplify the retrieval process Here, we propose an approach for automatic clustering for text document using a Self-Organizing Map (SOM) It is one of unsupervised artificial neural network that widely used for data analysis, data compression, clustering, and data mining The quality and accuracy of a SOM algorithm depends on the selection of values for some of its parameters which are its initial learning rate, SOM matrix dimensions, and the number of iterations Best values are typically selected using trial and error; however, in the current paper we suggest a more systematic approach to parameters optimization using the genetic algorithm The proposed method is applied to cluster 3 scientific papers datasets using their keywords Similar research papers were mapped closer to each other Clustering results were validated using the Dunn index

DOI
01 Jan 2020
TL;DR: This paper examines the transferability of validation indices, such as the Gamma index, Average Silhouette Width or Dunn index to mixed-type data, and the R package clustMixType is extended by these indices and their application is demonstrated.
Abstract: For cluster analysis based on mixed-type data (ie data consisting of numerical and categorical variables), comparatively few clustering methods are available One popular approach to deal with this kind of problems is an extension of the k-means algorithm (Huang, 1998), the so-called k-prototype algorithm, which is implemented in the R package clustMixType (Szepannek and Aschenbruck, 2019) It is further known that the selection of a suitable number of clusters k is particularly crucial in partitioning cluster procedures Many implementations of cluster validation indices in R are not suitable for mixed-type data This paper examines the transferability of validation indices, such as the Gamma index, Average Silhouette Width or Dunn index to mixed-type data Furthermore, the R package clustMixType is extended by these indices and their application is demonstrated Finally, the behaviour of the adapted indices is tested by a short simulation study using different data scenarios

Proceedings ArticleDOI
01 Aug 2020
TL;DR: A number of handcrafted features for handwritten text are considered, a feature space analysis is performed and an expert system for identifying and clustering handwriting styles using unsupervised methods based on this set of features is built.
Abstract: In this paper, we perform a detailed analysis of feature extraction approaches and existing solutions for the problem of handwriting styles clustering. We consider a number of handcrafted features for handwritten text, perform a feature space analysis for finding the best set of extracted features and build an expert system for identifying and clustering handwriting styles using unsupervised methods based on this set. We observed an improvement in clustering results by analyzing the described in this paper metrics for clustering evaluation (such as the Dunn index, silhouette, within- and between-cluster sums of squared errors), therefore we can conclude that the clustering based on such a subset is much more efficient. The results outlined below can be used for handwriting style determination, which is an important step in designing systems for text recognition and localization, authentication and verification tasks, and also can be applied to the problem of the detection of the mental or nervous disorders.

Journal ArticleDOI
TL;DR: A novel distance measure for microarray dataset is proposed and the bench mark algorithm k-medoids is used for clustering task and Dunn Index is used to analyze the cluster validation of the results obtained from the distance measure.

Journal ArticleDOI
26 Dec 2020
TL;DR: This study aims to group employees based on their level of discipline using the Self Organizing Map (SOM) and K-Means algorithm to make it easier to manage employee work discipline.
Abstract: Managing employee work discipline needs to be done to support the development of an organization. One way to make it easier to manage employee work discipline is to group employees based on their level of discipline. This study aims to group employees based on their level of discipline using the Self Organizing Map (SOM) and K-Means algorithm. This grouping begins with collecting employee attendance data, then processing attendance data where one of them is determining the parameters to be used, then ending by implementing the clustering algorithm using the SOM and K-Means algorithms. The results of grouping that have been obtained from the implementation of the SOM and K-Means algorithms are then validated using an internal validation test consisting of the Dunn Index, the Silhouette Index and the Connectivity Index to obtain the best number of clusters and algorithms. The results of the validation test obtained 3 best clusters for the level of discipline, namely the disciplinary cluster, the moderate cluster and the undisciplined cluster.

Posted ContentDOI
05 Jul 2020
TL;DR: The formalization of the medical data preprocessing stage was proposed in order to find personalized solutions based on current standards and pharmaceutical protocols and to determine deviations of parameters from the normative parameters of the group, as well as the average parameters.
Abstract: The study was conducted by applying machine learning and data mining methods to treatment personalization. This allows individual patient characteristics to be investigated. The personalization method was built on the clustering method and associative rules. It was suggested to determine the average distance between instances in order to find the optimal performance metrics. The formalization of the medical data preprocessing stage was proposed in order to find personalized solutions based on current standards and pharmaceutical protocols. The patient data model was built using time-dependent and time-independent parameters. Personalized treatment is usually based on the decision tree method. This approach requires significant computation time and cannot be parallelized. Therefore, it was proposed to group people by conditions and to determine deviations of parameters from the normative parameters of the group, as well as the average parameters. The novelty of the paper is the new clustering method, which was built from an ensemble of cluster algorithms, and the usage of the new distance measure with Hopkins metrics, which were 0.13 less than for the k-means method. The Dunn index was 0.03 higher than for the BIRCH (balanced iterative reducing and clustering using hierarchies) algorithm. The next stage was the mining of associative rules provided separately for each cluster. This allows a personalized approach to treatment to be created for each patient based on long-term monitoring. The correctness level of the proposed medical decisions is 86%, which was approved by experts.

Book ChapterDOI
01 Jan 2020
TL;DR: This paper uses both Firefly and Fuzzy Firefly algorithms separately along with algorithms like FCM, IFCM and RFCM and analyses their efficiency using two measures DB and D to conclude that RFCM with Hyper-tangent kernel and fuzzy firefly produce the best results with fastest convergence rate.
Abstract: In order to handle the problem of linear separability in the early data clustering algorithms, Euclidean distance is being replaced with Kernel functions as measures of similarity. Another problem with the clustering algorithms is the selection of initial centroids randomly, which affects not only the final result but also decreases the convergence rate. Optimal selection of initial centroids through optimization algorithms like Firefly or Fuzzy Firefly algorithms provide partial solution to this problem. In this paper, we focus on two kernels; Gaussian and Hyper-tangent and use both Firefly and Fuzzy Firefly algorithms separately along with algorithms like FCM, IFCM and RFCM and analyse their efficiency using two measures DB and D. Our analysis concludes that RFCM with Hyper-tangent kernel and fuzzy firefly produce the best results with fastest convergence rate. We use the two images; MRI scan of a human brain and blood cancer cells for our analysis.

Proceedings ArticleDOI
10 Dec 2020
TL;DR: In this paper, a tweet clustering system to determine topics from many documents in the form of text through text mining method using an ant clustering (AC) technique is presented.
Abstract: The aspects of life on a public figure, discussed by the community, are often exploited by the news media as topic information to create an article that can attract the attention of the reader. Efficiently, the media only needs to pay attention to social media to get some of the information. The more information needed leads to a vast amount of data involved, so the process becomes hard. In this paper, a tweet clustering system to determine topics from many documents in the form of text through text mining method using an ant clustering (AC) technique. AC is one of swarm intelligence algorithms inspired by the behavior of the ant colony in sorting corpses. Evaluation of a small dataset of text documents shows that four topics are successfully concluded: economy, social, politics, and government. The developed AC-based tweet clustering system produces an average cluster quality of the Dunn Index up to 0,3455.

Proceedings ArticleDOI
06 Nov 2020
TL;DR: Wang et al. as mentioned in this paper proposed a random forest method of clustering ensemble selection with Dunn index, which is based on the random forest algorithm of cluster integration selection and personal indoor thermal preference model, considering the shortcomings of the irrevocable merging strategy of hierarchical clustering algorithm.
Abstract: The random forest algorithm is an ensemble learning method, with the decision tree as its base classifier In the ensemble model, it is not always true that the more the base classifiers, the better the classification effect, since if there are more base classifiers with poor performance in the model, they may have negative impacts on the final classification result In order to modify the random forest classification method under the premise of ensuring the diversity of the random forest model, based on the random forest algorithm of cluster integration selection and personal indoor thermal preference model, this paper proposes a random forest method of clustering ensemble selection with Dunn index Considering the shortcomings of the irrevocable merging strategy of hierarchical clustering algorithm, a random forest method of hybrid clustering ensemble selection based on hierarchical clustering and k- medoids partition clustering is developed The effectiveness of the proposed methods is verified by classifying personal indoor thermal preferences

Proceedings ArticleDOI
04 Nov 2020
TL;DR: In this paper, an unsupervised learning technique was used to identify subtypes of cancer using gene expression data obtained from CBioportal, which can help in improving the efficacy and reducing the toxicity of the treatments by identifying clues to find target therapeutics.
Abstract: This study was conducted to review and identify the unsupervised techniques that can be employed to analyze gene expression data in order to identify better subtypes of tumors. Identifying subtypes of cancer help in improving the efficacy and reducing the toxicity of the treatments by identifying clues to find target therapeutics. Process of gene expression data analysis described under three steps as preprocessing, clustering, and cluster validation. Gene expression data obtained from CBioportal was analyzed in this research using unsupervised learning techniques. Partitioning around medoids, K-means and Hierarchical clustering techniques with different distance and linkage measures were used in initial clustering of expression data. After the cluster identification, cluster validation was conducted according to internal measures like Silhouette, Dunn index. Relative measures were used to identify optimal number of clusters. External validations like comparing the classes with clinical variables and visual analysis of the classes using heatmaps were conducted. After heatmap filtering, it was identified that the three cluster analysis results have meaningful clusters. The cluster analysis with 3 clusters identified using k means clustering has significant expression patterns in each cluster.

Book ChapterDOI
20 Feb 2020
TL;DR: In this article, the authors explore the impact of dimensionality over the existing standard data stream clustering algorithms and compare them for different dimensions of stream using six performance parameters, namely adjusted Rand index, Dunn index, entropy, F1 measure, purity and within cluster sum of square measure.
Abstract: Handling stream data is a tedious task. Recently numerous techniques are presented for analysing stream data. Stream data clustering is one of the important tasks in stream data mining. A number of application programming interfaces (APIs) are available for implementing the stream data clustering. These APIs can handle the stream data of any dimension. The objective of the presented paper is to explore the impact of dimensionality over the existing standard data stream clustering algorithms. Selected standard data stream clustering algorithms are compared for different dimensions of stream using six performance parameters, namely adjusted Rand index, Dunn index, entropy, F1 measure, purity and within cluster sum of square measure.

Book ChapterDOI
01 Jan 2020
TL;DR: The optimal number of seed points selection algorithm of an unknown data based on two important internal cluster validity indices, namely, Dunn Index and Silhouette Index is described, where Shannon’s entropy with the threshold value of distance has been used to calculate the position of the seed point.
Abstract: In the present world, clustering is considered to be the most important data mining tool which is applied to huge data to help the futuristic decision-making processes. It is an unsupervised classification technique by which the data points are grouped to form the homogeneous entity. Cluster analysis is used to find out the clusters from a unlabeled data. The position of the seed points primarily affects the performances of most partitional clustering techniques. The correct number of clusters in a dataset plays an important role to judge the quality of the partitional clustering technique. Selection of initial seed of K-means clustering is a critical problem for the formation of the optimal number of the cluster with the benefit of fast stability. In this paper, we have described the optimal number of seed points selection algorithm of an unknown data based on two important internal cluster validity indices, namely, Dunn Index and Silhouette Index. Here, Shannon’s entropy with the threshold value of distance has been used to calculate the position of the seed point. The algorithm is applied to different datasets and the results are comparatively better than other methods. Moreover, the comparisons have been done with other algorithms in terms of different parameters to distinguish the novelty of our proposed method.

Book ChapterDOI
10 Aug 2020
TL;DR: In this paper, cycle based clustering technique using reversible cellular automata (CAs) where closeness among objects is represented as objects belonging to the same cycle, that is reachable from each other.
Abstract: This work proposes cycle based clustering technique using reversible cellular automata (CAs) where ‘closeness’ among objects is represented as objects belonging to the same cycle, that is reachable from each other. The properties of such CAs are exploited for grouping the objects with minimum intra-cluster distance while ensuring that limited number of cycles exist in the configuration-space. The proposed algorithm follows an iterative strategy where the clusters with closely reachable objects of previous level are merged in the present level using an unique auxiliary CA. Finally, it is observed that, our algorithm is at least at par with the best algorithm existing today.

Book ChapterDOI
26 Sep 2020
TL;DR: This article argues a method for community partition based on information granularity, which optimize the social relationship model by using the link prediction method and establish the similarity model of user social relationship and obtains better I index, and Dunn index evaluation results compared with K-means.
Abstract: The social network community partition is conducive to obtaining hidden and valuable knowledge and rules, which is currently a hot research perspective. Traditional community mining often analyzes network structure information from a static point of view, but ignores the analysis of individual actors’ initiative, which limits the construction of community concept model and the effect of community partition. This article argues a method for community partition based on information granularity. First, we optimize the social relationship model by using the link prediction method and establish the similarity model of user social relationship. Second, aiming at the deficiency of K-means clustering algorithm and the defect of high dimension and sparsity of data, the principle of information granularity is introduced in user clustering analysis, and membership degree and generalized equivalence relation of user equivalence relation are given respectively. On this basis, we propose a social community partition method based on the information granularity. Finally, experiments show that, because of the effective integration of the important information of users’ social relations and the introduction of information granularity method, the proposed model obtains better I index, and Dunn index evaluation results compared with K-means.

Journal ArticleDOI
30 Apr 2020
Abstract: The earthquake is shocks or vibrations in the earth's surface because of shifting layers of rock at the base of the earth's surface. This natural phenomenon is common in Indonesia because it lies between Australian, Eurasian, Pacific plates, and it location surrounded by a ring of fire precisely. Therefore, this study aims to cluster earthquake events in Indonesia and describe the characteristics of each group based on clustering results. The method used is the Fuzzy K-Means Clustering. The clustering results obtained from clustering based on the depth, longitude, and latitude. In this study, the data used is the earthquake's data, which has a magnitude greater than or equal to 5 SR and only clumped by depth. Based on the Davies-Bouldin and Dunn index, the best clustering is 2 clusters which researchers cluster earthquake data in Indonesia into deep and shallow clusters.