scispace - formally typeset
Search or ask a question

Showing papers on "Dunn index published in 2021"


Journal ArticleDOI
TL;DR: In this paper, a game-based k-means (GBK) algorithm is proposed, where cluster centers compete with each other to attract the largest number of similar objectives or entities to their cluster.
Abstract: Due to its simplicity, versatility and the diversity of applications to which it can be applied, k-means is one of the well-known algorithms for clustering data. The foundation of this algorithm is based on the distance measure. However, the traditional k-means has some weaknesses that appear in some data sets related to real applications, the most important of which is to consider only the distance criterion for clustering. Various studies have been conducted to address each of these weaknesses to achieve a balance between quality and efficiency. In this paper, a novel k-means variant of the original algorithm is proposed. This approach leverages the power of bargaining game modelling in the k-means algorithm for clustering data. In this novel setting, cluster centres compete with each other to attract the largest number of similar objectives or entities to their cluster. Thus, the centres keep changing their positions so that they have smaller distances with the maximum possible data than other cluster centres. We name this new algorithm the game-based k-means (GBK-means) algorithm. To show the superiority and efficiency of GBK-means over conventional clustering algorithms, namely, k-means and fuzzy k-means, we use the following syntactic and real-world data sets: (1) a series of two-dimensional syntactic data sets; and (2) ten benchmark data sets that are widely used in different clustering studies. The evaluation criteria show GBK-means is able to cluster data more accurately than classical algorithms based on eight evaluation metrics, namely F-measure, the Dunn index (DI), the rand index (RI), the Jaccard index (JI), normalized mutual information (NMI), normalized variation of information (NVI), the measure of concordance and error rate (ER).

30 citations


Journal ArticleDOI
TL;DR: In this paper, a high-resolution, detailed electric load dataset was assessed, collected by smart meters from nearly a thousand households in Hungary, many of them single-family houses.

23 citations


Journal ArticleDOI
TL;DR: A novel method proposed the Energy Curve is used instead of a histogram with Otsu’s method and Harmony Search Algorithm to compute optimized gray levels, and results compared with various optimization algorithms with histogram clarify that the proposed method is superior to histogram-based methods.

18 citations


Journal ArticleDOI
01 May 2021
TL;DR: A parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results and a good scalability and a reliable validation compared to other existing measures when handling large scale data are proposed.
Abstract: Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a cluster of nodes and to the parallel computation models. Although the effectiveness of parallel models to deal with increasing volume of data little work is done on the validation of big clusters. To deal with this issue, we propose a parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results. Rather than computing the Dunn Index on a single machine in the clustering validation process, the new proposed measure is computed by distributing the partitioning among a cluster of nodes using a customized parallel model under Apache Spark framework. The proposed S-DI is also enhanced by a Sketch and Validate sampling technique which aims to approximate the Dunn Index value by using a small representative data-sample. Different experiments on simulated and real datasets showed a good scalability of our proposed measure and a reliable validation compared to other existing measures when handling large scale data.

15 citations


Journal ArticleDOI
TL;DR: A multi-objective automatic query plan recommendation method, a combination of incremental DBSCAN and NSGA-II, which outperforms the other well-known approaches for query processing and improves the accuracy of clustering.

12 citations


Journal ArticleDOI
TL;DR: This work presents a content-based advertisement viewability prediction framework using Artificial Intelligence (AI) methods and confirms that various in-content ad features, i.e., gender, type, discount, layout, and crowdedness play a vital role in predicting an ad’s viewability.
Abstract: In the current competitive corporate world, organizations rely on their products’ advertisements for surpassing competitors in reaching out to a larger pool of customers. This forces companies to focus on advertisement quality. This work presents a content-based advertisement viewability prediction framework using Artificial Intelligence (AI) methods. The primary focus here is on the web-advertisements available on various online shopping websites. Most of the past work in this domain emphasizes on the scroll depth and dwell time of an ad. However, the features that directly influence the viewability of an ad have been overlooked in the past. Unlike other approaches, this work considers multiple in-ad features that directly influence its viewability. Some of these include color, urgency, language, offers, discount, type, and prominent gender. This work presents an AI-based framework for identifying the features attributing towards increased viewability of ads. Feature selection techniques are executed on the dataset to extract important attributes. Afterward, clustering is applied to confirm the number of class labels assigned to the instances. To validate the clustering results, three validation indices are used here, namely Davies Bouldin Index, Dunn Index, and Silhouette Coefficient. Five classifiers, i.e., Support Vector Machine, k- Nearest Neighbors, Artificial Neural Network, Random Forest, and Gradient Regression Boosting Trees are trained using multiple features and viewability of an ad is predicted. The obtained results confirm that various in-content ad features, i.e., gender, type, discount, layout, and crowdedness play a vital role in predicting an ad’s viewability.

12 citations


Journal ArticleDOI
TL;DR: This work introduces the friends-of-friends concept during the random walk process so that the edges’ weights are determined utilizing an inclusive criterion and the focus of this work remains random walk-based clustering of graphs.

10 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider what happens if we treat internal cluster validity measures (such as the Calinski-Harabasz, Dunn, or Davies-Bouldin indices) as objective functions in unsupervised learning activities.

9 citations


DOI
03 Feb 2021
TL;DR: Results show that the PCA can effectively be employed in the clustering process as a check tool for the K-means and Hierarchical clustering.
Abstract: This paper addresses the use of clustering algorithms in the customer segmentation to define a marketing strategy of a credit card company Customer segmentation divides customers into groups based on common characteristics, which is useful for banks, businesses, and companies to improve their products or service opportunities The analysis explores the applications of the K-means, the Hierarchical clustering, and the Principal Component Analysis (PCA) in identifying the customer segments of a company based on their credit card transaction history The dataset used in the project summarizes the usage behavior of 8950 active credit card holders in the last 6 months, and our aim is to perform customer segmentation in the most accurate way using clustering techniques The project uses two approaches for customer segmentation: first, by considering all variables in the clustering algorithms using the Hierarchical clustering and the K-means Second, by applying the dimensionality reduction through Principal Component Analysis (PCA) to the dataset, then identifying the optimal number of clusters, and repeating the clustering analysis with the updated number of clusters Results show that the PCA can effectively be employed in the clustering process as a check tool for the K-means and Hierarchical clustering

7 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel method based on ensemble clustering for large probabilistic graphs that relies on co-occurrences of node pairs based on the probability of the corresponding common cluster graphs, and presents a Probabilistic co-association matrix as a consensus function to integrate base clustering results.
Abstract: Graphs are commonly used to express the communication of various data. Faced with uncertain data, we have probabilistic graphs. As a fundamental problem of such graphs, clustering has many applications in analyzing uncertain data. In this paper, we propose a novel method based on ensemble clustering for large probabilistic graphs. To generate ensemble clusters, we develop a set of probable possible worlds of the initial probabilistic graph. Then, we present a probabilistic co-association matrix as a consensus function to integrate base clustering results. It relies on co-occurrences of node pairs based on the probability of the corresponding common cluster graphs. Also, we apply two improvements in the steps before and after of ensembles generation. In the before step, we append neighborhood information based on node features to the initial graph to achieve a more accurate estimation of the probability between the nodes. In the after step, we use supervised metric learning-based Mahalanobis distance to automatically learn a metric from ensemble clusters. It aims to gain crucial features of the base clustering results. We evaluate our work using five real-world datasets and three clustering evaluation metrics, namely the Dunn index, Davies–Bouldin index, and Silhouette coefficient. The results show the impressive performance of clustering large probabilistic graphs.

6 citations


Journal ArticleDOI
TL;DR: In this article, the Euclidean distance, the dynamic time warping (DTW) and the generalized summed discrete Frechet dissimilarity were implemented with three linkage strategies ("average," "complete," and "Ward").
Abstract: BACKGROUND Obstructive sleep apnea (OSA) is a chronic disease characterized by recurrent pharyngeal collapses during sleep. In most severe cases, continuous positive airway pressure (CPAP) consists in keeping the airways open by administering mild air pressure. This treatment faces adherence issues. OBJECTIVES Eight hundred and forty-eight subjects were equipped with CPAP prescribed at the Grenoble University Hospital between 2016 and 2018. Their daily CPAP uses have been recorded during the first 3 months. Our aim is to cluster these adherence time series. With hierarchical agglomerative clustering, we focused on the choices of the dissimilarity measure and the internal cluster validation index (CVI). METHODS The Euclidean distance, the dynamic time warping (DTW) and the generalized summed discrete Frechet dissimilarity were implemented with three linkage strategies ("average," "complete," and "Ward"). The performances of each method (dissimilarity and linkage) were evaluated on a simulation study through the adjusted Rand index (ARI). The Ward linkage with DTW dissimilarity provided the best ARI. Then six different internal CVIs (Silhouette, Calinski Harabasz, Davies Bouldin, Modified Davies Bouldin, Dunn, and COP) were compared on their ability to choose the best number of clusters. The Dunn index beat the others. RESULTS CPAP data were clustered with the Ward linkage, the DTW dissimilarity and the Dunn index. It identified six clusters, from a cluster of patients (N = 29 subjects) whose stopped the therapy early on to a cluster (N = 105) with increasing adherence over time. Other clusters were extremely good users (N = 151), good users (N = 150), moderate users (N = 235), and poor adherers (N = 178).

Journal ArticleDOI
31 Mar 2021
TL;DR: The best cluster analysis was Agglomerative Ward Linkage which produced three clusters which shall be able to make a better policies of welfare based on the dominant indicators found in each city.
Abstract: The National Medium Term Development Plan 2020-2024 states that one of the visions of national development is to accelerate the distribution of welfare and justice. Cluster analysis is analysis that grouping of objects into several smaller groups where the objects in one group have similar characteristics. This study was conducted to find the best clustering method and to classify cities based on the level of welfare in Java. In this study, the cluster analysis that used was hard clustering such as K-Means, K-Medoids (PAM and CLARA), and Hierarchical Agglomerative as well as soft clustering such as Fuzzy C Means. This study use elbow method, silhouette method, and gap statistics to determine the optimal number of clusters. From the evaluation results of the silhouette coefficient, dunn index, connectivity coefficient, and Sw/Sb ratio, it was found that the best cluster analysis was Agglomerative Ward Linkage which produced three clusters. The first cluster consists of 27 cities with moderate welfare, the second cluster consists of 16 cities with high welfare, the third cluster consists of 76 cities with low welfare. With the best clustering results, the government of cities in Java shall be able to make a better policies of welfare based on the dominant indicators found in each cluster.

Journal ArticleDOI
TL;DR: In this article, an L2-weighted K-means clustering algorithm was proposed to estimate the drilling time and depth for different soil materials and land layers, and the proposed clustering scheme is evaluated widely used evaluation metrics such as Dunn Index, Davies-Bouldin index (DBI), Silhouette coefficient (SC), and Calinski-Harabaz Index (CHI).
Abstract: Recently groundwater scarcity has accelerated drilling operations worldwide as drilled boreholes are substantial for replenishing the needs of safe drinking water and achieving long-term sustainable development goals. However, the quest for achieving optimal drilling efficiency is ever continued. This paper aims to provide valuable insights into borehole drilling data utilizing the potential of advanced analytics by employing several enhanced cluster analysis techniques to propel drilling efficiency optimization and knowledge discovery. The study proposed an L2-weighted K-mean clustering algorithm in which the mean is computed from transformed weighted feature space. To verify the effectiveness of our proposed L2-weighted K-mean algorithm, we performed a comparative analysis of the proposed work with traditional clustering algorithms to estimate the digging time and depth for different soil materials and land layers. The proposed clustering scheme is evaluated widely used evaluation metrics such as Dunn Index, Davies–Bouldin index (DBI), Silhouette coefficient (SC), and Calinski–Harabaz Index (CHI). The study results highlight the significance of the proposed clustering algorithm as it achieved better clustering results than conventional clustering approaches. Moreover, for facilitation of subsequent learning, achievement of reliable classification, and generalization, we performed feature extraction based on the time interval of the drilling process according to soil material and land layer. We formulated the solution by grouping the extracted features into six different blocks to achieve our desired objective. Each block corresponds to various characteristics of soil materials and land layers. Extracted features are examined and visualized in point cloud space to analyze the water level patterns, depth, and days required to complete the drilling operations.

Proceedings ArticleDOI
19 Feb 2021
TL;DR: In this paper, an advanced model for the derivation of cardiac hemodynamic parameters from the first derivative impedance signal was proposed, which is based on unsupervised learning of morphological features of impedance plethysmography (IPG).
Abstract: Nowadays, unsupervised learning presents a new approach to analyze various hidden patterns inside of medical data. Still, it is a great challenge to apply unsupervised learning and produce valuable data, especially to the cardiac system. This paper has proposed an advanced model for the derivation of cardiac hemodynamic parameters from the first derivative impedance signal. This study aims to analyze the plethysmographic wave by non-invasive measurement of electrical impedance of limb. The proposed model is based on unsupervised learning of morphological features of impedance plethysmography (IPG). We conducted and compared the performance evaluation of three clustering techniques on recorded impedance data to perceive the cardiac cycle characteristics. The findings can potentially assist in determining several vital health care variables like blood pressure, arterial stiffness and respiration rate. The proposed model was tested on a recorded IPG dataset and it achieved DB index and Dunn index of 0.13 and 0.87 in agglomerative clustering for an optimal number of clusters.

Journal ArticleDOI
TL;DR: A recommendation framework that uses registered water consumption values as input data and provides meter replacement recommendations and results show that the proposed framework detects more compact clusters with smaller variance.
Abstract: Due to their structure and usage condition, water meters face degradation, breaking, freezing, and leakage problems. There are various studies intended to determine the appropriate time to replace degraded ones. Earlier studies have used several features, such as user meteorological parameters, usage conditions, water network pressure, and structure of meters to detect failed water meters. This article proposes a recommendation framework that uses registered water consumption values as input data and provides meter replacement recommendations. This framework takes time series of registered consumption values and preprocesses them in two rounds to extract effective features. Then, multiple un-/semi-supervised outlier detection methods are applied to the processed data and assigns outlier/normal labels to them. At the final stage, a hypergraph-based ensemble method receives the labels and combines them to discover the suitable label. Due to the unavailability of ground truth labeled data for meter replacement, we compare our method with respect to its FPR and two internal metrics: Dunn index and Davies-Bouldin Index. Results of our comparative experiments show that the proposed framework detects more compact clusters with smaller variance.

Journal ArticleDOI
01 Jan 2021
TL;DR: This work introduces an efficient model using Single Value Decomposition (SVD) as a method for dimensions reduction and K-means clustering as classification method and demonstrates that the proposed method is able to outperform other existing methods.
Abstract: The recommender system is technique and tool to filter the massive overloaded information for suggesting most useful information to the user in a personalized manner. In the period of "Big Data", the researcher experiences many problems to process big data accurately and efficiently. In this work, we introduce an efficient model using Single Value Decomposition (SVD) as a method for dimensions reduction and K-means clustering as classification method. Our proposed method and its corresponding results have been evaluated and compare results with other existing methods using metrics like standard deviation (SD), mean absolute error (MAE) root mean square error (RMSE), t-value, dunn index, average similarity and computational time using two publicly available datasets using Flixter dataset and MovieLens dataset. The result demonstrates that our proposed method is able to outperform other existing methods.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this paper, the authors addressed the problem of automated detection of safe zone(s) for helicopter landing in hazardous environments from videos captured by an Unmanned Aerial Vehicle (UAV).
Abstract: In this paper, we have addressed the problem of automated detection of safe zone(s) for helicopter landing in hazardous environments from videos captured by an Unmanned Aerial Vehicle (UAV). The unconstrained motion of the video capturing drone (the UAV in our case) makes the problem further difficult. The solution pipeline consists of natural landmark detection and tracking, stereo-pair generation using constrained graph clustering, digital terrain map construction and safe landing zone detection. The main methodological contribution lies in mathematically formulating epipolar constraint and then using it in a Minimum Spanning Tree (MST) based graph clustering approach. We have also made publicly available AHL (Autonomous Helicopter Landing) dataset, a new aerial video dataset captured by a drone, with annotated ground-truths. Experimental comparisons with other competing clustering methods i) in terms of Dunn Index and Davies Bouldin Index as well as ii) for frame-level safe zone detection in terms of F-measure and confusion matrix clearly demonstrate the effectiveness of the proposed formulation.

Journal ArticleDOI
18 Feb 2021
TL;DR: In this paper, a decision theoretic rough set-based neighborhood selection process is developed for self-organizing maps. And the results are evaluated in terms of DB index, Dunn index, quantization error, ARI, and NMI.
Abstract: A decision theoretic rough set-based neighborhood selection process is developed for self-organizing maps. While the neighborhood of the winner neuron is selected based on the probability of its associativity to the winner neuron, the selected neighborhood is updated using a new method which combines the probability of its associativity and the Gaussian function. This approach provides better results as compared to self-organizing map and other clustering algorithms on several real-life datasets. The results are evaluated in terms of DB index, Dunn index, quantization error, ARI, and NMI.

Proceedings ArticleDOI
08 Sep 2021
TL;DR: In this paper, the authors present a technical analysis of the Traversal Optimization Algorithm (TOA) for clustering and K-means clustering algorithm and rigorously test this algorithm against different data specifications.
Abstract: This research aims to present a technical analysis of the Traversal Optimisation Algorithm (TOA) for clustering and K-means clustering algorithm. The goal is to rigorously test this algorithm against different data specifications beyond what has previously been used with K-means without artificially and subjectively setting the initial number of clusters. The experimental evaluation involve the use of diverse cluster optimisation techniques for K-means while applying a wider range of internal validation methods such as Davies-Bouldin Index, Dunn Index and Silhouette Method, for appraising cluster quality of the Traversal Optimisation Algorithm, while at the same time not compromising the configuration of the default algorithm. The findings in this work shows that the optimisation algorithm’s clustering quality as calculated by multiple internal validity indices can be very poor when operating on datasets with varying characteristics. This is owing to the algorithm’s lack of any add-on mechanism for computing the optimal number of clusters that a dataset needs apriori. The results reveal that in a data processing contexts where the number of clusters are specified, the TOA yields a favourable cost-benefit in terms of run-time complexity and clustering quality.

Journal ArticleDOI
TL;DR: This method shows that EC can achieve better results and present clusters with higher robustness and accuracy than other hierarchical clustering methods.
Abstract: Ensemble Clustering (EC) methods became more popular in recent years. In this methods, some primary clustering algorithms are considered to be as inputs and a single cluster is generated to achieve the best results combined with each other. In this paper, we considered three hierarchical methods, which are single-link, average-link, and complete-link as the primary clustering and the results were combined with each other. This combination was done based on correlation matrix. The basic algorithms were combined as binary and triplicate and the results were evaluated as well. the IMDB film dataset were clustered based on existing features. CH, Silhouette and Dunn Index criteria were used to evaluate the results. These criteria evaluate the clustering quality by calculating intra-cluster and inter-cluster distances. CH index had the highest value when all three basic clusters are combined. our method shows that EC can achieve better results and present clusters with higher robustness and accuracy.