A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

doi:10.3390/SYM10080342

Open AccessJournal ArticleDOI

A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Behrooz Koohmareh Hosseini, +1 more

- 15 Aug 2018 -

Symmetry

- Vol. 10, Iss: 8, pp 342

TLDR

Clustering approach developed-based on Apache Spark framework shows its superiorities in precision and noise robustness in comparison with recent researches and comparison with similar approaches shows superiorities of the proposed method in scalability, high performance, and low computation cost.

Abstract:

Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.

A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Citations

A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray

Heat map visualisation of fire incidents based on transformed sigmoid risk model

Big data clustering techniques based on Spark: a literature review.

Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis.

Improved k-Means Clustering Algorithm for Big Data Based on Distributed SmartphoneNeural Engine Processor

References

MapReduce: simplified data processing on large clusters

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

Data clustering: 50 years beyond K-means

Multivariate Density Estimation, Theory, Practice and Visualization

Clustering by fast search and find of density peaks

Related Papers (5)

An incremental density-based clustering framework using fuzzy local clustering

Fast Dimension-based Partitioning and Merging clustering algorithm

DBSCALE: An efficient density-based clustering algorithm for data mining in large databases

Parallel Implementation of Density Peaks Clustering Algorithm Based on Spark

Enhancing Density Peak Clustering via Density Normalization