A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark
TLDR
Clustering approach developed-based on Apache Spark framework shows its superiorities in precision and noise robustness in comparison with recent researches and comparison with similar approaches shows superiorities of the proposed method in scalability, high performance, and low computation cost.Abstract:
Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.read more
Citations
More filters
Journal ArticleDOI
A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray
TL;DR: By proposing a Resilient Distributed Dataset (RDD) localized subclustering method, disk I/O burden of the MapReduce based clustering approaches has been solved and the comparison of the clustering results with similar works shows the superiority of the proposed algorithm in precision and cluster validity indexes.
Journal ArticleDOI
Heat map visualisation of fire incidents based on transformed sigmoid risk model
TL;DR: The heat map visualisation of fire incidents showed that, compared to the TSRM, the LRM led to the overgeneralisation of results more easily, and an online and interactive fire-risk-analysing software should be developed that can be applied to fire risk analysis in other regions.
Journal ArticleDOI
Big data clustering techniques based on Spark: a literature review.
TL;DR: This systematic survey investigates the existing Spark-based clustering methods in terms of their support to the characteristics Big Data and proposes a new taxonomy for the Spark- based clustering Methods.
Journal ArticleDOI
Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis.
TL;DR: In this article, a kernel based fuzzy clustering approach is proposed to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space.
Journal ArticleDOI
Improved k-Means Clustering Algorithm for Big Data Based on Distributed SmartphoneNeural Engine Processor
Fouad H. Awad,Murtadha M. Hamad +1 more
TL;DR: The results showed that using a neural engine processor on a mobile smartphone device can maximize the speed of the clustering algorithm, which shows an improvement in the performance of the cluttering up to two-times faster compared with traditional laptop/desktop processors.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Proceedings Article
A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise
TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Journal ArticleDOI
Data clustering: 50 years beyond K-means
TL;DR: A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.
Journal ArticleDOI
Multivariate Density Estimation, Theory, Practice and Visualization
TL;DR: Representation and Geometry of Multivariate Data.
Journal ArticleDOI
Clustering by fast search and find of density peaks
Alex Rodriguez,Alessandro Laio +1 more
TL;DR: A method in which the cluster centers are recognized as local density maxima that are far away from any points of higher density, and the algorithm depends only on the relative densities rather than their absolute values.
Related Papers (5)
An incremental density-based clustering framework using fuzzy local clustering
Sirisup Laohakiat,Vera Sa-ing +1 more
DBSCALE: An efficient density-based clustering algorithm for data mining in large databases
Cheng-Fa Tsai,Chun-Yi Sung +1 more