An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm

doi:10.1007/S40031-020-00485-2

Journal ArticleDOI

An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm

Tanvir Habib Sardar, +1 more

- 01 Dec 2020 -

Journal of The Institution of Engineers ...

- Vol. 101, Iss: 6, pp 641-650

Chats0

TLDR

The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering and works more efficiently when the dataset size and Hadoop cluster sizes are large.

Abstract:

Clustering is considered as one of the important data mining techniques. Document clustering is among many applications of clustering. The traditional clustering algorithms are proven inefficient for clustering rapidly generating large real world datasets. As a solution, traditional clustering algorithms are modified using distributed programming paradigm. MapReduce is a popular distributed programming paradigm designed for Hadoop distributed framework. This paper demonstrates a MapReduce based modification of K-Means clustering algorithm for document datasets. The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering. The experiments also show that the MapReduce clustering works more efficiently when the dataset size and Hadoop cluster sizes are large.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

K -Means Clustering Algorithm and Its Simulation Based on Distributed Computing Platform

Chunqiong Wu, +6 more

- 19 Jun 2021 -

Complexity

TL;DR: In this paper, the authors study the parallel k-means algorithm in MapReduce and parallelize the distance calculation process that provides independence between the data objects to perform cluster analysis in parallel.

...read moreread less

Journal ArticleDOI

MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering

Tanvir Habib Sardar, +1 more

- 19 Jul 2021 -

Journal of The Institution of Engineers ...

TL;DR: In this paper, a MapReduce-based fuzzy C-means algorithm for big document data clustering is proposed, which is extensively experimented with using different sizes of document datasets and executed over the Hadoop cluster of different sizes.

...read moreread less

Journal ArticleDOI

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

Tanvir Habib Sardar, +1 more

- 27 Jul 2021 -

Journal of The Institution of Engineers ...

TL;DR: In this article, a MapReduce-based Fuzzy C-Medoids clustering algorithm is designed and experimented with to cluster big data repository of documents datasets, the performance of the proposed algorithm is experimentally evaluated for different-sized Hadoop cluster sizes and different-size document datasets.

...read moreread less

Journal ArticleDOI

Data Optimization Analysis of Integrated Energy System Based on K -Means Algorithm

Hai-Qiang Guo, +4 more

- 26 May 2022 -

Wireless Communications and Mobile Compu...

TL;DR: In this article , a 300MW circulating liquid bed boiler for a thermal power plant as a research product was used to increase the thermal efficiency of boiler combustion and reduce nitrogen oxide emissions.

...read moreread less

Journal ArticleDOI

Document clustering analysis with aid of adaptive Jaro Winkler with Jellyfish search clustering algorithm

Perumal Pitchandi, +1 more

- 01 Jan 2023 -

Advances in engineering software

TL;DR: In this article , a document clustering method using Adaptive Jaro Winkler with Jellyfish Search Clustering (AJWJSC) algorithm and Chimp Optimization Algorithm (COA) is proposed.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

k-means++: the advantages of careful seeding

David Arthur, +1 more

TL;DR: By augmenting k-means with a very simple, randomized seeding technique, this work obtains an algorithm that is Θ(logk)-competitive with the optimal clustering.

...read moreread less

Journal ArticleDOI

An efficient k-means clustering algorithm: analysis and implementation

Tapas Kanungo, +5 more

- 01 Jul 2002 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This work presents a simple and efficient implementation of Lloyd's k-means clustering algorithm, which it calls the filtering algorithm, and establishes the practical efficiency of the algorithm's running time.

...read moreread less

A Comparison of Document Clustering Techniques

Michael Steinbach, +2 more

TL;DR: This paper compares the two main approaches to document clustering, agglomerative hierarchical clustering and K-means, and indicates that the bisecting K-MEans technique is better than the standard K-Means approach and as good or better as the hierarchical approaches that were tested for a variety of cluster evaluation metrics.

...read moreread less

Journal ArticleDOI

The global k-means clustering algorithm

Aristidis Likas, +2 more

- 01 Feb 2003 -

Pattern Recognition

TL;DR: The global k-means algorithm is presented which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N executions of the k-Means algorithm from suitable initial positions.

...read moreread less

Proceedings Article

Refining Initial Points for K-Means Clustering

Paul S. Bradley, +1 more

TL;DR: A procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution that allows the iterative algorithm to converge to a “better” local minimum.

...read moreread less