scispace - formally typeset
Journal ArticleDOI

An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm

Reads0
Chats0
TLDR
The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering and works more efficiently when the dataset size and Hadoop cluster sizes are large.
Abstract
Clustering is considered as one of the important data mining techniques. Document clustering is among many applications of clustering. The traditional clustering algorithms are proven inefficient for clustering rapidly generating large real world datasets. As a solution, traditional clustering algorithms are modified using distributed programming paradigm. MapReduce is a popular distributed programming paradigm designed for Hadoop distributed framework. This paper demonstrates a MapReduce based modification of K-Means clustering algorithm for document datasets. The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering. The experiments also show that the MapReduce clustering works more efficiently when the dataset size and Hadoop cluster sizes are large.

read more

Citations
More filters
Journal ArticleDOI

K -Means Clustering Algorithm and Its Simulation Based on Distributed Computing Platform

TL;DR: In this paper, the authors study the parallel k-means algorithm in MapReduce and parallelize the distance calculation process that provides independence between the data objects to perform cluster analysis in parallel.
Journal ArticleDOI

MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering

TL;DR: In this paper, a MapReduce-based fuzzy C-means algorithm for big document data clustering is proposed, which is extensively experimented with using different sizes of document datasets and executed over the Hadoop cluster of different sizes.
Journal ArticleDOI

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

TL;DR: In this article, a MapReduce-based Fuzzy C-Medoids clustering algorithm is designed and experimented with to cluster big data repository of documents datasets, the performance of the proposed algorithm is experimentally evaluated for different-sized Hadoop cluster sizes and different-size document datasets.
Journal ArticleDOI

Data Optimization Analysis of Integrated Energy System Based on K -Means Algorithm

TL;DR: In this article , a 300MW circulating liquid bed boiler for a thermal power plant as a research product was used to increase the thermal efficiency of boiler combustion and reduce nitrogen oxide emissions.
Journal ArticleDOI

Document clustering analysis with aid of adaptive Jaro Winkler with Jellyfish search clustering algorithm

TL;DR: In this article , a document clustering method using Adaptive Jaro Winkler with Jellyfish Search Clustering (AJWJSC) algorithm and Chimp Optimization Algorithm (COA) is proposed.
References
More filters
Proceedings ArticleDOI

k-means++: the advantages of careful seeding

TL;DR: By augmenting k-means with a very simple, randomized seeding technique, this work obtains an algorithm that is Θ(logk)-competitive with the optimal clustering.
Journal ArticleDOI

An efficient k-means clustering algorithm: analysis and implementation

TL;DR: This work presents a simple and efficient implementation of Lloyd's k-means clustering algorithm, which it calls the filtering algorithm, and establishes the practical efficiency of the algorithm's running time.

A Comparison of Document Clustering Techniques

TL;DR: This paper compares the two main approaches to document clustering, agglomerative hierarchical clustering and K-means, and indicates that the bisecting K-MEans technique is better than the standard K-Means approach and as good or better as the hierarchical approaches that were tested for a variety of cluster evaluation metrics.
Journal ArticleDOI

The global k-means clustering algorithm

TL;DR: The global k-means algorithm is presented which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N executions of the k-Means algorithm from suitable initial positions.
Proceedings Article

Refining Initial Points for K-Means Clustering

TL;DR: A procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution that allows the iterative algorithm to converge to a “better” local minimum.
Related Papers (5)