Parallel K-Means Clustering Based on MapReduce

doi:10.1007/978-3-642-10665-1_71

Book ChapterDOI

Parallel K-Means Clustering Based on MapReduce

- Vol. 5931, pp 674-679

TLDR

This paper proposes a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique and demonstrates that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

Abstract:

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Comprehensive Survey of Clustering Algorithms

Dongkuan Xu, +1 more

- 12 Aug 2015 -

Annals of Data Science

TL;DR: This review paper begins at the definition of clustering, takes the basic elements involved in the clustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyzes the clustered algorithms from two perspectives, the traditional ones and the modern ones.

...read moreread less

Journal ArticleDOI

Social big data

Gema Bello-Orgaz, +2 more

- 01 Mar 2016 -

Information Fusion

TL;DR: This paper presents a revision of the new methodologies that are designed to allow for efficient data mining and information fusion from social media and of thenew applications and frameworks that are currently appearing under the “umbrella” of the social networks, socialMedia and big data paradigms.

...read moreread less

Journal ArticleDOI

Big data analytics: a survey

Chun-Wei Tsai, +5 more

- 01 Oct 2015 -

Journal of Big Data

TL;DR: The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data.

...read moreread less

Journal ArticleDOI

Scalable k-means++

Bahman Bahmani, +4 more

TL;DR: In this article, the authors show how to reduce the number of passes needed to obtain, in parallel, a good initialization of k-means++ in both sequential and parallel settings.

...read moreread less

Posted Content

Scalable K-Means++

Bahman Bahmani, +4 more

- 29 Mar 2012 -

arXiv: Databases

TL;DR: It is proved that the proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and Experimental evaluation on real-world large-scale data demonstrates that k-Means|| outperforms k- means++ in both sequential and parallel settings.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Some methods for classification and analysis of multivariate observations

James B. MacQueen

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Journal ArticleDOI

The Google file system

Sanjay Ghemawat, +2 more

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

...read moreread less

Proceedings ArticleDOI

The Hadoop Distributed File System

Konstantin Shvachko, +3 more

TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.

...read moreread less

Parallel K-Means Clustering Based on MapReduce

Citations

A Comprehensive Survey of Clustering Algorithms

Social big data

Big data analytics: a survey

Scalable k-means++

Scalable K-Means++

References

Some methods for classification and analysis of multivariate observations

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

The Google file system

The Hadoop Distributed File System

Related Papers (5)

MapReduce: simplified data processing on large clusters

Some methods for classification and analysis of multivariate observations

k-means++: the advantages of careful seeding

Least squares quantization in PCM

A density-based algorithm for discovering clusters in large spatial Databases with Noise