Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

doi:10.5120/19788-0531

Open AccessJournal ArticleDOI

Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

Satish Gopalani, +1 more

- 18 Mar 2015 -

International Journal of Computer Applic...

- Vol. 113, Iss: 1, pp 8-11

TLDR

Two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark - both of which provide a processing model for analyzing big data are discussed, both of whom vary significantly based on the use case under implementation.

Abstract:

Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data This paper discusses two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark - both of which provide a processing model for analyzing big data Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K- Means)

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Big data in healthcare: management, analysis and future prospects

Sabyasachi Dash, +3 more

- 19 Jun 2019 -

Journal of Big Data

TL;DR: To provide relevant solutions for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data.

...read moreread less

Journal ArticleDOI

Big data analytics on Apache Spark

Salman Salloum, +4 more

- 13 Oct 2016 -

Journal of data science

TL;DR: This review shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing and highlights some research and development directions on Apache Spark for big data analytics.

...read moreread less

Journal ArticleDOI

Statistical Learning Theory and ELM for Big Social Data Analysis

Luca Oneto, +3 more

- 01 Aug 2016 -

IEEE Computational Intelligence Magazine

TL;DR: This paper shows how to exploit the most recent technological tools and advances in Statistical Learning Theory (SLT) in order to efficiently build an Extreme Learning Machine (ELM) and assess the resultant model's performance when applied to big social data analysis.

...read moreread less

Proceedings ArticleDOI

Big data machine learning using apache spark MLlib

Mehdi Assefi, +3 more

TL;DR: This contribution explores the expanding body of the Apache Spark MLlib 2.0 as an open-source, distributed, scalable, and platform independent machine learning library, and performs several real world machine learning experiments to examine the qualitative and quantitative attributes of the platform.

...read moreread less

Journal ArticleDOI

A three-way cluster ensemble approach for large-scale data

Hong Yu, +3 more

- 01 Dec 2019 -

International Journal of Approximate Rea...

TL;DR: The experimental results show that the proposed three-way cluster ensemble approach can effectively deal with large-scale data, and the proposed consensus clustering algorithm has a lower time cost and does not sacrifice the clustering quality.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Journal ArticleDOI

The Google file system

Sanjay Ghemawat, +2 more

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

...read moreread less

Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

Posted Content

Shark: SQL and Rich Analytics at Scale

Reynold Xin, +5 more

- 27 Nov 2012 -

arXiv: Databases

TL;DR: Shark is a new data analysis system that marries query processing with complex analytics on large clusters and extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL.

...read moreread less

Proceedings ArticleDOI

Shark: SQL and rich analytics at scale

Reynold Xin, +5 more

TL;DR: Shark as discussed by the authors is a new data analysis system that marries query processing with complex analytics on large clusters, and leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions at scale, and efficiently recovers from failures mid-query.

...read moreread less

Communications of The ACM

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

Citations

Big data in healthcare: management, analysis and future prospects

Big data analytics on Apache Spark

Statistical Learning Theory and ELM for Big Social Data Analysis

Big data machine learning using apache spark MLlib

A three-way cluster ensemble approach for large-scale data

References

MapReduce: simplified data processing on large clusters

The Google file system

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Shark: SQL and Rich Analytics at Scale

Shark: SQL and rich analytics at scale

Related Papers (5)

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Spark: cluster computing with working sets

MLlib: machine learning in apache spark

Apache Spark: a unified engine for big data processing

MapReduce: simplified data processing on large clusters