scispace - formally typeset
Open AccessJournal ArticleDOI

Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

Satish Gopalani, +1 more
- 18 Mar 2015 - 
- Vol. 113, Iss: 1, pp 8-11
TLDR
Two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark - both of which provide a processing model for analyzing big data are discussed, both of whom vary significantly based on the use case under implementation.
Abstract
Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data This paper discusses two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark - both of which provide a processing model for analyzing big data Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K- Means)

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Big data in healthcare: management, analysis and future prospects

TL;DR: To provide relevant solutions for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data.
Journal ArticleDOI

Big data analytics on Apache Spark

TL;DR: This review shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing and highlights some research and development directions on Apache Spark for big data analytics.
Journal ArticleDOI

Statistical Learning Theory and ELM for Big Social Data Analysis

TL;DR: This paper shows how to exploit the most recent technological tools and advances in Statistical Learning Theory (SLT) in order to efficiently build an Extreme Learning Machine (ELM) and assess the resultant model's performance when applied to big social data analysis.
Proceedings ArticleDOI

Big data machine learning using apache spark MLlib

TL;DR: This contribution explores the expanding body of the Apache Spark MLlib 2.0 as an open-source, distributed, scalable, and platform independent machine learning library, and performs several real world machine learning experiments to examine the qualitative and quantitative attributes of the platform.
Journal ArticleDOI

A three-way cluster ensemble approach for large-scale data

TL;DR: The experimental results show that the proposed three-way cluster ensemble approach can effectively deal with large-scale data, and the proposed consensus clustering algorithm has a lower time cost and does not sacrifice the clustering quality.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Posted Content

Shark: SQL and Rich Analytics at Scale

TL;DR: Shark is a new data analysis system that marries query processing with complex analytics on large clusters and extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL.
Proceedings ArticleDOI

Shark: SQL and rich analytics at scale

TL;DR: Shark as discussed by the authors is a new data analysis system that marries query processing with complex analytics on large clusters, and leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions at scale, and efficiently recovers from failures mid-query.
Related Papers (5)