Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means
Satish Gopalani,Rohan Arora +1 more
TLDR
Two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark - both of which provide a processing model for analyzing big data are discussed, both of whom vary significantly based on the use case under implementation.Abstract:
Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data This paper discusses two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark - both of which provide a processing model for analyzing big data Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K- Means)read more
Citations
More filters
Journal ArticleDOI
Big data in healthcare: management, analysis and future prospects
TL;DR: To provide relevant solutions for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data.
Journal ArticleDOI
Big data analytics on Apache Spark
TL;DR: This review shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing and highlights some research and development directions on Apache Spark for big data analytics.
Journal ArticleDOI
Statistical Learning Theory and ELM for Big Social Data Analysis
TL;DR: This paper shows how to exploit the most recent technological tools and advances in Statistical Learning Theory (SLT) in order to efficiently build an Extreme Learning Machine (ELM) and assess the resultant model's performance when applied to big social data analysis.
Proceedings ArticleDOI
Big data machine learning using apache spark MLlib
TL;DR: This contribution explores the expanding body of the Apache Spark MLlib 2.0 as an open-source, distributed, scalable, and platform independent machine learning library, and performs several real world machine learning experiments to examine the qualitative and quantitative attributes of the platform.
Journal ArticleDOI
A three-way cluster ensemble approach for large-scale data
TL;DR: The experimental results show that the proposed three-way cluster ensemble approach can effectively deal with large-scale data, and the proposed consensus clustering algorithm has a lower time cost and does not sacrifice the clustering quality.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI
The Google file system
TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Posted Content
Shark: SQL and Rich Analytics at Scale
TL;DR: Shark is a new data analysis system that marries query processing with complex analytics on large clusters and extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL.
Proceedings ArticleDOI
Shark: SQL and rich analytics at scale
TL;DR: Shark as discussed by the authors is a new data analysis system that marries query processing with complex analytics on large clusters, and leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions at scale, and efficiently recovers from failures mid-query.