Proceedings ArticleDOI
SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark
Min Li,Jian Tan,Yandong Wang,Li Zhang,Valentina Salapura +4 more
- pp 53
Reads0
Chats0
TLDR
This paper presents SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications, including machine learning, graph computation, SQL query and streaming applications, and evaluates the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.Abstract:
Spark has been increasingly adopted by industries in recent years for big data analysis by providing a fault tolerant, scalable and easy-to-use in memory abstraction. Moreover, the community has been actively developing a rich ecosystem around Spark, making it even more attractive. However, there is not yet a Spark specify benchmark existing in the literature to guide the development and cluster deployment of Spark to better fit resource demands of user applications. In this paper, we present SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications. SparkBench covers four main categories of applications, including machine learning, graph computation, SQL query and streaming applications. We also characterize the resource consumption, data flow and timing information of each application and evaluate the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.read more
Citations
More filters
Journal ArticleDOI
Big data analytics on Apache Spark
TL;DR: This review shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing and highlights some research and development directions on Apache Spark for big data analytics.
Proceedings ArticleDOI
A survey on reconfigurable accelerators for cloud computing
TL;DR: A thorough survey of the frameworks for the efficient utilization of the FPGAs in the data centers and the hardware accelerators that have been implemented for the most widely used cloud computing applications are presented.
Proceedings ArticleDOI
Benchmarking Distributed Stream Data Processing Systems
Jeyhun Karimov,Tilmann Rabl,Asterios Katsifodimos,Roman Samarev,Henri Heiskanen,Volker Markl +5 more
TL;DR: In this paper, the authors propose a framework for benchmarking distributed stream processing engines and evaluate the performance of three widely used streaming data processing systems in detail, namely Apache Storm, Apache Spark, and Apache Flink.
Proceedings ArticleDOI
Benchmarking Distributed Stream Data Processing Systems
Jeyhun Karimov,Tilmann Rabl,Asterios Katsifodimos,Roman Samarev,Henri Heiskanen,Volker Markl +5 more
TL;DR: In this paper, the authors propose a framework for benchmarking distributed stream processing engines and evaluate the performance of three widely used streaming data processing systems in detail, namely Apache Storm, Apache Spark, and Apache Flink.
Proceedings ArticleDOI
MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms
TL;DR: MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs, and if needed, the scheduling information from the analytic framework is leveraged to evict data that will be needed in the near future.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Proceedings Article
The PageRank Citation Ranking : Bringing Order to the Web
TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
BookDOI
An introduction to statistical learning
TL;DR: An introduction to statistical learning provides an accessible overview of the essential toolset for making sense of the vast and complex data sets that have emerged in science, industry, and other sectors in the past twenty years.
Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.