SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark

doi:10.1145/2742854.2747283

Proceedings ArticleDOI

SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark

Min Li, +4 more

- pp 53

Chats0

TLDR

This paper presents SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications, including machine learning, graph computation, SQL query and streaming applications, and evaluates the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.

Abstract:

Spark has been increasingly adopted by industries in recent years for big data analysis by providing a fault tolerant, scalable and easy-to-use in memory abstraction. Moreover, the community has been actively developing a rich ecosystem around Spark, making it even more attractive. However, there is not yet a Spark specify benchmark existing in the literature to guide the development and cluster deployment of Spark to better fit resource demands of user applications. In this paper, we present SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications. SparkBench covers four main categories of applications, including machine learning, graph computation, SQL query and streaming applications. We also characterize the resource consumption, data flow and timing information of each application and evaluate the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Big data analytics on Apache Spark

Salman Salloum, +4 more

- 13 Oct 2016 -

Journal of data science

TL;DR: This review shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing and highlights some research and development directions on Apache Spark for big data analytics.

...read moreread less

Proceedings ArticleDOI

A survey on reconfigurable accelerators for cloud computing

Christoforos Kachris, +1 more

TL;DR: A thorough survey of the frameworks for the efficient utilization of the FPGAs in the data centers and the hardware accelerators that have been implemented for the most widely used cloud computing applications are presented.

...read moreread less

Proceedings ArticleDOI

Benchmarking Distributed Stream Data Processing Systems

Jeyhun Karimov, +5 more

- 23 Feb 2018 -

arXiv: Databases

TL;DR: In this paper, the authors propose a framework for benchmarking distributed stream processing engines and evaluate the performance of three widely used streaming data processing systems in detail, namely Apache Storm, Apache Spark, and Apache Flink.

...read moreread less

Proceedings ArticleDOI

Benchmarking Distributed Stream Data Processing Systems

Jeyhun Karimov, +5 more

TL;DR: In this paper, the authors propose a framework for benchmarking distributed stream processing engines and evaluate the performance of three widely used streaming data processing systems in detail, namely Apache Storm, Apache Spark, and Apache Flink.

...read moreread less

Proceedings ArticleDOI

MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms

Luna Xu, +5 more

TL;DR: MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs, and if needed, the scheduling information from the analytic framework is leveraged to evict data that will be needed in the near future.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

Lawrence Page, +3 more

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.

...read moreread less

BookDOI

An introduction to statistical learning

Gareth M. James, +3 more

TL;DR: An introduction to statistical learning provides an accessible overview of the essential toolset for making sense of the vast and complex data sets that have emerged in science, industry, and other sectors in the past twenty years.

...read moreread less

Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

Communications of The ACM

The Hadoop Distributed File System

Konstantin Shvachko, +3 more

SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark

Citations

Big data analytics on Apache Spark

A survey on reconfigurable accelerators for cloud computing

Benchmarking Distributed Stream Data Processing Systems

Benchmarking Distributed Stream Data Processing Systems

MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

The PageRank Citation Ranking : Bringing Order to the Web

An introduction to statistical learning

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Related Papers (5)

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Spark: cluster computing with working sets

MapReduce: simplified data processing on large clusters

The Hadoop Distributed File System

Apache Hadoop YARN: yet another resource negotiator