scispace - formally typeset
Proceedings ArticleDOI

SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark

Reads0
Chats0
TLDR
This paper presents SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications, including machine learning, graph computation, SQL query and streaming applications, and evaluates the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.
Abstract
Spark has been increasingly adopted by industries in recent years for big data analysis by providing a fault tolerant, scalable and easy-to-use in memory abstraction. Moreover, the community has been actively developing a rich ecosystem around Spark, making it even more attractive. However, there is not yet a Spark specify benchmark existing in the literature to guide the development and cluster deployment of Spark to better fit resource demands of user applications. In this paper, we present SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications. SparkBench covers four main categories of applications, including machine learning, graph computation, SQL query and streaming applications. We also characterize the resource consumption, data flow and timing information of each application and evaluate the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.

read more

Citations
More filters
Journal ArticleDOI

Big data analytics on Apache Spark

TL;DR: This review shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing and highlights some research and development directions on Apache Spark for big data analytics.
Proceedings ArticleDOI

A survey on reconfigurable accelerators for cloud computing

TL;DR: A thorough survey of the frameworks for the efficient utilization of the FPGAs in the data centers and the hardware accelerators that have been implemented for the most widely used cloud computing applications are presented.
Proceedings ArticleDOI

Benchmarking Distributed Stream Data Processing Systems

TL;DR: In this paper, the authors propose a framework for benchmarking distributed stream processing engines and evaluate the performance of three widely used streaming data processing systems in detail, namely Apache Storm, Apache Spark, and Apache Flink.
Proceedings ArticleDOI

Benchmarking Distributed Stream Data Processing Systems

TL;DR: In this paper, the authors propose a framework for benchmarking distributed stream processing engines and evaluate the performance of three widely used streaming data processing systems in detail, namely Apache Storm, Apache Spark, and Apache Flink.
Proceedings ArticleDOI

MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms

TL;DR: MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs, and if needed, the scheduling information from the analytic framework is leveraged to evict data that will be needed in the near future.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
BookDOI

An introduction to statistical learning

TL;DR: An introduction to statistical learning provides an accessible overview of the essential toolset for making sense of the vast and complex data sets that have emerged in science, industry, and other sectors in the past twenty years.
Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Related Papers (5)