Sparkle: optimizing spark for large memory machines and analytics

doi:10.1145/3127479.3134762

Proceedings ArticleDOI

Sparkle: optimizing spark for large memory machines and analytics

Mijung Kim, +9 more

- pp 656-656

Chats0

TLDR

This work leverages Spark, an existing memory-centric data analytics framework with wide-spread adoption among data scientists, to bring the performance benefits of in-memory processing on scale-up servers to an increasingly common class of data analytics applications that process small to medium size datasets.

Abstract:

Given the growing availability of affordable scale-up servers, our goal is to bring the performance benefits of in-memory processing on scale-up servers to an increasingly common class of data analytics applications that process small to medium size datasets (up to a few 100GBs) that can easily fit in the memory of a typical scale-up server To achieve this goal, we leverage Spark, an existing memory-centric data analytics framework with wide-spread adoption among data scientists. Bringing Spark's data analytic capabilities to a scale-up system requires rethinking the original design assumptions, which, although effective for a scale-out system, are a poor match to a scale-up system resulting in unnecessary communication and memory inefficiencies.

Citations

PDF

Open Access

More filters

Proceedings Article

Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value Stores.

Shin-Yeh Tsai, +2 more

TL;DR: This paper explores the design of disaggregating PM and managing them remotely from compute servers, a model the authors call passive disaggregated persistent memory, or pDPM, which significantly lowers monetary and energy costs and avoids scalability bottlenecks at storage servers.

...read moreread less

Memory-Driven Computing.

Kimberly Keeton

TL;DR: This talk will discuss the technologies that comprise The Machine and their implications for systems software and application programs, as well as describe the work the team is doing at HPE to address some of these challenges and opportunities.

...read moreread less

Journal ArticleDOI

A Survey on Spark Ecosystem for Big Data Processing.

Shanjiang Tang, +4 more

- 18 Nov 2018 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: A thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark and introduces Spark programming model and computing system, and discusses the pros and cons.

...read moreread less

High-performance design of apache spark with RDMA and its benefits on various workloads

Lu Xiaoyi, +3 more

Proceedings ArticleDOI

Characterizing the Scale-Up Performance of Microservices using TeaStore

Sriyash Caculo, +2 more

TL;DR: A study of a publicly available microservice based application on a state-of-the-art x86 server supporting 128 logical CPUs per socket highlights the significant performance opportunities that exist when the scaling properties of individual services and knowledge of the underlying processor topology are properly exploited.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Proceedings ArticleDOI

Scaling Distributed Machine Learning with the Parameter Server

Mu Li

TL;DR: View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered.

...read moreread less

Journal ArticleDOI

Region-based memory management

Mads Tofte, +1 more

- 01 Feb 1997 -

Information & Computation

TL;DR: A region-based dynamic semantics for a skeletal programming language extracted from Standard ML is defined and the inference system which specifies where regions can be allocated and de-allocated is presented and a detailed proof that the system is sound with respect to a standard semantics is presented.

...read moreread less

Proceedings ArticleDOI

Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks

Haoyuan Li, +4 more

TL;DR: Tachyon is a distributed file system enabling reliable data sharing at memory speed across cluster computing frameworks by introducing a checkpointing algorithm that guarantees bounded recovery cost and resource allocation strategies for recomputation under commonly used resource schedulers.

...read moreread less

Microprocessors and Microsystems

Memory vs. Storage Software and Hardware: The Shifting Landscape

Jay Lofstead

Enabling Effective Utilization of GPUs for Data Management Systems

Xiaodong Zhang

Sparkle: optimizing spark for large memory machines and analytics

Citations

Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value Stores.

Memory-Driven Computing.

A Survey on Spark Ecosystem for Big Data Processing.

High-performance design of apache spark with RDMA and its benefits on various workloads

Characterizing the Scale-Up Performance of Microservices using TeaStore

References

MapReduce: simplified data processing on large clusters

Apache Spark: a unified engine for big data processing

Scaling Distributed Machine Learning with the Parameter Server

Region-based memory management

Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks

Related Papers (5)

Performance Characterization of Spark Workloads on Shared NUMA Systems

Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation

Processing data where it makes sense: Enabling in-memory computation

Memory vs. Storage Software and Hardware: The Shifting Landscape

Enabling Effective Utilization of GPUs for Data Management Systems