Apache Spark: a unified engine for big data processing

doi:10.1145/2934664

Journal ArticleDOI

Apache Spark: a unified engine for big data processing

Matei Zaharia, +13 more

- 28 Oct 2016 -

Communications of The ACM

- Vol. 59, Iss: 11, pp 56-65

Chats0

TLDR

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

Abstract:

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Toward RDB to NoSQL: transforming data with metamorfose framework

Evandro Miguel Kuszera, +2 more

TL;DR: This paper presents a novel approach to convert relational databases (RDB) to document and column family NoSQLs using a set of directed acyclic graphs (DAG) representing the target NoSQL model.

...read moreread less

Proceedings ArticleDOI

Dynamic Container-based Resource Management Framework of Spark Ecosystem

Nawab Muhammad Faseeh Qureshi, +6 more

TL;DR: This paper proposes dynamic container-based resource management framework, that shifts coupled associations of job profiles to dynamically available resource containers and relieves static container allocations and presumes them as a fresh piece of resource allocation for new job profile.

...read moreread less

Journal ArticleDOI

Coffea -- Columnar Object Framework For Effective Analysis

Nicholas Smith, +10 more

- 28 Aug 2020 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: This work will discuss the experience in implementing analysis of CMS data using the coffea framework, and a discussion of the user experience and future directions.

...read moreread less

Proceedings ArticleDOI

beHEALTHIER: A Microservices Platform for Analyzing and Exploiting Healthcare Data

Argyro Mavrogiorgou, +7 more

TL;DR: In this paper, the authors present a platform based on Microservice Architecture (MSA), which is able to efficiently manage and analyze these vast amounts of data, by utilizing a newly proposed kind of electronic health records (i.e., eXtended Health Records) and their corresponding networks.

...read moreread less

Journal ArticleDOI

BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments.

Maria Luiza Mondelli, +10 more

- 29 Aug 2018 -

PeerJ

TL;DR: This work presents BioWorkbench, a framework for managing and analyzing bioinformatics experiments, and shows that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

Journal ArticleDOI

A bridging model for parallel computation

Leslie G. Valiant

- 01 Aug 1990 -

Communications of The ACM

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

Proceedings ArticleDOI

Pregel: a system for large-scale graph processing

Grzegorz Malewicz, +6 more

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.

...read moreread less

Proceedings ArticleDOI

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, +4 more

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

Collapse

Related Papers (5)

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

Apache Spark: a unified engine for big data processing

Citations

Toward RDB to NoSQL: transforming data with metamorfose framework

Dynamic Container-based Resource Management Framework of Spark Ecosystem

Coffea -- Columnar Object Framework For Effective Analysis

beHEALTHIER: A Microservices Platform for Analyzing and Exploiting Healthcare Data

BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments.

References

MapReduce: simplified data processing on large clusters

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

A bridging model for parallel computation

Pregel: a system for large-scale graph processing

Dryad: distributed data-parallel programs from sequential building blocks

Related Papers (5)

MapReduce: simplified data processing on large clusters

The Hadoop Distributed File System

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Spark: cluster computing with working sets

Scikit-learn: Machine Learning in Python