Apache Spark: a unified engine for big data processing

TL;DR: To provide relevant solutions for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data.

...read moreread less

Proceedings ArticleDOI

Ray: a distributed framework for emerging AI applications

Philipp Moritz, +10 more

TL;DR: Ray as mentioned in this paper is a distributed system that implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine and employs a distributed scheduler and a distributed and fault-tolerant store to manage the control state.

...read moreread less

Journal ArticleDOI

Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.

Alexandra Olteanu, +3 more

TL;DR: A framework for identifying a broad range of menaces in the research and practices around social data is presented, including biases and inaccuracies at the source of the data, but also introduced during processing.

...read moreread less

Journal ArticleDOI

A Survey on Distributed Machine Learning

Joost Verbraeken, +5 more

- 13 Mar 2020 -

ACM Computing Surveys

TL;DR: In this article, the authors provide an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning.

...read moreread less

Journal ArticleDOI

CatBoost for big data: an interdisciplinary review

John Hancock, +1 more

- 19 Aug 2020 -

Journal of Big Data

TL;DR: This survey takes an interdisciplinary approach to cover studies related to CatBoost in a single work, and provides researchers an in-depth understanding to help clarify proper application of Cat boost in solving problems.

...read moreread less

Collapse

References

PDF

Open Access

More filters

"One Size Fits All": An Idea Whose Time Has Come and Gone?

Jens Dittrich, +5 more

TL;DR: In einer späteren Publikation [St07] stellte Michael Stonebraker sogar die These auf, dass es keine Anwendungen gibt, für die die traditionellen Datenbanksysteme die beste Alternative sind.

...read moreread less

Posted Content

MLI: An API for Distributed Machine Learning

Evan R. Sparks, +8 more

- 21 Oct 2013 -

arXiv: Learning

TL;DR: The initial results show that this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

...read moreread less

Proceedings ArticleDOI

MLI: An API for Distributed Machine Learning

Evan R. Sparks, +8 more

TL;DR: MLI as discussed by the authors is an application programming interface designed to address the challenges of building machine learning algorithms in a distributed setting based on data-centric computing, and its primary goal is to simplify the development of high-performance, scalable, distributed algorithms.

...read moreread less

Proceedings ArticleDOI

Rethinking Data-Intensive Science Using Scalable Analytics Systems

Frank Austin Nothaft, +12 more

TL;DR: ADAM is described, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%.

...read moreread less

Book

An Architecture for Fast and General Data Processing on Large Clusters

Matei Zaharia

TL;DR: This book proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale, and proposes a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs), which is implemented in the open source Spark system.

...read moreread less

Collapse

Related Papers (5)

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

Apache Spark: a unified engine for big data processing

Citations

Big data in healthcare: management, analysis and future prospects

Ray: a distributed framework for emerging AI applications

Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.

A Survey on Distributed Machine Learning

CatBoost for big data: an interdisciplinary review

References

"One Size Fits All": An Idea Whose Time Has Come and Gone?

MLI: An API for Distributed Machine Learning

MLI: An API for Distributed Machine Learning

Rethinking Data-Intensive Science Using Scalable Analytics Systems

An Architecture for Fast and General Data Processing on Large Clusters

Related Papers (5)

MapReduce: simplified data processing on large clusters

The Hadoop Distributed File System

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Spark: cluster computing with working sets

Scikit-learn: Machine Learning in Python