scispace - formally typeset
Journal ArticleDOI

Apache Spark: a unified engine for big data processing

TLDR
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
Abstract
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications

read more

Citations
More filters
Journal ArticleDOI

Big data in healthcare: management, analysis and future prospects

TL;DR: To provide relevant solutions for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data.
Proceedings ArticleDOI

Ray: a distributed framework for emerging AI applications

TL;DR: Ray as mentioned in this paper is a distributed system that implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine and employs a distributed scheduler and a distributed and fault-tolerant store to manage the control state.
Journal ArticleDOI

Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.

TL;DR: A framework for identifying a broad range of menaces in the research and practices around social data is presented, including biases and inaccuracies at the source of the data, but also introduced during processing.
Journal ArticleDOI

A Survey on Distributed Machine Learning

TL;DR: In this article, the authors provide an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning.
Journal ArticleDOI

CatBoost for big data: an interdisciplinary review

TL;DR: This survey takes an interdisciplinary approach to cover studies related to CatBoost in a single work, and provides researchers an in-depth understanding to help clarify proper application of Cat boost in solving problems.
References
More filters

"One Size Fits All": An Idea Whose Time Has Come and Gone?

TL;DR: In einer späteren Publikation [St07] stellte Michael Stonebraker sogar die These auf, dass es keine Anwendungen gibt, für die die traditionellen Datenbanksysteme die beste Alternative sind.
Posted Content

MLI: An API for Distributed Machine Learning

TL;DR: The initial results show that this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
Proceedings ArticleDOI

MLI: An API for Distributed Machine Learning

TL;DR: MLI as discussed by the authors is an application programming interface designed to address the challenges of building machine learning algorithms in a distributed setting based on data-centric computing, and its primary goal is to simplify the development of high-performance, scalable, distributed algorithms.
Proceedings ArticleDOI

Rethinking Data-Intensive Science Using Scalable Analytics Systems

TL;DR: ADAM is described, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%.
Book

An Architecture for Fast and General Data Processing on Large Clusters

TL;DR: This book proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale, and proposes a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs), which is implemented in the open source Spark system.
Related Papers (5)