Journal ArticleDOI
Apache Spark: a unified engine for big data processing
Matei Zaharia,Reynold Xin,Patrick Wendell,Tathagata Das,Michael Armbrust,Ankur Dave,Xiangrui Meng,Josh Rosen,Shivaram Venkataraman,Michael J. Franklin,Ali Ghodsi,Joseph E. Gonzalez,Scott Shenker,Ion Stoica +13 more
TLDR
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.Abstract:
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applicationsread more
Citations
More filters
Journal ArticleDOI
Big data in healthcare: management, analysis and future prospects
TL;DR: To provide relevant solutions for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data.
Proceedings ArticleDOI
Ray: a distributed framework for emerging AI applications
Philipp Moritz,Robert Nishihara,Stephanie Wang,Alexey Tumanov,Richard Liaw,Eric Liang,Melih Elibol,Zongheng Yang,William Paul,Michael I. Jordan,Ion Stoica +10 more
TL;DR: Ray as mentioned in this paper is a distributed system that implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine and employs a distributed scheduler and a distributed and fault-tolerant store to manage the control state.
Journal ArticleDOI
Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.
TL;DR: A framework for identifying a broad range of menaces in the research and practices around social data is presented, including biases and inaccuracies at the source of the data, but also introduced during processing.
Journal ArticleDOI
A Survey on Distributed Machine Learning
Joost Verbraeken,Matthijs Wolting,Jonathan Katzy,Jeroen Kloppenburg,Tim Verbelen,Jan S. Rellermeyer +5 more
TL;DR: In this article, the authors provide an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning.
Journal ArticleDOI
CatBoost for big data: an interdisciplinary review
TL;DR: This survey takes an interdisciplinary approach to cover studies related to CatBoost in a single work, and provides researchers an in-depth understanding to help clarify proper application of Cat boost in solving problems.
References
More filters
"One Size Fits All": An Idea Whose Time Has Come and Gone?
TL;DR: In einer späteren Publikation [St07] stellte Michael Stonebraker sogar die These auf, dass es keine Anwendungen gibt, für die die traditionellen Datenbanksysteme die beste Alternative sind.
Posted Content
MLI: An API for Distributed Machine Learning
Evan R. Sparks,Ameet Talwalkar,Virginia Smith,Jey Kottalam,Xinghao Pan,Joseph E. Gonzalez,Michael J. Franklin,Michael I. Jordan,Tim Kraska +8 more
TL;DR: The initial results show that this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
Proceedings ArticleDOI
MLI: An API for Distributed Machine Learning
Evan R. Sparks,Ameet Talwalkar,Virginia Smith,Jey Kottalam,Xinghao Pan,Joseph E. Gonzalez,Michael J. Franklin,Michael I. Jordan,Tim Kraska +8 more
TL;DR: MLI as discussed by the authors is an application programming interface designed to address the challenges of building machine learning algorithms in a distributed setting based on data-centric computing, and its primary goal is to simplify the development of high-performance, scalable, distributed algorithms.
Proceedings ArticleDOI
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Frank Austin Nothaft,Matt Massie,Timothy Danford,Zhao Zhang,Uri Laserson,Carl Yeksigian,Jey Kottalam,Arun Ahuja,Jeff Hammerbacher,Michael D. Linderman,Michael J. Franklin,Anthony D. Joseph,David A. Patterson +12 more
TL;DR: ADAM is described, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%.
Book
An Architecture for Fast and General Data Processing on Large Clusters
TL;DR: This book proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale, and proposes a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs), which is implemented in the open source Spark system.