Sparrow: distributed, low latency scheduling
Kay Ousterhout,Patrick Wendell,Matei Zaharia,Ion Stoica +3 more
- pp 69-84
TLDR
It is demonstrated that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design.Abstract:
Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.read more
Citations
More filters
Proceedings ArticleDOI
Large-scale cluster management at Google with Borg
TL;DR: A summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it are presented.
Proceedings ArticleDOI
Ray: a distributed framework for emerging AI applications
Philipp Moritz,Robert Nishihara,Stephanie Wang,Alexey Tumanov,Richard Liaw,Eric Liang,Melih Elibol,Zongheng Yang,William Paul,Michael I. Jordan,Ion Stoica +10 more
TL;DR: Ray as mentioned in this paper is a distributed system that implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine and employs a distributed scheduler and a distributed and fault-tolerant store to manage the control state.
Proceedings ArticleDOI
Occupy the cloud: distributed computing for the 99%
TL;DR: Stateless functions are a natural fit for data processing in future computing environments as mentioned in this paper, based on recent trends in network bandwidth and the advent of disaggregated storage, and stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity.
Proceedings ArticleDOI
An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems
Yu Gan,Yanqi Zhang,Dailun Cheng,Ankitha Shetty,Priyal Rathi,Nayan Katarki,Ariana Bruno,Justin Hu,Brian Ritchken,Brendon Jackson,Kelvin Hu,Meghna Pancholi,Yuan He,Brett Clancy,Chris Colen,Fukang Wen,Catherine Leung,Siyuan Wang,Leon Zaruvinsky,Mateo Espinosa,Rick Lin,Zhongling Liu,Jake Padilla,Christina Delimitrou +23 more
TL;DR: This paper presents DeathStarBench, a novel, open-source benchmark suite built with microservices that is representative of large end-to-end services, modular and extensible, and uses it to study the architectural characteristics of microservices, their implications in networking and operating systems, their challenges with respect to cluster management, and their trade-offs in terms of application design and programming frameworks.
Proceedings ArticleDOI
Apollo: scalable and coordinated scheduling for cloud-scale computing
Eric Boutin,Jaliya Ekanayake,Wei Lin,Bing Shi,Jingren Zhou,Zhengping Qian,Ming Wu,Lidong Zhou +7 more
TL;DR: Apollo as mentioned in this paper is a highly scalable and coordinated scheduling framework, which has been deployed on production clusters at Microsoft to schedule thousands of computations with millions of tasks efficiently and effectively on tens of thousands of machines daily.
References
More filters
Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Book
Hadoop: The Definitive Guide
TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Journal ArticleDOI
Analysis and simulation of a fair queueing algorithm
TL;DR: In this article, a fair gateway queueing algorithm based on an earlier suggestion by Nagle is proposed to control congestion in datagram networks, based on the idea of fair queueing.
Proceedings ArticleDOI
Analysis and simulation of a fair queueing algorithm
TL;DR: It is found that fair queueing provides several important advantages over the usual first-come-first-serve queueing algorithm: fair allocation of bandwidth, lower delay for sources using less than their full share of bandwidth and protection from ill-behaved sources.
Proceedings ArticleDOI
Improving MapReduce performance in heterogeneous environments
TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.