Sparrow: distributed, low latency scheduling

doi:10.1145/2517349.2522716

Open AccessProceedings ArticleDOI

Sparrow: distributed, low latency scheduling

- pp 69-84

TLDR

It is demonstrated that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design.

Abstract:

Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Large-scale cluster management at Google with Borg

Abhishek Verma, +5 more

TL;DR: A summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it are presented.

...read moreread less

Proceedings ArticleDOI

Ray: a distributed framework for emerging AI applications

Philipp Moritz, +10 more

TL;DR: Ray as mentioned in this paper is a distributed system that implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine and employs a distributed scheduler and a distributed and fault-tolerant store to manage the control state.

...read moreread less

Proceedings ArticleDOI

Occupy the cloud: distributed computing for the 99%

Eric Jonas, +4 more

TL;DR: Stateless functions are a natural fit for data processing in future computing environments as mentioned in this paper, based on recent trends in network bandwidth and the advent of disaggregated storage, and stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity.

...read moreread less

Proceedings ArticleDOI

An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems

Yu Gan, +23 more

TL;DR: This paper presents DeathStarBench, a novel, open-source benchmark suite built with microservices that is representative of large end-to-end services, modular and extensible, and uses it to study the architectural characteristics of microservices, their implications in networking and operating systems, their challenges with respect to cluster management, and their trade-offs in terms of application design and programming frameworks.

...read moreread less

Proceedings ArticleDOI

Apollo: scalable and coordinated scheduling for cloud-scale computing

Eric Boutin, +7 more

TL;DR: Apollo as mentioned in this paper is a highly scalable and coordinated scheduling framework, which has been deployed on production clusters at Microsoft to schedule thousands of computations with millions of tasks efficiently and effectively on tens of thousands of machines daily.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

Book

Hadoop: The Definitive Guide

Tom White

TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.

...read moreread less

Journal ArticleDOI

Analysis and simulation of a fair queueing algorithm

DemersA., +2 more

- 01 Aug 1989 -

Computer Communication Review

TL;DR: In this article, a fair gateway queueing algorithm based on an earlier suggestion by Nagle is proposed to control congestion in datagram networks, based on the idea of fair queueing.

...read moreread less

Proceedings ArticleDOI

Analysis and simulation of a fair queueing algorithm

Alan J. Demers, +2 more

TL;DR: It is found that fair queueing provides several important advantages over the usual first-come-first-serve queueing algorithm: fair allocation of bandwidth, lower delay for sources using less than their full share of bandwidth and protection from ill-behaved sources.

...read moreread less

Proceedings ArticleDOI

Improving MapReduce performance in heterogeneous environments

Matei Zaharia, +4 more

TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.

...read moreread less