scispace - formally typeset
Open AccessProceedings ArticleDOI

Sparrow: distributed, low latency scheduling

TLDR
It is demonstrated that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design.
Abstract
Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Large-scale cluster management at Google with Borg

TL;DR: A summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it are presented.
Proceedings ArticleDOI

Ray: a distributed framework for emerging AI applications

TL;DR: Ray as mentioned in this paper is a distributed system that implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine and employs a distributed scheduler and a distributed and fault-tolerant store to manage the control state.
Proceedings ArticleDOI

Occupy the cloud: distributed computing for the 99%

TL;DR: Stateless functions are a natural fit for data processing in future computing environments as mentioned in this paper, based on recent trends in network bandwidth and the advent of disaggregated storage, and stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity.
Proceedings ArticleDOI

An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems

TL;DR: This paper presents DeathStarBench, a novel, open-source benchmark suite built with microservices that is representative of large end-to-end services, modular and extensible, and uses it to study the architectural characteristics of microservices, their implications in networking and operating systems, their challenges with respect to cluster management, and their trade-offs in terms of application design and programming frameworks.
Proceedings ArticleDOI

Apollo: scalable and coordinated scheduling for cloud-scale computing

TL;DR: Apollo as mentioned in this paper is a highly scalable and coordinated scheduling framework, which has been deployed on production clusters at Microsoft to schedule thousands of computations with millions of tasks efficiently and effectively on tens of thousands of machines daily.
References
More filters
Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Book

Hadoop: The Definitive Guide

Tom White
TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Journal ArticleDOI

Analysis and simulation of a fair queueing algorithm

TL;DR: In this article, a fair gateway queueing algorithm based on an earlier suggestion by Nagle is proposed to control congestion in datagram networks, based on the idea of fair queueing.
Proceedings ArticleDOI

Analysis and simulation of a fair queueing algorithm

TL;DR: It is found that fair queueing provides several important advantages over the usual first-come-first-serve queueing algorithm: fair allocation of bandwidth, lower delay for sources using less than their full share of bandwidth and protection from ill-behaved sources.
Proceedings ArticleDOI

Improving MapReduce performance in heterogeneous environments

TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.
Related Papers (5)