scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Scheduling multi-tenant cloud workloads on accelerator-based systems

TL;DR: The Strings scheduler realizes the vision of a dynamic model where GPUs are treated as first class schedulable entities by decomposing the GPU scheduling problem into a combination of load balancing and per-device scheduling.
Abstract: Accelerator-based systems are making rapid inroads into becoming platforms of choice for high end cloud services. There is a need therefore, to move from the current model in which high performance applications explicitly and programmatically select the GPU devices on which to run, to a dynamic model where GPUs are treated as first class schedulable entities. The Strings scheduler realizes this vision by decomposing the GPU scheduling problem into a combination of load balancing and per-device scheduling. (i) Device-level scheduling efficiently uses all of a GPU's hardware resources, including its computational and data movement engines, and (ii) load balancing goes beyond obtaining high throughput, to ensure fairness through prioritizing GPU requests that have attained least service. With its methods, Strings achieves improvements in system throughput and fairness of up to 8.70x and 13%, respectively, compared to the CUDA runtime.
Citations
More filters
Journal ArticleDOI
TL;DR: An extensive and in-depth survey of GPU virtualization techniques and their scheduling methods is presented and a perspective on the challenges and opportunities for virtualization of heterogeneous computing environments is delivered.
Abstract: The integration of graphics processing units (GPUs) on high-end compute nodes has established a new accelerator-based heterogeneous computing model, which now permeates high-performance computing. The same paradigm nevertheless has limited adoption in cloud computing or other large-scale distributed computing paradigms. Heterogeneous computing with GPUs can benefit the Cloud by reducing operational costs and improving resource and energy efficiency. However, such a paradigm shift would require effective methods for virtualizing GPUs, as well as other accelerators. In this survey article, we present an extensive and in-depth survey of GPU virtualization techniques and their scheduling methods. We review a wide range of virtualization techniques implemented at the GPU library, driver, and hardware levels. Furthermore, we review GPU scheduling methods that address performance and fairness issues between multiple virtual machines sharing GPUs. We believe that our survey delivers a perspective on the challenges and opportunities for virtualization of heterogeneous computing environments.

84 citations


Cites background from "Scheduling multi-tenant cloud workl..."

  • ...Strings [Sengupta et al. 2014] extends its previous work, Rain [Sengupta et al. 2013], by including more effective scheduling polices....

    [...]

  • ...Strings [Sengupta et al. 2014] extends its previous work, Rain [Sengupta et al....

    [...]

Proceedings ArticleDOI
15 Nov 2015
TL;DR: GraphReduce is presented, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device's internal memory capacity and significantly outperforms other competing out-of-memory approaches.
Abstract: Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device's internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host and device. GraphReduce-based programming is performed via device functions that include gatherMap, gatherReduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Extensive experimental evaluations for a wide variety of graph inputs and algorithms demonstrate that GraphReduce significantly outperforms other competing out-of-memory approaches.

81 citations


Additional excerpts

  • ...Categories and Subject Descriptors C.1.2 [Multiple Data Stream Architecture]: SingleInstruction, Multiple-Data Processors (SIMD); D.1.3 [ Programming Techniques]: Concurrent ProgrammingParallel Programming General Terms Design, Experimentation, Performance, Big Data Keywords Graph Analytics, Big Data, GPGPU, Performance Optimization, Data Movement Optimization ©2015 Association for Computing Machinery....

    [...]

  • ...…Descriptors C.1.2 [Multiple Data Stream Architecture]: SingleInstruction, Multiple-Data Processors (SIMD); D.1.3 [ Programming Techniques]: Concurrent ProgrammingParallel Programming General Terms Design, Experimentation, Performance, Big Data Keywords Graph Analytics, Big Data, GPGPU, Performance…...

    [...]

Proceedings ArticleDOI
01 Oct 2020
TL;DR: This paper defines Planaria1, a microarchitectural capability that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime that enables spatially co-locating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration.
Abstract: Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries and markets. Cloud infrastructure and accelerators that offer INFerence-as-a-Service (INFaaS) have become the enabler of this rather quick and invasive shift in the industry. To that end, mostly accelerator-based INFaaS (Google’s TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has become the backbone of many real-life applications. However, as the demand for such services grows, merely scaling-out the number of accelerators is not economically cost-effective. Although multi-tenancy has propelled datacenter scalability, it has not been a primary factor in designing DNN accelerators due to the arms race for higher speed and efficiency. This paper sets out to explore this timely requirement of multi-tenancy through a new dimension: dynamic architecture fission. To that end, we define Planaria1 that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime. This microarchitectural capability enables spatially co-locating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration. To realize this dynamic reconfigurability, we first devise breakable omni-directional systolic arrays for DNN acceleration that allows omni-directional flow of data. Second, it uses this capability and a unique organization of on-chip memory, interconnection, and compute resources to enable fission in systolic array based DNN accelerators. Architecture fission and its associated flexibility enables an extra degree of freedom for task scheduling, that even allows breaking the accelerator with regard to the server load, DNN topology, and task priority. As such, it can simultaneously co-locate DNNs to enhance utilization, throughput, QoS, and fairness. We compare the proposed design to PREMA [4], a recent effort that offers multi-tenancy by time-multiplexing the DNN accelerator across multiple tasks. We use the same frequency, the same amount of compute and memory resources for both accelerators. The results show significant benefits with (soft, medium, hard) QoS requirements, in throughput (7.4×, 7.2×, 12.2×), SLA satisfaction rate (45%, 15%, 16%), and fairness (2.1×, 2.3×, 1.9×).

72 citations


Cites background from "Scheduling multi-tenant cloud workl..."

  • ...There is a large swath of related work on multi-tenancy for CPUs [36, 66, 97–103] and GPUs [29, 35, 37–42, 66, 104–110] due to its vitality for cloud-scale computing....

    [...]

  • ...In fact, the broader research community invested more than a decade of efforts to develop solutions across the computing stack to bring forth seamless and scalable multi-tenant cloud execution models [26–48]....

    [...]

Book ChapterDOI
24 Aug 2016
TL;DR: A dynamic graph analytics framework, GraphIn, that incrementally processes graphs on-the-fly using fixed-sized batches of updates and a novel programming model called I-GAS based on gather-apply-scatter programming paradigm that allows for implementing a large set of incremental graph processing algorithms seamlessly across multiple CPU cores are proposed.
Abstract: The massive explosion in social networks has led to a significant growth in graph analytics and specifically in dynamic, time-varying graphs. Most prior work processes dynamic graphs by first storing the updates and then repeatedly running static graph analytics on saved snapshots. To handle the extreme scale and fast evolution of real-world graphs, we propose a dynamic graph analytics framework, GraphIn, that incrementally processes graphs on-the-fly using fixed-sized batches of updates. As part of GraphIn, we propose a novel programming model called I-GAS based on gather-apply-scatter programming paradigm that allows for implementing a large set of incremental graph processing algorithms seamlessly across multiple CPU cores. We further propose a property-based, dual-path execution model to choose between incremental or static computation. Our experiments show that for a variety of graph inputs and algorithms, GraphIn achieves upi?źto 9.3 million updates/sec and over 400$$\times $$ speedup when compared to static graph recomputation.

58 citations


Additional excerpts

  • ...GraphReduce [22] framework can efficiently process graphs that cannot fit into the limited GPU memory [20,21] by mapping sub-graphs to the different memory abstractions of slow and fast memory [23]....

    [...]

Proceedings ArticleDOI
23 May 2016
TL;DR: Mystic, an interference-aware scheduler for efficient co-execution of applications on GPU-based clusters and cloud servers is presented, which identifies the similarities between new applications and the executing applications, and guides the scheduler to minimize the interference and improve system throughput.
Abstract: GPUs have become the primary choice of accelerators for high-end data centers and cloud servers, which can host thousands of disparate applications. With the growing demands for GPUs on clusters, there arises a need for efficient co-execution of applications on the same accelerator device. However, the resource contention among co-executing applications causes interference which leads to degradation in execution performance, impacts QoS requirements of applications and lowers overall system throughput. While previous work has proposed techniques for detecting interference, the existing solutions are either developed for CPU clusters, or use static profiling approaches which can be computationally intensive and do not scale well. We present Mystic, an interference-aware scheduler for efficient co-execution of applications on GPU-based clusters and cloud servers. The most important feature of Mystic is the use of learning-based analytical models for detecting interference between applications. We leverage a collaborative filtering framework to characterize an incoming application with respect to the interference it may cause when co-executing with other applications while sharing GPU resources. Mystic identifies the similarities between new applications and the executing applications, and guides the scheduler to minimize the interference and improve system throughput. We train the learning model with 42 CUDA applications, and consider another separate set of 55 diverse, real-world GPU applications for evaluation. Mystic is evaluated on a live GPU cluster with 32 NVIDIA GPUs. Our framework achieves performance guarantees for 90.3% of the evaluated applications. When compared with state-of-the art interference-oblivious schedulers, Mystic improves the system throughput by 27.5% on average, and achieves a 16.3% improvement on average in GPU utilization.

56 citations


Cites background from "Scheduling multi-tenant cloud workl..."

  • ...presents the Strings scheduler which decouples CPU and GPU execution, and guides scheduling using feedback about execution time, GPU utilization, data transfer, and memory bandwidth utilization from each GPU [32]....

    [...]

  • ...Interference-aware scheduling on GPU servers has been attempted in the past [31, 32]....

    [...]

  • ...which enables a single GPU to be shared in both space and time [13, 32]....

    [...]

References
More filters
Posted Content
TL;DR: A quantitative measure called Indiex of FRairness, applicable to any resource sharing or allocation problem, which is independent of the amount of the resource, and boundedness aids intuitive understanding of the fairness index.
Abstract: Fairness is an important performance criterion in all resource allocation schemes, including those in distributed computer systems However, it is often specified only qualitatively The quantitative measures proposed in the literature are either too specific to a particular application, or suffer from some undesirable characteristics In this paper, we have introduced a quantitative measure called Indiex of FRairness The index is applicable to any resource sharing or allocation problem It is independent of the amount of the resource The fairness index always lies between 0 and 1 This boundedness aids intuitive understanding of the fairness index For example, a distribution algorithm with a fairness of 010 means that it is unfair to 90% of the users Also, the discrimination index can be defined as 1 - fairness index

4,476 citations

Journal ArticleDOI
01 Feb 2011
TL;DR: StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.
Abstract: In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright © 2010 John Wiley & Sons, Ltd.

1,116 citations

01 Jan 1998
TL;DR: Indiex of Fairness as mentioned in this paper is a quantitative measure that is applicable to any resource sharing or allocation problem, and it is independent of the amount of the resource and the fairness index always lies between 0 and 1.
Abstract: Fairness is an important performance criterion in all resource allocation schemes, including those in distributed computer systems. However, it is often specified only qualitatively. The quantitative measures proposed in the literature are either too specific to a particular application, or suffer from some undesirable characteristics. In this paper, we have introduced a quantitative measure called Indiex of FRairness. The index is applicable to any resource sharing or allocation problem. It is independent of the amount of the resource. The fairness index always lies between 0 and 1. This boundedness aids intuitive understanding of the fairness index. For example, a distribution algorithm with a fairness of 0.10 means that it is unfair to 90% of the users. Also, the discrimination index can be defined as 1 - fairness index.

1,064 citations

Proceedings ArticleDOI
03 Dec 1995
TL;DR: Contradictory to this belief, it is shown and support by documentary evidence that inefficiency and inflexibility of current μ-kernels is not inherited from the basic idea but mostly from overloading the kernel and/or from improper implementation.
Abstract: From a software-technology point of view, the μ-kernel concept is superior to large integrated kernels. On the other hand, it is widely believed that (a) μ-kernel based systems are inherently inefficient and (b) they are not sufficiently flexible. Contradictory to this belief, we show and support by documentary evidence that inefficiency and inflexibility of current μ-kernels is not inherited from the basic idea but mostly from overloading the kernel and/or from improper implementation. Based on functional reasons, we describe some concepts which must be implemented by a μ-kernel and illustrate their flexibility. Then, we analyze the performance critical points. We show what performance is achievable, that the efficiency is sufficient with respect to macro-kernels and why some published contradictory measurements are not evident. Furthermore, we describe some implementation techniques and illustrate why μ-kernels are inherently not portable, although they improve portability of the whole system.

674 citations

Journal ArticleDOI
12 Nov 2000
TL;DR: It is demonstrated that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler, and that a small sample of the possible schedules is sufficient to identify a good schedule quickly.
Abstract: Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coscheduleThis paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workload's characteristics, using sampling to identify jobs which run well together.We demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance. We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly. On a system with random job arrivals and departures, response time is improved as much as 17% over a schedule which does not incorporate symbiosis.

619 citations