scispace - formally typeset
Search or ask a question
Proceedings Article

TimeGraph: GPU scheduling for real-time multi-tasking environments

TL;DR: TimeGraph is presented, a real-time GPU scheduler at the device-driver level for protecting important GPU workloads from performance interference and supports two priority-based scheduling policies in order to address the tradeoff between response times and throughput introduced by the asynchronous and non-preemptive nature of GPU processing.
Abstract: The Graphics Processing Unit (GPU) is now commonly used for graphics and data-parallel computing. As more and more applications tend to accelerate on the GPU in multi-tasking environments where multiple tasks access the GPU concurrently, operating systems must provide prioritization and isolation capabilities in GPU resource management, particularly in real-time setups. We present TimeGraph, a real-time GPU scheduler at the device-driver level for protecting important GPU workloads from performance interference. TimeGraph adopts a new event-driven model that synchronizes the GPU with the CPU to monitor GPU commands issued from the user space and control GPU resource usage in a responsive manner. TimeGraph supports two priority-based scheduling policies in order to address the tradeoff between response times and throughput introduced by the asynchronous and non-preemptive nature of GPU processing. Resource reservation mechanisms are also employed to account and enforce GPU resource usage, which prevent misbehaving tasks from exhausting GPU resources. Prediction of GPU command execution costs is further provided to enhance isolation. Our experiments using OpenGL graphics benchmarks demonstrate that TimeGraph maintains the frame-rates of primary GPU tasks at the desired level even in the face of extreme GPU workloads, whereas these tasks become nearly unresponsive without TimeGraph support. Our findings also include that the performance overhead imposed on TimeGraph can be limited to 4-10%, and its event-driven scheduler improves throughput by about 30 times over the existing tick-driven scheduler.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
09 Apr 2013
TL;DR: MemGuard provides bandwidth reservation for the guaranteed bandwidth for temporal isolation, with efficient reclaiming to maximally utilize the reserved bandwidth and improves performance by exploiting the best effort bandwidth after satisfying each core's reserved bandwidth.
Abstract: Memory bandwidth in modern multi-core platforms is highly variable for many reasons and is a big challenge in designing real-time systems as applications are increasingly becoming more memory intensive. In this work, we proposed, designed, and implemented an efficient memory bandwidth reservation system, that we call MemGuard. MemGuard distinguishes memory bandwidth as two parts: guaranteed and best effort. It provides bandwidth reservation for the guaranteed bandwidth for temporal isolation, with efficient reclaiming to maximally utilize the reserved bandwidth. It further improves performance by exploiting the best effort bandwidth after satisfying each core's reserved bandwidth. MemGuard is evaluated with SPEC2006 benchmarks on a real hardware platform, and the results demonstrate that it is able to provide memory performance isolation with minimal impact on overall throughput.

265 citations


Cites methods from "TimeGraph: GPU scheduling for real-..."

  • ...Proposed techniques have been successfully applied to CPU management [10], [7], [6] and more recently to GPU management [14], [15]....

    [...]

Proceedings ArticleDOI
23 Oct 2011
TL;DR: It is shown that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.
Abstract: We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models. Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.

256 citations

Journal ArticleDOI
14 Jun 2014
TL;DR: This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Abstract: GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x

191 citations


Cites background from "TimeGraph: GPU scheduling for real-..."

  • ...Both the scheduling framework and scheduling policies are implement in hardware to avoid the long latency of issuing commands to the GPU [17]....

    [...]

  • ...Issues with GPU sharing, such as priority inversion and no fairness, have already been noticed by operating systems [30, 17, 18, 27] and real-time [16, 6] research communities....

    [...]

  • ...Because the latency of issuing a command to the GPU is significant [17], commands are sent to the GPU as soon as possible....

    [...]

  • ...GERM [7] and TimeGraph [17] focus on graphics applications and provide a GPU command schedulers integrated in the device driver....

    [...]

Proceedings Article
13 Jun 2012
TL;DR: Gdev is presented, a new ecosystem of GPU resource management in the operating system (OS) that allows the user space as well as the OS itself to use GPUs as first-class computing resources.
Abstract: Graphics processing units (GPUs) have become a very powerful platformembracing a concept of heterogeneous many-core computing. However, application domains of GPUs are currently limited to specific systems, largely due to a lack of "first-class" GPU resource management for general-purposemulti-tasking systems. We present Gdev, a new ecosystem of GPU resource management in the operating system (OS). It allows the user space as well as the OS itself to use GPUs as first-class computing resources. Specifically, Gdev's virtual memory manager supports data swapping for excessive memory resource demands, and also provides a shared devicememory functionality that allows GPU contexts to communicate with other contexts. Gdev further provides a GPU scheduling scheme to virtualize a physical GPU into multiple logical GPUs, enhancing isolation among working sets of multi-tasking systems. Our evaluation conducted on Linux and the NVIDIA GPU shows that the basic performance of our prototype implementation is reliable even compared to proprietary software. Further detailed experiments demonstrate that Gdev achieves a 2x speedup for an encrypted file system using the GPU in the OS. Gdev can also improve the makespan of dataflow programs by up to 49% exploiting shared device memory, while an error in the utilization of virtualized GPUs can be limited within only 7%.

190 citations


Cites background or methods from "TimeGraph: GPU scheduling for real-..."

  • ...As discussed in previous work [15], it is very hard to analyze GPU commands and recognize the corresponding API calls in the OS....

    [...]

  • ...Gdev uses a similar scheme to TimeGraph [15] for GPU scheduling....

    [...]

  • ...Although we make use of some previous techniques [14, 15], Gdev...

    [...]

  • ...GPU Resource Management: TimeGraph [15] and GERM [2] provide a GPU command-driven scheduler integrated in the device driver....

    [...]

  • ...We further plan to integrate the configuration of priority and reserve for each single task into /proc, using the TimeGraph approach [15]....

    [...]

Proceedings ArticleDOI
03 Nov 2013
TL;DR: Dandelion automatically and transparently distributes data-parallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution.
Abstract: Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and energy efficiency. Because heterogeneous systems typically comprise multiple execution contexts with different programming abstractions and runtimes, programming them remains extremely challenging.Dandelion is a system designed to address this programmability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution contexts including CPUs, GPUs, FPGAs, and the cloud. It adopts the .NET LINQ (Language INtegrated Query) approach, integrating data-parallel operators into general purpose programming languages such as C# and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools.Dandelion automatically and transparently distributes data-parallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of .NET code on GPUs, Dandelion cross-compiles .NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the design and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads.

160 citations

References
More filters
Book
03 Jan 1989
TL;DR: In this paper, the problem of multiprogram scheduling on a single processor is studied from the viewpoint of the characteristics peculiar to the program functions that need guaranteed service, and it is shown that an optimum fixed priority scheduler possesses an upper bound to processor utilization which may be as low as 70 percent for large task sets.
Abstract: The problem of multiprogram scheduling on a single processor is studied from the viewpoint of the characteristics peculiar to the program functions that need guaranteed service. It is shown that an optimum fixed priority scheduler possesses an upper bound to processor utilization which may be as low as 70 percent for large task sets. It is also shown that full processor utilization can be achieved by dynamically assigning priorities on the basis of their current deadlines. A combination of these two scheduling techniques is also discussed.

5,397 citations

Proceedings ArticleDOI
03 Dec 1997
TL;DR: This work presents an analytical model for QoS management in systems which must satisfy application needs along multiple dimensions such as timeliness, reliable delivery schemes, cryptographic security and data quality, and refers to this model as Q-RAM (QoS-based Resource Allocation Model).
Abstract: Quality of service (QoS) has been receiving wide attention in many research communities including networking, multimedia systems, real-time systems and distributed systems. In large distributed systems such as those used in defense systems, on-demand service and inter-networked systems, applications contending for system resources must satisfy timing, reliability and security constraints as well as application-specific quality requirements. Allocating sufficient resources to different applications in order to satisfy various requirements is a fundamental problem in these situations. A basic yet flexible model for performance-driven resource allocations can therefore be useful in making appropriate tradeoffs. We present an analytical model for QoS management in systems which must satisfy application needs along multiple dimensions such as timeliness, reliable delivery schemes, cryptographic security and data quality. We refer to this model as Q-RAM (QoS-based Resource Allocation Model). The model assumes a system with multiple concurrent applications, each of which can operate at different levels of quality based on the system resources available to it. The goal of the model is to be able to allocate resources to the various applications such that the overall system utility is maximized under the constraint that each application can meet its minimum needs. We identify resource profiles of applications which allow such decisions to be made efficiently and in real-time. We also identify application utility functions along different dimensions which are composable to form unique application requirement profiles. We use a video-conferencing system to illustrate the model.

517 citations


"TimeGraph: GPU scheduling for real-..." refers background or methods in this paper

  • ...The amount of limit is computed by a traditional resource-reservation model based onC andT of each reserve [26]....

    [...]

  • ...CPU Scheduling: TimeGraph shares the concept of priority and reservation, which has been well-studied by the real-time systems community [13, 26], but there is a fundamental difference from these traditional studies in that TimeGraph is designed to address an arbitrarilyarriving non-preemptive GPU execution model, whereas the real-time systems community has often considered a periodic preemptive CPU execution model....

    [...]

  • ...It should be noted that this enforcement mechanism is very different from traditional CPU reservation mechanisms [20, 26] that use timers or ticks to suspend tasks, since GPU command groups are non-preemptive, and hence we need to perform enforcement at GPU command group boundary....

    [...]

  • ...replenishment used in real-time systems [20, 26]....

    [...]

Proceedings Article
01 Dec 1987

438 citations


"TimeGraph: GPU scheduling for real-..." refers background in this paper

  • ...Several bandwidth-preserving approaches [12, 29, 30] for an arbitrarily-arriving model exist, but a non-preemptive model has not been much studied yet....

    [...]

Proceedings ArticleDOI
01 Oct 1997
TL;DR: This paper presents a system that can schedule multiple independent activities so that they can obtain minimum guaranteed execution rates with application-specified reservation granularities via CPU Reservations.
Abstract: Workstations and personal computers are increasingly being used for applications with real-time characteristics such as speech understanding and synthesis, media computations and I/O, and animation, often concurrently executed with traditional non-real-time workloads. This paper presents a system that can schedule multiple independent activities so that: . activities can obtain minimum guaranteed execution rates with application-specified reservation granularities via CPU Reservations, CPU Reservations, which are of the form reserve X units of time out of every Y units, provide not just an average case execution rate of X/Y over long periods of time, but the stronger guarantee that from any instant of time, by Y time units later, the activity will have executed for at least X time units, . applications can use Time Constraints to schedule tasks by deadlines, with on-time completion guaranteed for tasks with accepted constraints, and . both CPU Reservations and Time Constraints are implemented very efficiently. In particular, . CPU scheduling overhead is bounded by a constant and is not a function of the number of schedulable tasks. Other key scheduler properties are: . activities cannot violate other activities' guarantees, . time constraints and CPU reservations may be used together, separately, or not at all (which gives a round-robin schedule), with well-defined interactions between all combinations, and . spare CPU time is fairly shared among all activities. The Rialto operating system, developed at Microsoft Research, achieves these goals by using a precomputed schedule, which is the fundamental basis of this work.

359 citations

Proceedings ArticleDOI
12 Dec 1999
TL;DR: Borrowed-Virtual-Time (BVT) Scheduling is presented, showing that it provides low-latency for real-time and interactive applications yet weighted sharing of the CPU across applications according to system policy, even with thread failure at the real- time level, all with a low-overhead implementation on multiprocessor as well as uniprocessors.
Abstract: Systems need to run a larger and more diverse set of applications, from real-time to interactive to batch, on uniprocessor and multiprocessor platforms. However, most schedulers either do not address latency requirements or are specialized to complex real-time paradigms, limiting their applicability to general-purpose systems.In this paper, we present Borrowed-Virtual-Time (BVT) Scheduling, showing that it provides low-latency for real-time and interactive applications yet weighted sharing of the CPU across applications according to system policy, even with thread failure at the real-time level, all with a low-overhead implementation on multiprocessors as well as uniprocessors. It makes minimal demands on application developers, and can be used with a reservation or admission control module for hard real-time applications.

303 citations