Enabling preemptive multiprogramming on GPUs

doi:10.1145/2678373.2665702

Enabling Preemptive Multiprogramming on GPUs

Ivan Tanasic

1,2

, Isaac Gelado

3

, Javier Cabezas

1,2

, Alex Ramirez

1,2

, Nacho Navarro

1,2

, Mateo Valero

1,2

1

Barcelona Supercomputing Center

2

Universitat Politecnica de Catalunya

3

NVIDIA Research

ﬁrst.last@bsc.es nacho@ac.upc.edu igelado@nvidia.com

Abstract

GPUs are being increasingly adopted as compute acceler-

ators in many domains, spanning environments from mobile

systems to cloud computing. These systems are usually run-

ning multiple applications, from one or several users. How-

ever GPUs do not provide the support for resource sharing

traditionally expected in these scenarios. Thus, such systems

are unable to provide key multiprogrammed workload require-

ments, such as responsiveness, fairness or quality of service.

In this paper, we propose a set of hardware extensions that

allow GPUs to efﬁciently support multiprogrammed GPU

workloads. We argue for preemptive multitasking and design

two preemption mechanisms that can be used to implement

GPU scheduling policies. We extend the architecture to al-

low concurrent execution of GPU kernels from different user

processes and implement a scheduling policy that dynami-

cally distributes the GPU cores among concurrently running

kernels, according to their priorities. We extend the NVIDIA

GK110 (Kepler) like GPU architecture with our proposals

and evaluate them on a set of multiprogrammed workloads

with up to eight concurrent processes. Our proposals improve

execution time of high-priority processes by 15.6x, the average

application turnaround time between 1.5x to 2x, and system

fairness up to 3.4x.

1. Introduction

Graphics Processing Units (GPUs) have become fully pro-

grammable massively parallel processors [

21

,

38

,

4

] that are

able to efﬁciently execute both, traditional graphics workloads,

and certain types of general purpose codes [

26

]. GPUs have

been designed to maximize the performance of a single appli-

cation, and thus assume exclusive access from a single process.

Accesses to the GPU from different applications are serial-

ized. However, as the number of applications ported to GPUs

grows, sharing scenarios are starting to appear. Issues with

GPU sharing, such as priority inversion and no fairness, have

already been noticed by operating systems [

30

,

17

,

18

,

27

] and

real-time [

16

,

6

] research communities. Moreover, with the

integration of programmable GPUs into mobile SoCs [

31

,

5

]

and consumer CPUs [

3

,

15

], the demand for GPU sharing is

likely to increase. This leads us to believe that support for

ﬁne-grained sharing of GPUs must be implemented.

Today’s GPUs contain execution and data transfer engines

that receive commands from the CPU through command

queues. The execution engine comprises all the GPU cores,

that are treated as a whole. Commands from the same process

targeting different engines can be executed concurrently (e.g.,

a data transfer can be performed in parallel with a GPU kernel

execution). However, when a command is running, it has ex-

clusive access to the engine and cannot be preempted (i.e., the

command runs to completion). Hence, a long-latency GPU

kernel can occupy the execution engine, preventing other ker-

nels from same or different process to progress. This limitation

hinders true multiprogramming in GPUs.

The latest NVIDIA GPU architecture, GK110 (Kepler),

improves the concurrency of commands coming from the

same process by providing several hardware command queues

(often referred to as NVIDIA Hyper-Q [

23

]). NVIDIA also

provides a software solution [

24

] that acts as a proxy to allow

several processes to use the GPU as one, at the cost of losing

process isolation. Combining these two features is especially

useful for improving the utilization of GPU engines in legacy

MPI applications. However they do not solve the problem of

sharing the GPU among several applications.

To enable true sharing, GPUs need a hardware mechanism

that can preempt the execution of GPU kernels, rather than

waiting for the program to release the GPU. Such mechanism

would enable system-level scheduling policies that can control

the execution resources, in a similar way the multitasking

operating systems do with the CPUs today. The assumed

reason [

1

,

27

] for the lack of a preemption mechanism in

GPUs is the expected high overhead of saving and restoring

the context of GPU cores (up to 256KB of register ﬁle and

48 KB of on-chip scratch-pad memory per GPU core), which

can take up to 44

µ

s in GK110, assuming the peak memory

bandwidth. Compared to the context switch time of less than

1

µ

s on modern CPUs, this might seem to be a prohibitively

high overhead.

In this paper we show how preemptive multitasking is not

only necessary, but also a feasible approach to multiprogram-

ming on GPUs. We design two preemption mechanisms with

different effectiveness and implementation costs. One is simi-

lar to the classic operating system preemption where the ex-

ecution on GPU cores is stopped, and their context is saved

to implement true preemptive multitasking. The other mecha-

nism exploits the semantics of the GPU programming model

and the nature of GPU applications to implement preemp-

tion by stopping the issue of new work to preempted GPU

cores, and draining them from currently running work. We

show that both mechanisms provide improvements in system

responsiveness and fairness at the expense of a small loss in

throughput.

Still, exclusive access to the execution engine limits the

possible sharing to time multiplexing. We propose further

hardware extensions that remove the exclusive access con-

straint and allow the utilization of GPU cores, individually.

These extensions enable different processes to concurrently

execute GPU kernels on different sets of GPU cores. Fur-

thermore, we implement Dynamic Spatial Sharing (DSS), a

hardware scheduling policy that dynamically partitions the

resources (GPU cores) and assigns them to different processes

according to the priorities assigned by the OS.

The three main contributions of the paper are (1) the design

of two preemption mechanisms that allow GPUs to implement

scheduling policies, (2) extensions for concurrent execution of

different processes on GPUs that allow implementing spatial

sharing, and (3) a scheduling policy that dynamically assigns

disjoint sets of GPU cores to different processes. Experimen-

tal evaluation shows that the hardware support for preemptive

multi-tasking introduced in this paper allows scheduler imple-

mentations for multiprogrammed environments that, on aver-

age, improve the performance of high-priority applications up

to 15.6x over the baseline at the cost of 12% of degradation in

throughput. Our DSS scheduling policy improves normalized

turnaround time up to 2x and system fairness up to 3.4x at the

cost of throughput degradation up to 35%.

2. Background and Motivation

In this section we provide the background on GPU architecture

and execution model that are necessary for understanding our

proposals. Our base architecture is modeled after the NVIDIA

GK110 chip, but we keep the discussion generic, to cover

architectures from other vendors, as well as fused CPU-GPU

architectures.

2.1. GPU Program Execution

Typically, GPU applications consist of repetitive bursts of 1)

CPU execution, that perform control, preprocessing or I/O

operations, 2) GPU execution (kernels), that performs com-

putationally demanding tasks, and 3) data transfers between

CPU and GPU, that bring input data to the GPU memory and

return the outputs to the CPU memory. The GPU device driver

is in charge of performing the bookkeeping tasks for the GPU,

as the OS performs for the CPU (e.g., managing the GPU

memory space). GPU kernel invocations (kernel launch in

CUDA terminology), initiation of data transfers, and GPU

memory allocations are typically performed in the CPU code

(referred to as commands in the rest of the paper). Each kernel

launch consists of a number of threads executing the same

code. Threads are grouped in thread blocks that are indepen-

dent of each other, and only threads from the same thread

block can cooperate using barrier synchronization and com-

munication through local memory (shared memory in CUDA

terminology).

Programming models for GPUs provide software work

queues (streams in CUDA terminology) that allow program-

mers to specify the dependences between commands. Com-

mands in different streams are considered independent and

may be executed concurrently by the hardware. Because the

latency of issuing a command to the GPU is signiﬁcant [

17

],

commands are sent to the GPU as soon as possible. Each

process that uses a GPU gets its own GPU context, which

contains the page table of the GPU memory and the streams

deﬁned by the programmer.

To overcome the inefﬁciencies introduced by multiple pro-

cesses sharing the GPU [

20

], NVIDIA provides a software

solution called Muli-Process Service (MPS) [

24

]. MPS in-

stantiates a proxy process that receives requests from client

processes (e.g., processes in an MPI application) and executes

them on the GPU. Such a solution has two main limitations:

(1) memory allocated from any of the client processes, is ac-

cessible from any of the other client processes, thus breaking

the memory isolation between processes; (2) it does not allow

to enforce scheduling policies across the client processes. Al-

though MPS provides important performance improvement in

the case of MPI applications, it is not a general solution for

multiprogrammed workloads.

2.2. Base GPU Architecture

The base architecture assumed in this paper is depicted in

Figure 1. The GPU is connected to the rest of the system

through an interconnection network (1). In the case of dis-

crete GPUs, the interconnect is the PCI Express bus and the

GPU has its own physical RAM (2). In the case of fused

CPU/GPU architectures [

8

], the interconnect is an on-chip

network and GPU and CPU share the same physical RAM

(3). Current generation GPUs, including our baseline, do not

support demand paging (memory swap-out). Thus, in today’s

GPU systems, allocations from all contexts reside in the GPU

physical memory. The GPU has an execution engine (4) with

access to its memory, and a data transfer engine (5) for trans-

ferring the data between CPU memory and GPU memory (in

integrated designs GPUs have DMA engines for bulk memory

transfers, too). All the GPU computation cores (Streaming

Multiprocessors or SMs in CUDA terminology) belong to

the execution engine and are treated as a whole, for kernel

scheduling purposes.

The interface to the CPU implements several hardware

queues (i.e., NVIDIA Hyper-Q) used by the CPU to issue

GPU commands. The GPU device driver maps streams from

the applications on the command queues. A command dis-

patcher (6) inspects the top of the command queues and issues

the commands to the corresponding engine. Data transfer

commands are issued to the data transfer engine via DMA

queues while kernel launch commands are issued to the ex-

ecution engine via the execution queue (7). After issuing a

command, the dispatcher stops inspecting that queue. After

completing a command, the corresponding engine notiﬁes the

Figure 1: Baseline GPU architecture.

command dispatcher so that the queue is re-enabled for inspec-

tion. Thus, commands from different command queues that

target different engines can be concurrently executed. Con-

versely, commands coming from the same command queue are

executed sequentially, following the semantics of the stream

abstraction deﬁned by the programming model. Traditionally,

a single command queue was provided, but newer GPUs from

NVIDIA provide several of them, often referred to as Hy-

perQ [

23

]. Using several queues increases the opportunities to

overlap independent commands that target different engines.

The GPU also includes a set of global control registers (8)

that hold the GPU context information used by the engines.

These control registers hold process-speciﬁc information, such

as the location of the virtual memory structures (e.g., page

table), the GPU kernels registered by the process, or structures

used by the graphics pipeline.

2.3. Base GPU Execution Engine

The base GPU execution engine we assume in this paper is

shaded in Figure 1. The SM driver (9) gets kernel launch

commands from the execution queue (7), and sets up the Ker-

nel Status Registers (10) (KSR) with the control information

such as number of work units to execute, kernel parameters

(number and stack pointer). . . The SM driver uses the contents

of these registers, as well as the global GPU control registers,

to setup the SMs before the execution of a kernel starts. The

execution queue can contain a number of independent kernel

commands coming from the same context, that are scheduled

for execution in a ﬁrst-come ﬁrst-serve (FCFS) manner.

A kernel launch command consists of thread blocks that are

independent from each other and, therefore, they are executed

on SMs (11) independently. Each thread block is divided into

ﬁxed-size groups of threads that execute in a lock-step (warps

in CUDA terminology) [

21

]. A reconvergence stack tracks the

execution of divergent threads in a warp by storing the pro-

gram counter and mask of the threads that took the divergent

branch [

11

]. SM cycles through warps from all thread blocks

assigned to it and execute (12) the next instruction of a warp

with ready operands.

When a thread block is issued to an SM, it remains resident

in that SM until all of its warps ﬁnish execution. An SM

can execute more than one thread block in an interleaved

fashion. Concurrent execution of thread blocks relies on static

hardware partitioning, so the available hardware resources

Figure 2: Execution of soft real-time application with (a) FCFS

(current GPUs), (b) nonpreemptive priority and (c) preemptive

priority schedulers. K1 and K2 are low-priority kernels, while

K3 is high-priority.

(13) (e.g., registers and shared memory) are split among all

the thread blocks in the SM. The resource usage of all the

thread blocks from a kernel is the same and it is known at

kernel launch time. The number of thread blocks that can

run concurrently is thus determined by the ﬁrst fully used

hardware resource. Static hardware partitioning implies that

only thread blocks from the same kernel can be scheduled to

concurrently run on the same SM.

After the setup of the SM is done, the SM driver issues

thread blocks to each SM until they are fully utilized (i.e.,

run out of hardware resources). Whenever a SM ﬁnishes

executing a thread block, the SM driver gets notiﬁed and

issues a new thread block from the same kernel to that SM.

This policy combined with static partitioning of hardware

resources means that kernels with enough thread blocks from

one kernel will occupy all the available SMs, forcing the other

kernel commands in the execution queue to stall. As a result,

concurrent execution among kernels is possible only if there

are free resources after issuing all the work from previous

kernels. This back-to-back execution happens when a kernel

does not have enough thread blocks to occupy all SMs or

the scheduled kernel is ﬁnishing its execution, and SMs are

becoming free again. Today’s GPUs, however, do not support

concurrent execution of commands from different contexts on

the same engine. That is, only kernels from the

same process

can be concurrently executed.

2.4. Arguments for Preemptive Execution

The main goal of this work is to enable ﬁne-grained scheduling

of multiprogrammed workloads running on the GPU. Figure 2

illustrates how scheduling support in current GPUs is not

sufﬁcient when, for example, a soft real-time application is

competing for resources with other applications. The exe-

cution on a modern GPU is shown in Figure 2a, where the

kernel with a deadline (

K3

) does not get scheduled until all

previously issued kernels (

K1

and

K2

) have ﬁnished execut-

ing. A software implementation [

16

] or a modiﬁcation to GPU

command scheduler could allow priorities to be assigned to

processes, resulting in the timeline shown in Figure 2b.

A common characteristic of the previous approaches is

that the execution latency of

K3

depends on the execution

time of previously launched kernels from other processes.

This is an undesirable behavior from both system’s and user’s

perspective, and limits the effectiveness of the GPU scheduler.

To decouple the scheduling from the latency of kernel running

on the GPU, a preemption mechanism is needed. Figure 2c

illustrates how the latency of the kernel

K3

could decrease

even further if kernel

K1

can be preempted. Allowing GPUs to

be used for latency sensitive applications is the ﬁrst motivation

of this paper.

Preemptive execution on GPUs is not only useful to speed

up high-priority tasks, it is required to guarantee forward

progress of applications in multiprogrammed environments.

The persistent threads pattern of GPU computing, for instance,

uses kernels that occupy the GPU and actively wait for work

to be submitted from the CPU [

2

,

14

]. Preventing starvation

when this kind of applications run in the multiprogrammed

system is the second motivation of this paper.

There is a widespread assumption that preemption in GPUs

is not cost-effective due to the large cost of context switch-

ing [

1

,

27

]. Even though it is clear that in some cases it is

necessary [

19

], it is not clear if beneﬁts can justify the disad-

vantages when preemption is used by ﬁne-grained schedulers.

Comparing beneﬁts and drawbacks of the context saving and

restoring approach to preemption with an alternative approach

where no context is saved or restored on preemption points is

the third motivation of this paper.

3. Support for Multiprogramming in GPUs

Following the standard practice of systems design, we sep-

arate mechanisms from policies that use them. We provide

two generic preemption mechanism and policies that are com-

pletely oblivious to the preemption mechanism used. To sim-

plify the implementation of policies, we abstract the common

hardware in a scheduling framework.

3.1. Concurrent Execution of Processes

To support multiprogramming, the memory hierarchy, the ex-

ecution engine and the SMs all have to be aware of multiple

active contexts. The memory hierarchy of the GPU needs to

support concurrent accesses from different processes, using

different address spaces. Modern GPUs implement two types

of memory hierarchy [

28

]. In one, the shared levels of the

memory hierarchy are accessed using virtual addresses and the

address translation is performed in the memory controller. The

cache lines and memory segments of such hierarchy have to be

tagged with an address space identiﬁer. The other implementa-

tion uses address translation at the private levels of the memory

hierarchy, and physical memory addresses to access shared

levels of the memory hierarchy. The mechanisms that we de-

scribe here are compatible with both approaches. We assume

that later approach is implemented, hence no modiﬁcations

are required to the memory subsystem.

If only one GPU context executes kernels, SMs can easily

get the context information from the global GPU control struc-

tures. We extend the execution engine to include a context

table with information of all active contexts. The context in-

Figure 3: Operation of the SM driver. Dashed objects are pro-

posed extensions.

formation is sent to the SM during the setup, before it starts

receiving thread blocks to execute. The SM is extended with a

GPU context id register, a base page table register and other

context speciﬁc information, such as the texture register. The

base page table register is used on a TLB miss to walk the

per-process page table stored in the main memory of the GPU.

This is in contrast to the base GPU architecture where the same

page table was used by all SMs, since they execute kernels

from the same context. Similarly, the GPU context id register

is used when accessing the objects associated with the GPU

context (e.g., kernels) from SM. We extend the context of

the SM, rather than reading this information from the context

table that would otherwise require many read ports to allow

concurrent accesses from SMs.

3.2. Preemptive Kernel Execution

The scheduling policy is always in charge of ﬁguring out when

and which kernels should be scheduled to run. If there are

no idle SMs in the system, it uses the preemption mechanism

to free up some SMs. To provide a generic preemption sup-

port to different policies, we need to be able to preempt the

execution on each SM individually. We provide this support

by extending the SM driver. Figure 3 shows the operation of

the SM driver, with dashed objects showing our extensions.

When there are kernels to execute, the SM driver looks for an

idle SM, performs the setup, and starts issuing thread blocks

until the SM is fully occupied. The SM driver then repeats the

procedure until there are no more idle SMs. When there are

thread blocks left, the baseline SM driver issues a new thread

block every time an SM notiﬁes the driver that it ﬁnished

executing a thread block.

We extend this operation and allow the scheduling policy

to preempt the execution on an SM (independent of which

preemption mechanism is used) by labeling it as reserved.

After receiving a notiﬁcation of ﬁnished thread block from

the SM, the SM driver checks if the SM is reserved. If not,

it proceeds with the normal operation (issuing new thread

blocks). If reserved, the driver waits for preemption to be done,

sets up the SM for the kernel that reserved it, and continues

with the normal operation. In Section 3.3 we describe the

hardware extensions used by the SM driver to perform the

bookkeeping of SMs and active kernels.

The ﬁrst preemption mechanism that we implement,

con-

text switch

, follows the basic principle of preemption used

by operating system schedulers. The execution contexts of

all the thread blocks running in the preempted SM are saved

to off-chip memory, and these thread blocks are issued again,

later on. Each active kernel has a preallocated memory where

the context of its preempted thread blocks are kept. When a

preempted thread block is issued, its execution context is ﬁrst

restored so the computation can continue correctly. This con-

text consists of the architectural registers used by each thread

of the thread block, the private partition of the shared memory,

and other state that deﬁnes the execution of the thread block

(e.g., the pointer to the reconvergence stack and state of the

barrier unit). Saving and restoring the context is performed

by a microprogrammed trap routine. Each thread saves all of

its registers, while the shared memory of the thread block is

collaboratively saved by its threads. This operation is very

similar to the context save and restore performed on device-

side kernel launch when using the dynamic parallelism feature

of GK110 [

25

]. Since preemption raises an asynchronous trap,

precise exceptions are needed [

32

]. The simplest solution is

to drain the pipeline from all the on-ﬂight instruction before

jumping to the trap routine. The main drawback of the con-

text switch mechanism is that during the context save and

restore, thread blocks do not progress, leading to a complete

underutilization of the SM. This underutilization could be

improved by using compiler-microarchitecture co-designed

context minimization techniques, such as iGPU [22].

The second mechanism that we implement,

SM draining

,

tries to avoid this underutilization by preempting the execution

on a thread block boundary (i.e., when a thread block ﬁnishes

execution). Since thread blocks are independent and each one

has its own state, no context has to be saved nor restored this

way. This mechanism deals with the interleaved execution of

multiple thread blocks in an SM by draining the whole SM

when the preemption happens. To perform the preemption by

draining, the SM driver stops issuing any new thread blocks

to the given SM. When all the thread blocks issued to that SM

ﬁnish, the execution on that SM is preempted.

The context switch mechanism has a relatively predictable

latency that mainly depends on the amount of data that has

to be moved from the SM (register ﬁle and shared memory)

to the off-chip memory. The draining mechanism, on the

other hand, tries to trade the predictable latency for higher

utilization of the SM. Its latency depends on the execution

time of currently running thread blocks, but SMs still get to

Figure 4: Scheduling framework. The rest of the execution

engine (SM Driver and SMs) is shaded.

do some useful work while draining. The draining mechanism

naturally ﬁts the current GPU architectures as it only requires

small modiﬁcations to the SM driver. The biggest drawback is

its inability to effectively preempt the execution of applications

with very long running thread blocks or even preempt the

execution of malicious or persistent kernels at all.

3.3. Scheduling Framework

We extract a generic set of functionalities into a scheduling

framework that can be used to implement different schedul-

ing policies. The framework provides the means to track the

state of kernels and SMs and to allow the scheduling policy

to trigger the preemption of any SM. The scheduling policy

plugs into the framework and implements the logic of the con-

crete scheduling algorithm. Both the scheduling framework

and scheduling policies are implement in hardware to avoid

the long latency of issuing commands to the GPU [

17

]. Both

the context switch and draining preemption mechanisms are

supported by our framework. Scheduling policies performing

prioritization, time multiplexing, spatial sharing or some com-

bination of these can be implemented on top of it. The OS can

tweak the priorities on the ﬂight, but is not directly involved

in the scheduling process. Thus, there is no impact on the OS

noise.

Figure 4 shows the components of the scheduling frame-

work. An example of interaction between the scheduling

policy and the framework is given in Section 3.4.

Command

Buffers

receive the commands from the command dispatcher

and separate the execution commands from different contexts.

Each command buffer can store one command. The

Active

Queue

stores the identiﬁers of the active (running or pre-

empted) kernels. When there are free entires in the active

queue, the scheduling policy can read a command (kernel

launch) from one of the command buffers and allocate an en-

try in the

Kernel Status Register Table (KSRT)

. KSRT is

used to track active kernels and each valid entry is a KSR of

one active kernel, augmented with the identiﬁer of its GPU

context. The active queue is used by the policy to search

for scheduling candidates by indexing the KSRT. The

SM

Enabling preemptive multiprogramming on GPUs

Figures

Citations

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Graviton: trusted execution environments on GPUs

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming

References

GPU Computing

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

System-Level Performance Metrics for Multiprogram Workloads

Related Papers (5)

Rodinia: A benchmark suite for heterogeneous computing

The case for GPGPU spatial multitasking

Improving GPGPU concurrency with elastic kernels

Analyzing CUDA workloads using a detailed GPU simulator

Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing