scispace - formally typeset
Open AccessJournal ArticleDOI

Enabling preemptive multiprogramming on GPUs

Reads0
Chats0
TLDR
This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Abstract
GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x

read more

Content maybe subject to copyright    Report

Enabling Preemptive Multiprogramming on GPUs
Ivan Tanasic
1,2
, Isaac Gelado
3
, Javier Cabezas
1,2
, Alex Ramirez
1,2
, Nacho Navarro
1,2
, Mateo Valero
1,2
1
Barcelona Supercomputing Center
2
Universitat Politecnica de Catalunya
3
NVIDIA Research
first.last@bsc.es nacho@ac.upc.edu igelado@nvidia.com
Abstract
GPUs are being increasingly adopted as compute acceler-
ators in many domains, spanning environments from mobile
systems to cloud computing. These systems are usually run-
ning multiple applications, from one or several users. How-
ever GPUs do not provide the support for resource sharing
traditionally expected in these scenarios. Thus, such systems
are unable to provide key multiprogrammed workload require-
ments, such as responsiveness, fairness or quality of service.
In this paper, we propose a set of hardware extensions that
allow GPUs to efficiently support multiprogrammed GPU
workloads. We argue for preemptive multitasking and design
two preemption mechanisms that can be used to implement
GPU scheduling policies. We extend the architecture to al-
low concurrent execution of GPU kernels from different user
processes and implement a scheduling policy that dynami-
cally distributes the GPU cores among concurrently running
kernels, according to their priorities. We extend the NVIDIA
GK110 (Kepler) like GPU architecture with our proposals
and evaluate them on a set of multiprogrammed workloads
with up to eight concurrent processes. Our proposals improve
execution time of high-priority processes by 15.6x, the average
application turnaround time between 1.5x to 2x, and system
fairness up to 3.4x.
1. Introduction
Graphics Processing Units (GPUs) have become fully pro-
grammable massively parallel processors [
21
,
38
,
4
] that are
able to efficiently execute both, traditional graphics workloads,
and certain types of general purpose codes [
26
]. GPUs have
been designed to maximize the performance of a single appli-
cation, and thus assume exclusive access from a single process.
Accesses to the GPU from different applications are serial-
ized. However, as the number of applications ported to GPUs
grows, sharing scenarios are starting to appear. Issues with
GPU sharing, such as priority inversion and no fairness, have
already been noticed by operating systems [
30
,
17
,
18
,
27
] and
real-time [
16
,
6
] research communities. Moreover, with the
integration of programmable GPUs into mobile SoCs [
31
,
5
]
and consumer CPUs [
3
,
15
], the demand for GPU sharing is
likely to increase. This leads us to believe that support for
fine-grained sharing of GPUs must be implemented.
Today’s GPUs contain execution and data transfer engines
that receive commands from the CPU through command
queues. The execution engine comprises all the GPU cores,
that are treated as a whole. Commands from the same process
targeting different engines can be executed concurrently (e.g.,
a data transfer can be performed in parallel with a GPU kernel
execution). However, when a command is running, it has ex-
clusive access to the engine and cannot be preempted (i.e., the
command runs to completion). Hence, a long-latency GPU
kernel can occupy the execution engine, preventing other ker-
nels from same or different process to progress. This limitation
hinders true multiprogramming in GPUs.
The latest NVIDIA GPU architecture, GK110 (Kepler),
improves the concurrency of commands coming from the
same process by providing several hardware command queues
(often referred to as NVIDIA Hyper-Q [
23
]). NVIDIA also
provides a software solution [
24
] that acts as a proxy to allow
several processes to use the GPU as one, at the cost of losing
process isolation. Combining these two features is especially
useful for improving the utilization of GPU engines in legacy
MPI applications. However they do not solve the problem of
sharing the GPU among several applications.
To enable true sharing, GPUs need a hardware mechanism
that can preempt the execution of GPU kernels, rather than
waiting for the program to release the GPU. Such mechanism
would enable system-level scheduling policies that can control
the execution resources, in a similar way the multitasking
operating systems do with the CPUs today. The assumed
reason [
1
,
27
] for the lack of a preemption mechanism in
GPUs is the expected high overhead of saving and restoring
the context of GPU cores (up to 256KB of register file and
48 KB of on-chip scratch-pad memory per GPU core), which
can take up to 44
µ
s in GK110, assuming the peak memory
bandwidth. Compared to the context switch time of less than
1
µ
s on modern CPUs, this might seem to be a prohibitively
high overhead.
In this paper we show how preemptive multitasking is not
only necessary, but also a feasible approach to multiprogram-
ming on GPUs. We design two preemption mechanisms with
different effectiveness and implementation costs. One is simi-
lar to the classic operating system preemption where the ex-
ecution on GPU cores is stopped, and their context is saved
to implement true preemptive multitasking. The other mecha-
nism exploits the semantics of the GPU programming model
and the nature of GPU applications to implement preemp-
tion by stopping the issue of new work to preempted GPU
cores, and draining them from currently running work. We
show that both mechanisms provide improvements in system

responsiveness and fairness at the expense of a small loss in
throughput.
Still, exclusive access to the execution engine limits the
possible sharing to time multiplexing. We propose further
hardware extensions that remove the exclusive access con-
straint and allow the utilization of GPU cores, individually.
These extensions enable different processes to concurrently
execute GPU kernels on different sets of GPU cores. Fur-
thermore, we implement Dynamic Spatial Sharing (DSS), a
hardware scheduling policy that dynamically partitions the
resources (GPU cores) and assigns them to different processes
according to the priorities assigned by the OS.
The three main contributions of the paper are (1) the design
of two preemption mechanisms that allow GPUs to implement
scheduling policies, (2) extensions for concurrent execution of
different processes on GPUs that allow implementing spatial
sharing, and (3) a scheduling policy that dynamically assigns
disjoint sets of GPU cores to different processes. Experimen-
tal evaluation shows that the hardware support for preemptive
multi-tasking introduced in this paper allows scheduler imple-
mentations for multiprogrammed environments that, on aver-
age, improve the performance of high-priority applications up
to 15.6x over the baseline at the cost of 12% of degradation in
throughput. Our DSS scheduling policy improves normalized
turnaround time up to 2x and system fairness up to 3.4x at the
cost of throughput degradation up to 35%.
2. Background and Motivation
In this section we provide the background on GPU architecture
and execution model that are necessary for understanding our
proposals. Our base architecture is modeled after the NVIDIA
GK110 chip, but we keep the discussion generic, to cover
architectures from other vendors, as well as fused CPU-GPU
architectures.
2.1. GPU Program Execution
Typically, GPU applications consist of repetitive bursts of 1)
CPU execution, that perform control, preprocessing or I/O
operations, 2) GPU execution (kernels), that performs com-
putationally demanding tasks, and 3) data transfers between
CPU and GPU, that bring input data to the GPU memory and
return the outputs to the CPU memory. The GPU device driver
is in charge of performing the bookkeeping tasks for the GPU,
as the OS performs for the CPU (e.g., managing the GPU
memory space). GPU kernel invocations (kernel launch in
CUDA terminology), initiation of data transfers, and GPU
memory allocations are typically performed in the CPU code
(referred to as commands in the rest of the paper). Each kernel
launch consists of a number of threads executing the same
code. Threads are grouped in thread blocks that are indepen-
dent of each other, and only threads from the same thread
block can cooperate using barrier synchronization and com-
munication through local memory (shared memory in CUDA
terminology).
Programming models for GPUs provide software work
queues (streams in CUDA terminology) that allow program-
mers to specify the dependences between commands. Com-
mands in different streams are considered independent and
may be executed concurrently by the hardware. Because the
latency of issuing a command to the GPU is significant [
17
],
commands are sent to the GPU as soon as possible. Each
process that uses a GPU gets its own GPU context, which
contains the page table of the GPU memory and the streams
defined by the programmer.
To overcome the inefficiencies introduced by multiple pro-
cesses sharing the GPU [
20
], NVIDIA provides a software
solution called Muli-Process Service (MPS) [
24
]. MPS in-
stantiates a proxy process that receives requests from client
processes (e.g., processes in an MPI application) and executes
them on the GPU. Such a solution has two main limitations:
(1) memory allocated from any of the client processes, is ac-
cessible from any of the other client processes, thus breaking
the memory isolation between processes; (2) it does not allow
to enforce scheduling policies across the client processes. Al-
though MPS provides important performance improvement in
the case of MPI applications, it is not a general solution for
multiprogrammed workloads.
2.2. Base GPU Architecture
The base architecture assumed in this paper is depicted in
Figure 1. The GPU is connected to the rest of the system
through an interconnection network (1). In the case of dis-
crete GPUs, the interconnect is the PCI Express bus and the
GPU has its own physical RAM (2). In the case of fused
CPU/GPU architectures [
8
], the interconnect is an on-chip
network and GPU and CPU share the same physical RAM
(3). Current generation GPUs, including our baseline, do not
support demand paging (memory swap-out). Thus, in today’s
GPU systems, allocations from all contexts reside in the GPU
physical memory. The GPU has an execution engine (4) with
access to its memory, and a data transfer engine (5) for trans-
ferring the data between CPU memory and GPU memory (in
integrated designs GPUs have DMA engines for bulk memory
transfers, too). All the GPU computation cores (Streaming
Multiprocessors or SMs in CUDA terminology) belong to
the execution engine and are treated as a whole, for kernel
scheduling purposes.
The interface to the CPU implements several hardware
queues (i.e., NVIDIA Hyper-Q) used by the CPU to issue
GPU commands. The GPU device driver maps streams from
the applications on the command queues. A command dis-
patcher (6) inspects the top of the command queues and issues
the commands to the corresponding engine. Data transfer
commands are issued to the data transfer engine via DMA
queues while kernel launch commands are issued to the ex-
ecution engine via the execution queue (7). After issuing a
command, the dispatcher stops inspecting that queue. After
completing a command, the corresponding engine notifies the

Figure 1: Baseline GPU architecture.
command dispatcher so that the queue is re-enabled for inspec-
tion. Thus, commands from different command queues that
target different engines can be concurrently executed. Con-
versely, commands coming from the same command queue are
executed sequentially, following the semantics of the stream
abstraction defined by the programming model. Traditionally,
a single command queue was provided, but newer GPUs from
NVIDIA provide several of them, often referred to as Hy-
perQ [
23
]. Using several queues increases the opportunities to
overlap independent commands that target different engines.
The GPU also includes a set of global control registers (8)
that hold the GPU context information used by the engines.
These control registers hold process-specific information, such
as the location of the virtual memory structures (e.g., page
table), the GPU kernels registered by the process, or structures
used by the graphics pipeline.
2.3. Base GPU Execution Engine
The base GPU execution engine we assume in this paper is
shaded in Figure 1. The SM driver (9) gets kernel launch
commands from the execution queue (7), and sets up the Ker-
nel Status Registers (10) (KSR) with the control information
such as number of work units to execute, kernel parameters
(number and stack pointer). . . The SM driver uses the contents
of these registers, as well as the global GPU control registers,
to setup the SMs before the execution of a kernel starts. The
execution queue can contain a number of independent kernel
commands coming from the same context, that are scheduled
for execution in a first-come first-serve (FCFS) manner.
A kernel launch command consists of thread blocks that are
independent from each other and, therefore, they are executed
on SMs (11) independently. Each thread block is divided into
fixed-size groups of threads that execute in a lock-step (warps
in CUDA terminology) [
21
]. A reconvergence stack tracks the
execution of divergent threads in a warp by storing the pro-
gram counter and mask of the threads that took the divergent
branch [
11
]. SM cycles through warps from all thread blocks
assigned to it and execute (12) the next instruction of a warp
with ready operands.
When a thread block is issued to an SM, it remains resident
in that SM until all of its warps finish execution. An SM
can execute more than one thread block in an interleaved
fashion. Concurrent execution of thread blocks relies on static
hardware partitioning, so the available hardware resources
Figure 2: Execution of soft real-time application with (a) FCFS
(current GPUs), (b) nonpreemptive priority and (c) preemptive
priority schedulers. K1 and K2 are low-priority kernels, while
K3 is high-priority.
(13) (e.g., registers and shared memory) are split among all
the thread blocks in the SM. The resource usage of all the
thread blocks from a kernel is the same and it is known at
kernel launch time. The number of thread blocks that can
run concurrently is thus determined by the first fully used
hardware resource. Static hardware partitioning implies that
only thread blocks from the same kernel can be scheduled to
concurrently run on the same SM.
After the setup of the SM is done, the SM driver issues
thread blocks to each SM until they are fully utilized (i.e.,
run out of hardware resources). Whenever a SM finishes
executing a thread block, the SM driver gets notified and
issues a new thread block from the same kernel to that SM.
This policy combined with static partitioning of hardware
resources means that kernels with enough thread blocks from
one kernel will occupy all the available SMs, forcing the other
kernel commands in the execution queue to stall. As a result,
concurrent execution among kernels is possible only if there
are free resources after issuing all the work from previous
kernels. This back-to-back execution happens when a kernel
does not have enough thread blocks to occupy all SMs or
the scheduled kernel is finishing its execution, and SMs are
becoming free again. Today’s GPUs, however, do not support
concurrent execution of commands from different contexts on
the same engine. That is, only kernels from the
same process
can be concurrently executed.
2.4. Arguments for Preemptive Execution
The main goal of this work is to enable fine-grained scheduling
of multiprogrammed workloads running on the GPU. Figure 2
illustrates how scheduling support in current GPUs is not
sufficient when, for example, a soft real-time application is
competing for resources with other applications. The exe-
cution on a modern GPU is shown in Figure 2a, where the
kernel with a deadline (
K3
) does not get scheduled until all
previously issued kernels (
K1
and
K2
) have finished execut-
ing. A software implementation [
16
] or a modification to GPU
command scheduler could allow priorities to be assigned to
processes, resulting in the timeline shown in Figure 2b.
A common characteristic of the previous approaches is
that the execution latency of
K3
depends on the execution
time of previously launched kernels from other processes.
This is an undesirable behavior from both system’s and user’s

perspective, and limits the effectiveness of the GPU scheduler.
To decouple the scheduling from the latency of kernel running
on the GPU, a preemption mechanism is needed. Figure 2c
illustrates how the latency of the kernel
K3
could decrease
even further if kernel
K1
can be preempted. Allowing GPUs to
be used for latency sensitive applications is the first motivation
of this paper.
Preemptive execution on GPUs is not only useful to speed
up high-priority tasks, it is required to guarantee forward
progress of applications in multiprogrammed environments.
The persistent threads pattern of GPU computing, for instance,
uses kernels that occupy the GPU and actively wait for work
to be submitted from the CPU [
2
,
14
]. Preventing starvation
when this kind of applications run in the multiprogrammed
system is the second motivation of this paper.
There is a widespread assumption that preemption in GPUs
is not cost-effective due to the large cost of context switch-
ing [
1
,
27
]. Even though it is clear that in some cases it is
necessary [
19
], it is not clear if benefits can justify the disad-
vantages when preemption is used by fine-grained schedulers.
Comparing benefits and drawbacks of the context saving and
restoring approach to preemption with an alternative approach
where no context is saved or restored on preemption points is
the third motivation of this paper.
3. Support for Multiprogramming in GPUs
Following the standard practice of systems design, we sep-
arate mechanisms from policies that use them. We provide
two generic preemption mechanism and policies that are com-
pletely oblivious to the preemption mechanism used. To sim-
plify the implementation of policies, we abstract the common
hardware in a scheduling framework.
3.1. Concurrent Execution of Processes
To support multiprogramming, the memory hierarchy, the ex-
ecution engine and the SMs all have to be aware of multiple
active contexts. The memory hierarchy of the GPU needs to
support concurrent accesses from different processes, using
different address spaces. Modern GPUs implement two types
of memory hierarchy [
28
]. In one, the shared levels of the
memory hierarchy are accessed using virtual addresses and the
address translation is performed in the memory controller. The
cache lines and memory segments of such hierarchy have to be
tagged with an address space identifier. The other implementa-
tion uses address translation at the private levels of the memory
hierarchy, and physical memory addresses to access shared
levels of the memory hierarchy. The mechanisms that we de-
scribe here are compatible with both approaches. We assume
that later approach is implemented, hence no modifications
are required to the memory subsystem.
If only one GPU context executes kernels, SMs can easily
get the context information from the global GPU control struc-
tures. We extend the execution engine to include a context
table with information of all active contexts. The context in-
Figure 3: Operation of the SM driver. Dashed objects are pro-
posed extensions.
formation is sent to the SM during the setup, before it starts
receiving thread blocks to execute. The SM is extended with a
GPU context id register, a base page table register and other
context specific information, such as the texture register. The
base page table register is used on a TLB miss to walk the
per-process page table stored in the main memory of the GPU.
This is in contrast to the base GPU architecture where the same
page table was used by all SMs, since they execute kernels
from the same context. Similarly, the GPU context id register
is used when accessing the objects associated with the GPU
context (e.g., kernels) from SM. We extend the context of
the SM, rather than reading this information from the context
table that would otherwise require many read ports to allow
concurrent accesses from SMs.
3.2. Preemptive Kernel Execution
The scheduling policy is always in charge of figuring out when
and which kernels should be scheduled to run. If there are
no idle SMs in the system, it uses the preemption mechanism
to free up some SMs. To provide a generic preemption sup-
port to different policies, we need to be able to preempt the
execution on each SM individually. We provide this support
by extending the SM driver. Figure 3 shows the operation of
the SM driver, with dashed objects showing our extensions.
When there are kernels to execute, the SM driver looks for an
idle SM, performs the setup, and starts issuing thread blocks
until the SM is fully occupied. The SM driver then repeats the
procedure until there are no more idle SMs. When there are
thread blocks left, the baseline SM driver issues a new thread
block every time an SM notifies the driver that it finished
executing a thread block.
We extend this operation and allow the scheduling policy
to preempt the execution on an SM (independent of which
preemption mechanism is used) by labeling it as reserved.

After receiving a notification of finished thread block from
the SM, the SM driver checks if the SM is reserved. If not,
it proceeds with the normal operation (issuing new thread
blocks). If reserved, the driver waits for preemption to be done,
sets up the SM for the kernel that reserved it, and continues
with the normal operation. In Section 3.3 we describe the
hardware extensions used by the SM driver to perform the
bookkeeping of SMs and active kernels.
The first preemption mechanism that we implement,
con-
text switch
, follows the basic principle of preemption used
by operating system schedulers. The execution contexts of
all the thread blocks running in the preempted SM are saved
to off-chip memory, and these thread blocks are issued again,
later on. Each active kernel has a preallocated memory where
the context of its preempted thread blocks are kept. When a
preempted thread block is issued, its execution context is first
restored so the computation can continue correctly. This con-
text consists of the architectural registers used by each thread
of the thread block, the private partition of the shared memory,
and other state that defines the execution of the thread block
(e.g., the pointer to the reconvergence stack and state of the
barrier unit). Saving and restoring the context is performed
by a microprogrammed trap routine. Each thread saves all of
its registers, while the shared memory of the thread block is
collaboratively saved by its threads. This operation is very
similar to the context save and restore performed on device-
side kernel launch when using the dynamic parallelism feature
of GK110 [
25
]. Since preemption raises an asynchronous trap,
precise exceptions are needed [
32
]. The simplest solution is
to drain the pipeline from all the on-flight instruction before
jumping to the trap routine. The main drawback of the con-
text switch mechanism is that during the context save and
restore, thread blocks do not progress, leading to a complete
underutilization of the SM. This underutilization could be
improved by using compiler-microarchitecture co-designed
context minimization techniques, such as iGPU [22].
The second mechanism that we implement,
SM draining
,
tries to avoid this underutilization by preempting the execution
on a thread block boundary (i.e., when a thread block finishes
execution). Since thread blocks are independent and each one
has its own state, no context has to be saved nor restored this
way. This mechanism deals with the interleaved execution of
multiple thread blocks in an SM by draining the whole SM
when the preemption happens. To perform the preemption by
draining, the SM driver stops issuing any new thread blocks
to the given SM. When all the thread blocks issued to that SM
finish, the execution on that SM is preempted.
The context switch mechanism has a relatively predictable
latency that mainly depends on the amount of data that has
to be moved from the SM (register file and shared memory)
to the off-chip memory. The draining mechanism, on the
other hand, tries to trade the predictable latency for higher
utilization of the SM. Its latency depends on the execution
time of currently running thread blocks, but SMs still get to
Figure 4: Scheduling framework. The rest of the execution
engine (SM Driver and SMs) is shaded.
do some useful work while draining. The draining mechanism
naturally fits the current GPU architectures as it only requires
small modifications to the SM driver. The biggest drawback is
its inability to effectively preempt the execution of applications
with very long running thread blocks or even preempt the
execution of malicious or persistent kernels at all.
3.3. Scheduling Framework
We extract a generic set of functionalities into a scheduling
framework that can be used to implement different schedul-
ing policies. The framework provides the means to track the
state of kernels and SMs and to allow the scheduling policy
to trigger the preemption of any SM. The scheduling policy
plugs into the framework and implements the logic of the con-
crete scheduling algorithm. Both the scheduling framework
and scheduling policies are implement in hardware to avoid
the long latency of issuing commands to the GPU [
17
]. Both
the context switch and draining preemption mechanisms are
supported by our framework. Scheduling policies performing
prioritization, time multiplexing, spatial sharing or some com-
bination of these can be implemented on top of it. The OS can
tweak the priorities on the flight, but is not directly involved
in the scheduling process. Thus, there is no impact on the OS
noise.
Figure 4 shows the components of the scheduling frame-
work. An example of interaction between the scheduling
policy and the framework is given in Section 3.4.
Command
Buffers
receive the commands from the command dispatcher
and separate the execution commands from different contexts.
Each command buffer can store one command. The
Active
Queue
stores the identifiers of the active (running or pre-
empted) kernels. When there are free entires in the active
queue, the scheduling policy can read a command (kernel
launch) from one of the command buffers and allocate an en-
try in the
Kernel Status Register Table (KSRT)
. KSRT is
used to track active kernels and each valid entry is a KSR of
one active kernel, augmented with the identifier of its GPU
context. The active queue is used by the policy to search
for scheduling candidates by indexing the KSRT. The
SM

Citations
More filters
Proceedings ArticleDOI

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

TL;DR: Chimera first introduces streaming multiprocessor flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution, and utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency.
Proceedings ArticleDOI

Graviton: trusted execution environments on GPUs

TL;DR: Graviton enables applications to offload security- and performance-sensitive kernels and data to a GPU, and execute kernels in isolation from other code running on the GPU and all software on the host, including the device driver, the operating system, and the hypervisor.
Proceedings ArticleDOI

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing

TL;DR: Simultaneous Multikernel (SMK) is proposed, a fine-grain dynamic sharing mechanism, that fully utilizes resources within a streaming multiprocessor by exploiting heterogeneity of different kernels to improve system throughput while maintaining fairness.
Proceedings ArticleDOI

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

TL;DR: Baymax is presented, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization.
Journal ArticleDOI

Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming

TL;DR: Warped-Slicer is proposed, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance and is also computationally efficient.
References
More filters

GPU Computing

TL;DR: The background, hardware, and programming model for GPU computing is described, the state of the art in tools and techniques are summarized, and four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications are presented.
Journal ArticleDOI

NVIDIA Tesla: A Unified Graphics and Computing Architecture

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
Proceedings ArticleDOI

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).
Proceedings ArticleDOI

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

TL;DR: It is shown that a realistic hardware implementation that dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes improves performance by an average of 20.7% for an estimated area increase of 4.7%.
Journal ArticleDOI

System-Level Performance Metrics for Multiprogram Workloads

TL;DR: The authors propose two performance metrics: average normalized turnaround time, a user- oriented metric, and system throughput, a system-oriented metric for developing multiprogram performance metrics in a top-down fashion starting from system-level objectives.
Related Papers (5)