scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Design and evaluation of the gemtc framework for GPU-enabled many-task computing

TL;DR: GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing and provides a high-productivity programming model for the growing number of supercomputers that are accelerator-enabled.
Abstract: We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration With GeMTC, a broad class of such "many-task" applications can leverage the increasing number of accelerated and hybrid high-end computing systems GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing We demonstrate the use of a high-level MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a high-productivity programming model for the growing number of supercomputers that are accelerator-enabled While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 27M GPU threads on the Blue Waters supercomputer

Summary (4 min read)

1. INTRODUCTION

  • This work explores methods for, and potential benefits of, applying the increasingly abundant and economical generalpurpose graphics processing units to a broader class of applications.
  • Tasks typically run to completion: they follow the simple input-process-output model of procedures, rather than retaining state as in web services or MPI processes.
  • Efficient MTC implementations are now commonplace on clusters, grids, and clouds.
  • Integration of GeMTC with Swift, enabling a broad class of dataflow-based scientific applications, and improving programmability for both hybrid multicore hosts and extreme scale systems.
  • Work is load balanced among large numbers of GPUs.

2. CHALLENGES OF MANY-TASK COM-PUTING ON GPGPUS

  • The authors GeMTC work is motivated by the fact that with current mainstream programming models, a significant portion of GPU processing capabilities underutilized by MTC workloads.
  • The results presented here indicate that this approach enables higher utilization of GPU resources, greater concurrency, and hence higher many-task throughput.

2.1 NVIDIA GPUs and GPGPU Computing.

  • General-purpose computing on graphics processing units allows a host CPU to offload a wide variety of computation, not just graphics, to a graphics processing unit (GPU).
  • GPUs are designed for vector parallelism: they contain many lightweight cores designed to support parallel bulk processing of graphics data.
  • A SMX contains many warps, and each warp provides 32 concurrent threads of execution.
  • Ousterhout et al., [11] make a compelling argument for the pervasive use of tiny tasks in compute clusters.
  • The authors apply a similar argument to motivate the GeMTC model of running many small independent tasks on accelerators.

2.2 Mainstream GPU Support for MTC

  • The dominant CUDA and OpenCL GPGPU programming models both provide extensions to traditional programming languages such as C with added API calls to interact with accelerators.
  • OpenACC is an open standard and aims to provide the portability of OpenCL while requiring less detailed knowledge of accelerator architecture than is required in CUDA and OpenCL programming.
  • Concurrent Kernels [14] is a CUDA feature that enables the developer to launch parallel work on a GPU.
  • The current model of GeMTC and Swift relies on communication between the CPU and GPU to drive tasks to and from the Swift script.
  • In addition, to process workflows with complex dependencies, the developer must group tasks into batches and block on batch completion before executing dependent kernels, an inadequate approach for supporting heterogeneous concurrent tasks.

3. GEMTC ARCHITECTURE

  • Given that their target test bed consisted of NVIDIA GPUs and that the authors wanted to examine the GPU at the finest granularity possible, they opted to implement their framework using CUDA.
  • This decision allowed us to work at the finest granularity possible but limited their evaluation to NVIDIA based hardware.
  • While GeMTC was originally developed on NVIDIA CUDA devices, its architecture is general, and has also been implemented on the Intel Xeon Phi [16] .
  • The Phi, however, represents a different accelerator architecture, meriting separate study, and is not addressed in this paper.
  • A work queue in GPU memory is populated from calls to a C-based API, and GPU workers pick up and execute these tasks.

3.1 Kernel Structure and Task Descriptions

  • A key element of GeMTC is the daemon launched on the GPU, named the Super Kernel, which enables many hardware level workers (at the warp level) on the GPU.
  • After a worker has completed a computation, the results are placed on an outgoing result queue and returned to the caller.
  • Within traditional GPU programming, a user defined function that runs on the GPU is called a kernel.
  • These concurrent kernels are a key technology in the GeMTC framework.
  • The Super Kernel gathers hardware information from the GPU and dynamically starts the maximum number of workers available on that GPU.

3.2 GeMTC API

  • Figure 5 uses a simple molecular dynamics (MD) example to demonstrate how a user can leverage the GeMTC API to launch a simulation on the GPU.
  • Once these parameters have been transferred into GPU memory the user pushes the task to the GPU along with all the information needed to create the task description on the device.
  • At this point the user can begin polling for a result.
  • When the gemtcPoll function returns a result, the user can then unpack the memory and move to the next operation.
  • It is expected that end users will utilize high-level Swift scripts to launch their tasks on GeMTC.

3.3 Queues, Tasks, and Memory Management

  • The Incoming Work Queue is populated by calls to the GeMTC API and contains tasks that are ready to execute.
  • The tasks in this queue contain a TaskDescription and the necessary parameters to execute the task.
  • With traditional CUDA programming models the current best practice is to allocate all memory needed by an application at launch time and then manually manage and reuse this memory as needed.
  • Then pointers to these free chunks and their sizes are stored in a circular linked list on the CPU .
  • The main bottleneck for obtaining high task throughput through GeMTC is the latency associated with writing a task to the GPU DRAM memory.

4. SWIFT: DATAFLOW EXECUTION AND PROGRAMMING MODEL FOR MTC

  • Swift [4] is an implicitly parallel functional dataflow programming language that is proving increasingly useful to express the higher-level logic of scientific and engineering applications.
  • Many important application classes and programming techniques that are driving the requirements for such extremescale systems include branch and bound, stochastic programming, materials by design, and uncertainty quantification.
  • The dataflow programming model of the Swift parallel scripting language can elegantly express, through implicit parallelism, the massive concurrency demanded by these applications while retaining the productivity benefits of a high-level language.
  • When using its own resource provisioner [6].
  • This enables Swift to express a far broader set of applications, and makes it a productive coordination language for hybrid CPU+accelerator nodes and systems.

GeMTC Integration with Swift

  • The integration with Swift provides many mutual benefits for both Swift and GeMTC.
  • The final box on the right illustrates how GeMTC fits into the Swift/T stack.
  • Thus, the user's Swift application can simply call any function mapped to an AppKernel from the high level Swift program.
  • Data transfers overlap with ongoing GPU computations implicitly and automatically.
  • And because the GeMTC API calls are handled at the Turbine worker level, the Swift programmer is freed from the burden of writing complex mem- ory management code for the GPU.

5. PERFORMANCE EVALUATION

  • This section evaluates the GeMTC framework with a set of AppKernels from the GeMTC AppKernel Library.
  • App-Kernels are CUDA device functions that have been tuned to deliver high performance under MTC workloads.
  • The authors work with a lightweight molecular dynamics simulation called MDLite.
  • The authors conclude with an analysis of MDLite over multiple XK7 nodes and examine a set of simple adder benchmarks to highlight throughput and efficiency.
  • Blue Waters contains ∼20K Cray XE6 CPU based nodes and ∼4K Cray XK7 GPU nodes.

5.1 Molecular Dynamics

  • The user specifies the number of particles in a "universe" along with their starting positions, the number of dimensions, and a starting mass.
  • MDLite runs a simulation that determines how the potential and kinetic energy in the system changes as the particles change position.
  • By varying the number of active threads included in a warp computation, the authors prove that for the right application it could indeed benefit from the 32 threads in a GPU warp.
  • Figure 15 evaluates a varied number of MDLite simulations running over a K20X GPU.

5.2 Throughput and Efficiency

  • Next, the authors evaluate GeMTC with a simple adder benchmark.
  • Af- terwards, the authors can easily measure the efficiency and overhead of their system: efficiency = (expected runtime/observed runtime).
  • First, a CPU version of the simple adder is executed through Swift/T on XE6 nodes.
  • Figure 20 highlights the single-node efficiency of GeMTC running with 168 active workers per GPU.
  • The authors attribute this drop in performance to greater worker contention on the device queues and the fact that Swift must now drive 168 times the amount of work per node.

5.3 Preliminary MTC Xeon Phi Results

  • The authors have also gathered preliminary results for supporting MTC workloads on the Intel Xeon Phi Coprocessor.
  • As shown in Figure 23 the authors can achieve the same level of efficiency with shorter running tasks (50% shorter) on a Xeon Phi compared with a GTX-680 NVIDIA GPU.
  • The authors highlight the fact that with GeMTC on its own they observe upwards of 90% efficiency with tasks lasting 5 ms.
  • This means that a fully general purpose framework would be capable of launching tasks an order of magnitude faster.
  • The authors will continue to improve performance to ensure all components of the system can keep up with these task dispatch rates.

7. CONCLUSIONS

  • The authors have presented GeMTC, a framework for enabling MTC workloads to run efficiently on NVIDIA GPUs.
  • The GeMTC framework is responsible for receiving work from a host through the use of the C API, and scheduling and running that work on many independent GPU workers.
  • Results are returned through the C API to the host and then to Swift.
  • Applications that can generate thousands of SIMD threads may prefer to use traditional CUDA programming techniques.
  • Under the current configurations, users are required to write their own AppKernels.

Did you find this useful? Give us your feedback

Figures (24)

Content maybe subject to copyright    Report

Design and Evaluation of the GeMTC Framework for
GPU-enabled Many-Task Computing
Scott J. Krieder,
Justin M. Wozniak,
Timothy Armstrong,
§
Michael Wilde
Daniel S. Katz,
Benjamin Grimmer,
Ian T. Foster,
§
Ioan Raicu
Department of Computer Science, Illinois Institute of Technology
Mathematics and Computer Science Division, Argonne National Laboratory
§
Department of Computer Science, University of Chicago
Computation Institute, University of Chicago & Argonne National Laboratory
ABSTRACT
We present the design and first performance and usability
evaluation of GeMTC, a novel execution model and run-
time system that enables accelerators to be programmed
with many concurrent and independent tasks of potentially
short or variable duration. With GeMTC, a broad class
of such “many-task” applications can leverage the increas-
ing number of accelerated and hybrid high-end computing
systems. GeMTC overcomes the obstacles to using GPUs
in a many-task manner by scheduling and launching inde-
pendent tasks on hardware designed for SIMD-style vector
processing. We demonstrate the use of a high-level MTC
programming model (the Swift parallel dataflow language)
to run tasks on many accelerators and thus provide a high-
productivity programming model for the growing number of
supercomputers that are accelerator-enabled. While still in
an experimental stage, GeMTC can already support tasks of
fine (subsecond) granularity and execute concurrent hetero-
geneous tasks on 86,000 independent GPU warps spanning
2.7M GPU threads on the Blue Waters supercomputer.
Categories and Subject Descriptors
D.1.3 [Programming Techniques]: Concurrent Program-
ming
Keywords
Many-task computing; GPGPU; CUDA; Accelerators; Hy-
brid execution; Workflow; Programming models; Execution
models.
1. INTRODUCTION
This work explores methods for, and potential benefits of,
applying the increasingly abundant and economical general-
purpose graphics processing units (GPGPU) to a broader
class of applications. It extends the utility of GPGPU from
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
HPDC’14, June 23–27, Vancouver, BC, Canada.
Copyright 2014 ACM 978-1-4503-2749-7/14/06 ...$15.00.
http://dx.doi.org/10.1145/2600212.2600228.
the class of heavily vectorizable applications to irregularly-
structured many-task applications. Such applications are
increasingly common, stemming from both problem-solving
approaches (i.e., parameter sweeps, simulated annealing or
branch-and-bound optimizations, uncertainty quantification)
and application domains (climate modeling, rational mate-
rials design, molecular dynamics, bioinformatics).
In many-task computing (MTC) [1, 2], tasks may be of
short (even subsecond) duration or highly variable (ranging
from milliseconds to minutes). Their dependency and data
passing characteristics may range from many similar tasks to
complex, and possibly dynamically determined, dependency
patterns. Tasks typically run to completion: they follow
the simple input-process-output model of procedures, rather
than retaining state as in web services or MPI processes.
Efficient MTC implementations are now commonplace on
clusters, grids, and clouds. In recent years we have ex-
tended MTC to applications on homogeneous supercom-
puters, using tools such as Falkon [3], Swift [4], JETS [5],
and Coasters [6]. Other programming models and tools
that support MTC include MapReduce, volunteer comput-
ing [7], SLURM [8], and Cobalt [9], which allow super-
computer tasks to be subdivided into asynchronous sub-
tasks [10]. All these approaches can benefit from the MTC-
enabling accelerator work we describe here. The contribu-
tions of this work are as follows:
Design and implementation of GeMTC, a framework
enabling MTC workloads to run efficiently on NVIDIA
GPUs.
Improved dynamic GPU memory management, pro-
viding efficient scaling and a 10x improvement over
native CUDA dynamic memory management.
Integration of GeMTC with Swift, enabling a broad
class of dataflow-based scientific applications, and im-
proving programmability for both hybrid multicore hosts
and extreme scale systems. Work is load balanced
among large numbers of GPUs.
Performance evaluation on synthetic benchmarks and
a proxy code representing molecular dynamics simula-
tion workloads.
This paper is organized as follows: Section 2 describes the
challenges of many-task computing on GPGPUs. Section 3
describes the GeMTC framework and its underlying archi-
tecture. Section 4 describes Swift and its integration as a

Graphics Processing Unit (GPU)
SMX
1
(Contains M Warps)
Warp M
(32 Threads)
Warp 1
(32 Threads)
Warp 2
(32 Threads)
SMX
2
(Contains M Warps)
SMX
N
(Contains M Warps)
Figure 1: Diagram of GPU architecture hierarchy.
GeMTC programming model. Section 5 presents a perfor-
mance evaluation, and Section 6 discusses related work. We
summarize our contributions in Section 7 and briefly discuss
related work.
2. CHALLENGES OF MANY-TASK COM-
PUTING ON GPGPUS
Our GeMTC work is motivated by the fact that with cur-
rent mainstream programming models, a significant portion
of GPU processing capabilities underutilized by MTC work-
loads. We advocate sending a larger number of smaller, con-
current, independent tasks to a GPU. The results presented
here indicate that this approach enables higher utilization
of GPU resources, greater concurrency, and hence higher
many-task throughput.
2.1 NVIDIA GPUs and GPGPU Computing.
General-purpose computing on graphics processing units
(GPGPU) allows a host CPU to offload a wide variety of
computation, not just graphics, to a graphics processing
unit (GPU). GPUs are designed for vector parallelism: they
contain many lightweight cores designed to support paral-
lel bulk processing of graphics data. GPGPU leverages this
parallel architecture for nongraphic computations such as
matrix multiplication. In the context of this paper, all ref-
erences to GPU refer to this GPGPU approach. In addition
to application speedup, other benefits to leveraging accel-
erators include power efficiency (improved Flops/watt) and
cost savings (improved Flops/$).
As shown in Figure 1, a NVIDIA GPU (which dominates
the GPGPU and HPC marketplace) is comprised of many
Streaming Multiprocessors (SMXs). A SMX contains many
warps, and each warp provides 32 concurrent threads of ex-
ecution. All threads within a warp run in a Single Instruc-
tion Multiple Thread (SIMT) fashion. As we describe below,
GeMTC schedules independent computations on the GPU
at the warp level, a level of independent task concurrency
not provided by any mainstream GPU programming model.
Our GeMTC work targets the latest generation of NVIDIA
GPUs, specifically the Kepler K20X. This device has 14
SMXs with 192 cores per SMX, a maximum of 168 warps,
and a total core count of 2,688. MTC workloads that send
only single tasks, or small numbers of large tasks, to acceler-
ator devices observe near-serialized performance, and leave
a significant portion of device processor capability unused.
Ousterhout et al., [11] make a compelling argument for the
pervasive use of tiny tasks in compute clusters. We apply a
similar argument to motivate the GeMTC model of running
many small independent tasks on accelerators. Driven by
this tiny-task motivation, GeMTC provides an architecture
for “overdecomposition” [12] of accelerator-resident tasks,
which can then be tightly packed into a GPU to maximize
efficiency and minimize time to solution. While Swift load
balances tasks and applies compiler optimizations in sup-
port of overdecomposition, the user must write applications
with suitably fine-grained tasks.
2.2 Mainstream GPU Support for MTC
The dominant CUDA and OpenCL GPGPU programming
models both provide extensions to traditional programming
languages such as C with added API calls to interact with
accelerators. CUDA is supported by NVIDIA and works on
NVIDIA GPUs. OpenCL is based on an open standard that
aims to provide improved portability across a variety of ac-
celerators and other compute devices. OpenACC is a newer
pragma-based technology that is gaining momentum. As in
OpenMP, OpenACC programmers provide hints to the com-
piler where they believe a computation would benefit from
being offloaded to an accelerator. OpenACC is an open stan-
dard and aims to provide the portability of OpenCL while
requiring less detailed knowledge of accelerator architecture
than is required in CUDA and OpenCL programming. In
many cases OpenACC may require significantly less coding,
but early measurements (e.g., by Wienke et al. [13]) suggest
that OpenACC is not yet capable of delivering equivalent
performance.
Concurrent Kernels [14] is a CUDA feature that enables
the developer to launch parallel work on a GPU. However,
the maximum number of concurrent kernels is limited to
32, far less than the number of 168 independent warps pro-
vided by the latest Kepler GPUs. HyperQ and Dynamic
Parallelism [15], recent CUDA enhancements introduced by
NVIDIA with the Kepler architecture, are a step toward
MTC support. HyperQ allows more parallel work to be
sent to the GPU, while Dynamic Parallelism allows threads
to spawn more threads on the device. The current model
of GeMTC and Swift relies on communication between the
CPU and GPU to drive tasks to and from the Swift script. If
a task sent to GeMTC from Swift was represented by com-
pact code and could be decomposed even further (e.g., loop
unrolling) it is possible that GeMTC could utilize Dynamic
Parallelism to dynamically launch new tasks and process the
parent task with even more improved performance, but we
leave that as future work. Most other programming models,
however, still treat the GPU as a solution to large vector-
oriented SIMD computations and do not adequately support
the potential for speedup of many-task applications.
A primary motivation for our work on GeMTC is that
none of these mainstream accelerator programming mod-
els provides the flexible support for independent concurrent
tasks required by many-task applications. In order to ef-
fectively utilize an accelerator, MTC applications with com-
plex task dependencies need task results rapidly returned
from device to host so that the application can process its
dataflow-driven dependencies. To the best of our knowledge,
no solution prior to GeMTC offers this capability.
Figure 2(A) illustrates why many-task computing work-
loads experience low efficiencies through Concurrent Ker-

nels, the best available standard CUDA concurrency model
for independent tasks launched by the host. In this model,
tasks must be submitted at the same time, and no additional
tasks can be submitted until all tasks are complete. With
unbalanced task durations, a significant number of GPU pro-
cessor cores will be underutilized. In addition, to process
workflows with complex dependencies, the developer must
group tasks into batches and block on batch completion be-
fore executing dependent kernels, an inadequate approach
for supporting heterogeneous concurrent tasks. Figure 2(B)
demonstrates how GeMTC provides support for heteroge-
neous tasks by treating every warp worker as an indepen-
dently operating SIMD compute device. Because the warps
are operating independently they are able to pick up work
immediately rather than block on other warps for comple-
tion. Figure 2(C) demonstrates how overdecomposition can
be utilized by GeMTC to pack tiny tasks neatly into the
GPU, maximizing device core utilization and reducing ap-
plication time to solution.
3. GEMTC ARCHITECTURE
Given that our target test bed consisted of NVIDIA GPUs
and that we wanted to examine the GPU at the finest gran-
ularity possible, we opted to implement our framework us-
ing CUDA. This decision allowed us to work at the finest
granularity possible but limited our evaluation to NVIDIA
based hardware. While GeMTC was originally developed
on NVIDIA CUDA devices, its architecture is general, and
has also been implemented on the Intel Xeon Phi [16]. The
Phi, however, represents a different accelerator architecture,
meriting separate study, and is not addressed in this paper.
Figure 3 shows a high-level diagram of GeMTC driven
by tasks generated by the Swift parallel functional dataflow
language (described in Section IV). GeMTC launches a dae-
mon on the GPU that enables independent tasks to be mul-
tiplexed onto warp-level GPU workers. A work queue in
GPU memory is populated from calls to a C-based API,
and GPU workers pick up and execute these tasks. After a
worker has completed a computation, the results are placed
on an outgoing result queue and returned to the caller.
3.1 Kernel Structure and Task Descriptions
A key element of GeMTC is the daemon launched on the
GPU, named the Super Kernel, which enables many hard-
ware level workers (at the warp level) on the GPU. A work
queue in GPU memory is populated from calls to a C API,
and GPU workers pick up and execute these tasks. After a
worker has completed a computation, the results are placed
on an outgoing result queue and returned to the caller.
Within traditional GPU programming, a user defined func-
tion that runs on the GPU is called a kernel. An application
may define many GPU kernels, and application logic may be
written to execute some or all kernels in parallel. These con-
current kernels are a key technology in the GeMTC frame-
work. Once the GeMTC framework is initialized, the Super
Kernel daemon is started, the memory management system
is set up, and calls can begin to Application Kernels (App-
Kernels). The Super Kernel gathers hardware information
from the GPU and dynamically starts the maximum num-
ber of workers available on that GPU. A worker consists of a
single warp, and therefore the maximum number of workers
is equal to the maximum number of warps.
1
2
3
...
...
...
13
14
15
16
17
...
...
...
27
28
Time
1
2
3
4
5
6
...
168
169
170
171
172
173
174
...
336
FIFO
Warp
Workers
1 to M,
(M>>N)
Streaming
Multi-
processors
(SMX)
1 to N
Time
(B) GeMTC FIFO Scheduler
(A) Concurrent Kernels with Batched Tasks
Batch Completed
Simulation Completed
Simulation Completed
FIFO
Warp
Workers
1 to M,
(M>>N)
Time
(C) GeMTC Overdecomposition
Simulation Completed
1
1
1
1
1
1
1
1
1
2
2
3
3
3
4
4
4
5
5
5
5
5
6
6
6
6
6
6
6
6
...
...
...
168
168
169
172
173
170
170
170
171
171
171
171
174
174
336
...
...
...
...171
Figure 2: GeMTC FIFO scheduler processes tasks
as soon as they are available, rather than blocking
on batches for completion. The warps required to
execute cases (B) and (C) are provided by all the
streaming multiprocessor’s within the shaded area
of (A). While the hardware available remains the
same, the number of parallel channels is increased
for the amount of concurrent parallel work.

Figure 3: Flow of a task in GeMTC.
AppKernels are the computations that are executed by a
GeMTC worker. The AppKernels are modular in design,
and users can quickly contribute to the AppKernel Library
by writing their own AppKernels based on pre-existing tem-
plates. A major appeal of the GeMTC framework is the
decomposition of the GPU programming model. Instead of
an application launching hundreds or thousands of threads,
which could quickly become more challenging to manage,
GeMTC AppKernels are optimized at the warp level, mean-
ing the programmer and AppKernel logic are responsible for
managing only 32 threads in a given application. Further-
more, run-time logic can be used to control concurrency of
tasks to ensure that GPU cores are kept utilized without
exhausting the GPU memory.
The Task Description is a C struct that contains rele-
vant information for executing an AppKernel as a task on
GeMTC. The Task Description is passed from a client via
the GeMTC API (e.g., by Swift) to the GPU and queued
with parameters on the device to the input queue or queued
with task results on the outgoing result queue.
Figure 4 shows how a sample AppKernel could be writ-
ten to compute a naive square matrix multiplication through
GeMTC. Swift stubs have marshaled AppKernel parameters
into a single boxed parameter. Therefore, after calibrating
for warp size, the first step is to unbox the parameters. Af-
ter executing an algorithm optimized for the warp size, the
result is stored in a location identified from unboxing the
input parameters. The result is then placed on an outgoing
result queue, and the warp is ready to pick up new work.
3.2 GeMTC API
The GeMTC API is a C-based API which consists of eight
major functions identified in Table 1. Figure 5 uses a sim-
ple molecular dynamics (MD) example to demonstrate how
a user can leverage the GeMTC API to launch a simula-
tion on the GPU. For the MD example, the user defines
the initial universe of molecules as a parameter to the MD
function. Once these parameters have been transferred into
GPU memory the user pushes the task to the GPU along
with all the information needed to create the task description
on the device. The push operation contains, as parameters,
the four pieces of data necessary to construct the task de-
scription; in this case, TaskType = MDLite, TASK ID is set
to a unique integer value (for tracking the task throughout
its lifetime), numThreads = 32, and *params = a pointer to
device memory where the task parameters are stored.
1 __device__ void MatrixMultiply(void *boxed_input)
2 {
3 // calibrate for warp size
4 int warp_size = 32;
5 int thread = threadIdx.x % warp_size;
6 // unbox host parameters
7 float* inputParams = (float*)boxed_input;
8 int matrixWidth = inputParams[0];
9 int matrixSize = matrixWidth * matrixWidth;
10 float *matrixA = inputParams+1;
11 float *matrixB = matrixA + matrixSize;
12 float *matrixOut = matrixA + 2 * matrixSize;
13 // compute Matrix Multiplication
14 for (unsigned int i = thread; i < matrixWidth;
15 i=i+warp_size){
16 for (unsigned int j = 0; j < matrixWidth; j++) {
17 float sum = 0;
18 for (unsigned int k = 0; k < matrixWidth; k++) {
19 float a = matrixA[i * matrixWidth + k];
20 float b = matrixB[k * matrixWidth + j];
21 sum += a * b;
22 }
23 // result location from input parameters
24 matrixOut[i * matrixWidth + j ] = sum;
25 }
26 }
27 }
Figure 4: GeMTC Mat-Mul AppKernel
Table 1: GeMTC API
API Call Functionality Provided
gemtc(Setup/Cleanup) (Start/Stop) GeMTC
gemtc(Push/Poll) (Submit/Return) Tasks
gemtcMemcpyHostToDevice Memory Copy
gemtcMemcpyDeviceToHost Memory Copy
gemtcGPU(Malloc/Free) (Allocate/Free) Memory
At this point the user can begin polling for a result. The
precompiled MD AppKernel already knows how to pack and
unpack the function parameters from memory; and once the
function completes, the result is packed into memory and
placed on the result queue. When the gemtcPoll function
returns a result, the user can then unpack the memory and
move to the next operation. The gemtcPoll function does
not block on a specific task, and it automatically pops any
completed task(s) off the result queue. This strategy is ex-
plained in further detail in the Task Bundling subsection.
In addition, the example shown in Figure 5 is specific to
users leveraging the C API. It is expected that end users
will utilize high-level Swift scripts to launch their tasks on
GeMTC. The calls described above are implicitly handled by
the GeMTC and Swift integration, as explained in further
detail in Section 4.
3.3 Queues, Tasks, and Memory Management
GeMTC manages two queues on the device. The Incom-
ing Work Queue is populated by calls to the GeMTC API
and contains tasks that are ready to execute. The tasks in
this queue contain a TaskDescription and the necessary pa-
rameters to execute the task. Both in-memory queues are
configured as circular linked-lists with pointers indicating
the front and rear of the queue. When a worker picks up
a task, it will dequeue from the front, and any new work
is placed at the rear. Figure 6 demonstrates how workers
interact with the queues.

1 # include gemtc.cu”
2 main(){
3 // Start GeMTC
4 gemtcSetup(QUEUE_SIZE);
5 // Allocate device memory
6 device_params = gemtcGPUMalloc(MALLOC_SIZE);
7 // Populate device memory
8 gemtcMemcpyHostToDevice(device_params,
9 host_params, MALLOC_SIZE);
10 // Push a task to the GPU
11 gemtcPush(MD_Lite, NUM_THREADS,
12 TaskID, device_params);
13 // Poll for completed results
14 gemtcPoll(TaskID, pointer);
15 // Copy back results
16 gemtcMemcpyDeviceToHost(host_params,
17 pointer, MALLOC_SIZE);
18 // Free GPU memory
19 gemtcGPUFree(pointer);
20 // Shutdown GeMTC
21 gemtcCleanup();
22 }
Figure 5: Code sample of GeMTC API.
Figure 6: GPU Workers interacting with queues.
The GeMTC framework requires efficient device memory
allocation on a per task basis. Each task enqueued requires
at least two device allocations: the first for the task itself
and the second for parameters and results. The existing
CUDA memory management system was not designed for a
large number of independent memory allocations. With tra-
ditional CUDA programming models the current best prac-
tice is to allocate all memory needed by an application at
launch time and then manually manage and reuse this mem-
ory as needed.
To reduce the large overhead of individual memory alloca-
tions for MTC workloads, GeMTC includes a sub-allocator
designed to efficiently handle many requests for dynamic al-
location. The sub-allocator uses the existing CUDA malloc
to allocate large contiguous pieces of device memory, allo-
cating more as needed. Then pointers to these free chunks
and their sizes are stored in a circular linked list on the
CPU (see Figure 7). This list is ordered by increasing de-
vice address to allow for easy memory coalescing of adjacent
memory chunks.
Figure 7: Memory mapping of free memory available
to the device.
Figure 8: Result of gemtcMalloc on free memory.
When a GeMTC memory allocation request is sent from
the host to the GPU, the sub-allocator will traverse the list
and select the first chunk of free device memory meeting
the allocation requirements. Figure 8 demonstrates how the
header is then updated to reflect the remaining free device
memory available in that chunk. This operation runs in the
same order of time as a single memory copy to the device.
Upon freeing device memory, the header is read to iden-
tify the size of the chunk. Then it is added to the list of
free memory in the correct location. If there is any free con-
secutive memory, the chunk is coalesced to provide a single
larger contiguous chunk of memory. The operation to free
device memory takes roughly the same amount of time as
reading the header (i.e., a device memory copy).
Both malloc() and free() within GeMTC’s memory man-
agement run in O(n), where n is the length of the free mem-
ory list. In addition, the size of the list is proportional to
the amount of memory fragmentation since each element is
recorded as a separate chunk of memory. Because malloc
and free both need to write and read to the GPU memory,
these operations may scale poorly under workloads with high
fragmentation. However, the MTC workloads we examine
show no signs of high fragmentation. The original cudaMal-
loc ran in 100 microseconds, and our gemtcMalloc runs in
10 microseconds.
To optimize the GeMTC framework for fine-grained tasks,
we have implemented a task-bundling system to reduce the
amount of communication between the host and GPU. The
main bottleneck for obtaining high task throughput through
GeMTC is the latency associated with writing a task to the
GPU DRAM memory. This bundling system as shown in
Figure 9 creates a buffer of tasks that need to be written to
the GPU, and flushes it periodically or when it is full. This

Citations
More filters
Proceedings ArticleDOI
16 Nov 2014
TL;DR: This work presents a comprehensive set of compiler techniques for data-driven task parallelism, including novel compiler optimizations and intermediate representations, and demonstrates that these techniques greatly reduce communication overhead and enable extreme scalability.
Abstract: Swift/T is a high-level language for writing concise, deterministic scripts that compose serial or parallel codes implemented in lower-level programming models into large-scale parallel applications. It executes using a data-driven task parallel execution model that is capable of orchestrating millions of concurrently executing asynchronous tasks on homogeneous or heterogeneous resources. Producing code that executes efficiently at this scale requires sophisticated compiler transformations: poorly optimized code inhibits scaling with excessive synchronization and communication. We present a comprehensive set of compiler techniques for data-driven task parallelism, including novel compiler optimizations and intermediate representations. We report application benchmark studies, including unbalanced tree search and simulated annealing, and demonstrate that our techniques greatly reduce communication overhead and enable extreme scalability, distributing up to 612 million dynamically load balanced tasks per second at scales of up to 262,144 cores without explicit parallelism, synchronization, or load balancing in application code.

52 citations


Cites background from "Design and evaluation of the gemtc ..."

  • ...Swift/T focuses on high-performance fine-grained task parallelism, such as calling foreign functions (including C and Fortran) with in-memory data and launching kernels on GPUs and other accelerators [16]....

    [...]

Posted Content
Wei Tan1, Liangliang Cao2, Liana Fong1
TL;DR: In this article, a CUDA-based matrix factorization library that implements memory-optimized alternate least square (ALS) method to solve very large-scale MF is presented, which uses a variety set of techniques to maximize the performance on either single or multiple GPUs.
Abstract: Matrix factorization (MF) is employed by many popular algorithms, e.g., collaborative filtering. The emerging GPU technology, with massively multicore and high intra-chip memory bandwidth but limited memory capacity, presents an opportunity for accelerating MF much further when appropriately exploiting the GPU architectural characteristics. This paper presents cuMF, a CUDA-based matrix factorization library that implements memory-optimized alternate least square (ALS) method to solve very large-scale MF. CuMF uses a variety set of techniques to maximize the performance on either single or multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead between computing units, and utilizing a novel topology-aware parallel reduction scheme. With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, this cuMF can solve the largest matrix factorization problem ever reported yet in current literature, while maintaining impressively good performance.

40 citations


Cites background from "Design and evaluation of the gemtc ..."

  • ...Recently, the GPU emerges as an accelerator for parallel algorithms [14, 15]....

    [...]

Posted Content
TL;DR: Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues, and can be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases.
Abstract: GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time sharing and preemption are expensive. Worse, when a DL application cannot completely use a GPU's resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization. We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, in order to achieve fine-grained GPU sharing among multiple DL applications. Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases. Our integration of Salus with TensorFlow and evaluation on popular DL jobs show that Salus can improve the average completion time of DL training jobs by $3.19\times$, GPU utilization for hyper-parameter tuning by $2.38\times$, and GPU utilization of DL inference applications by $42\times$ over not sharing the GPU and $7\times$ over NVIDIA MPS with small overhead.

35 citations

Proceedings ArticleDOI
16 Nov 2014
TL;DR: The Strings scheduler realizes the vision of a dynamic model where GPUs are treated as first class schedulable entities by decomposing the GPU scheduling problem into a combination of load balancing and per-device scheduling.
Abstract: Accelerator-based systems are making rapid inroads into becoming platforms of choice for high end cloud services. There is a need therefore, to move from the current model in which high performance applications explicitly and programmatically select the GPU devices on which to run, to a dynamic model where GPUs are treated as first class schedulable entities. The Strings scheduler realizes this vision by decomposing the GPU scheduling problem into a combination of load balancing and per-device scheduling. (i) Device-level scheduling efficiently uses all of a GPU's hardware resources, including its computational and data movement engines, and (ii) load balancing goes beyond obtaining high throughput, to ensure fairness through prioritizing GPU requests that have attained least service. With its methods, Strings achieves improvements in system throughput and fairness of up to 8.70x and 13%, respectively, compared to the CUDA runtime.

34 citations

Proceedings ArticleDOI
26 Jan 2017
TL;DR: Pagoda is presented, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel, and achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.
Abstract: Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

34 citations


Cites background from "Design and evaluation of the gemtc ..."

  • ...Another way to spawn tasks is to use a batch-based mechanism [14], where CPU sends a batch of tasks to the GPU....

    [...]

  • ...Prior work has identified the issue of GPU underutilization [37, 8, 25, 14]....

    [...]

  • ...Prior work, GPU enabled Many-Task Computing (GeMTC) [14], presents a runtime task scheduling mechanism, where a task executes as a single threadblock....

    [...]

References
More filters
Proceedings ArticleDOI
12 Feb 2011
TL;DR: A novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users and significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance.
Abstract: Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.

363 citations


"Design and evaluation of the gemtc ..." refers background or methods in this paper

  • ...Future work also includes performance evaluation of diverse application kernels; analysis of the ability of such kernels to effectively utilize concurrent warps; enabling of virtual warps [25] which can both subdivide and span physical warps; support for other accelerators such as the Xeon Phi; and continued performance refinement....

    [...]

  • ...[25] developed and evaluated methods that obtained 9X speedups of breadthfirst search in graphs over prior GPU implementations by enabling each warp to run independent threads and even multiple “virtual” threads....

    [...]

Proceedings ArticleDOI
10 Nov 2007
TL;DR: Falkon's integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system, and large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90% reduction in end-to-end run time.
Abstract: To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and (2) a streamlined dispatcher. Falkon's integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. We describe Falkon architecture and implementation, and present performance results for both microbenchmarks and applications. Microbenchmarks show that Falkon throughput (487 tasks/sec) and scalability (to 54,000 executors and 2,000,000 tasks processed in just 112 minutes) are one to two orders of magnitude better than other systems used in production Grids. Large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90% reduction in end-to-end run time, relative to versions that execute tasks via separate scheduler submissions.

350 citations


"Design and evaluation of the gemtc ..." refers methods in this paper

  • ...When using Falkon [3], Swift achieved over 1,000 tasks per second....

    [...]

  • ...When using Falkon [3], Swift achieved over 1,000 tasks per second....

    [...]

  • ...In recent years we have ex­tended MTC to applications on homogeneous supercom­puters, using tools such as Falkon [3], Swift [4], JETS [5], and Coasters [6]....

    [...]

  • ...In recent years we have extended MTC to applications on homogeneous supercomputers, using tools such as Falkon [3], Swift [4], JETS [5], and Coasters [6]....

    [...]

  • ...[3] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde, Falkon: a Fast and Light-weight tasK executiON framework, in Proc. of the 2007 ACM/IEEE Conf. on Supercomputing (SC 07)....

    [...]

Book ChapterDOI
27 Aug 2012
TL;DR: This work presents the first experiences with OpenACC, an API consisting of compiler directives to offload loops and regions of C/C++ and Fortran code to accelerators and finds that OpenACC offers a promising ratio of development effort to performance and that a directive-based approach to program accelerators is more efficient than low-level APIs, even if suboptimal performance is achieved.
Abstract: Today's trend to use accelerators like GPGPUs in heterogeneous computer systems has entailed several low-level APIs for accelerator programming. However, programming these APIs is often tedious and therefore unproductive. To tackle this problem, recent approaches employ directive-based high-level programming for accelerators. In this work, we present our first experiences with OpenACC, an API consisting of compiler directives to offload loops and regions of C/C++ and Fortran code to accelerators. We compare the performance of OpenACC to PGI Accelerator and OpenCL for two real-world applications and evaluate programmability and productivity. We find that OpenACC offers a promising ratio of development effort to performance and that a directive-based approach to program accelerators is more efficient than low-level APIs, even if suboptimal performance is achieved.

279 citations


"Design and evaluation of the gemtc ..." refers background in this paper

  • ...[13]) suggest that OpenACC is not yet capable of delivering equivalent performance....

    [...]

Proceedings ArticleDOI
23 Oct 2011
TL;DR: It is shown that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.
Abstract: We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models. Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.

256 citations

Proceedings ArticleDOI
03 Nov 2013
TL;DR: Dandelion automatically and transparently distributes data-parallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution.
Abstract: Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and energy efficiency. Because heterogeneous systems typically comprise multiple execution contexts with different programming abstractions and runtimes, programming them remains extremely challenging.Dandelion is a system designed to address this programmability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution contexts including CPUs, GPUs, FPGAs, and the cloud. It adopts the .NET LINQ (Language INtegrated Query) approach, integrating data-parallel operators into general purpose programming languages such as C# and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools.Dandelion automatically and transparently distributes data-parallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of .NET code on GPUs, Dandelion cross-compiles .NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the design and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads.

160 citations


"Design and evaluation of the gemtc ..." refers methods in this paper

  • ...NET code into CUDA code that is then executed on the PTask runtime [29]....

    [...]

Frequently Asked Questions (18)
Q1. What are the contributions mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?

The authors present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. The authors demonstrate the use of a high-level MTC programming model ( the Swift parallel dataflow language ) to run tasks on many accelerators and thus provide a highproductivity programming model for the growing number of supercomputers that are accelerator-enabled. 

GeMTC is currently optimized for executing within environments containing a single GPU per node, such as Blue Waters ; but future work aims to address heterogeneous accelerator environments. The authors leave this for future work. Future work also includes performance evaluation of diverse application kernels ; analysis of the ability of such kernels to effectively utilize concurrent warps ; enabling of virtual warps [ 25 ] which can both subdivide and span physical warps ; support for other accelerators such as the Xeon Phi ; and continued performance refinement. 

Many important application classes and programming techniques that are driving the requirements for such extremescale systems include branch and bound, stochastic programming, materials by design, and uncertainty quantification. 

The dataflow programming model of the Swift parallel scripting language can elegantly express, through implicit parallelism, the massive concurrency demanded by these applications while retaining the productivity benefits of a high-level language. 

To optimize the GeMTC framework for fine-grained tasks, the authors have implemented a task-bundling system to reduce the amount of communication between the host and GPU. 

Future work also includes performance evaluation of diverse application kernels; analysis of the ability of such kernels to effectively utilize concurrent warps; enabling of virtual warps [25] which can both subdivide and span physical warps; support for other accelerators such as the Xeon Phi; and continued performance refinement. 

The GeMTC implementation on the Xeon-Phi will benefit greatly from avoiding memory and thread oversubscription, as highlighted in this work. 

Tasks typically run to completion: they follow the simple input-process-output model of procedures, rather than retaining state as in web services or MPI processes. 

Scaling an application down to the level of concurrency available within a single warp can provide the highest level of thread utilization for some applications. 

MTC workloads that send only single tasks, or small numbers of large tasks, to accelerator devices observe near-serialized performance, and leave a significant portion of device processor capability unused. 

Instead of an application launching hundreds or thousands of threads, which could quickly become more challenging to manage, GeMTC AppKernels are optimized at the warp level, meaning the programmer and AppKernel logic are responsible for managing only 32 threads in a given application. 

The Pegasus project runs at the hypervisor level and promotes GPU sharing across virtual machines, while including a custom DomA scheduler for GPU task scheduling. 

Each task enqueued requires at least two device allocations: the first for the task itself and the second for parameters and results. 

The main bottleneck for obtaining high task throughput through GeMTC is the latency associated with writing a task to the GPU DRAM memory. 

If the compiler is able to generate device code and parallel instructions, the developer may opt to write sequential code and benefit from accelerator speedup. 

The precompiled MD AppKernel already knows how to pack and unpack the function parameters from memory; and once the function completes, the result is packed into memory and placed on the result queue. 

With traditional CUDA programming models the current best practice is to allocate all memory needed by an application at launch time and then manually manage and reuse this memory as needed. 

While the walltime of MDLite successfully decreases as more threads are added, the speedup obtained is significantly less than ideal after 8 threads are active within a single warp.