Proceedings Article•DOI•

Design and evaluation of the gemtc framework for GPU-enabled many-task computing

Q: What are the contributions mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?

The authors present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. The authors demonstrate the use of a high-level MTC programming model ( the Swift parallel dataflow language ) to run tasks on many accelerators and thus provide a highproductivity programming model for the growing number of supercomputers that are accelerator-enabled.

Q: What are the future works mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?

GeMTC is currently optimized for executing within environments containing a single GPU per node, such as Blue Waters ; but future work aims to address heterogeneous accelerator environments. The authors leave this for future work. Future work also includes performance evaluation of diverse application kernels ; analysis of the ability of such kernels to effectively utilize concurrent warps ; enabling of virtual warps [ 25 ] which can both subdivide and span physical warps ; support for other accelerators such as the Xeon Phi ; and continued performance refinement.

Q: What is the purpose of the task-bundling system?

To optimize the GeMTC framework for fine-grained tasks, the authors have implemented a task-bundling system to reduce the amount of communication between the host and GPU.

Q: What is the future work of GeMTC?

Future work also includes performance evaluation of diverse application kernels; analysis of the ability of such kernels to effectively utilize concurrent warps; enabling of virtual warps [25] which can both subdivide and span physical warps; support for other accelerators such as the Xeon Phi; and continued performance refinement.

Q: What is the benefit of a GeMTC implementation on the Xeon Phi?

The GeMTC implementation on the Xeon-Phi will benefit greatly from avoiding memory and thread oversubscription, as highlighted in this work.

Q: How many threads can be used in a single warp?

Scaling an application down to the level of concurrency available within a single warp can provide the highest level of thread utilization for some applications.

Scott J. Krieder¹, Justin M. Wozniak², Timothy G. Armstrong³, Michael Wilde², Daniel S. Katz², Benjamin Grimmer¹, Ian Foster², Ioan Raicu¹ - Show less +4 more•Institutions (3)

Illinois Institute of Technology¹, Argonne National Laboratory², University of Chicago³

23 Jun 2014-pp 153-164

TL;DR: GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing and provides a high-productivity programming model for the growing number of supercomputers that are accelerator-enabled.

read less

Abstract: We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration With GeMTC, a broad class of such "many-task" applications can leverage the increasing number of accelerated and hybrid high-end computing systems GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing We demonstrate the use of a high-level MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a high-productivity programming model for the growing number of supercomputers that are accelerator-enabled While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 27M GPU threads on the Blue Waters supercomputer

...read moreread less

Summary (4 min read)

Jump to: [1. INTRODUCTION] – [2. CHALLENGES OF MANY-TASK COM-PUTING ON GPGPUS] – [2.1 NVIDIA GPUs and GPGPU Computing.] – [2.2 Mainstream GPU Support for MTC] – [3. GEMTC ARCHITECTURE] – [3.1 Kernel Structure and Task Descriptions] – [3.2 GeMTC API] – [3.3 Queues, Tasks, and Memory Management] – [4. SWIFT: DATAFLOW EXECUTION AND PROGRAMMING MODEL FOR MTC] – [GeMTC Integration with Swift] – [5. PERFORMANCE EVALUATION] – [5.1 Molecular Dynamics] – [5.2 Throughput and Efficiency] – [5.3 Preliminary MTC Xeon Phi Results] – [6. RELATED WORK] and [7. CONCLUSIONS]

1. INTRODUCTION

This work explores methods for, and potential benefits of, applying the increasingly abundant and economical generalpurpose graphics processing units to a broader class of applications.
Tasks typically run to completion: they follow the simple input-process-output model of procedures, rather than retaining state as in web services or MPI processes.
Efficient MTC implementations are now commonplace on clusters, grids, and clouds.
Integration of GeMTC with Swift, enabling a broad class of dataflow-based scientific applications, and improving programmability for both hybrid multicore hosts and extreme scale systems.
Work is load balanced among large numbers of GPUs.

2. CHALLENGES OF MANY-TASK COM-PUTING ON GPGPUS

The authors GeMTC work is motivated by the fact that with current mainstream programming models, a significant portion of GPU processing capabilities underutilized by MTC workloads.
The results presented here indicate that this approach enables higher utilization of GPU resources, greater concurrency, and hence higher many-task throughput.

2.1 NVIDIA GPUs and GPGPU Computing.

General-purpose computing on graphics processing units allows a host CPU to offload a wide variety of computation, not just graphics, to a graphics processing unit (GPU).
GPUs are designed for vector parallelism: they contain many lightweight cores designed to support parallel bulk processing of graphics data.
A SMX contains many warps, and each warp provides 32 concurrent threads of execution.
Ousterhout et al., [11] make a compelling argument for the pervasive use of tiny tasks in compute clusters.
The authors apply a similar argument to motivate the GeMTC model of running many small independent tasks on accelerators.

2.2 Mainstream GPU Support for MTC

The dominant CUDA and OpenCL GPGPU programming models both provide extensions to traditional programming languages such as C with added API calls to interact with accelerators.
OpenACC is an open standard and aims to provide the portability of OpenCL while requiring less detailed knowledge of accelerator architecture than is required in CUDA and OpenCL programming.
Concurrent Kernels [14] is a CUDA feature that enables the developer to launch parallel work on a GPU.
The current model of GeMTC and Swift relies on communication between the CPU and GPU to drive tasks to and from the Swift script.
In addition, to process workflows with complex dependencies, the developer must group tasks into batches and block on batch completion before executing dependent kernels, an inadequate approach for supporting heterogeneous concurrent tasks.

3. GEMTC ARCHITECTURE

Given that their target test bed consisted of NVIDIA GPUs and that the authors wanted to examine the GPU at the finest granularity possible, they opted to implement their framework using CUDA.
This decision allowed us to work at the finest granularity possible but limited their evaluation to NVIDIA based hardware.
While GeMTC was originally developed on NVIDIA CUDA devices, its architecture is general, and has also been implemented on the Intel Xeon Phi [16] .
The Phi, however, represents a different accelerator architecture, meriting separate study, and is not addressed in this paper.
A work queue in GPU memory is populated from calls to a C-based API, and GPU workers pick up and execute these tasks.

3.1 Kernel Structure and Task Descriptions

A key element of GeMTC is the daemon launched on the GPU, named the Super Kernel, which enables many hardware level workers (at the warp level) on the GPU.
After a worker has completed a computation, the results are placed on an outgoing result queue and returned to the caller.
Within traditional GPU programming, a user defined function that runs on the GPU is called a kernel.
These concurrent kernels are a key technology in the GeMTC framework.
The Super Kernel gathers hardware information from the GPU and dynamically starts the maximum number of workers available on that GPU.

3.2 GeMTC API

Figure 5 uses a simple molecular dynamics (MD) example to demonstrate how a user can leverage the GeMTC API to launch a simulation on the GPU.
Once these parameters have been transferred into GPU memory the user pushes the task to the GPU along with all the information needed to create the task description on the device.
At this point the user can begin polling for a result.
When the gemtcPoll function returns a result, the user can then unpack the memory and move to the next operation.
It is expected that end users will utilize high-level Swift scripts to launch their tasks on GeMTC.

3.3 Queues, Tasks, and Memory Management

The Incoming Work Queue is populated by calls to the GeMTC API and contains tasks that are ready to execute.
The tasks in this queue contain a TaskDescription and the necessary parameters to execute the task.
With traditional CUDA programming models the current best practice is to allocate all memory needed by an application at launch time and then manually manage and reuse this memory as needed.
Then pointers to these free chunks and their sizes are stored in a circular linked list on the CPU .
The main bottleneck for obtaining high task throughput through GeMTC is the latency associated with writing a task to the GPU DRAM memory.

4. SWIFT: DATAFLOW EXECUTION AND PROGRAMMING MODEL FOR MTC

Swift [4] is an implicitly parallel functional dataflow programming language that is proving increasingly useful to express the higher-level logic of scientific and engineering applications.
Many important application classes and programming techniques that are driving the requirements for such extremescale systems include branch and bound, stochastic programming, materials by design, and uncertainty quantification.
The dataflow programming model of the Swift parallel scripting language can elegantly express, through implicit parallelism, the massive concurrency demanded by these applications while retaining the productivity benefits of a high-level language.
When using its own resource provisioner [6].
This enables Swift to express a far broader set of applications, and makes it a productive coordination language for hybrid CPU+accelerator nodes and systems.

GeMTC Integration with Swift

The integration with Swift provides many mutual benefits for both Swift and GeMTC.
The final box on the right illustrates how GeMTC fits into the Swift/T stack.
Thus, the user's Swift application can simply call any function mapped to an AppKernel from the high level Swift program.
Data transfers overlap with ongoing GPU computations implicitly and automatically.
And because the GeMTC API calls are handled at the Turbine worker level, the Swift programmer is freed from the burden of writing complex mem- ory management code for the GPU.

5. PERFORMANCE EVALUATION

This section evaluates the GeMTC framework with a set of AppKernels from the GeMTC AppKernel Library.
App-Kernels are CUDA device functions that have been tuned to deliver high performance under MTC workloads.
The authors work with a lightweight molecular dynamics simulation called MDLite.
The authors conclude with an analysis of MDLite over multiple XK7 nodes and examine a set of simple adder benchmarks to highlight throughput and efficiency.
Blue Waters contains ∼20K Cray XE6 CPU based nodes and ∼4K Cray XK7 GPU nodes.

5.1 Molecular Dynamics

The user specifies the number of particles in a "universe" along with their starting positions, the number of dimensions, and a starting mass.
MDLite runs a simulation that determines how the potential and kinetic energy in the system changes as the particles change position.
By varying the number of active threads included in a warp computation, the authors prove that for the right application it could indeed benefit from the 32 threads in a GPU warp.
Figure 15 evaluates a varied number of MDLite simulations running over a K20X GPU.

5.2 Throughput and Efficiency

Next, the authors evaluate GeMTC with a simple adder benchmark.
Af- terwards, the authors can easily measure the efficiency and overhead of their system: efficiency = (expected runtime/observed runtime).
First, a CPU version of the simple adder is executed through Swift/T on XE6 nodes.
Figure 20 highlights the single-node efficiency of GeMTC running with 168 active workers per GPU.
The authors attribute this drop in performance to greater worker contention on the device queues and the fact that Swift must now drive 168 times the amount of work per node.

5.3 Preliminary MTC Xeon Phi Results

The authors have also gathered preliminary results for supporting MTC workloads on the Intel Xeon Phi Coprocessor.
As shown in Figure 23 the authors can achieve the same level of efficiency with shorter running tasks (50% shorter) on a Xeon Phi compared with a GTX-680 NVIDIA GPU.
The authors highlight the fact that with GeMTC on its own they observe upwards of 90% efficiency with tasks lasting 5 ms.
This means that a fully general purpose framework would be capable of launching tasks an order of magnitude faster.
The authors will continue to improve performance to ensure all components of the system can keep up with these task dispatch rates.

7. CONCLUSIONS

The authors have presented GeMTC, a framework for enabling MTC workloads to run efficiently on NVIDIA GPUs.
The GeMTC framework is responsible for receiving work from a host through the use of the C API, and scheduling and running that work on many independent GPU workers.
Results are returned through the C API to the host and then to Swift.
Applications that can generate thousands of SIMD threads may prefer to use traditional CUDA programming techniques.
Under the current configurations, users are required to write their own AppKernels.

Did you find this useful? Give us your feedback

Figures (24)

Figure 10: Swift/T stack including GeMTC.

Figure 9: GeMTC implicitly bundles tasks to efficiently utilize PCI bandwidth and latency.

Figure 1: Diagram of GPU architecture hierarchy.

Figure 11: Swift script launching GeMTC.

Figure 12: Diagram demonstrating execution model for molecular dynamics with replica exchange. Short simulation segments are run in an ensemble with asynchronous data exchanges [24].

Figure 23: Microbenchmark measuring efficiency for tasks with varied granularities on a variety of hardware: a 1344-core NVIDIA GPU, a 60-core Intel Xeon Phi, and a 48-core AMD Opteron SMP.

Figure 2: GeMTC FIFO scheduler processes tasks as soon as they are available, rather than blocking on batches for completion. The warps required to execute cases (B) and (C) are provided by all the streaming multiprocessor’s within the shaded area of (A). While the hardware available remains the same, the number of parallel channels is increased for the amount of concurrent parallel work.

Figure 16: GeMTC and MDLite scaling over 1344 workers on Blue Waters.

Figure 14: Speedup achieved with varied concurrency (1-32 threads) within a single warp, launching MDLite tasks from 2,500 atoms to 19,200 atoms.

Figure 13: GeMTC scales MD within a single warp and achieves decreased walltime as the level of concurrency within the computation is increased.

Figure 15: GeMTC utilization on the K20X running MD codes with varied worker counts from 1 to 168.

Figure 21: GeMTC + Swift efficiency for varying task granularities up to 512 nodes on Blue Waters with a single GeMTC worker active per node.

Figure 22: Efficiency for workloads with varied task granularities up to 86K independent warps of Blue Waters. 168 active workers/GPU.

Figure 19: GeMTC + Swift Throughput over 10,000 GPU workers.

Figure 17: Fine-grained Swift CPU workloads on Blue Waters, demonstrating the ability to drive finegrained workloads with high efficiency.

Figure 20: Single-node efficiency on Blue Waters for parallel work with varied task granularities running 168 GPU workers.

Figure 18: Swift driving GeMTC tasks on a Cray XK7(K20X equipped) node of Blue Waters.

Content maybe subject to copyright Report

Design and Evaluation of the GeMTC Framework for

GPU-enabled Many-Task Computing

Scott J. Krieder,

∗

Justin M. Wozniak,

†

Timothy Armstrong,

Michael Wilde

†‡

Daniel S. Katz,

‡

Benjamin Grimmer,

∗

Ian T. Foster,

§† ‡

Ioan Raicu

∗†

∗

Department of Computer Science, Illinois Institute of Technology

†

Mathematics and Computer Science Division, Argonne National Laboratory

Department of Computer Science, University of Chicago

‡

Computation Institute, University of Chicago & Argonne National Laboratory

ABSTRACT

We present the design and ﬁrst performance and usability

evaluation of GeMTC, a novel execution model and run-

time system that enables accelerators to be programmed

with many concurrent and independent tasks of potentially

short or variable duration. With GeMTC, a broad class

of such “many-task” applications can leverage the increas-

ing number of accelerated and hybrid high-end computing

systems. GeMTC overcomes the obstacles to using GPUs

in a many-task manner by scheduling and launching inde-

pendent tasks on hardware designed for SIMD-style vector

processing. We demonstrate the use of a high-level MTC

programming model (the Swift parallel dataﬂow language)

to run tasks on many accelerators and thus provide a high-

productivity programming model for the growing number of

supercomputers that are accelerator-enabled. While still in

an experimental stage, GeMTC can already support tasks of

ﬁne (subsecond) granularity and execute concurrent hetero-

geneous tasks on 86,000 independent GPU warps spanning

2.7M GPU threads on the Blue Waters supercomputer.

Categories and Subject Descriptors

D.1.3 [Programming Techniques]: Concurrent Program-

ming

Keywords

Many-task computing; GPGPU; CUDA; Accelerators; Hy-

brid execution; Workﬂow; Programming models; Execution

models.

1. INTRODUCTION

This work explores methods for, and potential beneﬁts of,

applying the increasingly abundant and economical general-

purpose graphics processing units (GPGPU) to a broader

class of applications. It extends the utility of GPGPU from

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

HPDC’14, June 23–27, Vancouver, BC, Canada.

http://dx.doi.org/10.1145/2600212.2600228.

the class of heavily vectorizable applications to irregularly-

structured many-task applications. Such applications are

increasingly common, stemming from both problem-solving

approaches (i.e., parameter sweeps, simulated annealing or

branch-and-bound optimizations, uncertainty quantiﬁcation)

and application domains (climate modeling, rational mate-

rials design, molecular dynamics, bioinformatics).

In many-task computing (MTC) [1, 2], tasks may be of

short (even subsecond) duration or highly variable (ranging

from milliseconds to minutes). Their dependency and data

passing characteristics may range from many similar tasks to

complex, and possibly dynamically determined, dependency

patterns. Tasks typically run to completion: they follow

the simple input-process-output model of procedures, rather

than retaining state as in web services or MPI processes.

Eﬃcient MTC implementations are now commonplace on

clusters, grids, and clouds. In recent years we have ex-

tended MTC to applications on homogeneous supercom-

puters, using tools such as Falkon [3], Swift [4], JETS [5],

and Coasters [6]. Other programming models and tools

that support MTC include MapReduce, volunteer comput-

ing [7], SLURM [8], and Cobalt [9], which allow super-

computer tasks to be subdivided into asynchronous sub-

tasks [10]. All these approaches can beneﬁt from the MTC-

enabling accelerator work we describe here. The contribu-

tions of this work are as follows:

• Design and implementation of GeMTC, a framework

enabling MTC workloads to run eﬃciently on NVIDIA

GPUs.

• Improved dynamic GPU memory management, pro-

viding eﬃcient scaling and a 10x improvement over

native CUDA dynamic memory management.

• Integration of GeMTC with Swift, enabling a broad

class of dataﬂow-based scientiﬁc applications, and im-

proving programmability for both hybrid multicore hosts

and extreme scale systems. Work is load balanced

among large numbers of GPUs.

• Performance evaluation on synthetic benchmarks and

a proxy code representing molecular dynamics simula-

tion workloads.

This paper is organized as follows: Section 2 describes the

challenges of many-task computing on GPGPUs. Section 3

describes the GeMTC framework and its underlying archi-

tecture. Section 4 describes Swift and its integration as a

Graphics Processing Unit (GPU)

SMX

(Contains M Warps)

Warp M

(32 Threads)

Warp 1

(32 Threads)

Warp 2

(32 Threads)

SMX

(Contains M Warps)

SMX

(Contains M Warps)

Figure 1: Diagram of GPU architecture hierarchy.

GeMTC programming model. Section 5 presents a perfor-

mance evaluation, and Section 6 discusses related work. We

summarize our contributions in Section 7 and brieﬂy discuss

related work.

2. CHALLENGES OF MANY-TASK COM-

PUTING ON GPGPUS

Our GeMTC work is motivated by the fact that with cur-

rent mainstream programming models, a signiﬁcant portion

of GPU processing capabilities underutilized by MTC work-

loads. We advocate sending a larger number of smaller, con-

current, independent tasks to a GPU. The results presented

here indicate that this approach enables higher utilization

of GPU resources, greater concurrency, and hence higher

many-task throughput.

2.1 NVIDIA GPUs and GPGPU Computing.

General-purpose computing on graphics processing units

(GPGPU) allows a host CPU to oﬄoad a wide variety of

computation, not just graphics, to a graphics processing

unit (GPU). GPUs are designed for vector parallelism: they

contain many lightweight cores designed to support paral-

lel bulk processing of graphics data. GPGPU leverages this

parallel architecture for nongraphic computations such as

matrix multiplication. In the context of this paper, all ref-

erences to GPU refer to this GPGPU approach. In addition

to application speedup, other beneﬁts to leveraging accel-

erators include power eﬃciency (improved Flops/watt) and

cost savings (improved Flops/$).

As shown in Figure 1, a NVIDIA GPU (which dominates

the GPGPU and HPC marketplace) is comprised of many

Streaming Multiprocessors (SMXs). A SMX contains many

warps, and each warp provides 32 concurrent threads of ex-

ecution. All threads within a warp run in a Single Instruc-

tion Multiple Thread (SIMT) fashion. As we describe below,

GeMTC schedules independent computations on the GPU

at the warp level, a level of independent task concurrency

not provided by any mainstream GPU programming model.

Our GeMTC work targets the latest generation of NVIDIA

GPUs, speciﬁcally the Kepler K20X. This device has 14

SMXs with 192 cores per SMX, a maximum of 168 warps,

and a total core count of 2,688. MTC workloads that send

only single tasks, or small numbers of large tasks, to acceler-

ator devices observe near-serialized performance, and leave

a signiﬁcant portion of device processor capability unused.

Ousterhout et al., [11] make a compelling argument for the

pervasive use of tiny tasks in compute clusters. We apply a

similar argument to motivate the GeMTC model of running

many small independent tasks on accelerators. Driven by

this tiny-task motivation, GeMTC provides an architecture

for “overdecomposition” [12] of accelerator-resident tasks,

which can then be tightly packed into a GPU to maximize

eﬃciency and minimize time to solution. While Swift load

balances tasks and applies compiler optimizations in sup-

port of overdecomposition, the user must write applications

with suitably ﬁne-grained tasks.

2.2 Mainstream GPU Support for MTC

The dominant CUDA and OpenCL GPGPU programming

models both provide extensions to traditional programming

languages such as C with added API calls to interact with

accelerators. CUDA is supported by NVIDIA and works on

NVIDIA GPUs. OpenCL is based on an open standard that

aims to provide improved portability across a variety of ac-

celerators and other compute devices. OpenACC is a newer

pragma-based technology that is gaining momentum. As in

OpenMP, OpenACC programmers provide hints to the com-

piler where they believe a computation would beneﬁt from

being oﬄoaded to an accelerator. OpenACC is an open stan-

dard and aims to provide the portability of OpenCL while

requiring less detailed knowledge of accelerator architecture

than is required in CUDA and OpenCL programming. In

many cases OpenACC may require signiﬁcantly less coding,

but early measurements (e.g., by Wienke et al. [13]) suggest

that OpenACC is not yet capable of delivering equivalent

performance.

Concurrent Kernels [14] is a CUDA feature that enables

the developer to launch parallel work on a GPU. However,

the maximum number of concurrent kernels is limited to

32, far less than the number of 168 independent warps pro-

vided by the latest Kepler GPUs. HyperQ and Dynamic

Parallelism [15], recent CUDA enhancements introduced by

NVIDIA with the Kepler architecture, are a step toward

MTC support. HyperQ allows more parallel work to be

sent to the GPU, while Dynamic Parallelism allows threads

to spawn more threads on the device. The current model

of GeMTC and Swift relies on communication between the

CPU and GPU to drive tasks to and from the Swift script. If

a task sent to GeMTC from Swift was represented by com-

pact code and could be decomposed even further (e.g., loop

unrolling) it is possible that GeMTC could utilize Dynamic

Parallelism to dynamically launch new tasks and process the

parent task with even more improved performance, but we

leave that as future work. Most other programming models,

however, still treat the GPU as a solution to large vector-

oriented SIMD computations and do not adequately support

the potential for speedup of many-task applications.

A primary motivation for our work on GeMTC is that

none of these mainstream accelerator programming mod-

els provides the ﬂexible support for independent concurrent

tasks required by many-task applications. In order to ef-

fectively utilize an accelerator, MTC applications with com-

plex task dependencies need task results rapidly returned

from device to host so that the application can process its

dataﬂow-driven dependencies. To the best of our knowledge,

no solution prior to GeMTC oﬀers this capability.

Figure 2(A) illustrates why many-task computing work-

loads experience low eﬃciencies through Concurrent Ker-

nels, the best available standard CUDA concurrency model

for independent tasks launched by the host. In this model,

tasks must be submitted at the same time, and no additional

tasks can be submitted until all tasks are complete. With

unbalanced task durations, a signiﬁcant number of GPU pro-

cessor cores will be underutilized. In addition, to process

workﬂows with complex dependencies, the developer must

group tasks into batches and block on batch completion be-

fore executing dependent kernels, an inadequate approach

for supporting heterogeneous concurrent tasks. Figure 2(B)

demonstrates how GeMTC provides support for heteroge-

neous tasks by treating every warp worker as an indepen-

dently operating SIMD compute device. Because the warps

are operating independently they are able to pick up work

immediately rather than block on other warps for comple-

tion. Figure 2(C) demonstrates how overdecomposition can

be utilized by GeMTC to pack tiny tasks neatly into the

GPU, maximizing device core utilization and reducing ap-

plication time to solution.

3. GEMTC ARCHITECTURE

Given that our target test bed consisted of NVIDIA GPUs

and that we wanted to examine the GPU at the ﬁnest gran-

ularity possible, we opted to implement our framework us-

ing CUDA. This decision allowed us to work at the ﬁnest

granularity possible but limited our evaluation to NVIDIA

based hardware. While GeMTC was originally developed

on NVIDIA CUDA devices, its architecture is general, and

has also been implemented on the Intel Xeon Phi [16]. The

Phi, however, represents a diﬀerent accelerator architecture,

meriting separate study, and is not addressed in this paper.

Figure 3 shows a high-level diagram of GeMTC driven

by tasks generated by the Swift parallel functional dataﬂow

language (described in Section IV). GeMTC launches a dae-

mon on the GPU that enables independent tasks to be mul-

tiplexed onto warp-level GPU workers. A work queue in

GPU memory is populated from calls to a C-based API,

and GPU workers pick up and execute these tasks. After a

worker has completed a computation, the results are placed

on an outgoing result queue and returned to the caller.

3.1 Kernel Structure and Task Descriptions

A key element of GeMTC is the daemon launched on the

GPU, named the Super Kernel, which enables many hard-

ware level workers (at the warp level) on the GPU. A work

queue in GPU memory is populated from calls to a C API,

and GPU workers pick up and execute these tasks. After a

worker has completed a computation, the results are placed

on an outgoing result queue and returned to the caller.

Within traditional GPU programming, a user deﬁned func-

tion that runs on the GPU is called a kernel. An application

may deﬁne many GPU kernels, and application logic may be

written to execute some or all kernels in parallel. These con-

current kernels are a key technology in the GeMTC frame-

work. Once the GeMTC framework is initialized, the Super

Kernel daemon is started, the memory management system

is set up, and calls can begin to Application Kernels (App-

Kernels). The Super Kernel gathers hardware information

from the GPU and dynamically starts the maximum num-

ber of workers available on that GPU. A worker consists of a

single warp, and therefore the maximum number of workers

is equal to the maximum number of warps.

...

Time

...

168

169

170

171

172

173

174

...

336

FIFO

Warp

Workers

1 to M,

(M>>N)

Streaming

Multi-

processors

(SMX)

1 to N

Time

(B) GeMTC FIFO Scheduler

(A) Concurrent Kernels with Batched Tasks

Batch Completed

Simulation Completed

FIFO

Warp

Workers

1 to M,

(M>>N)

Time

Simulation Completed

...

168

169

172

173

170

171

174

336

...

...171

Figure 2: GeMTC FIFO scheduler processes tasks

as soon as they are available, rather than blocking

on batches for completion. The warps required to

execute cases (B) and (C) are provided by all the

streaming multiprocessor’s within the shaded area

of (A). While the hardware available remains the

same, the number of parallel channels is increased

for the amount of concurrent parallel work.

Figure 3: Flow of a task in GeMTC.

AppKernels are the computations that are executed by a

GeMTC worker. The AppKernels are modular in design,

and users can quickly contribute to the AppKernel Library

by writing their own AppKernels based on pre-existing tem-

plates. A major appeal of the GeMTC framework is the

decomposition of the GPU programming model. Instead of

an application launching hundreds or thousands of threads,

which could quickly become more challenging to manage,

GeMTC AppKernels are optimized at the warp level, mean-

ing the programmer and AppKernel logic are responsible for

managing only 32 threads in a given application. Further-

more, run-time logic can be used to control concurrency of

tasks to ensure that GPU cores are kept utilized without

exhausting the GPU memory.

The Task Description is a C struct that contains rele-

vant information for executing an AppKernel as a task on

GeMTC. The Task Description is passed from a client via

the GeMTC API (e.g., by Swift) to the GPU and queued

with parameters on the device to the input queue or queued

with task results on the outgoing result queue.

Figure 4 shows how a sample AppKernel could be writ-

ten to compute a naive square matrix multiplication through

GeMTC. Swift stubs have marshaled AppKernel parameters

into a single boxed parameter. Therefore, after calibrating

for warp size, the ﬁrst step is to unbox the parameters. Af-

ter executing an algorithm optimized for the warp size, the

result is stored in a location identiﬁed from unboxing the

input parameters. The result is then placed on an outgoing

result queue, and the warp is ready to pick up new work.

3.2 GeMTC API

The GeMTC API is a C-based API which consists of eight

major functions identiﬁed in Table 1. Figure 5 uses a sim-

ple molecular dynamics (MD) example to demonstrate how

a user can leverage the GeMTC API to launch a simula-

tion on the GPU. For the MD example, the user deﬁnes

the initial universe of molecules as a parameter to the MD

function. Once these parameters have been transferred into

GPU memory the user pushes the task to the GPU along

with all the information needed to create the task description

on the device. The push operation contains, as parameters,

the four pieces of data necessary to construct the task de-

scription; in this case, TaskType = MDLite, TASK ID is set

to a unique integer value (for tracking the task throughout

its lifetime), numThreads = 32, and *params = a pointer to

device memory where the task parameters are stored.

1 __device__ void MatrixMultiply(void *boxed_input)

2 {

3 // calibrate for warp size

4 int warp_size = 32;

5 int thread = threadIdx.x % warp_size;

6 // unbox host parameters

7 float* inputParams = (float*)boxed_input;

8 int matrixWidth = inputParams[0];

9 int matrixSize = matrixWidth * matrixWidth;

10 float *matrixA = inputParams+1;

11 float *matrixB = matrixA + matrixSize;

12 float *matrixOut = matrixA + 2 * matrixSize;

13 // compute Matrix Multiplication

14 for (unsigned int i = thread; i < matrixWidth;

15 i=i+warp_size){

16 for (unsigned int j = 0; j < matrixWidth; j++) {

17 float sum = 0;

18 for (unsigned int k = 0; k < matrixWidth; k++) {

19 float a = matrixA[i * matrixWidth + k];

20 float b = matrixB[k * matrixWidth + j];

21 sum += a * b;

22 }

23 // result location from input parameters

24 matrixOut[i * matrixWidth + j ] = sum;

25 }

26 }

27 }

Figure 4: GeMTC Mat-Mul AppKernel

Table 1: GeMTC API

API Call Functionality Provided

gemtc(Setup/Cleanup) (Start/Stop) GeMTC

gemtc(Push/Poll) (Submit/Return) Tasks

gemtcMemcpyHostToDevice Memory Copy

gemtcMemcpyDeviceToHost Memory Copy

gemtcGPU(Malloc/Free) (Allocate/Free) Memory

At this point the user can begin polling for a result. The

precompiled MD AppKernel already knows how to pack and

unpack the function parameters from memory; and once the

function completes, the result is packed into memory and

placed on the result queue. When the gemtcPoll function

returns a result, the user can then unpack the memory and

move to the next operation. The gemtcPoll function does

not block on a speciﬁc task, and it automatically pops any

completed task(s) oﬀ the result queue. This strategy is ex-

plained in further detail in the Task Bundling subsection.

In addition, the example shown in Figure 5 is speciﬁc to

users leveraging the C API. It is expected that end users

will utilize high-level Swift scripts to launch their tasks on

GeMTC. The calls described above are implicitly handled by

the GeMTC and Swift integration, as explained in further

detail in Section 4.

3.3 Queues, Tasks, and Memory Management

GeMTC manages two queues on the device. The Incom-

ing Work Queue is populated by calls to the GeMTC API

and contains tasks that are ready to execute. The tasks in

this queue contain a TaskDescription and the necessary pa-

rameters to execute the task. Both in-memory queues are

conﬁgured as circular linked-lists with pointers indicating

the front and rear of the queue. When a worker picks up

a task, it will dequeue from the front, and any new work

is placed at the rear. Figure 6 demonstrates how workers

interact with the queues.

1 # include ”gemtc.cu”

2 main(){

3 // Start GeMTC

4 gemtcSetup(QUEUE_SIZE);

5 // Allocate device memory

6 device_params = gemtcGPUMalloc(MALLOC_SIZE);

7 // Populate device memory

8 gemtcMemcpyHostToDevice(device_params,

9 host_params, MALLOC_SIZE);

10 // Push a task to the GPU

11 gemtcPush(MD_Lite, NUM_THREADS,

12 TaskID, device_params);

13 // Poll for completed results

14 gemtcPoll(TaskID, pointer);

15 // Copy back results

16 gemtcMemcpyDeviceToHost(host_params,

17 pointer, MALLOC_SIZE);

18 // Free GPU memory

19 gemtcGPUFree(pointer);

20 // Shutdown GeMTC

21 gemtcCleanup();

22 }

Figure 5: Code sample of GeMTC API.

Figure 6: GPU Workers interacting with queues.

The GeMTC framework requires eﬃcient device memory

allocation on a per task basis. Each task enqueued requires

at least two device allocations: the ﬁrst for the task itself

and the second for parameters and results. The existing

CUDA memory management system was not designed for a

large number of independent memory allocations. With tra-

ditional CUDA programming models the current best prac-

tice is to allocate all memory needed by an application at

launch time and then manually manage and reuse this mem-

ory as needed.

To reduce the large overhead of individual memory alloca-

tions for MTC workloads, GeMTC includes a sub-allocator

designed to eﬃciently handle many requests for dynamic al-

location. The sub-allocator uses the existing CUDA malloc

to allocate large contiguous pieces of device memory, allo-

cating more as needed. Then pointers to these free chunks

and their sizes are stored in a circular linked list on the

CPU (see Figure 7). This list is ordered by increasing de-

vice address to allow for easy memory coalescing of adjacent

memory chunks.

Figure 7: Memory mapping of free memory available

to the device.

Figure 8: Result of gemtcMalloc on free memory.

When a GeMTC memory allocation request is sent from

the host to the GPU, the sub-allocator will traverse the list

and select the ﬁrst chunk of free device memory meeting

the allocation requirements. Figure 8 demonstrates how the

header is then updated to reﬂect the remaining free device

memory available in that chunk. This operation runs in the

same order of time as a single memory copy to the device.

Upon freeing device memory, the header is read to iden-

tify the size of the chunk. Then it is added to the list of

free memory in the correct location. If there is any free con-

secutive memory, the chunk is coalesced to provide a single

larger contiguous chunk of memory. The operation to free

device memory takes roughly the same amount of time as

reading the header (i.e., a device memory copy).

Both malloc() and free() within GeMTC’s memory man-

agement run in O(n), where n is the length of the free mem-

ory list. In addition, the size of the list is proportional to

the amount of memory fragmentation since each element is

recorded as a separate chunk of memory. Because malloc

and free both need to write and read to the GPU memory,

these operations may scale poorly under workloads with high

fragmentation. However, the MTC workloads we examine

show no signs of high fragmentation. The original cudaMal-

loc ran in ∼100 microseconds, and our gemtcMalloc runs in

∼10 microseconds.

To optimize the GeMTC framework for ﬁne-grained tasks,

we have implemented a task-bundling system to reduce the

amount of communication between the host and GPU. The

main bottleneck for obtaining high task throughput through

GeMTC is the latency associated with writing a task to the

GPU DRAM memory. This bundling system as shown in

Figure 9 creates a buﬀer of tasks that need to be written to

the GPU, and ﬂushes it periodically or when it is full. This

HTML Viewer

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?

The authors present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. The authors demonstrate the use of a high-level MTC programming model ( the Swift parallel dataflow language ) to run tasks on many accelerators and thus provide a highproductivity programming model for the growing number of supercomputers that are accelerator-enabled.

Q2. What are the future works mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?

GeMTC is currently optimized for executing within environments containing a single GPU per node, such as Blue Waters ; but future work aims to address heterogeneous accelerator environments. The authors leave this for future work. Future work also includes performance evaluation of diverse application kernels ; analysis of the ability of such kernels to effectively utilize concurrent warps ; enabling of virtual warps [ 25 ] which can both subdivide and span physical warps ; support for other accelerators such as the Xeon Phi ; and continued performance refinement.

Q3. What are the main requirements for extremescale systems?

Many important application classes and programming techniques that are driving the requirements for such extremescale systems include branch and bound, stochastic programming, materials by design, and uncertainty quantification.

Q4. What is the purpose of the Swift parallel scripting language?

The dataflow programming model of the Swift parallel scripting language can elegantly express, through implicit parallelism, the massive concurrency demanded by these applications while retaining the productivity benefits of a high-level language.

Q5. What is the purpose of the task-bundling system?

To optimize the GeMTC framework for fine-grained tasks, the authors have implemented a task-bundling system to reduce the amount of communication between the host and GPU.

Q6. What is the future work of GeMTC?

Future work also includes performance evaluation of diverse application kernels; analysis of the ability of such kernels to effectively utilize concurrent warps; enabling of virtual warps [25] which can both subdivide and span physical warps; support for other accelerators such as the Xeon Phi; and continued performance refinement.

Q7. What is the benefit of a GeMTC implementation on the Xeon Phi?

The GeMTC implementation on the Xeon-Phi will benefit greatly from avoiding memory and thread oversubscription, as highlighted in this work.

Q8. What is the common way to run a task?

Tasks typically run to completion: they follow the simple input-process-output model of procedures, rather than retaining state as in web services or MPI processes.

Q9. How many threads can be used in a single warp?

Scaling an application down to the level of concurrency available within a single warp can provide the highest level of thread utilization for some applications.

Q10. What is the main argument for the use of small tasks in GPUs?

MTC workloads that send only single tasks, or small numbers of large tasks, to accelerator devices observe near-serialized performance, and leave a significant portion of device processor capability unused.

Q11. What is the main appeal of the GeMTC framework?

Instead of an application launching hundreds or thousands of threads, which could quickly become more challenging to manage, GeMTC AppKernels are optimized at the warp level, meaning the programmer and AppKernel logic are responsible for managing only 32 threads in a given application.

Q12. What is the main purpose of Pegasus?

The Pegasus project runs at the hypervisor level and promotes GPU sharing across virtual machines, while including a custom DomA scheduler for GPU task scheduling.

Q13. How many device allocations are required for each task?

Each task enqueued requires at least two device allocations: the first for the task itself and the second for parameters and results.

Q14. What is the main bottleneck for obtaining high task throughput through GeMTC?

The main bottleneck for obtaining high task throughput through GeMTC is the latency associated with writing a task to the GPU DRAM memory.

Q15. What is the way to avoid low level accelerator development?

If the compiler is able to generate device code and parallel instructions, the developer may opt to write sequential code and benefit from accelerator speedup.

Q16. What is the function that is loaded into memory?

The precompiled MD AppKernel already knows how to pack and unpack the function parameters from memory; and once the function completes, the result is packed into memory and placed on the result queue.

Q17. What is the practice for CUDA?

With traditional CUDA programming models the current best practice is to allocate all memory needed by an application at launch time and then manually manage and reuse this memory as needed.

Q18. How many threads are active in a single warp?

While the walltime of MDLite successfully decreases as more threads are added, the speedup obtained is significantly less than ideal after 8 threads are active within a single warp.

Design and evaluation of the gemtc framework for GPU-enabled many-task computing

Summary (4 min read)

1. INTRODUCTION

2. CHALLENGES OF MANY-TASK COM-PUTING ON GPGPUS

2.1 NVIDIA GPUs and GPGPU Computing.

2.2 Mainstream GPU Support for MTC

3. GEMTC ARCHITECTURE

3.1 Kernel Structure and Task Descriptions

3.2 GeMTC API

3.3 Queues, Tasks, and Memory Management

4. SWIFT: DATAFLOW EXECUTION AND PROGRAMMING MODEL FOR MTC

GeMTC Integration with Swift

5. PERFORMANCE EVALUATION

5.1 Molecular Dynamics

5.2 Throughput and Efficiency

5.3 Preliminary MTC Xeon Phi Results

6. RELATED WORK

7. CONCLUSIONS

Figures (24)

Citations

Cites background from "Design and evaluation of the gemtc ..."

Cites methods from "Design and evaluation of the gemtc ..."

References

"Design and evaluation of the gemtc ..." refers methods in this paper

"Design and evaluation of the gemtc ..." refers methods in this paper

"Design and evaluation of the gemtc ..." refers background or methods in this paper

"Design and evaluation of the gemtc ..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?

Q2. What are the future works mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?

Q3. What are the main requirements for extremescale systems?

Q4. What is the purpose of the Swift parallel scripting language?

Q5. What is the purpose of the task-bundling system?

Q6. What is the future work of GeMTC?

Q7. What is the benefit of a GeMTC implementation on the Xeon Phi?

Q8. What is the common way to run a task?

Q9. How many threads can be used in a single warp?

Q10. What is the main argument for the use of small tasks in GPUs?

Q11. What is the main appeal of the GeMTC framework?

Q12. What is the main purpose of Pegasus?

Q13. How many device allocations are required for each task?

Q14. What is the main bottleneck for obtaining high task throughput through GeMTC?

Q15. What is the way to avoid low level accelerator development?

Q16. What is the function that is loaded into memory?

Q17. What is the practice for CUDA?

Q18. How many threads are active in a single warp?