Design and evaluation of the gemtc framework for GPU-enabled many-task computing
Summary (4 min read)
1. INTRODUCTION
- This work explores methods for, and potential benefits of, applying the increasingly abundant and economical generalpurpose graphics processing units to a broader class of applications.
- Tasks typically run to completion: they follow the simple input-process-output model of procedures, rather than retaining state as in web services or MPI processes.
- Efficient MTC implementations are now commonplace on clusters, grids, and clouds.
- Integration of GeMTC with Swift, enabling a broad class of dataflow-based scientific applications, and improving programmability for both hybrid multicore hosts and extreme scale systems.
- Work is load balanced among large numbers of GPUs.
2. CHALLENGES OF MANY-TASK COM-PUTING ON GPGPUS
- The authors GeMTC work is motivated by the fact that with current mainstream programming models, a significant portion of GPU processing capabilities underutilized by MTC workloads.
- The results presented here indicate that this approach enables higher utilization of GPU resources, greater concurrency, and hence higher many-task throughput.
2.1 NVIDIA GPUs and GPGPU Computing.
- General-purpose computing on graphics processing units allows a host CPU to offload a wide variety of computation, not just graphics, to a graphics processing unit (GPU).
- GPUs are designed for vector parallelism: they contain many lightweight cores designed to support parallel bulk processing of graphics data.
- A SMX contains many warps, and each warp provides 32 concurrent threads of execution.
- Ousterhout et al., [11] make a compelling argument for the pervasive use of tiny tasks in compute clusters.
- The authors apply a similar argument to motivate the GeMTC model of running many small independent tasks on accelerators.
2.2 Mainstream GPU Support for MTC
- The dominant CUDA and OpenCL GPGPU programming models both provide extensions to traditional programming languages such as C with added API calls to interact with accelerators.
- OpenACC is an open standard and aims to provide the portability of OpenCL while requiring less detailed knowledge of accelerator architecture than is required in CUDA and OpenCL programming.
- Concurrent Kernels [14] is a CUDA feature that enables the developer to launch parallel work on a GPU.
- The current model of GeMTC and Swift relies on communication between the CPU and GPU to drive tasks to and from the Swift script.
- In addition, to process workflows with complex dependencies, the developer must group tasks into batches and block on batch completion before executing dependent kernels, an inadequate approach for supporting heterogeneous concurrent tasks.
3. GEMTC ARCHITECTURE
- Given that their target test bed consisted of NVIDIA GPUs and that the authors wanted to examine the GPU at the finest granularity possible, they opted to implement their framework using CUDA.
- This decision allowed us to work at the finest granularity possible but limited their evaluation to NVIDIA based hardware.
- While GeMTC was originally developed on NVIDIA CUDA devices, its architecture is general, and has also been implemented on the Intel Xeon Phi [16] .
- The Phi, however, represents a different accelerator architecture, meriting separate study, and is not addressed in this paper.
- A work queue in GPU memory is populated from calls to a C-based API, and GPU workers pick up and execute these tasks.
3.1 Kernel Structure and Task Descriptions
- A key element of GeMTC is the daemon launched on the GPU, named the Super Kernel, which enables many hardware level workers (at the warp level) on the GPU.
- After a worker has completed a computation, the results are placed on an outgoing result queue and returned to the caller.
- Within traditional GPU programming, a user defined function that runs on the GPU is called a kernel.
- These concurrent kernels are a key technology in the GeMTC framework.
- The Super Kernel gathers hardware information from the GPU and dynamically starts the maximum number of workers available on that GPU.
3.2 GeMTC API
- Figure 5 uses a simple molecular dynamics (MD) example to demonstrate how a user can leverage the GeMTC API to launch a simulation on the GPU.
- Once these parameters have been transferred into GPU memory the user pushes the task to the GPU along with all the information needed to create the task description on the device.
- At this point the user can begin polling for a result.
- When the gemtcPoll function returns a result, the user can then unpack the memory and move to the next operation.
- It is expected that end users will utilize high-level Swift scripts to launch their tasks on GeMTC.
3.3 Queues, Tasks, and Memory Management
- The Incoming Work Queue is populated by calls to the GeMTC API and contains tasks that are ready to execute.
- The tasks in this queue contain a TaskDescription and the necessary parameters to execute the task.
- With traditional CUDA programming models the current best practice is to allocate all memory needed by an application at launch time and then manually manage and reuse this memory as needed.
- Then pointers to these free chunks and their sizes are stored in a circular linked list on the CPU .
- The main bottleneck for obtaining high task throughput through GeMTC is the latency associated with writing a task to the GPU DRAM memory.
4. SWIFT: DATAFLOW EXECUTION AND PROGRAMMING MODEL FOR MTC
- Swift [4] is an implicitly parallel functional dataflow programming language that is proving increasingly useful to express the higher-level logic of scientific and engineering applications.
- Many important application classes and programming techniques that are driving the requirements for such extremescale systems include branch and bound, stochastic programming, materials by design, and uncertainty quantification.
- The dataflow programming model of the Swift parallel scripting language can elegantly express, through implicit parallelism, the massive concurrency demanded by these applications while retaining the productivity benefits of a high-level language.
- When using its own resource provisioner [6].
- This enables Swift to express a far broader set of applications, and makes it a productive coordination language for hybrid CPU+accelerator nodes and systems.
GeMTC Integration with Swift
- The integration with Swift provides many mutual benefits for both Swift and GeMTC.
- The final box on the right illustrates how GeMTC fits into the Swift/T stack.
- Thus, the user's Swift application can simply call any function mapped to an AppKernel from the high level Swift program.
- Data transfers overlap with ongoing GPU computations implicitly and automatically.
- And because the GeMTC API calls are handled at the Turbine worker level, the Swift programmer is freed from the burden of writing complex mem- ory management code for the GPU.
5. PERFORMANCE EVALUATION
- This section evaluates the GeMTC framework with a set of AppKernels from the GeMTC AppKernel Library.
- App-Kernels are CUDA device functions that have been tuned to deliver high performance under MTC workloads.
- The authors work with a lightweight molecular dynamics simulation called MDLite.
- The authors conclude with an analysis of MDLite over multiple XK7 nodes and examine a set of simple adder benchmarks to highlight throughput and efficiency.
- Blue Waters contains ∼20K Cray XE6 CPU based nodes and ∼4K Cray XK7 GPU nodes.
5.1 Molecular Dynamics
- The user specifies the number of particles in a "universe" along with their starting positions, the number of dimensions, and a starting mass.
- MDLite runs a simulation that determines how the potential and kinetic energy in the system changes as the particles change position.
- By varying the number of active threads included in a warp computation, the authors prove that for the right application it could indeed benefit from the 32 threads in a GPU warp.
- Figure 15 evaluates a varied number of MDLite simulations running over a K20X GPU.
5.2 Throughput and Efficiency
- Next, the authors evaluate GeMTC with a simple adder benchmark.
- Af- terwards, the authors can easily measure the efficiency and overhead of their system: efficiency = (expected runtime/observed runtime).
- First, a CPU version of the simple adder is executed through Swift/T on XE6 nodes.
- Figure 20 highlights the single-node efficiency of GeMTC running with 168 active workers per GPU.
- The authors attribute this drop in performance to greater worker contention on the device queues and the fact that Swift must now drive 168 times the amount of work per node.
5.3 Preliminary MTC Xeon Phi Results
- The authors have also gathered preliminary results for supporting MTC workloads on the Intel Xeon Phi Coprocessor.
- As shown in Figure 23 the authors can achieve the same level of efficiency with shorter running tasks (50% shorter) on a Xeon Phi compared with a GTX-680 NVIDIA GPU.
- The authors highlight the fact that with GeMTC on its own they observe upwards of 90% efficiency with tasks lasting 5 ms.
- This means that a fully general purpose framework would be capable of launching tasks an order of magnitude faster.
- The authors will continue to improve performance to ensure all components of the system can keep up with these task dispatch rates.
7. CONCLUSIONS
- The authors have presented GeMTC, a framework for enabling MTC workloads to run efficiently on NVIDIA GPUs.
- The GeMTC framework is responsible for receiving work from a host through the use of the C API, and scheduling and running that work on many independent GPU workers.
- Results are returned through the C API to the host and then to Swift.
- Applications that can generate thousands of SIMD threads may prefer to use traditional CUDA programming techniques.
- Under the current configurations, users are required to write their own AppKernels.
Did you find this useful? Give us your feedback
Citations
52 citations
Cites background from "Design and evaluation of the gemtc ..."
...Swift/T focuses on high-performance fine-grained task parallelism, such as calling foreign functions (including C and Fortran) with in-memory data and launching kernels on GPUs and other accelerators [16]....
[...]
40 citations
Cites background from "Design and evaluation of the gemtc ..."
...Recently, the GPU emerges as an accelerator for parallel algorithms [14, 15]....
[...]
35 citations
34 citations
34 citations
Cites background from "Design and evaluation of the gemtc ..."
...Another way to spawn tasks is to use a batch-based mechanism [14], where CPU sends a batch of tasks to the GPU....
[...]
...Prior work has identified the issue of GPU underutilization [37, 8, 25, 14]....
[...]
...Prior work, GPU enabled Many-Task Computing (GeMTC) [14], presents a runtime task scheduling mechanism, where a task executes as a single threadblock....
[...]
References
363 citations
"Design and evaluation of the gemtc ..." refers background or methods in this paper
...Future work also includes performance evaluation of diverse application kernels; analysis of the ability of such kernels to effectively utilize concurrent warps; enabling of virtual warps [25] which can both subdivide and span physical warps; support for other accelerators such as the Xeon Phi; and continued performance refinement....
[...]
...[25] developed and evaluated methods that obtained 9X speedups of breadthfirst search in graphs over prior GPU implementations by enabling each warp to run independent threads and even multiple “virtual” threads....
[...]
350 citations
"Design and evaluation of the gemtc ..." refers methods in this paper
...When using Falkon [3], Swift achieved over 1,000 tasks per second....
[...]
...When using Falkon [3], Swift achieved over 1,000 tasks per second....
[...]
...In recent years we have extended MTC to applications on homogeneous supercomputers, using tools such as Falkon [3], Swift [4], JETS [5], and Coasters [6]....
[...]
...In recent years we have extended MTC to applications on homogeneous supercomputers, using tools such as Falkon [3], Swift [4], JETS [5], and Coasters [6]....
[...]
...[3] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde, Falkon: a Fast and Light-weight tasK executiON framework, in Proc. of the 2007 ACM/IEEE Conf. on Supercomputing (SC 07)....
[...]
279 citations
"Design and evaluation of the gemtc ..." refers background in this paper
...[13]) suggest that OpenACC is not yet capable of delivering equivalent performance....
[...]
256 citations
160 citations
"Design and evaluation of the gemtc ..." refers methods in this paper
...NET code into CUDA code that is then executed on the PTask runtime [29]....
[...]
Related Papers (5)
Frequently Asked Questions (18)
Q2. What are the future works mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?
GeMTC is currently optimized for executing within environments containing a single GPU per node, such as Blue Waters ; but future work aims to address heterogeneous accelerator environments. The authors leave this for future work. Future work also includes performance evaluation of diverse application kernels ; analysis of the ability of such kernels to effectively utilize concurrent warps ; enabling of virtual warps [ 25 ] which can both subdivide and span physical warps ; support for other accelerators such as the Xeon Phi ; and continued performance refinement.
Q3. What are the main requirements for extremescale systems?
Many important application classes and programming techniques that are driving the requirements for such extremescale systems include branch and bound, stochastic programming, materials by design, and uncertainty quantification.
Q4. What is the purpose of the Swift parallel scripting language?
The dataflow programming model of the Swift parallel scripting language can elegantly express, through implicit parallelism, the massive concurrency demanded by these applications while retaining the productivity benefits of a high-level language.
Q5. What is the purpose of the task-bundling system?
To optimize the GeMTC framework for fine-grained tasks, the authors have implemented a task-bundling system to reduce the amount of communication between the host and GPU.
Q6. What is the future work of GeMTC?
Future work also includes performance evaluation of diverse application kernels; analysis of the ability of such kernels to effectively utilize concurrent warps; enabling of virtual warps [25] which can both subdivide and span physical warps; support for other accelerators such as the Xeon Phi; and continued performance refinement.
Q7. What is the benefit of a GeMTC implementation on the Xeon Phi?
The GeMTC implementation on the Xeon-Phi will benefit greatly from avoiding memory and thread oversubscription, as highlighted in this work.
Q8. What is the common way to run a task?
Tasks typically run to completion: they follow the simple input-process-output model of procedures, rather than retaining state as in web services or MPI processes.
Q9. How many threads can be used in a single warp?
Scaling an application down to the level of concurrency available within a single warp can provide the highest level of thread utilization for some applications.
Q10. What is the main argument for the use of small tasks in GPUs?
MTC workloads that send only single tasks, or small numbers of large tasks, to accelerator devices observe near-serialized performance, and leave a significant portion of device processor capability unused.
Q11. What is the main appeal of the GeMTC framework?
Instead of an application launching hundreds or thousands of threads, which could quickly become more challenging to manage, GeMTC AppKernels are optimized at the warp level, meaning the programmer and AppKernel logic are responsible for managing only 32 threads in a given application.
Q12. What is the main purpose of Pegasus?
The Pegasus project runs at the hypervisor level and promotes GPU sharing across virtual machines, while including a custom DomA scheduler for GPU task scheduling.
Q13. How many device allocations are required for each task?
Each task enqueued requires at least two device allocations: the first for the task itself and the second for parameters and results.
Q14. What is the main bottleneck for obtaining high task throughput through GeMTC?
The main bottleneck for obtaining high task throughput through GeMTC is the latency associated with writing a task to the GPU DRAM memory.
Q15. What is the way to avoid low level accelerator development?
If the compiler is able to generate device code and parallel instructions, the developer may opt to write sequential code and benefit from accelerator speedup.
Q16. What is the function that is loaded into memory?
The precompiled MD AppKernel already knows how to pack and unpack the function parameters from memory; and once the function completes, the result is packed into memory and placed on the result queue.
Q17. What is the practice for CUDA?
With traditional CUDA programming models the current best practice is to allocate all memory needed by an application at launch time and then manually manage and reuse this memory as needed.
Q18. How many threads are active in a single warp?
While the walltime of MDLite successfully decreases as more threads are added, the speedup obtained is significantly less than ideal after 8 threads are active within a single warp.