Design and evaluation of the gemtc framework for GPU-enabled many-task computing
read more
Citations
Adaptive Task Aggregation for High-Performance Sparse Solvers on GPUs
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Pagoda: A GPU Runtime System for Narrow Tasks
Highlights of X-Stack ExM Deliverable Swift/T
References
BOINC: A System for Public-Resource Computing and Storage
SLURM: Simple Linux Utility for Resource Management
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Swift: A language for distributed parallel scripting
Swift: Fast, Reliable, Loosely Coupled Parallel Computation
Related Papers (5)
Frequently Asked Questions (18)
Q2. What are the future works mentioned in the paper "Design and evaluation of the gemtc framework for gpu-enabled many-task computing" ?
GeMTC is currently optimized for executing within environments containing a single GPU per node, such as Blue Waters ; but future work aims to address heterogeneous accelerator environments. The authors leave this for future work. Future work also includes performance evaluation of diverse application kernels ; analysis of the ability of such kernels to effectively utilize concurrent warps ; enabling of virtual warps [ 25 ] which can both subdivide and span physical warps ; support for other accelerators such as the Xeon Phi ; and continued performance refinement.
Q3. What are the main requirements for extremescale systems?
Many important application classes and programming techniques that are driving the requirements for such extremescale systems include branch and bound, stochastic programming, materials by design, and uncertainty quantification.
Q4. What is the purpose of the Swift parallel scripting language?
The dataflow programming model of the Swift parallel scripting language can elegantly express, through implicit parallelism, the massive concurrency demanded by these applications while retaining the productivity benefits of a high-level language.
Q5. What is the purpose of the task-bundling system?
To optimize the GeMTC framework for fine-grained tasks, the authors have implemented a task-bundling system to reduce the amount of communication between the host and GPU.
Q6. What is the future work of GeMTC?
Future work also includes performance evaluation of diverse application kernels; analysis of the ability of such kernels to effectively utilize concurrent warps; enabling of virtual warps [25] which can both subdivide and span physical warps; support for other accelerators such as the Xeon Phi; and continued performance refinement.
Q7. What is the benefit of a GeMTC implementation on the Xeon Phi?
The GeMTC implementation on the Xeon-Phi will benefit greatly from avoiding memory and thread oversubscription, as highlighted in this work.
Q8. What is the common way to run a task?
Tasks typically run to completion: they follow the simple input-process-output model of procedures, rather than retaining state as in web services or MPI processes.
Q9. How many threads can be used in a single warp?
Scaling an application down to the level of concurrency available within a single warp can provide the highest level of thread utilization for some applications.
Q10. What is the main argument for the use of small tasks in GPUs?
MTC workloads that send only single tasks, or small numbers of large tasks, to accelerator devices observe near-serialized performance, and leave a significant portion of device processor capability unused.
Q11. What is the main appeal of the GeMTC framework?
Instead of an application launching hundreds or thousands of threads, which could quickly become more challenging to manage, GeMTC AppKernels are optimized at the warp level, meaning the programmer and AppKernel logic are responsible for managing only 32 threads in a given application.
Q12. What is the main purpose of Pegasus?
The Pegasus project runs at the hypervisor level and promotes GPU sharing across virtual machines, while including a custom DomA scheduler for GPU task scheduling.
Q13. How many device allocations are required for each task?
Each task enqueued requires at least two device allocations: the first for the task itself and the second for parameters and results.
Q14. What is the main bottleneck for obtaining high task throughput through GeMTC?
The main bottleneck for obtaining high task throughput through GeMTC is the latency associated with writing a task to the GPU DRAM memory.
Q15. What is the way to avoid low level accelerator development?
If the compiler is able to generate device code and parallel instructions, the developer may opt to write sequential code and benefit from accelerator speedup.
Q16. What is the function that is loaded into memory?
The precompiled MD AppKernel already knows how to pack and unpack the function parameters from memory; and once the function completes, the result is packed into memory and placed on the result queue.
Q17. What is the practice for CUDA?
With traditional CUDA programming models the current best practice is to allocate all memory needed by an application at launch time and then manually manage and reuse this memory as needed.
Q18. How many threads are active in a single warp?
While the walltime of MDLite successfully decreases as more threads are added, the speedup obtained is significantly less than ideal after 8 threads are active within a single warp.