XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
Thierry Gautier,João V. F. Lima,Nicolas Maillard,Bruno Raffin +3 more
- pp 1299-1308
TLDR
This work presents the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data- flow task model and a locality-aware work stealing scheduler, and shows performance results on two dense linear algebra kernels and a highly efficient Cholesky factorization.Abstract:
Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.read more
Citations
More filters
Journal ArticleDOI
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
TL;DR: Kokkos’ abstractions are described, its application programmer interface (API) is summarized, performance results for unit-test kernels and mini-applications are presented, and an incremental strategy for migrating legacy C++ codes to Kokkos is outlined.
Journal ArticleDOI
Superglue: a shared memory framework using data versioning for dependency-aware task-based parallelization ∗
TL;DR: The SuperGlue framework is a shared memory framework using data versioning for dependency-aware task-based parallelization that simplifies the development of shared memory systems and speeds up the implementation of distributed systems.
Proceedings ArticleDOI
LAWS: locality-aware work-stealing for multi-socket multi-core architectures
Quan Chen,Minyi Guo,Haibing Guan +2 more
TL;DR: A Locality-Aware Work-Stealing (LAWS) scheduler is proposed, which better utilizes both the shared cache and the NUMA memory system and can improve the performance of memory-bound programs up to 54.2% compared with traditional work-stealing schedulers.
Journal ArticleDOI
Scheduling independent tasks on multi-cores with GPU accelerators
Raphaël Bleuse,Safia Kedad-Sidhoum,Florence Monna,Florence Monna,Grégory Mounié,Denis Trystram,Denis Trystram +6 more
TL;DR: This paper presents a new method for scheduling efficiently parallel applications with m CPUs and k GPUs, where each task of the application can be processed either on a core (CPU) or on a GPU.
Proceedings ArticleDOI
Simplifying programming and load balancing of data parallel applications on heterogeneous systems
TL;DR: Maat is a library for OpenCL programmers that allows for the efficient execution of a single data-parallel kernel using all the available devices and provides the programmer with an abstract view of the system to enable the management of heterogeneous environments regardless of the underlying architecture.
References
More filters
Journal ArticleDOI
Cilk: An Efficient Multithreaded Runtime System
Robert D. Blumofe,Christopher F. Joerg,Bradley C. Kuszmaul,Charles E. Leiserson,Keith H. Randall,Yuli Zhou +5 more
TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.
Proceedings ArticleDOI
The implementation of the Cilk-5 multithreaded language
TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.
Journal ArticleDOI
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
TL;DR: StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.
Journal ArticleDOI
A class of parallel tiled linear algebra algorithms for multicore architectures
TL;DR: Algorithms for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data are presented.
Journal ArticleDOI
Thread Scheduling for Multiprogrammed Multiprocessors
TL;DR: This work presents a user-level thread scheduler for shared-memory multiprocessors, and it achieves linear speedup whenever P is small relative to the parallelism T1/T∈fty .