XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

doi:10.1109/IPDPS.2013.66

Open AccessProceedings ArticleDOI

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

- pp 1299-1308

TLDR

This work presents the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data- flow task model and a locality-aware work stealing scheduler, and shows performance results on two dense linear algebra kernels and a highly efficient Cholesky factorization.

Abstract:

Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

Citations

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Superglue: a shared memory framework using data versioning for dependency-aware task-based parallelization ∗

LAWS: locality-aware work-stealing for multi-socket multi-core architectures

Scheduling independent tasks on multi-cores with GPU accelerators

Simplifying programming and load balancing of data parallel applications on heterogeneous systems

References

Cilk: An Efficient Multithreaded Runtime System

The implementation of the Cilk-5 multithreaded language

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

A class of parallel tiled linear algebra algorithms for multicore architectures

Thread Scheduling for Multiprogrammed Multiprocessors

Related Papers (5)

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

The implementation of the Cilk-5 multithreaded language

Cilk: An Efficient Multithreaded Runtime System

CHARM++: a portable concurrent object oriented system based on C++

Performance-effective and low-complexity task scheduling for heterogeneous computing