StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Cédric Augonnet,Samuel Thibault,Raymond Namyst,Pierre-André Wacrenier +3 more
- Vol. 23, Iss: 2, pp 187-198
Reads0
Chats0
TLDR
StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.Abstract:
In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright © 2010 John Wiley & Sons, Ltd.read more
Citations
More filters
Journal ArticleDOI
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
TL;DR: Kokkos’ abstractions are described, its application programmer interface (API) is summarized, performance results for unit-test kernels and mini-applications are presented, and an incremental strategy for migrating legacy C++ codes to Kokkos is outlined.
Journal ArticleDOI
A Survey of CPU-GPU Heterogeneous Computing Techniques
Sparsh Mittal,Jeffrey S. Vetter +1 more
TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Journal ArticleDOI
DAGuE: A generic distributed DAG engine for High Performance Computing
George Bosilca,Aurelien Bouteiller,Anthony Danalis,Thomas Herault,Pierre Lemarinier,Jack Dongarra +5 more
TL;DR: DAGuE is presented, a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority.
Journal ArticleDOI
PaRSEC: Exploiting Heterogeneity to Enhance Scalability
George Bosilca,Aurelien Bouteiller,Anthony Danalis,Mathieu Faverge,Thomas Herault,Jack Dongarra +5 more
TL;DR: In this article, the authors present an approach based on task parallelism that reveals the application's parallelism by expressing its algorithm as a task flow, which allows the algorithm to be decoupled from the data distribution and the underlying hardware.
Book ChapterDOI
A static task partitioning approach for heterogeneous systems using OpenCL
Dominik Grewe,Michael O'Boyle +1 more
TL;DR: This work proposes a portable partitioning scheme for OpenCL programs on heterogeneous CPU-GPU systems based on a purely static approach based on predictive modelling and program features and achieves a speedup over a suite of 47 benchmarks.
References
More filters
Journal ArticleDOI
Performance-effective and low-complexity task scheduling for heterogeneous computing
TL;DR: Two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time are presented, called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm.
Journal ArticleDOI
A Survey of General-Purpose Computation on Graphics Hardware
John D. Owens,David Luebke,Naga K. Govindaraju,Mark J. Harris,Jens Krüger,Aaron Lefohn,Timothy John Purcell +6 more
TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.
Proceedings ArticleDOI
Automatically Tuned Linear Algebra Software
R. Clint Whaley,Jack Dongarra +1 more
TL;DR: An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).
Proceedings ArticleDOI
Benchmarking GPUs to tune dense linear algebra
Vasily Volkov,James Demmel +1 more
TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.
Journal ArticleDOI
A class of parallel tiled linear algebra algorithms for multicore architectures
TL;DR: Algorithms for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data are presented.