StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

doi:10.1002/CPE.1631

Open AccessJournal ArticleDOI

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Cédric Augonnet, +3 more

- Vol. 23, Iss: 2, pp 187-198

Chats0

TLDR

StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.

Abstract:

In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright © 2010 John Wiley & Sons, Ltd.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

H. Carter Edwards, +2 more

- 01 Dec 2014 -

Journal of Parallel and Distributed Comp...

TL;DR: Kokkos’ abstractions are described, its application programmer interface (API) is summarized, performance results for unit-test kernels and mini-applications are presented, and an incremental strategy for migrating legacy C++ codes to Kokkos is outlined.

...read moreread less

Journal ArticleDOI

A Survey of CPU-GPU Heterogeneous Computing Techniques

Sparsh Mittal, +1 more

- 21 Jul 2015 -

ACM Computing Surveys

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.

...read moreread less

Journal ArticleDOI

DAGuE: A generic distributed DAG engine for High Performance Computing

George Bosilca, +5 more

TL;DR: DAGuE is presented, a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority.

...read moreread less

Journal ArticleDOI

PaRSEC: Exploiting Heterogeneity to Enhance Scalability

George Bosilca, +5 more

TL;DR: In this article, the authors present an approach based on task parallelism that reveals the application's parallelism by expressing its algorithm as a task flow, which allows the algorithm to be decoupled from the data distribution and the underlying hardware.

...read moreread less

Book ChapterDOI

A static task partitioning approach for heterogeneous systems using OpenCL

Dominik Grewe, +1 more

TL;DR: This work proposes a portable partitioning scheme for OpenCL programs on heterogeneous CPU-GPU systems based on a purely static approach based on predictive modelling and program features and achieves a speedup over a suite of 47 benchmarks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Performance-effective and low-complexity task scheduling for heterogeneous computing

Haluk Rahmi Topcuoglu, +2 more

- 01 Mar 2002 -

IEEE Transactions on Parallel and Distri...

TL;DR: Two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time are presented, called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm.

...read moreread less

Journal ArticleDOI

A Survey of General-Purpose Computation on Graphics Hardware

John D. Owens, +6 more

- 01 Mar 2007 -

Computer Graphics Forum

TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.

...read moreread less

Proceedings ArticleDOI

Automatically Tuned Linear Algebra Software

R. Clint Whaley, +1 more

TL;DR: An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).

...read moreread less

Proceedings ArticleDOI

Benchmarking GPUs to tune dense linear algebra

Vasily Volkov, +1 more

TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.

...read moreread less

Journal ArticleDOI

A class of parallel tiled linear algebra algorithms for multicore architectures

Alfredo Buttari, +3 more

TL;DR: Algorithms for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data are presented.

...read moreread less