scispace - formally typeset
Open AccessJournal ArticleDOI

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Reads0
Chats0
TLDR
StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.
Abstract
In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright © 2010 John Wiley & Sons, Ltd.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

TL;DR: Kokkos’ abstractions are described, its application programmer interface (API) is summarized, performance results for unit-test kernels and mini-applications are presented, and an incremental strategy for migrating legacy C++ codes to Kokkos is outlined.
Journal ArticleDOI

A Survey of CPU-GPU Heterogeneous Computing Techniques

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Journal ArticleDOI

DAGuE: A generic distributed DAG engine for High Performance Computing

TL;DR: DAGuE is presented, a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority.
Journal ArticleDOI

PaRSEC: Exploiting Heterogeneity to Enhance Scalability

TL;DR: In this article, the authors present an approach based on task parallelism that reveals the application's parallelism by expressing its algorithm as a task flow, which allows the algorithm to be decoupled from the data distribution and the underlying hardware.
Book ChapterDOI

A static task partitioning approach for heterogeneous systems using OpenCL

TL;DR: This work proposes a portable partitioning scheme for OpenCL programs on heterogeneous CPU-GPU systems based on a purely static approach based on predictive modelling and program features and achieves a speedup over a suite of 47 benchmarks.
References
More filters
Journal ArticleDOI

Performance-effective and low-complexity task scheduling for heterogeneous computing

TL;DR: Two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time are presented, called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm.
Journal ArticleDOI

A Survey of General-Purpose Computation on Graphics Hardware

TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.
Proceedings ArticleDOI

Automatically Tuned Linear Algebra Software

TL;DR: An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).
Proceedings ArticleDOI

Benchmarking GPUs to tune dense linear algebra

TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.
Journal ArticleDOI

A class of parallel tiled linear algebra algorithms for multicore architectures

TL;DR: Algorithms for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data are presented.
Related Papers (5)