scispace - formally typeset
Open AccessJournal ArticleDOI

Tightening Contention Delays While Scheduling Parallel Applications on Multi-core Architectures

Reads0
Chats0
TLDR
Two contention-aware scheduling strategies that produce a time-triggered schedule of the application’s tasks are presented, as well as a heuristic solution that generates schedules very close to ones of the ILP (5% longer on average), with a much lower time complexity.
Abstract
Multi-core systems are increasingly interesting candidates for executing parallel real-time applications, in avionic, space or automotive industries, as they provide both computing capabilities and power efficiency. However, ensuring that timing constraints are met on such platforms is challenging, because some hardware resources are shared between cores. Assuming worst-case contentions when analyzing the schedulability of applications may result in systems mistakenly declared unschedulable, although the worst-case level of contentions can never occur in practice. In this paper, we present two contention-aware scheduling strategies that produce a time-triggered schedule of the application’s tasks. Based on knowledge of the application’s structure, our scheduling strategies precisely estimate the effective contentions, in order to minimize the overall makespan of the schedule. An Integer Linear Programming (ILP) solution of the scheduling problem is presented, as well as a heuristic solution that generates schedules very close to ones of the ILP (5% longer on average), with a much lower time complexity. Our heuristic improves by 19% the overall makespan of the resulting schedules compared to a worst-case contention baseline.

read more

Content maybe subject to copyright    Report

HAL Id: hal-01655383
https://hal.archives-ouvertes.fr/hal-01655383
Submitted on 5 Dec 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Tightening Contention Delays While Scheduling Parallel
Applications on Multi-core Architectures
Benjamin Rouxel, Steven Derrien, Isabelle Puaut
To cite this version:
Benjamin Rouxel, Steven Derrien, Isabelle Puaut. Tightening Contention Delays While Schedul-
ing Parallel Applications on Multi-core Architectures. ACM Transactions on Embedded Computing
Systems (TECS), ACM, 2017, 16 (5s), pp.1 - 20. �10.1145/3126496�. �hal-01655383�

Tightening contention delays while scheduling parallel
applications on multi-core architectures
Benjamin Rouxel
1
,StevenDerrien
1
, and Isabelle Puaut
1
1
University of Rennes 1/IRISA
September 19, 2017
Abstract
Multi-core systems are increasingly interesting candidates for executing parallel real-time
applications, in avionic, space or automotive industries, as they provide both computing ca-
pabilities and power eciency. However, ensuring that timing constraints are met on such
platforms is challenging, because some hardware resources are shared between cores.
Assuming worst-case contentions when analyzing the schedulability of applications may
result in systems mistakenly declared unschedulable, although the worst-case level of contentions
can never occur in practice. In this paper, we present two contention-aware scheduling strategies
that produce a time-triggered schedule of the application’s tasks. Based on knowledge of the
application’s structure, our scheduling strategies precisely estimate the eective contentions, in
order to minimize the overall makespan of the schedule. An Integer Linear Programming (ILP)
solution of the scheduling problem is presented, as well as a heuristic solution that generates
schedules very close to ones of the ILP (5 % longer on average), with a much lower time
complexity. Our heuristic improves by 19% the overall makespan of the resulting schedules
compared to a worst-case contention baseline.
Keywords Real-time System, Contention-Aware Scheduling
This work was partially supported by ARGO (http://www.argo-project.eu/), funded by the European
Commission under Horizon 2020 Research and Innovation Action, Grant Agreement Number 688131.
1 Introduction
The increasing demand for computing power and low energy consumption is placing multi-/many-
core architectures as increasingly interesting candidates for e xecuting embedded critical systems. It
becomes more and more common to find such architectures in automotive, avionic or space industries
[2, 16].
Guaranteeing that timing constraints are met on multi-core platforms is a challenging issue. One
diculty lies in the estimation of the Worst-Case Execution Time (WCET) of tasks. Due to the
presence of shared hardware resources (buses, shared last level of cache, . . . ), techniques designed
for single-core architectures cannot be directly applied to multi-core ones. Since it is hard in general
to guarantee th e absence of resource conflicts during execution, current WCET techniques either
produce pessimistic WCET estimates or constrain the execution to enforce the absence of conflicts,
at the price of a sign ifi cant hardware und er-u tiliz ation.
benjamin.rouxel@irisa.fr
steven.derrien@irisa.fr
isabelle.puaut@irisa.fr
1

A second issue is the selection of a scheduling strategy which will decide where and when to
execute tasks. Scheduling for multi-core platforms was the subject of many research works, surveyed
in [8]. We believe that static mapping of tasks to c ore s (partitioned scheduling) and time-triggered
scheduling on each core allow to have control on sharing of hardware resources, and thus allow to
better estimate worst-case contention delays.
Some existing work on multi-core scheduling considers that the platform workload consists of
independent tasks. As parallel execution is the most promising solution to improve performance,
we envision that within only a few years from now, real-time workloads will evolve toward parallel
programs. The timing behaviour of such programs is challenging to analyze because they consist
of dependent tasks interacting through complex synchronization/communication mechanisms. We
believe that models oering a high-level view of the behavior of parallel programs allow a precise
estimation of shared resource conflicts. In this paper, we assume parallel applications m odeled as
directed acyclic task graphs (DAGs), and show that the knowledge of the application’s structure
allows to have p recise estimation of tasks that eectively execute in parallel, and thus contention
delays. These DAGs do not necessarily need to be built from scratch, which would require an
important engineering eort. Automatic extraction of parallelism, for in stance from a high level
description of applications in model based design workflows [10], looks to us a much more promising
direction.
In this paper, we present two mapping and scheduling strategies featuring bus contention aware-
ness. Both strategies apply to multi-core platforms where cores are connected to a round-robin bus.
Asafe(butpessimistic)boundfortheaccesslatencyistoconsiderNbCores - 1 contending tasks
being granted access to the bus (with NbCores as the number of available cores). Our scheduling
strategies take into consideration the application’s structure and information on the schedule under
construction to estimate precisely the eective degree of interference used to compute the access
latency. The proposed scheduling strategies generate a non preemptive time-triggered partitioned
schedule and select the appropriate level of contention to minimize the schedule length.
The first proposed scheduling method models the task mapping and scheduling problem as con-
straints on task assignment, task start times and communications between tasks. We demonstrate
that the optimal schedule can only b e found using quadratic equations due to th e nature of the re-
quired information to model the communication cost. This modeling is then refined into an Integer
Linear Programming (ILP) formulation that in some cases overestimates communication costs and
thus may not find the shortest schedule. Since the solved scheduling problem is NP-hard, the ILP
formulation is shown to not scale properly when the number of tasks grows. We thus developed
a heuristic scheduling technique that scales much better with the number of tasks and is able to
compute the accurate communication cost. Albeit not always finding the optimal solution, the ILP
formulation is included in this paper, because it gives a non ambiguous description of the problem
under study, and also serves as a baseline to evaluate the quality of the proposed heuristic technique.
The proposed scheduling techniques are evaluated experimentally. The schedule’s length gener-
ated by our heuristic is compared to its equivalent baseline scheduling technique accounting for the
worst case contention. The experimental evaluation also studies the interest of allowing concurrent
bus accesses as compared to related work where concurrent accesses are systematically avoided in
the generated schedule. Finally, we study the time required by the proposed techniques, as well
as how schedule lengths vary when changing architectural parameters such as the duration of one
slot of the round-robin bus. The experimental evaluation uses a subset of the StreamIT streaming
benchmarks [25] as well as synthetic task graphs using the TGFF graph generator [11].
The contributions of this work are threefold:
1. First, we propose a novel approach to derive precise bounds on worst-case contention on a
shared round-robin bus. Compared to previous methods, we use knowledge of the application’s
structure (task graph) as well as knowledge of tasks placement and scheduling to precisely
estimate tasks that execute in parallel, and thus tighten the worst bus access delay.
2

2. Second, we present two scheduling meth ods that calculate a time-triggered partitioned sched-
ule, using an ILP formulation and a heuristic. The novelty with respect to existing scheduling
techniques lie s on the ability of the s cheduler to select the best strategy regarding concurrent
accesses to the shared bus (allow of forbid concurrency) to minimize the overall makespan of
the sche d u le.
3. Third, we provide experimental data to evaluate the benefit of precise estimation of contentions
as compared to the baseline estimation where NbCores - 1 tasks are granted access to the bus
for e very memory access. Moreover, we discuss the interest of allowing concurrency (and
thus interference) between tasks as compared to state-of-the-art techniques such as [2] where
contentions are systematically avoided.
The rest of this paper details the proposed techniques and is organized as follows. Section 2
presents related studies. Assumptions on the hardware and software are given in Section 3. Section
4 details the proposed method to calculate precise worst-case degree of interference when accessing
the shared bus. Section 5 then presents the two techniques for schedule generation, using an ILP
formulation and a heuristic. Section 6 presents experimental results. Concluding remarks are given
in S e ction 7.
2 Related work
Tasks scheduling on multi-core platforms consists in deciding where (mapping) and when (schedul-
ing) each task is executed. The literature on mapping/scheduling of tasks on multi-cores is tremen-
dous as there exists plenty of dierent properties on, e.g. the input task set, the type of scheduling
algorithm. According to the survey from Davis and Burns [8], the three main categories of scheduling
algorithms are global scheduling, semi-partitioned, and partitioned scheduling. According to their
terminology, the scheduling methods presented in this paper can be classified as static, partitioned,
time-triggered and non-preemptive.
Shared resources in multi-core systems may be either shared software objects (such as variables)
that have to be used in mutual exclusion or shared hardware resources (such as buses or caches) that
are shared between cores according to some resource arbitration policies (TDMA, round-robin, etc).
These two classes of resources lead to dierent analyses to ensure that there is neither starvation nor
deadlock. Dealing with shared objects is not new, and there now exists several techniques adapted
from the single-core systems. Most of them are based on priority inheritance. In p articular Jarrett
et al. [17] apply priority inheritance to multi-cores and propose a resource management protocol
which bounds the access latency to a shared resource. Negrean et al. [20] provide a method to
compute the blocking time induced by concurrent tasks in order to determine their response time.
Beyond shared objects, multi-core processors feature hardware resources that may be accessed
concurrently by tasks running on the dierent cores. Typical hardware resources are the main
memory, the memory bus or shared last-level cache. A contention analysis then has to be defined
to determine the worst case delay for a task to gain access to the resource (see [12] for a survey).
Some shared resources may directly implement timing isolation mechanism between cores, such as
Time Division Multiple Access (TDMA) buses, making contention analysis straightforward.
To avoid resource under-utilization caused by TDMA, other resource sharing strategies such as
round-robin oer a good trade-o between predictability and resource usage. Worst-case bounds on
contention are similar to those of TDMA. However, knowledge about the system may help tightening
estimated contention delays.
Approaches to estimate contention delays for round-robin arbitration dier according to the
nature and amount of information used to estimate contention delays. For architectures with caches,
Dasari et al. [6, 7] only assume task mapping known, whereas Rihani et al. [16] assume both
mapping and execution order on each core known. Schliecker et al. [22] tightly determine the
3

number of interfering bus requests. In comparison with these works, our technique jointly calculates
task scheduling and contentions with the objective of minimizing the schedule makespan by letting
the technique decide when it is necess ary to avoid or to account for interference.
Further renements of contention costs can be obtained by using specic task mo dels. Pelliz-
zoni et al. [21] intro duced the PRedictable Execution Model (PREM) that splits a task in a read
communication phase and an execute phase. This allows accesses to shared hardware resources to
be precisely identified. In our work, we use a model very close to the PREM task model.
The PREM task model, or similar ones, was used in several research works [1, 2]. Alhammad
and Pellizzoni [1] proposed a heuristic to map and schedule a fork/join graph onto a multi-core
in a contention-free manner. They split the graph in sequential or parallel segments, and then
schedule each se gment. In contrast to us, they consider only co de and local data access in contention
estimation, leaving the global shared variable in the main external memory with worst concurrency
assumed when accessing them. Moreover, we deal with any DAG not just fork/join graphs, and write
back modified data to memory only when required. Becker et al. [2] prop os ed an ILP formulation
and a heuristic aiming at scheduling periodic independent PREM-based tasks on one cluster of a
Kalray MPPA processor. They systematically create a contention-free schedule. Our work diers in
the considered task mo de l as well as the goal to reach. They consider sporadic independent tasks to
which they aim at finding a valid schedule that meets each tasks’ deadline. In contrast, we consider
one iteration of a task graph and we aim at finding the shortest schedule. In addition, our resulting
schedule might include overlapping communications due to scheduler decision, while [1, 2] only allow
synchronized communication.
Giannopoulou et al proposed in [13] a combined analysis of computing, memory and communi-
cation scheduling in a mixed-criticality setting, for cluster-based architectures such as the Kalray
MPPA. Similarly to our work, the authors aim, among others, at precisely estimating contention
delays, in particular by identifying tasks that may run in parallel under the FTTS schedule, that
uses synchronization barriers. However, to our best knowledge they do not take benefit of the
application structure, in particular dependencies between tasks to further refine contention delays.
In order to reduce the impact of communication delays on schedules, [4, 14] hide the communi-
cation request while a computation task is running. This accounts with the asynchronism implied
by DMA requests. However they use a worst-case contention which could be refined by our study.
In addition to the initial problem, shared resource interference can be accounted at schedule time
in ord e r to tighten the overall makespan of the resulting schedule.
To quantify memory interference on DRAM-banks, [19, 27] proposed two analyses, request-driven
and job-driven. The former one bounds memory request delays considering memory interference on
the DRAM bank, while the latter add s the worst-case concurrency on the data-bus of the DRAM.
Their work is orthogonal to ours: the request-driven analysis would refine the access time part
in our delay, while our method could refine their job-driven analysis by decreasing the amount of
concurrency they use.
3 System model
3.1 Architecture model
We consider a multi-core architecture f or which every core is connected to a main external memory
through a bus. Each core either has private access to a ScratchPad memory (SPM) (e.g.: Patmos
[23]) or there exists a mechanism for bank privatization (e.g.: Kalray MPPA [9]). Such a private
memory allows, after having first fetched data/code from the main memory, to perform computations
without any access to the shared bus. For each core, data is transferred between the SPM and the
main memory through a shared bus.
Communications are blocking and indivisible. The sender core initiates a memory request, then
waits until the request is fully complete (blocking communications), i.e. the data is transferred
4

Citations
More filters
Journal ArticleDOI

Bounding DRAM Interference in COTS Heterogeneous MPSoCs for Mixed Criticality Systems

TL;DR: This paper derives a generalized interference delay analysis for DRAM main memory that accounts for a breadth of features deployed in COTS platforms, and explores the design space by studying the effects of each feature on both the worst-case delay for critical applications, and the bandwidth for noncritical applications.
Proceedings ArticleDOI

A Holistic Memory Contention Analysis for Parallel Real-Time Tasks under Partitioned Scheduling

TL;DR: A fine-grained analysis of the memory contention experienced by parallel tasks running on a multi-core platform is proposed, formulated to bound the memory interference by leveraging a three-phase execution model and holistically considering multiple memory transactions issued during each phase.
Proceedings ArticleDOI

Hiding Communication Delays in Contention-Free Execution for SPM-Based Multi-Core Architectures

TL;DR: A scheduling technique is proposed that jointly selects SPM contents off-line, in such a way that the cost of SPM loading/unloading is hidden, and generated schedules can be implemented with low overhead on a predictable multi-core architecture (Kalray MPPA).
Journal ArticleDOI

SLAQA: Quality-level Aware Scheduling of Task Graphs on Heterogeneous Distributed Systems

TL;DR: In this paper, the authors consider the problem of scheduling a real-time application modelled as a single Directed-acyclic Task Graph (DTG), where tasks may have multiple implementations designated as quality-levels, with higher quality levels producing more accurate results and contributing to higher rewards/Quality-of-Service for the system.
Journal ArticleDOI

Correct-by-Construction Parallelization of Hard Real-Time Avionics Applications on Off-the-Shelf Predictable Hardware

TL;DR: This work presents the first end-to-end modeling and compilation flow to parallelize hard real-time control applications while fully guaranteeing the respect of real- time requirements on off-the-shelf hardware.
References
More filters
Journal ArticleDOI

A Theorem on Boolean Matrices

TL;DR: It is proved the validity of an algorithm whose running time goes up slightly faster than the square of d, the running times of which increase-other things being equal-as the cube of d.
Book ChapterDOI

StreamIt: A Language for Streaming Applications

TL;DR: The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain and the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations.

TGFF: task graphs for free

TL;DR: A user-controllable, general-purpose, pseudorandom task graph generator called Task Graphs For Free, which has the ability to generate independent tasks as well as task sets which are composed of partially ordered task graphs.
Journal ArticleDOI

A survey of hard real-time scheduling for multiprocessor systems

TL;DR: The survey outlines fundamental results about multiprocessor real-time scheduling that hold independent of the scheduling algorithms employed, and provides a taxonomy of the different scheduling methods, and considers the various performance metrics that can be used for comparison purposes.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What have the authors contributed in "Tightening contention delays while scheduling parallel applications on multi-core architectures" ?

In this paper, the authors present two contention-aware scheduling strategies that produce a time-triggered schedule of the application ’ s tasks. An Integer Linear Programming ( ILP ) solution of the scheduling problem is presented, as well as a heuristic solution that generates schedules very close to ones of the ILP ( 5 % longer on average ), with a much lower time complexity. 

Extensions to architectures with local caches is another direction for future research. 

The first proposed scheduling method models the task mapping and scheduling problem as constraints on task assignment, task start times and communications between tasks. 

The basic idea of the proposed heuristic, based on forward list scheduling, is to order tasks from the task graph, and then to add each task one by one in the schedule without backtracking, while keeping the goal of minimizing the overall makespan of the schedule. 

Since the solved scheduling problem is NP-hard, the ILP formulation is shown to not scale properly when the number of tasks grows. 

In this paper, the authors assume parallel applications modeled as directed acyclic task graphs (DAGs), and show that the knowledge of the application’s structure allows to have precise estimation of tasks that effectively execute in parallel, and thus contention delays. 

Knowledge of tasks’ scheduling (tasks placement and time window assigned to each task), when known, can further help refining the amount of interference suffered by a task. 

To model their problem as an instance of ILP, the authors make the choice of using a safe linear approximation of equation (16d), in which the authors substitute variable waitingSlotsri by a constant WAIT r i that may overestimate the number of waiting slots. 

According to their terminology, the scheduling methods presented in this paper can be classified as static, partitioned, time-triggered and non-preemptive. 

Due to the presence of shared hardware resources (buses, shared last level of cache, . . . ), techniques designed for single-core architectures cannot be directly applied to multi-core ones. 

The authors were not able to use all the benchmarks and applications provided in the suite due to difficulties when extracting information (task graph, WCET, . . . ) or because some test cases are linear chains of tasks with no concurrency. 

the communication delay of the read and write phases (respectively noted delayri and delayw i ) depend on several factors: amount of data to be transferred, number of potential concurrent accesses to the bus. 

In order to reduce the impact of communication delays on schedules, [4, 14] hide the communication request while a computation task is running. 

For both sets, the authors used the latest version of the TGFF task generation software to generate task graphs with tasks’ chains of different lengths and widths, including both fork-join graphs and more evolved structures (e.g. multi-DAGs).