What is the basic idea of the proposed heuristic?

The basic idea of the proposed heuristic, based on forward list scheduling, is to order tasks from the task graph, and then to add each task one by one in the schedule without backtracking, while keeping the goal of minimizing the overall makespan of the schedule.

What can be done to help refine the amount of interference suffered by a task?

Knowledge of tasks’ scheduling (tasks placement and time window assigned to each task), when known, can further help refining the amount of interference suffered by a task.

What is the way to model the problem as an instance of ILP?

To model their problem as an instance of ILP, the authors make the choice of using a safe linear approximation of equation (16d), in which the authors substitute variable waitingSlotsri by a constant WAIT r i that may overestimate the number of waiting slots.

What is the terminology of the three main categories of scheduling methods?

According to their terminology, the scheduling methods presented in this paper can be classified as static, partitioned, time-triggered and non-preemptive.

Why did the authors not use all the benchmarks and applications?

The authors were not able to use all the benchmarks and applications provided in the suite due to difficulties when extracting information (task graph, WCET, . . . ) or because some test cases are linear chains of tasks with no concurrency.

What are the factors that affect the WCET of the read and write phases?

the communication delay of the read and write phases (respectively noted delayri and delayw i ) depend on several factors: amount of data to be transferred, number of potential concurrent accesses to the bus.

What is the difference between the two sets of task graphs?

For both sets, the authors used the latest version of the TGFF task generation software to generate task graphs with tasks’ chains of different lengths and widths, including both fork-join graphs and more evolved structures (e.g. multi-DAGs).

(Open Access) Tightening Contention Delays While Scheduling Parallel Applications on Multi-core Architectures (2017) | Benjamin Rouxel

Q: What have the authors contributed in "Tightening contention delays while scheduling parallel applications on multi-core architectures" ?

In this paper, the authors present two contention-aware scheduling strategies that produce a time-triggered schedule of the application ’ s tasks. An Integer Linear Programming ( ILP ) solution of the scheduling problem is presented, as well as a heuristic solution that generates schedules very close to ones of the ILP ( 5 % longer on average ), with a much lower time complexity.

Q: What are the future works mentioned in the paper "Tightening contention delays while scheduling parallel applications on multi-core architectures" ?

Extensions to architectures with local caches is another direction for future research.

Q: What is the first proposed scheduling method?

The first proposed scheduling method models the task mapping and scheduling problem as constraints on task assignment, task start times and communications between tasks.

Q: How does the paper show that parallel applications can be modeled?

In this paper, the authors assume parallel applications modeled as directed acyclic task graphs (DAGs), and show that the knowledge of the application’s structure allows to have precise estimation of tasks that effectively execute in parallel, and thus contention delays.

HAL Id: hal-01655383

https://hal.archives-ouvertes.fr/hal-01655383

Submitted on 5 Dec 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Tightening Contention Delays While Scheduling Parallel

Applications on Multi-core Architectures

Benjamin Rouxel, Steven Derrien, Isabelle Puaut

To cite this version:

Benjamin Rouxel, Steven Derrien, Isabelle Puaut. Tightening Contention Delays While Schedul-

ing Parallel Applications on Multi-core Architectures. ACM Transactions on Embedded Computing

Systems (TECS), ACM, 2017, 16 (5s), pp.1 - 20. �10.1145/3126496�. �hal-01655383�

Tightening contention delays while scheduling parallel

applications on multi-core architectures

Benjamin Rouxel

⇤1

,StevenDerrien

†1

, and Isabelle Puaut

‡1

University of Rennes 1/IRISA

September 19, 2017

Abstract

Multi-core systems are increasingly interesting candidates for executing parallel real-time

applications, in avionic, space or automotive industries, as they provide both computing ca-

pabilities and power eﬃciency. However, ensuring that timing constraints are met on such

platforms is challenging, because some hardware resources are shared between cores.

Assuming worst-case contentions when analyzing the schedulability of applications may

result in systems mistakenly declared unschedulable, although the worst-case level of contentions

can never occur in practice. In this paper, we present two contention-aware scheduling strategies

that produce a time-triggered schedule of the application’s tasks. Based on knowledge of the

application’s structure, our scheduling strategies precisely estimate the eﬀective contentions, in

order to minimize the overall makespan of the schedule. An Integer Linear Programming (ILP)

solution of the scheduling problem is presented, as well as a heuristic solution that generates

schedules very close to ones of the ILP (5 % longer on average), with a much lower time

complexity. Our heuristic improves by 19% the overall makespan of the resulting schedules

compared to a worst-case contention baseline.

Keywords — Real-time System, Contention-Aware Scheduling

This work was partially supported by ARGO (http://www.argo-project.eu/), funded by the European

Commission under Horizon 2020 Research and Innovation Action, Grant Agreement Number 688131.

1 Introduction

The increasing demand for computing power and low energy consumption is placing multi-/many-

core architectures as increasingly interesting candidates for e xecuting embedded critical systems. It

becomes more and more common to ﬁnd such architectures in automotive, avionic or space industries

[2, 16].

Guaranteeing that timing constraints are met on multi-core platforms is a challenging issue. One

diﬃculty lies in the estimation of the Worst-Case Execution Time (WCET) of tasks. Due to the

presence of shared hardware resources (buses, shared last level of cache, . . . ), techniques designed

for single-core architectures cannot be directly applied to multi-core ones. Since it is hard in general

to guarantee th e absence of resource conﬂicts during execution, current WCET techniques either

produce pessimistic WCET estimates or constrain the execution to enforce the absence of conﬂicts,

at the price of a sign iﬁ cant hardware und er-u tiliz ation.

⇤

benjamin.rouxel@irisa.fr

†

steven.derrien@irisa.fr

‡

isabelle.puaut@irisa.fr

A second issue is the selection of a scheduling strategy which will decide where and when to

execute tasks. Scheduling for multi-core platforms was the subject of many research works, surveyed

in [8]. We believe that static mapping of tasks to c ore s (partitioned scheduling) and time-triggered

scheduling on each core allow to have control on sharing of hardware resources, and thus allow to

better estimate worst-case contention delays.

Some existing work on multi-core scheduling considers that the platform workload consists of

independent tasks. As parallel execution is the most promising solution to improve performance,

we envision that within only a few years from now, real-time workloads will evolve toward parallel

programs. The timing behaviour of such programs is challenging to analyze because they consist

of dependent tasks interacting through complex synchronization/communication mechanisms. We

believe that models oﬀering a high-level view of the behavior of parallel programs allow a precise

estimation of shared resource conﬂicts. In this paper, we assume parallel applications m odeled as

directed acyclic task graphs (DAGs), and show that the knowledge of the application’s structure

allows to have p recise estimation of tasks that eﬀectively execute in parallel, and thus contention

delays. These DAGs do not necessarily need to be built from scratch, which would require an

important engineering eﬀort. Automatic extraction of parallelism, for in stance from a high level

description of applications in model based design workﬂows [10], looks to us a much more promising

direction.

In this paper, we present two mapping and scheduling strategies featuring bus contention aware-

ness. Both strategies apply to multi-core platforms where cores are connected to a round-robin bus.

Asafe(butpessimistic)boundfortheaccesslatencyistoconsiderNbCores - 1 contending tasks

being granted access to the bus (with NbCores as the number of available cores). Our scheduling

strategies take into consideration the application’s structure and information on the schedule under

construction to estimate precisely the eﬀective degree of interference used to compute the access

latency. The proposed scheduling strategies generate a non preemptive time-triggered partitioned

schedule and select the appropriate level of contention to minimize the schedule length.

The ﬁrst proposed scheduling method models the task mapping and scheduling problem as con-

straints on task assignment, task start times and communications between tasks. We demonstrate

that the optimal schedule can only b e found using quadratic equations due to th e nature of the re-

quired information to model the communication cost. This modeling is then reﬁned into an Integer

Linear Programming (ILP) formulation that in some cases overestimates communication costs and

thus may not ﬁnd the shortest schedule. Since the solved scheduling problem is NP-hard, the ILP

formulation is shown to not scale properly when the number of tasks grows. We thus developed

a heuristic scheduling technique that scales much better with the number of tasks and is able to

compute the accurate communication cost. Albeit not always ﬁnding the optimal solution, the ILP

formulation is included in this paper, because it gives a non ambiguous description of the problem

under study, and also serves as a baseline to evaluate the quality of the proposed heuristic technique.

The proposed scheduling techniques are evaluated experimentally. The schedule’s length gener-

ated by our heuristic is compared to its equivalent baseline scheduling technique accounting for the

worst case contention. The experimental evaluation also studies the interest of allowing concurrent

bus accesses as compared to related work where concurrent accesses are systematically avoided in

the generated schedule. Finally, we study the time required by the proposed techniques, as well

as how schedule lengths vary when changing architectural parameters such as the duration of one

slot of the round-robin bus. The experimental evaluation uses a subset of the StreamIT streaming

benchmarks [25] as well as synthetic task graphs using the TGFF graph generator [11].

The contributions of this work are threefold:

1. First, we propose a novel approach to derive precise bounds on worst-case contention on a

shared round-robin bus. Compared to previous methods, we use knowledge of the application’s

structure (task graph) as well as knowledge of tasks placement and scheduling to precisely

estimate tasks that execute in parallel, and thus tighten the worst bus access delay.

2. Second, we present two scheduling meth ods that calculate a time-triggered partitioned sched-

ule, using an ILP formulation and a heuristic. The novelty with respect to existing scheduling

techniques lie s on the ability of the s cheduler to select the best strategy regarding concurrent

accesses to the shared bus (allow of forbid concurrency) to minimize the overall makespan of

the sche d u le.

3. Third, we provide experimental data to evaluate the beneﬁt of precise estimation of contentions

as compared to the baseline estimation where NbCores - 1 tasks are granted access to the bus

for e very memory access. Moreover, we discuss the interest of allowing concurrency (and

thus interference) between tasks as compared to state-of-the-art techniques such as [2] where

contentions are systematically avoided.

The rest of this paper details the proposed techniques and is organized as follows. Section 2

presents related studies. Assumptions on the hardware and software are given in Section 3. Section

4 details the proposed method to calculate precise worst-case degree of interference when accessing

the shared bus. Section 5 then presents the two techniques for schedule generation, using an ILP

formulation and a heuristic. Section 6 presents experimental results. Concluding remarks are given

in S e ction 7.

2 Related work

Tasks scheduling on multi-core platforms consists in deciding where (mapping) and when (schedul-

ing) each task is executed. The literature on mapping/scheduling of tasks on multi-cores is tremen-

dous as there exists plenty of diﬀerent properties on, e.g. the input task set, the type of scheduling

algorithm. According to the survey from Davis and Burns [8], the three main categories of scheduling

algorithms are global scheduling, semi-partitioned, and partitioned scheduling. According to their

terminology, the scheduling methods presented in this paper can be classiﬁed as static, partitioned,

time-triggered and non-preemptive.

Shared resources in multi-core systems may be either shared software objects (such as variables)

that have to be used in mutual exclusion or shared hardware resources (such as buses or caches) that

are shared between cores according to some resource arbitration policies (TDMA, round-robin, etc).

These two classes of resources lead to diﬀerent analyses to ensure that there is neither starvation nor

deadlock. Dealing with shared objects is not new, and there now exists several techniques adapted

from the single-core systems. Most of them are based on priority inheritance. In p articular Jarrett

et al. [17] apply priority inheritance to multi-cores and propose a resource management protocol

which bounds the access latency to a shared resource. Negrean et al. [20] provide a method to

compute the blocking time induced by concurrent tasks in order to determine their response time.

Beyond shared objects, multi-core processors feature hardware resources that may be accessed

concurrently by tasks running on the diﬀerent cores. Typical hardware resources are the main

memory, the memory bus or shared last-level cache. A contention analysis then has to be deﬁned

to determine the worst case delay for a task to gain access to the resource (see [12] for a survey).

Some shared resources may directly implement timing isolation mechanism between cores, such as

Time Division Multiple Access (TDMA) buses, making contention analysis straightforward.

To avoid resource under-utilization caused by TDMA, other resource sharing strategies such as

round-robin oﬀer a good trade-oﬀ between predictability and resource usage. Worst-case bounds on

contention are similar to those of TDMA. However, knowledge about the system may help tightening

estimated contention delays.

Approaches to estimate contention delays for round-robin arbitration diﬀer according to the

nature and amount of information used to estimate contention delays. For architectures with caches,

Dasari et al. [6, 7] only assume task mapping known, whereas Rihani et al. [16] assume both

mapping and execution order on each core known. Schliecker et al. [22] tightly determine the

number of interfering bus requests. In comparison with these works, our technique jointly calculates

task scheduling and contentions with the objective of minimizing the schedule makespan by letting

the technique decide when it is necess ary to avoid or to account for interference.

Further reﬁnements of contention costs can be obtained by using speciﬁc task mo dels. Pelliz-

zoni et al. [21] intro duced the PRedictable Execution Model (PREM) that splits a task in a read

communication phase and an execute phase. This allows accesses to shared hardware resources to

be precisely identiﬁed. In our work, we use a model very close to the PREM task model.

The PREM task model, or similar ones, was used in several research works [1, 2]. Alhammad

and Pellizzoni [1] proposed a heuristic to map and schedule a fork/join graph onto a multi-core

in a contention-free manner. They split the graph in sequential or parallel segments, and then

schedule each se gment. In contrast to us, they consider only co de and local data access in contention

estimation, leaving the global shared variable in the main external memory with worst concurrency

assumed when accessing them. Moreover, we deal with any DAG not just fork/join graphs, and write

back modiﬁed data to memory only when required. Becker et al. [2] prop os ed an ILP formulation

and a heuristic aiming at scheduling periodic independent PREM-based tasks on one cluster of a

Kalray MPPA processor. They systematically create a contention-free schedule. Our work diﬀers in

the considered task mo de l as well as the goal to reach. They consider sporadic independent tasks to

which they aim at ﬁnding a valid schedule that meets each tasks’ deadline. In contrast, we consider

one iteration of a task graph and we aim at ﬁnding the shortest schedule. In addition, our resulting

schedule might include overlapping communications due to scheduler decision, while [1, 2] only allow

synchronized communication.

Giannopoulou et al proposed in [13] a combined analysis of computing, memory and communi-

cation scheduling in a mixed-criticality setting, for cluster-based architectures such as the Kalray

MPPA. Similarly to our work, the authors aim, among others, at precisely estimating contention

delays, in particular by identifying tasks that may run in parallel under the FTTS schedule, that

uses synchronization barriers. However, to our best knowledge they do not take beneﬁt of the

application structure, in particular dependencies between tasks to further reﬁne contention delays.

In order to reduce the impact of communication delays on schedules, [4, 14] hide the communi-

cation request while a computation task is running. This accounts with the asynchronism implied

by DMA requests. However they use a worst-case contention which could be reﬁned by our study.

In addition to the initial problem, shared resource interference can be accounted at schedule time

in ord e r to tighten the overall makespan of the resulting schedule.

To quantify memory interference on DRAM-banks, [19, 27] proposed two analyses, request-driven

and job-driven. The former one bounds memory request delays considering memory interference on

the DRAM bank, while the latter add s the worst-case concurrency on the data-bus of the DRAM.

Their work is orthogonal to ours: the request-driven analysis would reﬁne the access time part

in our delay, while our method could reﬁne their job-driven analysis by decreasing the amount of

concurrency they use.

3 System model

3.1 Architecture model

We consider a multi-core architecture f or which every core is connected to a main external memory

through a bus. Each core either has private access to a ScratchPad memory (SPM) (e.g.: Patmos

[23]) or there exists a mechanism for bank privatization (e.g.: Kalray MPPA [9]). Such a private

memory allows, after having ﬁrst fetched data/code from the main memory, to perform computations

without any access to the shared bus. For each core, data is transferred between the SPM and the

main memory through a shared bus.

Communications are blocking and indivisible. The sender core initiates a memory request, then

waits until the request is fully complete (blocking communications), i.e. the data is transferred

Tightening Contention Delays While Scheduling Parallel Applications on Multi-core Architectures

Figures

Citations

Bounding DRAM Interference in COTS Heterogeneous MPSoCs for Mixed Criticality Systems

A Holistic Memory Contention Analysis for Parallel Real-Time Tasks under Partitioned Scheduling

Hiding Communication Delays in Contention-Free Execution for SPM-Based Multi-Core Architectures

SLAQA: Quality-level Aware Scheduling of Task Graphs on Heterogeneous Distributed Systems

Correct-by-Construction Parallelization of Hard Real-Time Avionics Applications on Off-the-Shelf Predictable Hardware

References

A Theorem on Boolean Matrices

StreamIt: A Language for Streaming Applications

Approximation algorithms for bin packing: a survey

TGFF: task graphs for free

A survey of hard real-time scheduling for multiprocessor systems

Related Papers (5)

A Predictable Execution Model for COTS-Based Embedded Systems

Contention-Free Execution of Automotive Applications on a Clustered Many-Core Platform

Response Time Analysis of Synchronous Data Flow Programs on a Many-Core Processor

Bounding memory interference delay in COTS-based multi-core systems

Time-critical computing on a single-chip massively parallel processor

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Tightening contention delays while scheduling parallel applications on multi-core architectures" ?

Q2. What are the future works mentioned in the paper "Tightening contention delays while scheduling parallel applications on multi-core architectures" ?

Q3. What is the first proposed scheduling method?

Q4. What is the basic idea of the proposed heuristic?

Q5. Why is the ILP formulation not scaled properly when the number of tasks grows?

Q6. How does the paper show that parallel applications can be modeled?

Q7. What can be done to help refine the amount of interference suffered by a task?

Q8. What is the way to model the problem as an instance of ILP?

Q9. What is the terminology of the three main categories of scheduling methods?

Q10. Why is it difficult to apply a WCET technique to multi-core platforms?

Q11. Why did the authors not use all the benchmarks and applications?

Q12. What are the factors that affect the WCET of the read and write phases?

Q13. How do they reduce the impact of communication delays on schedules?

Q14. What is the difference between the two sets of task graphs?