Journal Article•DOI•

Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems

Alan A. Bertossi¹, Luigi V. Mancini², F. Rossini³•Institutions (3)

University of Trento¹, Sapienza University of Rome², Telecom Italia Mobile³

01 Sep 1999-IEEE Transactions on Parallel and Distributed Systems (IEEE)-Vol. 10, Iss: 9, pp 934-945

TL;DR: In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy planned on a different processor.

read less

Abstract: Hard-real-time systems require predictable performance despite the occurrence of failures. In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy scheduled on a different processor. An active copy is always executed, while a passive copy is executed only in the case of a failure. First, the paper considers the ability of the widely-used rate-monotonic scheduling algorithm to meet the deadlines of periodic tasks in the presence of a processor failure. In particular, the completion time test is extended so as to check the schedulability on a single processor of a task set including backup copies. Then, the paper extends the well-known rate-monotonic first-fit assignment algorithm, where all the task copies, included the backup copies, are considered by rate-monotonic priority order and assigned to the first processor in which they fit. The proposed algorithm determines which tasks must use the active duplication and which can use the passive duplication. Passive duplication is preferred whenever possible, so as to overbook each processor with many passive copies whose primary copies are assigned to different processors. Moreover, the space allocated to active copies is reclaimed as soon as a failure is detected. Passive copy overbooking and active copy deallocation allow many passive copies to be scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed. Simulation studies reveal a remarkable saving of processors with respect to those needed by the usual active duplication approach in which the schedule of the non-fault-tolerant case is duplicated on two sets of processors.

...read moreread less

Summary (4 min read)

Jump to: [1 INTRODUCTION] – [2 BACKGROUND] – [2.1 The Scheduling Problem] – [2.2 The Fault-Tolerant Model] – [2.3 The Rate-Monotonic Algorithm] – [2.4 The Completion Time Test] – [2.5 The Rate-Monotonic First-Fit] – [3 OVERVIEW OF THE FAULT-TOLERANT RMFF ALGORITHM] – [4.1 Schedulability Criteria] – [4.2 Fault-Tolerant CTT] – [4.3 Fault-Tolerant RMFF] – [4.4 Recovery from a Processor Failure] – [4.5 Tolerating Many Processor Failures] – [4.6 Tolerating Software Failures] – [5 SIMULATION EXPERIMENTS] – [6 CONCLUDING REMARKS] and [ACKNOWLEDGMENTS]

1 INTRODUCTION

THROUGHOUT industrial computing, there is an increasingdemand for more complex and sophisticated hard-realtime computing systems.
An active copy presents the advantages of requiring no synchronization with its primary copyÐit can run before, after, or concurrently with the other copyÐand of having a larger time window for executionÐnamely, the whole period of the task.
In particular, this paper extends the RMFF algorithm to tolerate failures under the assumption that processors fail in a fail-stop manner.

2 BACKGROUND

This section gives a formal definition of the scheduling problem and a precise specification of the fault tolerance model.
Moreover, important properties of the well-known RM, CTT, and RMFF algorithms are recalled.

2.1 The Scheduling Problem

The requests for i are periodic, with constant interval Ti between every two consecutive requests, and i's first request occurs at time 0.
The worst case execution time for all the requests of i is constant and equal to Ci, with Ci Ti. Periodic tasks 1; :::; n are independent, that is the requests of any task do not depend on the execution of the other tasks.

2.2 The Fault-Tolerant Model

The fault-tolerant scheduling problem consists of finding a schedule for the tasks so as to satisfy the following additional condition: (S4) fault tolerance is guaranteed, namely, conditions (S1)(S3) are verified even in the presence of failures.
In order to achieve fault tolerance, two copies for each task are used, called primary and backup copies.
In practice Di is smaller than Ci, since backup copies usually provide a reduced functionality in a smaller execution time than the primary copies.
Since it must contain the indices of the primary task and of the sender and receiver processors, its size is O log n bits, also known as This message is small.
The overhead needed for such processor failure detections is mainly given by the short-message latency of the communication subsystem employed.

2.3 The Rate-Monotonic Algorithm

Liu and Layland [10] proposed a fixed-priority scheduling algorithm, called Rate-Monotonic (RM), for solving the (nonfault-tolerant) problem stated in Section 2.1 on a single processor system, that is when m 1.
At any instant of time, a pending task with the highest priority is scheduled.
Liu and Layland proved the following two important results concerning fixed-priority scheduling algorithms.
The largest response time for any periodic request of 1 occurs whenever i is requested simultaneously with the requests for all higher priority tasks.
A critical instant occurs when all tasks are in phase at time zero, which is called critical instant phasing, because it is the phasing that results in the longest response time for the first request of each task.

2.4 The Completion Time Test

From Theorems 1 and 2, the following necessary and sufficient schedulability criterion was derived by Joseph and Pandya [5], as discussed also in [8].
This schedulability test is called Completion Time Test (CTT).
It is worth noting that, by Theorem 3, the schedulability of lower priority tasks does not guarantee the schedulability of higher priority tasks.
Therefore, in order to check the schedulability of a set of tasks, each task must get through the CTT when it is scheduled with all higher priority tasks.

2.5 The Rate-Monotonic First-Fit

Dhall and Liu [3] generalized the RM algorithm to accommodate multiprocessor systems.
In particular, they proposed the so called Rate-Monotonic First-Fit (RMFF) algorithm.
It is a partitioning algorithm, where tasks are first assigned to processors following the RM priority order and then all the tasks assigned to the same processor are scheduled with the RM algorithm.
Dhall and Liu showed that, using a schedulability condition weaker than CTT, RMFF uses about 2.33U processors in the worst case, where U is the load of the task set.
In practice, however, RMFF remains competitive, for its simplicity and efficiency.

3 OVERVIEW OF THE FAULT-TOLERANT RMFF ALGORITHM

This section provides an informal high-level description of the proposed Fault-Tolerant Rate-Monotonic First-Fit algorithm.
The algorithm prefers to schedule a backup copy as a passive copy whenever possible, so as to overbook each processor with more passive copies whose primary copies are assigned to different processors.
Clearly, with another ordering a higher priority task can be assigned to the same processor after i.
I must be schedulable together with all the primary and active backup copies already assigned to Pj; .
These conditions are analogous to those of (A1), with the difference that the second one takes into account the situation where the failed processor is that running the primary copy i.

4.1 Schedulability Criteria

In order to extend Theorem 3, consider a generic task set containing both primary and backup copies which must be scheduled all together on a single processor.
Pj includes the active backup copies assigned to processor.
Pj be the set of periodic tasks given in priority order which are assigned to processor Pj. Consider now the case that a failure of processor Proof.
Hence, the proof follows from Theorem 3. tu.

4.2 Fault-Tolerant CTT

Based on Theorems 4 and 5, two kinds of schedulability tests are needed, one to check for schedulability in the absence of failures, and the other to check for schedulability after a processor failure.
Pj, since the recovery from the failure of any processor other than Pj must be taken into account.

4.3 Fault-Tolerant RMFF

The first step assigns the primary copy i to the first processor in which it fits.
The second step establishes the recovery time Bi and the status of the backup copy i.
Thus, duplicating on two sets of processors the schedule for the nonfault-tolerant case requires at least four processors to tolerate one failure.
The proposed FTRMFF algorithm, instead, tolerates one failure using three processors only.
The procedure FTRMFF-Assignment is executed off-line and requires O nm2 schedulability tests to be performed.

4.4 Recovery from a Processor Failure

Once an assignment is found by the FTRMFF algorithm, each processor.
The uncompleted tasks assigned to Pf are recovered by the remaining nonfaulty processors.
Note also that, in any case, all the active backup copies of primary tasks scheduled on the nonfaulty processors are deallocated from Pj. FTRMFF-Recovery Pf; (1) Do the following steps in parallel for all the processors 2.
The procedure FTRMFF-Recovery is executed on-line and is very fast, since all the required sets, including passiveRecover Pj; Pf and recover Pj; Pf , were previously computed off-line by the FTRMFF-Assignment procedure, which already made all the schedulability tests, too.

4.5 Tolerating Many Processor Failures

In order to tolerate many processor failures, spare processors must be employed to replace failed processors on-line.
Pf is detected within the closest completion time of the task set primary Pf [ active Pf and the time interval between two consecutive failures is three times the largest task request period.
This phase takes at most During the recovery phase, all the passive copies of the uncompleted tasks assigned to Pf are executed by the non-faulty processors only once (step 1.1.2 of FTRMFFReplacing), and the spare processor Ps inherits.
The reconfiguration phase is completed by time 2Tn.
If there are q spare processors, q faulty processors can be replaced by means of the FTRMFF-Replacing procedure, while one additional failure can be tolerated by means of the FTRMFFRecovery procedure.

4.6 Tolerating Software Failures

In addition to processor failures, a hard-real-time system can fail also due to design faults in the software.
To explain the ideas of the approach, assume that two different implementations of the same task specification are provided.
Since the processors are assumed fail-stop, if the acceptance test fails, it signals the presence of an error in the software.
The time to execute the acceptance test is assumed to be included in the primary copy execution time.
One approach to implement the recovery from software failures is as follows.

5 SIMULATION EXPERIMENTS

In order to evaluate the number of processors used by the FTRMFF algorithm for scheduling both primary and backup copies, simulation experiments are performed.
For the chosen n and , the experiment is repeated 30 times, and the average result is computed.
The performance metric in all the experiments is the number of processors required to assign a given task set.
In the outcome of the experiments, the authors denote with N the number of processors required by the FTRMFF algorithm for a task set consisting of both primary and backup task copies, and with M the number of processors required by the RMFF algorithm for a task set with identical primary copies and no backup copies.

6 CONCLUDING REMARKS

This paper has considered the problem of preemptively scheduling a set of independent periodic tasks under the assumption that each task deadline coincides with the next request of the same task.
The proposed FTRMFF algorithm extends the well-known Rate-Monotonic First-Fit scheduling algorithm to tolerate failures, uses a novel combined active/passive duplication scheme, and determines by itself which tasks should use passive duplication and which should use active duplication.
This optimization is left for further work.
It is worth noting that the proposed algorithm works also if some backup copies are forced to be active.
Finally, further research could deal with assignment strategies which are different from those considered in this paper.

ACKNOWLEDGMENTS

The C++ code used in the simulation experiments was written by Andrea Fusiello.
This work was supported by grants from the Ministero dell'UniversitaÁ e della Ricerca Scientifica e Tecnologica, the Consiglio Nazionale delle Ricerche, and the UniversitaÁ di TrentoÐProgetto Speciale 1997.

Did you find this useful? Give us your feedback

Figures (5)

Fig. 1. The schedule in the absence of failures.

Fig. 3. Substitution of the failed processor with a spare processor.

Fig. 2. Recovery from a failure of P1 at time 0.

Fig. 4. Ratios between the number of processors required by FTRMFF (upper functions) or RMFF (lower functions) and the total load of the task sets.

Fig. 5. Ratio N ÿM =M of additional processors required by FTRMFF with respect to RMFF to provide fault tolerance.

Content maybe subject to copyright Report

Fault-Tolerant Rate-Monotonic First-Fit

Scheduling in Hard-Real-Time Systems

Alan A. Bertossi, Luigi V. Mancini, and Federico Rossini

AbstractÐHard-real-time systems require predictable performance despite the occurrence of failures. In this paper, fault tolerance is

implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a

passive backup copy scheduled on a different processor. An active copy is always executed, while a passive copy is executed only in

the case of a failure. First, the paper considers the ability of the widely-used Rate-Monotonic scheduling algorithm to meet the

deadlines of periodic tasks in the presence of a processor failure. In particular, the Completion Time Test is extended so as to check

the schedulability on a single processor of a task set including backup copies. Then, the paper extends the well-known Rate-Monotonic

First-Fit assignment algorithm, where all the task copies, included the backup copies, are considered by Rate-Monotonic priority order

and assigned to the first processor in which they fit. The proposed algorithm determines which tasks must use the active duplication

and which can use the passive duplication. Passive duplication is preferred whenever possible, so as to overbook each processor with

many passive copies whose primary copies are assigned to different processors. Moreover, the space allocated to active copies is

reclaimed as soon as a failure is detected. Passive copy overbooking and active copy deallocation allow many passive copies to be

scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed. Simulation

studies reveal a remarkable saving of processors with respect to those needed by the usual active duplication approach in which the

schedule of the non-fault-tolerant case is duplicated on two sets of processors.

Index TermsÐFault tolerance, hard-real-time systems, multiprocessor systems, periodic tasks, rate-monotonic scheduling, task

replication.

1INTRODUCTION

HROUGHOUT industrial computing, there is an increasing

demand for more complex and sophisticated hard-real-

time computing systems. In particular, fault tolerance is one

of the requirements that are playing a vital role in the

design of new hard-real-time distributed systems.

A variety of schemes have been proposed to support

fault-tolerant computing in distributed systems, such

schemes can be partitioned into two broad classes. In the

first class, which employs the passive replication techniques,

a passive backup copy of a primary task is assigned to one

or more backup processors; when a primary task fails, the

passive copies of the task are restarted on the backup

processor, hence a passive copy is executed only in the

presence of a failure. In the second class, which employs the

active replication techniques, the same set of tasks is always

executed on two or more sets of processors; every primary

task has an active backup copy: if any task fails, its mirror

image will continue to execute.

Many hard-real-time scheduling problems have been

found to be NP-hard: most likely, there are no optimal

polynomial-time algorithms for them [2], [11]. In particular,

scheduling periodic tasks with arbitrary deadlines is NP-

hard, even if only a single processor is available [12].

Several heuristics for scheduling periodic tasks on uni-

processor and multiprocessor systems have been proposed.

Liu and Layland [10] introduced the Rate-Monotonic (RM)

algorithm for preemptively scheduling periodic tasks on a

single processor, under the assumption that task deadlines

are equal to their periods. Joseph and Pandya [5] later

derived the Completion Time Test (CTT) for checking

schedulability of a set of fixed-priority tasks on a single

processor. RM was generalized to multiprocessor systems

by Dhall and Liu [3], who proposed, among others, the

Rate-Monotonic First-Fit (RMFF) heuristic. More refined

heuristics for multiprocessors were proposed by Burchard,

Liebeherr, Oh, and Son [1].

It is worth noting that the RM algorithm is becoming an

industry standard because of its simplicity and flexibility. It

is a low overhead greedy algorithm, which is optimal

among all fixed-priority algorithms. Moreover, it possesses

certain advantages, for example, the implementation of

efficient schedulers for aperiodic tasks, and the retiming of

intervals in order to shed the load in a predictable fashion

[8].

As for fault-tolerant scheduling algorithms, a dynamic

programming algorithm for multiprocessors was presented

in [7] which ensures that backup schedules can be

efficiently embedded within the primary schedule. An

algorithm was proposed in [9] which generates optimal

schedules in a uniprocessor system by employing a passive

replication to tolerate software failures only. The algorithms

proposed in [14] are based on a bidding strategy and

dynamically recompute the schedule when a processor

934 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 9, SEPTEMBER 1999

. A.A. Bertossi is with the Dipartimento di Matematica, Universita

Trento, Via Sommarive 14, 38050 Trento, Italy.

E-mail: bertossi@science.unitn.it.

. L.V. Mancini is with the Dipartimento di Scienze dell'Informazione,

Universita

di Roma ªLa Sapienza,º Via Salaria 113, 00198 Roma, Italy. E-

mail: lv.mancini@dsi.uniroma1.it.

. F. Rossini is with Telecom Italia Mobile, Area Applicazioni Informatiche,

Via Tor Pagnotta 90, 00143 Roma, Italy.

Manuscript received 20 June 1997.

For information on obtaining reprints of this article, please send e-mail to:

tpds@computer.org, and reference IEEECS Log Number 105271.

1045-9219/99/$10.00 ß 1999 IEEE

fails, in order to redistribute the tasks among the remaining

nonfaulty processors. In [13], two algorithms are designed

which reserve time for the processing of backup tasks on

uniprocessors running fixed-priority schedulers. Finally,

the techniques of backup overbooking and backup deal-

location were introduced in [4] to achieve fault tolerance in

multiprocessor systems, but for aperiodic nonpreemptive

tasks only.

It is noted here that none of the fault-tolerant algorithms

discussed above extended the RMFF algorithm or have

combined in the same schedule both active and passive

replication of the tasks. However, the latter idea seems

potentially useful since it provides the ability to exploit the

advantages of both types of replication in the same system.

Indeed, the simplest way to achieve fault tolerance in

hard-real-time systems consists in using active duplication

for all tasks. An active copy presents the advantages of

requiring no synchronization with its primary copyÐit can

run before, after, or concurrently with the other copyÐand

of having a larger time window for executionÐnamely, the

whole period of the task. However, using active duplication

for all tasks doubles the number of processors required in

the nonfault-tolerant case. In contrast, a passive copy can be

executed only if a failure prevents the corresponding

primary copy from completing. A passive copy has the

disadvantages of having tighter timing constraintsÐin the

worst case it is not activated until the scheduled completion

time of the primary copyÐand of requiring some time

overhead for synchronization with the corresponding

primary copy. These drawbacks can be overcome by

choosing active replication when the scheduled completion

time of the primary copy is close to the deadline, that is to

the end of the period, and by having smaller execution

times for the backup copies. Moreover, since the time

overhead for synchronization is usually very small, it can be

included in the execution time of the primary task. Most

importantly, passive duplication has the great advantage of

overbooking the processorsÐmany passive copies whose

primary copies are assigned to different processors can be

scheduled on the same processor so as to share the same

time interval. Indeed, under the assumption of a single

processor failure, only one of such passive copies will be

actually executed, namely, the passive copy whose primary

copy was prevented from completing because of the failure.

Moreover, if only one failure is tolerated, the space allocated

to active copies whose primary copy is not assigned to the

failed processor can be reclaimed as soon as a failure is

detected. Passive copy overbooking and active copy deal-

location allow fewer processors to be used with respect to

the case in which active duplication is used for all tasks.

The present paper considers the problem of preemp-

tively scheduling a set of independent periodic tasks on a

distributed system, such that each task deadline coincides

with the next request of the same task, and all tasks start in-

phase. In particular, this paper extends the RMFF algorithm

to tolerate failures under the assumption that processors fail

in a fail-stop manner. The algorithm determines by itself

which tasks must use active duplication and which can use

passive duplication, preferring passive duplication when-

ever possible. The rest of the paper is organized as follows.

Section 2 gives a formal definition of the scheduling

problem and a precise specification of the fault tolerance

model. Moreover, the classical RM, CTT, and RMFF

algorithms are recalled. Section 3 provides a high-level

description of the proposed Fault-Tolerant Rate-Monotonic

First-Fit (FTRMFF) algorithm. The algorithm analysis is

done in Section 4. In particular, the ability of RM to meet the

deadlines in the presence of one processor failure is

characterized in Section 4.1, and CTT is extended in Section

4.2 so as to check the schedulability on a single processor of

a task set including backup copies. Then, such an extended

CTT is used in Section 4.3 to assign task copies to processors

following a First-Fit heuristic which employs passive copy

overbooking and active copy space reclaiming. An algo-

rithm to recover from a single processor failure is shown in

Section 4.4, while extensions to tolerate both many

processor failures and software failures are presented in

Sections 4.5 and 4.6, respectively. In Section 5, simulation

experiments show that the proposed FTRMFF algorithm

requires fewer processors than the active duplication

approach. Finally, Section 6 summarizes the work and

discusses further possible extensions.

2BACKGROUND

This section gives a formal definition of the scheduling

problem and a precise specification of the fault tolerance

model. Moreover, important properties of the well-known

RM, CTT, and RMFF algorithms are recalled.

2.1 The Scheduling Problem

A periodic task 

is completely identified by a pair C

,

where C

is 

's execution time and T

is 

's request period. The

requests for 

are periodic, with constant interval T

between every two consecutive requests, and 

's first

request occurs at time 0. The worst case execution time for

all the (infinite) requests of 

is constant and equal to C

with C

 T

. Periodic tasks 

;:::;

are independent, that is

the requests of any task do not depend on the execution of

the other tasks. The load of a periodic task 

C

 is

 C

, while the load of the task set 

; ...;

is U 

1  i  n

Given n independent periodic tasks 

; ...;

and a set of

identical processors, the scheduling problem consists of

finding an order in which all the periodic requests of the

tasks are to be executed on the processors so as to satisfy the

following scheduling conditions:

(S1) integrity is preserved, that is, tasks and processors are

sequential: each task is executed by at most one

processor at a time and no processor executes more than

one task at a time;

(S2) deadlines are met, namely, each request of any task must

be completely executed before the next request of the

same task, that is, by the end of its period;

(S3) the number m of processors is minimized.

2.2 The Fault-Tolerant Model

It is assumed that the processors belong to a distributed

system and are connected by some kind of communication

BERTOSSI ET AL.: FAULT-TOLERANT RATE-MONOTONIC FIRST-FIT SCHEDULING IN HARD-REAL-TIME SYSTEMS 935

subsystem. The failure characteristics of the hardware are

the following:

(F1) Processors fail in a fail-stop manner, that is a processor

is either operational (i.e., nonfaulty) or ceases function-

ing;

(F2) All nonfaulty processors can communicate with each

other;

(F3) Hardware provides fault isolation in the sense that a

faulty processor cannot cause incorrect behavior in a

nonfaulty processor; in other words, processors are

independent as regard to failures;

(F4) The failure of a processor P

is detected by the

remaining nonfaulty processors after the failure, but

within the instant corresponding to the closest task

completion time of a task scheduled on P

Note that assumption (F4) can be easily satisfied by a

specific failure detection protocol as explained below, since

by assumption (F1) all the processors are assumed to be fail-

stop.

The fault-tolerant scheduling problem consists of finding a

schedule for the tasks so as to satisfy the following

additional condition:

(S4) fault tolerance is guaranteed, namely, conditions (S1)-

(S3) are verified even in the presence of failures.

In order to achieve fault tolerance, two copies for each

task are used, called primary and backup copies. The primary

copy 

has its request period equal to T

and its execution

time equal to C

, while the backup copy 

has the same

request period T

but an execution time D

6 C

. Although

the fault-tolerant algorithm to be proposed works also when

is greater than or equal to C

, in practice D

is smaller

than C

, since backup copies usually provide a reduced

functionality in a smaller execution time than the primary

copies.

The primary copy of a task is always executed, while its

backup copy 

is executed according to 

's status, which

can be active or passive. If the status is active, then 

always executed, while if it is passive, then 

is executed

only when the primary copy fails. In other words, although

both active and passive copies of the primary tasks are

statically assigned to processors, passive backup copies are

actually executed only when a failure of the corresponding

primary copy occurs.

Each passive copy 

is informed of the completion of 

at every occurrence of the periodic task by means of a

message that the processor running 

sends in each period

hT

; h  1 T

 by 

's completion time to the processor

assigned to the passive copy 

. This message is small: since

it must contain the indices of the primary task and of the

sender and receiver processors, its size is Olog n bits. If the

message is not received by a certain due time (to be

specified in Section 3), a failure on the processor running 

is assumed and the passive copy 

is scheduled. The

overhead needed for such processor failure detections is

mainly given by the short-message latency of the commu-

nication subsystem employed. In particular, with the

current off-the-shelf technology, this overhead can be

estimated in the order of few microseconds and is assumed

to be included in the execution time of the primary copies.

As for active copies, no implicit or explicit synchronization

is assumed with their primary copies, since an active copy

can run before, after, or concurrently with its primary copy.

2.3 The Rate-Monotonic Algorithm

Liu and Layland [10] proposed a fixed-priority scheduling

algorithm, called Rate-Monotonic (RM), for solving the

(nonfault-tolerant) problem stated in Section 2.1 on a single

processor system, that is when m  1. In their algorithm,

each task is assigned a priority according to its request rate

(the inverse of its request period)Ðtasks with short periods

are assigned high priorities. At any instant of time, a

pending task with the highest priority is scheduled. A

currently running task with lower priority is preempted

whenever a request of higher priority occurs, and the

interrupted task is resumed later.

As an example, consider tasks 

and 

to be scheduled

on one processor, and let C

1; 3 and

C

3; 5. Task 

has higher priority than 

, and

the first request of 

is scheduled during the time interval

0; 1: Then the first request of 

is scheduled during 1; 3.

At time 3, the second request of 

comes, 

is preempted,

and 

is scheduled during 3; 4. Then 

is resumed and

scheduled during 4; 5; and so on.

Liu and Layland proved the following two important

results concerning fixed-priority scheduling algorithms.

Theorem 1. The largest response time for any periodic request of



occurs whenever 

is requested simultaneously with the

requests for all higher priority tasks.

Theorem 2. A periodic task set can be scheduled by a fixed-

priority algorithm provided that the deadline of the first

request of each task starting from a critical instant (i.e., an

instant in which all tasks are simultaneously requested) is met.

For example, a critical instant occurs when all tasks are in

phase at time zero, which is called critical instant phasing,

because it is the phasing that results in the longest response

time for the first request of each task. As a consequence, to

check the schedulability of any task 

, it is sufficient to

check whether 

is schedulable in its first period 0;T



when it is scheduled with all higher priority tasks.

2.4 The Completion Time Test

From Theorems 1 and 2, the following necessary and

sufficient schedulability criterion was derived by Joseph

and Pandya [5], as discussed also in [8].

Theorem 3. Let the periodic tasks 

; ...;

be given in priority

order and scheduled by a fixed-priority algorithm. All the

periodic requests of 

will meet the deadlines under all task

phasings if and only if:

min

0 <t T



1  k i

dt=T

e=t



 1:

The entire set of tasks 

; ...;

is schedulable under all task

phasings if and only if:

936 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 9, SEPTEMBER 1999

max

1  i  n

min

0 <t T



1ki

dt=T

e=t



 1

The minimization required in Theorem 3 is easy to

compute in the case of the Rate-Monotonic algorithm. In

fact, t needs to be checked only a finite number of times, as

explained below.

Let  f

; ...;

g, with T

 ...  T

, be a set of tasks

in phase at time zero, the cumulative work on a processor

required by tasks in  during 0;t is:

Wt; 



2 

dt=T

Create the sequence of times S

; ... with





2 

, and with S

l1

 WS

;. If for some l,

 S

l1

 T

, then 

is schedulable. Otherwise, if T

 S

for some l, task 

is not schedulable. Note that S

is exactly

equal to the minimum t, 0 <t<T

,forwhich

1  k  i

dt=T

et as required in Theorem 3. This

schedulability test is called Completion Time Test (CTT).

As an immediate consequence of the above theorems, the

following property holds:

Property 1. Let the Completion Time Test be satisfied for



; ...;

, and let S

 S

l1

 T

for some l. Then in any

period hT

; h  1 T

, with h integer, 

will complete

no later than the instant hT

 S

For the sake of clarity, the quantity S

will be denoted in

the following by

since such a quantity represents the

worst-case completion time of task 

in any request period T

As an example of use of CTT, consider again tasks 

and



with C

1; 3 and C

3; 5 and let us

check the schedulability of 

 1  3  4;

 W4; f

;

g  1d4=3e3d4=5e5;

and

 W5; f

;

g  1d5=3e3d5=5e5:

Since S

 S

 T

 5, all the periodic requests of 

will

meet their deadlines.

It is worth noting that, by Theorem 3, the schedulability

of lower priority tasks does not guarantee the schedulability

of higher priority tasks. Therefore, in order to check the

schedulability of a set of tasks, each task must get through

the CTT when it is scheduled with all higher priority tasks.

If tasks are picked by priority order, the schedulability test

can proceed in an incremental way: CTT is performed

considering tasks 

; ... 

on the period 0;T

,for

i  1; ...;n, that is, by adding one task 

at a time to the

preceding tasks 

; ...;

iÿ1

; without the need to test again

the schedulability of 

; ...;

iÿ1

. In this way, as soon as

computed,

will not change anymore, since only lower

priority tasks will be considered later.

2.5 The Rate-Monotonic First-Fit

Dhall and Liu [3] generalized the RM algorithm to

accommodate multiprocessor systems. In particular, they

proposed the so called Rate-Monotonic First-Fit (RMFF)

algorithm. It is a partitioning algorithm, where tasks are

first assigned to processors following the RM priority order

and then all the tasks assigned to the same processor are

scheduled with the RM algorithm. Let T

 T

 ...  T

the algorithm acts as follows. For i  1; 2; ... ;n, the generic

task 

is assigned to the first processor P

such that 

and

the other tasks already assigned to P

can be scheduled on

according to RM. If no such processor exists, the task is

assigned to a new processor. Dhall and Liu showed that,

using a schedulability condition weaker than CTT, RMFF

uses about 2.33U processors in the worst case, where U is

the load of the task set. The 2.33 worst case bound was

recently lowered to 1.75 by Burchard, Liebeherr, Oh, and

Son [1], using a schedulability condition stronger than that

used in [3], but without using the RM priority order for task

assignment, and partially using CTT. In practice, however,

RMFF remains competitive, for its simplicity and efficiency.

It employs the same priority order both for assigning tasks

to processors and scheduling tasks on each processor, and

requires on the average a number of processors very close

to U when CTT is used to check for schedulability on each

processor, as confirmed also by the simulation experiments

exhibited in Section 5. Moreover, as shown in Section 4, it

can be extended in a clean way to tolerate hardware and

software failures.

3OVERVIEW OF THE FAULT-TOLERANT RMFF

LGORITHM

This section provides an informal high-level description of

the proposed Fault-Tolerant Rate-Monotonic First-Fit

(FTRMFF) algorithm. The algorithm analysis is done in

next section. For the sake of simplicity, only the extension to

tolerate one processor failure is discussed hereafter. Exten-

sions to support many processor failures or software

failures will be discussed in Sections 4.5 and 4.6, respec-

tively.

In the FTRMFF algorithm, primary and backup copies of

different tasks can be assigned to the same processor. Of

course, in order to tolerate a processor failure, the primary

copy and the backup copy of the same task should not be

assigned to the same processor. The algorithm proposed

can be viewed as the RMFF algorithm applied to a task set

including both primary and backup copies. Task copies,

both primary and backup, are ordered by increasing

periods, namely, the priority of a copy is equal to the

inverse of its period. A tie between a primary copy 

and its

backup copy 

is broken by giving higher priority to 

Thus tasks are indexed by decreasing RM priorities, and are

assigned to the processors following the order:



;

;

;

; ...;

;

CTT is used to check whether a task copy can be

assigned to a processor. Thanks to Property 1 of Section 2,

CTT also provides enough information to decide whether a

backup copy should be active or passive. Indeed, while

checking for schedulability of a primary copy 

, CTT also

computes its worst-case completion time

. If the schedul-

ability test for 

succeeds, that is when

 T

, then for

BERTOSSI ET AL.: FAULT-TOLERANT RATE-MONOTONIC FIRST-FIT SCHEDULING IN HARD-REAL-TIME SYSTEMS 937

each request period there are at least T

time units to

schedule 

as a passive copy on another processor. Let

 T

be the recovery time of the backup copy 

.If

 D

, then 

may be scheduled as a passive copy, since

there is enough time to execute 

after 

if a processor

failure prevents 

from being completed; otherwise 

must

be scheduled as an active copy. The algorithm prefers to

schedule a backup copy as a passive copy whenever

possible, so as to overbook each processor with more

passive copies whose primary copies are assigned to

different processors.

It is worth noting that, although tasks could be assigned

to processors following any order, considering task copies

by decreasing RM priorities greatly simplifies the algo-

rithm. Indeed, such an ordering is the same ordering used

by the RM algorithm to schedule the tasks assigned to each

processor. Therefore, when a task 

is assigned to a

processor, only lower priority tasks will be assigned later

to the same processor, and the time intervals for 

execution on the processor will remain unchanged. In

particular, also worst case completion time

will remain

unchanged. This allows to determine whether the backup

copy 

of 

can be scheduled as a passive copy. Clearly,

with another ordering a higher priority task can be assigned

to the same processor after 

. In this case,

needs to be

recomputed and 

must be reassigned and rescheduled.

This justifies the 

;

;

;

; ... 

;

order of assign-

ment. Moreover, since the algorithm generalizes RMFF, it

assigns a backup copy 

, either passive or active, to the first

processor P

such that 

is not assigned to P

, and 

and

the other primary and backup copies already assigned to P

can be scheduled on P

according to the RM algorithm for a

single processor.

To find a processor a task copy can be assigned to,

however, several applications of CTT are required, which

take into account the situations in which no processor fails

or any processor fails. The applications of the test depend

on the kind (primary/backup) of the task copy to be

assigned as well as on its status (active/passive) if the copy

is a backup copy. There are three main assignment cases.

(A1) To assign a primary copy 

to a processor P

; two

conditions have to be checked.

. 

must be schedulable together with all the primary

and active backup copies already assigned to P

;

. 

must be schedulable together with all the primary

copies already assigned to P

and all the active and

backup copies assigned to P

such that their

corresponding primary copies are all assigned to

the same processor P

, and this condition must be

considered for all P

6 P

The first condition takes into account the situation in

which no failure occurs, while the second one takes into

account the situation in which any processor other than P

fails. Thus, as many applications of CTT as the total number

of processors are required in the worst case to determine

whether 

can be assigned to P

. Note that the second

condition use the space reserved on P

to active copies

whose primary copies are not assigned to P

, since only one

processor failure is assumed to be tolerated.

(A2) To assign an active backup copy 

to a processor P

assume that the primary copy 

is already assigned to

processor P

6 P

twoconditionshavealsotobe

checked.

. 

must be schedulable together with all the primary

and active backup copies already assigned to P

;

. 

must be schedulable together with all the primary

copies already assigned to P

and all the active and

backup copies assigned to P

such that their

corresponding primary copies are all assigned to P

These conditions are analogous to those of (A1), with the

difference that the second one takes into account the

situation where the failed processor is that running the

primary copy 

. Thus only two applications of CTT are

required to determine whether 

can be assigned to P

(A3) Finally, to assign a passive backup copy 

to a

processor P

, assuming again that the primary copy 

already assigned to processor P

6 P

, only one condi-

tion has to be tested, which is identical to the second

condition of (A2). Thus only one application of CTT is

needed to determine whether 

can be assigned to P

As soon as task copies are assigned to processors, all the

copies assigned to the same processor are scheduled with

the RM algorithm. However, in the absence of failures, each

processor executes its primary copies and active backup

copies only. When the processor assigned to 

does not

receive the synchronization message of 

by time hT



a failure of the processor running 

is assumed and the

passive copy 

is executed. To understand how to recover

from a failure, assume 

is assigned to processor P

which

is detected at time  to be failed, with  belonging to

hT

; h  1 T

 for any h.If

is an active copy scheduled

on a processor P

, then 

will continue to be executed and

no further action is needed for 

.If

is passive, then 

becomes active on P

starting either from , if the execution

of 

was not completed by P

before , or from h  1 T

,if

the execution of 

was already completed before . In other

words, if >hT



; then 

was completed before P

failure and there is no need to schedule 

by time h  1 T

If   hT



then, in order to recover the lost computation

of 

, 

must be executed for the first time during the

interval ; h  1 T

, which in general is shorter than T

.It

will be shown in the next section that 

, the primary copies

of P

, and the backup copies of P

meet their deadlines even

in this case.

4ANALYSIS OF THE FAULT-TOLERANT RMFF

LGORITHM

In this section, necessary and sufficient schedulability

criteria are proved which extend Theorem 3 to schedule a

set of primary and backup copies to recover from one

processor failure. Based on the proposed criteria, a fault-

tolerant extension of RMFF is derived and proved to be

correct.

4.1 Schedulability Criteria

In order to extend Theorem 3, consider a generic task set 

containing both primary and backup copies which must be

938 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 9, SEPTEMBER 1999

HTML Viewer

Frequently Asked Questions (2)

Q1. What are the contributions in "Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems" ?

In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy scheduled on a different processor. First, the paper considers the ability of the widely-used Rate-Monotonic scheduling algorithm to meet the deadlines of periodic tasks in the presence of a processor failure. Then, the paper extends the well-known Rate-Monotonic First-Fit assignment algorithm, where all the task copies, included the backup copies, are considered by Rate-Monotonic priority order and assigned to the first processor in which they fit. Moreover, the space allocated to active copies is reclaimed as soon as a failure is detected.

Q2. What are the future works in "Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems" ?

However, further research is needed, e. g., to derive an analytical worst case bound on the number of processors used by the proposed FTRMFF algorithm, or to devise schedulability conditions which are weaker but simpler than the Completion Time Test, e. g., as those proposed in [ 1 ]. This optimization is left for further work. As a subject for future research, the combined duplication scheme proposed in the present paper could be used to extend the Rate-Monotonic First-Fit algorithm in order to tolerate failures also in the presence of resource reclaiming and task synchronization. Finally, further research could deal with assignment strategies which are different from those considered in this paper.

Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems

Summary (4 min read)

1 INTRODUCTION

2 BACKGROUND

2.1 The Scheduling Problem

2.2 The Fault-Tolerant Model

2.3 The Rate-Monotonic Algorithm

2.4 The Completion Time Test

2.5 The Rate-Monotonic First-Fit

3 OVERVIEW OF THE FAULT-TOLERANT RMFF ALGORITHM

4.1 Schedulability Criteria

4.2 Fault-Tolerant CTT

4.3 Fault-Tolerant RMFF

4.4 Recovery from a Processor Failure

4.5 Tolerating Many Processor Failures

4.6 Tolerating Software Failures

5 SIMULATION EXPERIMENTS

6 CONCLUDING REMARKS

ACKNOWLEDGMENTS

Figures (5)

Citations

Cites methods from "Fault-tolerant rate-monotonic first..."

Cites background from "Fault-tolerant rate-monotonic first..."

Cites background from "Fault-tolerant rate-monotonic first..."

Cites methods from "Fault-tolerant rate-monotonic first..."

References

"Fault-tolerant rate-monotonic first..." refers methods in this paper

"Fault-tolerant rate-monotonic first..." refers background in this paper

"Fault-tolerant rate-monotonic first..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (2)

Q1. What are the contributions in "Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems" ?

Q2. What are the future works in "Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems" ?