scispace - formally typeset

Journal ArticleDOI

Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems

01 Sep 1999-IEEE Transactions on Parallel and Distributed Systems (IEEE)-Vol. 10, Iss: 9, pp 934-945

TL;DR: In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy planned on a different processor.

AbstractHard-real-time systems require predictable performance despite the occurrence of failures. In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy scheduled on a different processor. An active copy is always executed, while a passive copy is executed only in the case of a failure. First, the paper considers the ability of the widely-used rate-monotonic scheduling algorithm to meet the deadlines of periodic tasks in the presence of a processor failure. In particular, the completion time test is extended so as to check the schedulability on a single processor of a task set including backup copies. Then, the paper extends the well-known rate-monotonic first-fit assignment algorithm, where all the task copies, included the backup copies, are considered by rate-monotonic priority order and assigned to the first processor in which they fit. The proposed algorithm determines which tasks must use the active duplication and which can use the passive duplication. Passive duplication is preferred whenever possible, so as to overbook each processor with many passive copies whose primary copies are assigned to different processors. Moreover, the space allocated to active copies is reclaimed as soon as a failure is detected. Passive copy overbooking and active copy deallocation allow many passive copies to be scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed. Simulation studies reveal a remarkable saving of processors with respect to those needed by the usual active duplication approach in which the schedule of the non-fault-tolerant case is duplicated on two sets of processors.

Summary (4 min read)

1 INTRODUCTION

  • THROUGHOUT industrial computing, there is an increasingdemand for more complex and sophisticated hard-realtime computing systems.
  • An active copy presents the advantages of requiring no synchronization with its primary copyÐit can run before, after, or concurrently with the other copyÐand of having a larger time window for executionÐnamely, the whole period of the task.
  • In particular, this paper extends the RMFF algorithm to tolerate failures under the assumption that processors fail in a fail-stop manner.

2 BACKGROUND

  • This section gives a formal definition of the scheduling problem and a precise specification of the fault tolerance model.
  • Moreover, important properties of the well-known RM, CTT, and RMFF algorithms are recalled.

2.1 The Scheduling Problem

  • The requests for i are periodic, with constant interval Ti between every two consecutive requests, and i's first request occurs at time 0.
  • The worst case execution time for all the requests of i is constant and equal to Ci, with Ci Ti. Periodic tasks 1; :::; n are independent, that is the requests of any task do not depend on the execution of the other tasks.

2.2 The Fault-Tolerant Model

  • The fault-tolerant scheduling problem consists of finding a schedule for the tasks so as to satisfy the following additional condition: (S4) fault tolerance is guaranteed, namely, conditions (S1)(S3) are verified even in the presence of failures.
  • In order to achieve fault tolerance, two copies for each task are used, called primary and backup copies.
  • In practice Di is smaller than Ci, since backup copies usually provide a reduced functionality in a smaller execution time than the primary copies.
  • Since it must contain the indices of the primary task and of the sender and receiver processors, its size is O log n bits, also known as This message is small.
  • The overhead needed for such processor failure detections is mainly given by the short-message latency of the communication subsystem employed.

2.3 The Rate-Monotonic Algorithm

  • Liu and Layland [10] proposed a fixed-priority scheduling algorithm, called Rate-Monotonic (RM), for solving the (nonfault-tolerant) problem stated in Section 2.1 on a single processor system, that is when m 1.
  • At any instant of time, a pending task with the highest priority is scheduled.
  • Liu and Layland proved the following two important results concerning fixed-priority scheduling algorithms.
  • The largest response time for any periodic request of 1 occurs whenever i is requested simultaneously with the requests for all higher priority tasks.
  • A critical instant occurs when all tasks are in phase at time zero, which is called critical instant phasing, because it is the phasing that results in the longest response time for the first request of each task.

2.4 The Completion Time Test

  • From Theorems 1 and 2, the following necessary and sufficient schedulability criterion was derived by Joseph and Pandya [5], as discussed also in [8].
  • This schedulability test is called Completion Time Test (CTT).
  • It is worth noting that, by Theorem 3, the schedulability of lower priority tasks does not guarantee the schedulability of higher priority tasks.
  • Therefore, in order to check the schedulability of a set of tasks, each task must get through the CTT when it is scheduled with all higher priority tasks.

2.5 The Rate-Monotonic First-Fit

  • Dhall and Liu [3] generalized the RM algorithm to accommodate multiprocessor systems.
  • In particular, they proposed the so called Rate-Monotonic First-Fit (RMFF) algorithm.
  • It is a partitioning algorithm, where tasks are first assigned to processors following the RM priority order and then all the tasks assigned to the same processor are scheduled with the RM algorithm.
  • Dhall and Liu showed that, using a schedulability condition weaker than CTT, RMFF uses about 2.33U processors in the worst case, where U is the load of the task set.
  • In practice, however, RMFF remains competitive, for its simplicity and efficiency.

3 OVERVIEW OF THE FAULT-TOLERANT RMFF ALGORITHM

  • This section provides an informal high-level description of the proposed Fault-Tolerant Rate-Monotonic First-Fit algorithm.
  • The algorithm prefers to schedule a backup copy as a passive copy whenever possible, so as to overbook each processor with more passive copies whose primary copies are assigned to different processors.
  • Clearly, with another ordering a higher priority task can be assigned to the same processor after i.
  • I must be schedulable together with all the primary and active backup copies already assigned to Pj; .
  • These conditions are analogous to those of (A1), with the difference that the second one takes into account the situation where the failed processor is that running the primary copy i.

4.1 Schedulability Criteria

  • In order to extend Theorem 3, consider a generic task set containing both primary and backup copies which must be scheduled all together on a single processor.
  • Pj includes the active backup copies assigned to processor.
  • Pj be the set of periodic tasks given in priority order which are assigned to processor Pj. Consider now the case that a failure of processor Proof.
  • Hence, the proof follows from Theorem 3. tu.

4.2 Fault-Tolerant CTT

  • Based on Theorems 4 and 5, two kinds of schedulability tests are needed, one to check for schedulability in the absence of failures, and the other to check for schedulability after a processor failure.
  • Pj, since the recovery from the failure of any processor other than Pj must be taken into account.

4.3 Fault-Tolerant RMFF

  • The first step assigns the primary copy i to the first processor in which it fits.
  • The second step establishes the recovery time Bi and the status of the backup copy i.
  • Thus, duplicating on two sets of processors the schedule for the nonfault-tolerant case requires at least four processors to tolerate one failure.
  • The proposed FTRMFF algorithm, instead, tolerates one failure using three processors only.
  • The procedure FTRMFF-Assignment is executed off-line and requires O nm2 schedulability tests to be performed.

4.4 Recovery from a Processor Failure

  • Once an assignment is found by the FTRMFF algorithm, each processor.
  • The uncompleted tasks assigned to Pf are recovered by the remaining nonfaulty processors.
  • Note also that, in any case, all the active backup copies of primary tasks scheduled on the nonfaulty processors are deallocated from Pj. FTRMFF-Recovery Pf; (1) Do the following steps in parallel for all the processors 2.
  • The procedure FTRMFF-Recovery is executed on-line and is very fast, since all the required sets, including passiveRecover Pj; Pf and recover Pj; Pf , were previously computed off-line by the FTRMFF-Assignment procedure, which already made all the schedulability tests, too.

4.5 Tolerating Many Processor Failures

  • In order to tolerate many processor failures, spare processors must be employed to replace failed processors on-line.
  • Pf is detected within the closest completion time of the task set primary Pf [ active Pf and the time interval between two consecutive failures is three times the largest task request period.
  • This phase takes at most During the recovery phase, all the passive copies of the uncompleted tasks assigned to Pf are executed by the non-faulty processors only once (step 1.1.2 of FTRMFFReplacing), and the spare processor Ps inherits.
  • The reconfiguration phase is completed by time 2Tn.
  • If there are q spare processors, q faulty processors can be replaced by means of the FTRMFF-Replacing procedure, while one additional failure can be tolerated by means of the FTRMFFRecovery procedure.

4.6 Tolerating Software Failures

  • In addition to processor failures, a hard-real-time system can fail also due to design faults in the software.
  • To explain the ideas of the approach, assume that two different implementations of the same task specification are provided.
  • Since the processors are assumed fail-stop, if the acceptance test fails, it signals the presence of an error in the software.
  • The time to execute the acceptance test is assumed to be included in the primary copy execution time.
  • One approach to implement the recovery from software failures is as follows.

5 SIMULATION EXPERIMENTS

  • In order to evaluate the number of processors used by the FTRMFF algorithm for scheduling both primary and backup copies, simulation experiments are performed.
  • For the chosen n and , the experiment is repeated 30 times, and the average result is computed.
  • The performance metric in all the experiments is the number of processors required to assign a given task set.
  • In the outcome of the experiments, the authors denote with N the number of processors required by the FTRMFF algorithm for a task set consisting of both primary and backup task copies, and with M the number of processors required by the RMFF algorithm for a task set with identical primary copies and no backup copies.

6 CONCLUDING REMARKS

  • This paper has considered the problem of preemptively scheduling a set of independent periodic tasks under the assumption that each task deadline coincides with the next request of the same task.
  • The proposed FTRMFF algorithm extends the well-known Rate-Monotonic First-Fit scheduling algorithm to tolerate failures, uses a novel combined active/passive duplication scheme, and determines by itself which tasks should use passive duplication and which should use active duplication.
  • This optimization is left for further work.
  • It is worth noting that the proposed algorithm works also if some backup copies are forced to be active.
  • Finally, further research could deal with assignment strategies which are different from those considered in this paper.

ACKNOWLEDGMENTS

  • The C++ code used in the simulation experiments was written by Andrea Fusiello.
  • This work was supported by grants from the Ministero dell'UniversitaÁ e della Ricerca Scientifica e Tecnologica, the Consiglio Nazionale delle Ricerche, and the UniversitaÁ di TrentoÐProgetto Speciale 1997.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

Fault-Tolerant Rate-Monotonic First-Fit
Scheduling in Hard-Real-Time Systems
Alan A. Bertossi, Luigi V. Mancini, and Federico Rossini
AbstractÐHard-real-time systems require predictable performance despite the occurrence of failures. In this paper, fault tolerance is
implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a
passive backup copy scheduled on a different processor. An active copy is always executed, while a passive copy is executed only in
the case of a failure. First, the paper considers the ability of the widely-used Rate-Monotonic scheduling algorithm to meet the
deadlines of periodic tasks in the presence of a processor failure. In particular, the Completion Time Test is extended so as to check
the schedulability on a single processor of a task set including backup copies. Then, the paper extends the well-known Rate-Monotonic
First-Fit assignment algorithm, where all the task copies, included the backup copies, are considered by Rate-Monotonic priority order
and assigned to the first processor in which they fit. The proposed algorithm determines which tasks must use the active duplication
and which can use the passive duplication. Passive duplication is preferred whenever possible, so as to overbook each processor with
many passive copies whose primary copies are assigned to different processors. Moreover, the space allocated to active copies is
reclaimed as soon as a failure is detected. Passive copy overbooking and active copy deallocation allow many passive copies to be
scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed. Simulation
studies reveal a remarkable saving of processors with respect to those needed by the usual active duplication approach in which the
schedule of the non-fault-tolerant case is duplicated on two sets of processors.
Index TermsÐFault tolerance, hard-real-time systems, multiprocessor systems, periodic tasks, rate-monotonic scheduling, task
replication.
æ
1INTRODUCTION
T
HROUGHOUT industrial computing, there is an increasing
demand for more complex and sophisticated hard-real-
time computing systems. In particular, fault tolerance is one
of the requirements that are playing a vital role in the
design of new hard-real-time distributed systems.
A variety of schemes have been proposed to support
fault-tolerant computing in distributed systems, such
schemes can be partitioned into two broad classes. In the
first class, which employs the passive replication techniques,
a passive backup copy of a primary task is assigned to one
or more backup processors; when a primary task fails, the
passive copies of the task are restarted on the backup
processor, hence a passive copy is executed only in the
presence of a failure. In the second class, which employs the
active replication techniques, the same set of tasks is always
executed on two or more sets of processors; every primary
task has an active backup copy: if any task fails, its mirror
image will continue to execute.
Many hard-real-time scheduling problems have been
found to be NP-hard: most likely, there are no optimal
polynomial-time algorithms for them [2], [11]. In particular,
scheduling periodic tasks with arbitrary deadlines is NP-
hard, even if only a single processor is available [12].
Several heuristics for scheduling periodic tasks on uni-
processor and multiprocessor systems have been proposed.
Liu and Layland [10] introduced the Rate-Monotonic (RM)
algorithm for preemptively scheduling periodic tasks on a
single processor, under the assumption that task deadlines
are equal to their periods. Joseph and Pandya [5] later
derived the Completion Time Test (CTT) for checking
schedulability of a set of fixed-priority tasks on a single
processor. RM was generalized to multiprocessor systems
by Dhall and Liu [3], who proposed, among others, the
Rate-Monotonic First-Fit (RMFF) heuristic. More refined
heuristics for multiprocessors were proposed by Burchard,
Liebeherr, Oh, and Son [1].
It is worth noting that the RM algorithm is becoming an
industry standard because of its simplicity and flexibility. It
is a low overhead greedy algorithm, which is optimal
among all fixed-priority algorithms. Moreover, it possesses
certain advantages, for example, the implementation of
efficient schedulers for aperiodic tasks, and the retiming of
intervals in order to shed the load in a predictable fashion
[8].
As for fault-tolerant scheduling algorithms, a dynamic
programming algorithm for multiprocessors was presented
in [7] which ensures that backup schedules can be
efficiently embedded within the primary schedule. An
algorithm was proposed in [9] which generates optimal
schedules in a uniprocessor system by employing a passive
replication to tolerate software failures only. The algorithms
proposed in [14] are based on a bidding strategy and
dynamically recompute the schedule when a processor
934 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 9, SEPTEMBER 1999
. A.A. Bertossi is with the Dipartimento di Matematica, Universita
Á
di
Trento, Via Sommarive 14, 38050 Trento, Italy.
E-mail: bertossi@science.unitn.it.
. L.V. Mancini is with the Dipartimento di Scienze dell'Informazione,
Universita
Á
di Roma ªLa Sapienza,º Via Salaria 113, 00198 Roma, Italy. E-
mail: lv.mancini@dsi.uniroma1.it.
. F. Rossini is with Telecom Italia Mobile, Area Applicazioni Informatiche,
Via Tor Pagnotta 90, 00143 Roma, Italy.
Manuscript received 20 June 1997.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number 105271.
1045-9219/99/$10.00 ß 1999 IEEE

fails, in order to redistribute the tasks among the remaining
nonfaulty processors. In [13], two algorithms are designed
which reserve time for the processing of backup tasks on
uniprocessors running fixed-priority schedulers. Finally,
the techniques of backup overbooking and backup deal-
location were introduced in [4] to achieve fault tolerance in
multiprocessor systems, but for aperiodic nonpreemptive
tasks only.
It is noted here that none of the fault-tolerant algorithms
discussed above extended the RMFF algorithm or have
combined in the same schedule both active and passive
replication of the tasks. However, the latter idea seems
potentially useful since it provides the ability to exploit the
advantages of both types of replication in the same system.
Indeed, the simplest way to achieve fault tolerance in
hard-real-time systems consists in using active duplication
for all tasks. An active copy presents the advantages of
requiring no synchronization with its primary copyÐit can
run before, after, or concurrently with the other copyÐand
of having a larger time window for executionÐnamely, the
whole period of the task. However, using active duplication
for all tasks doubles the number of processors required in
the nonfault-tolerant case. In contrast, a passive copy can be
executed only if a failure prevents the corresponding
primary copy from completing. A passive copy has the
disadvantages of having tighter timing constraintsÐin the
worst case it is not activated until the scheduled completion
time of the primary copyÐand of requiring some time
overhead for synchronization with the corresponding
primary copy. These drawbacks can be overcome by
choosing active replication when the scheduled completion
time of the primary copy is close to the deadline, that is to
the end of the period, and by having smaller execution
times for the backup copies. Moreover, since the time
overhead for synchronization is usually very small, it can be
included in the execution time of the primary task. Most
importantly, passive duplication has the great advantage of
overbooking the processorsÐmany passive copies whose
primary copies are assigned to different processors can be
scheduled on the same processor so as to share the same
time interval. Indeed, under the assumption of a single
processor failure, only one of such passive copies will be
actually executed, namely, the passive copy whose primary
copy was prevented from completing because of the failure.
Moreover, if only one failure is tolerated, the space allocated
to active copies whose primary copy is not assigned to the
failed processor can be reclaimed as soon as a failure is
detected. Passive copy overbooking and active copy deal-
location allow fewer processors to be used with respect to
the case in which active duplication is used for all tasks.
The present paper considers the problem of preemp-
tively scheduling a set of independent periodic tasks on a
distributed system, such that each task deadline coincides
with the next request of the same task, and all tasks start in-
phase. In particular, this paper extends the RMFF algorithm
to tolerate failures under the assumption that processors fail
in a fail-stop manner. The algorithm determines by itself
which tasks must use active duplication and which can use
passive duplication, preferring passive duplication when-
ever possible. The rest of the paper is organized as follows.
Section 2 gives a formal definition of the scheduling
problem and a precise specification of the fault tolerance
model. Moreover, the classical RM, CTT, and RMFF
algorithms are recalled. Section 3 provides a high-level
description of the proposed Fault-Tolerant Rate-Monotonic
First-Fit (FTRMFF) algorithm. The algorithm analysis is
done in Section 4. In particular, the ability of RM to meet the
deadlines in the presence of one processor failure is
characterized in Section 4.1, and CTT is extended in Section
4.2 so as to check the schedulability on a single processor of
a task set including backup copies. Then, such an extended
CTT is used in Section 4.3 to assign task copies to processors
following a First-Fit heuristic which employs passive copy
overbooking and active copy space reclaiming. An algo-
rithm to recover from a single processor failure is shown in
Section 4.4, while extensions to tolerate both many
processor failures and software failures are presented in
Sections 4.5 and 4.6, respectively. In Section 5, simulation
experiments show that the proposed FTRMFF algorithm
requires fewer processors than the active duplication
approach. Finally, Section 6 summarizes the work and
discusses further possible extensions.
2BACKGROUND
This section gives a formal definition of the scheduling
problem and a precise specification of the fault tolerance
model. Moreover, important properties of the well-known
RM, CTT, and RMFF algorithms are recalled.
2.1 The Scheduling Problem
A periodic task
i
is completely identified by a pair C
i
;T
i
,
where C
i
is
i
's execution time and T
i
is
i
's request period. The
requests for
i
are periodic, with constant interval T
i
between every two consecutive requests, and
i
's first
request occurs at time 0. The worst case execution time for
all the (infinite) requests of
i
is constant and equal to C
i
,
with C
i
T
i
. Periodic tasks
1
;:::;
n
are independent, that is
the requests of any task do not depend on the execution of
the other tasks. The load of a periodic task
i
C
i
;T
i
is
U
i
C
i
=T
i
, while the load of the task set
1
; ...;
n
is U
P
1 i n
U
i
:
Given n independent periodic tasks
1
; ...;
n
and a set of
identical processors, the scheduling problem consists of
finding an order in which all the periodic requests of the
tasks are to be executed on the processors so as to satisfy the
following scheduling conditions:
(S1) integrity is preserved, that is, tasks and processors are
sequential: each task is executed by at most one
processor at a time and no processor executes more than
one task at a time;
(S2) deadlines are met, namely, each request of any task must
be completely executed before the next request of the
same task, that is, by the end of its period;
(S3) the number m of processors is minimized.
2.2 The Fault-Tolerant Model
It is assumed that the processors belong to a distributed
system and are connected by some kind of communication
BERTOSSI ET AL.: FAULT-TOLERANT RATE-MONOTONIC FIRST-FIT SCHEDULING IN HARD-REAL-TIME SYSTEMS 935

subsystem. The failure characteristics of the hardware are
the following:
(F1) Processors fail in a fail-stop manner, that is a processor
is either operational (i.e., nonfaulty) or ceases function-
ing;
(F2) All nonfaulty processors can communicate with each
other;
(F3) Hardware provides fault isolation in the sense that a
faulty processor cannot cause incorrect behavior in a
nonfaulty processor; in other words, processors are
independent as regard to failures;
(F4) The failure of a processor P
f
is detected by the
remaining nonfaulty processors after the failure, but
within the instant corresponding to the closest task
completion time of a task scheduled on P
f
.
Note that assumption (F4) can be easily satisfied by a
specific failure detection protocol as explained below, since
by assumption (F1) all the processors are assumed to be fail-
stop.
The fault-tolerant scheduling problem consists of finding a
schedule for the tasks so as to satisfy the following
additional condition:
(S4) fault tolerance is guaranteed, namely, conditions (S1)-
(S3) are verified even in the presence of failures.
In order to achieve fault tolerance, two copies for each
task are used, called primary and backup copies. The primary
copy
i
has its request period equal to T
i
and its execution
time equal to C
i
, while the backup copy
i
has the same
request period T
i
but an execution time D
i
6 C
i
. Although
the fault-tolerant algorithm to be proposed works also when
D
i
is greater than or equal to C
i
, in practice D
i
is smaller
than C
i
, since backup copies usually provide a reduced
functionality in a smaller execution time than the primary
copies.
The primary copy of a task is always executed, while its
backup copy
i
is executed according to
i
's status, which
can be active or passive. If the status is active, then
i
is
always executed, while if it is passive, then
i
is executed
only when the primary copy fails. In other words, although
both active and passive copies of the primary tasks are
statically assigned to processors, passive backup copies are
actually executed only when a failure of the corresponding
primary copy occurs.
Each passive copy
i
is informed of the completion of
i
at every occurrence of the periodic task by means of a
message that the processor running
i
sends in each period
hT
i
; h 1 T
i
by
i
's completion time to the processor
assigned to the passive copy
i
. This message is small: since
it must contain the indices of the primary task and of the
sender and receiver processors, its size is Olog n bits. If the
message is not received by a certain due time (to be
specified in Section 3), a failure on the processor running
i
is assumed and the passive copy
i
is scheduled. The
overhead needed for such processor failure detections is
mainly given by the short-message latency of the commu-
nication subsystem employed. In particular, with the
current off-the-shelf technology, this overhead can be
estimated in the order of few microseconds and is assumed
to be included in the execution time of the primary copies.
As for active copies, no implicit or explicit synchronization
is assumed with their primary copies, since an active copy
can run before, after, or concurrently with its primary copy.
2.3 The Rate-Monotonic Algorithm
Liu and Layland [10] proposed a fixed-priority scheduling
algorithm, called Rate-Monotonic (RM), for solving the
(nonfault-tolerant) problem stated in Section 2.1 on a single
processor system, that is when m 1. In their algorithm,
each task is assigned a priority according to its request rate
(the inverse of its request period)Ðtasks with short periods
are assigned high priorities. At any instant of time, a
pending task with the highest priority is scheduled. A
currently running task with lower priority is preempted
whenever a request of higher priority occurs, and the
interrupted task is resumed later.
As an example, consider tasks
1
and
2
to be scheduled
on one processor, and let C
1
;T
1
1; 3 and
C
2
;T
2
3; 5. Task
1
has higher priority than
2
, and
the first request of
1
is scheduled during the time interval
0; 1: Then the first request of
2
is scheduled during 1; 3.
At time 3, the second request of
1
comes,
2
is preempted,
and
1
is scheduled during 3; 4. Then
2
is resumed and
scheduled during 4; 5; and so on.
Liu and Layland proved the following two important
results concerning fixed-priority scheduling algorithms.
Theorem 1. The largest response time for any periodic request of
1
occurs whenever
i
is requested simultaneously with the
requests for all higher priority tasks.
Theorem 2. A periodic task set can be scheduled by a fixed-
priority algorithm provided that the deadline of the first
request of each task starting from a critical instant (i.e., an
instant in which all tasks are simultaneously requested) is met.
For example, a critical instant occurs when all tasks are in
phase at time zero, which is called critical instant phasing,
because it is the phasing that results in the longest response
time for the first request of each task. As a consequence, to
check the schedulability of any task
i
, it is sufficient to
check whether
i
is schedulable in its first period 0;T
i
when it is scheduled with all higher priority tasks.
2.4 The Completion Time Test
From Theorems 1 and 2, the following necessary and
sufficient schedulability criterion was derived by Joseph
and Pandya [5], as discussed also in [8].
Theorem 3. Let the periodic tasks
1
; ...;
n
be given in priority
order and scheduled by a fixed-priority algorithm. All the
periodic requests of
i
will meet the deadlines under all task
phasings if and only if:
min
0 <t T
i
X
1 k i
C
k
dt=T
k
e=t
1:
The entire set of tasks
1
; ...;
n
is schedulable under all task
phasings if and only if:
936 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 9, SEPTEMBER 1999

max
1 i n
min
0 <t T
i
X
1ki
C
k
dt=T
k
e=t
1
The minimization required in Theorem 3 is easy to
compute in the case of the Rate-Monotonic algorithm. In
fact, t needs to be checked only a finite number of times, as
explained below.
Let f
1
; ...;
i
g, with T
1
... T
i
, be a set of tasks
in phase at time zero, the cumulative work on a processor
required by tasks in during 0;t is:
Wt; 
X
k
2
C
k
dt=T
k
e:
Create the sequence of times S
0
;S
1
; ... with
S
0
P
k
2
C
k
, and with S
l1
WS
l
;. If for some l,
S
l
S
l1
T
i
, then
i
is schedulable. Otherwise, if T
i
S
l
for some l, task
i
is not schedulable. Note that S
l
is exactly
equal to the minimum t, 0 <t<T
i
,forwhich
P
1 k i
C
k
dt=T
k
et as required in Theorem 3. This
schedulability test is called Completion Time Test (CTT).
As an immediate consequence of the above theorems, the
following property holds:
Property 1. Let the Completion Time Test be satisfied for
1
; ...;
i
, and let S
l
S
l1
T
i
for some l. Then in any
period hT
i
; h 1 T
i
, with h integer,
i
will complete
no later than the instant hT
i
S
l
.
For the sake of clarity, the quantity S
l
will be denoted in
the following by
i
since such a quantity represents the
worst-case completion time of task
i
in any request period T
i
.
As an example of use of CTT, consider again tasks
1
and
2
with C
1
;T
1
1; 3 and C
2
;T
2
3; 5 and let us
check the schedulability of
2
:
S
0
1 3 4;
S
1
W4; f
1
;
2
g 1d4=3e3d4=5e5;
and
S
2
W5; f
1
;
2
g 1d5=3e3d5=5e5:
Since S
1
S
2
T
2
5, all the periodic requests of
2
will
meet their deadlines.
It is worth noting that, by Theorem 3, the schedulability
of lower priority tasks does not guarantee the schedulability
of higher priority tasks. Therefore, in order to check the
schedulability of a set of tasks, each task must get through
the CTT when it is scheduled with all higher priority tasks.
If tasks are picked by priority order, the schedulability test
can proceed in an incremental way: CTT is performed
considering tasks
1
; ...
1
on the period 0;T
i
,for
i 1; ...;n, that is, by adding one task
i
at a time to the
preceding tasks
1
; ...;
iÿ1
; without the need to test again
the schedulability of
1
; ...;
iÿ1
. In this way, as soon as
i
is
computed,
i
will not change anymore, since only lower
priority tasks will be considered later.
2.5 The Rate-Monotonic First-Fit
Dhall and Liu [3] generalized the RM algorithm to
accommodate multiprocessor systems. In particular, they
proposed the so called Rate-Monotonic First-Fit (RMFF)
algorithm. It is a partitioning algorithm, where tasks are
first assigned to processors following the RM priority order
and then all the tasks assigned to the same processor are
scheduled with the RM algorithm. Let T
1
T
2
... T
n
,
the algorithm acts as follows. For i 1; 2; ... ;n, the generic
task
i
is assigned to the first processor P
j
such that
i
and
the other tasks already assigned to P
j
can be scheduled on
P
j
according to RM. If no such processor exists, the task is
assigned to a new processor. Dhall and Liu showed that,
using a schedulability condition weaker than CTT, RMFF
uses about 2.33U processors in the worst case, where U is
the load of the task set. The 2.33 worst case bound was
recently lowered to 1.75 by Burchard, Liebeherr, Oh, and
Son [1], using a schedulability condition stronger than that
used in [3], but without using the RM priority order for task
assignment, and partially using CTT. In practice, however,
RMFF remains competitive, for its simplicity and efficiency.
It employs the same priority order both for assigning tasks
to processors and scheduling tasks on each processor, and
requires on the average a number of processors very close
to U when CTT is used to check for schedulability on each
processor, as confirmed also by the simulation experiments
exhibited in Section 5. Moreover, as shown in Section 4, it
can be extended in a clean way to tolerate hardware and
software failures.
3OVERVIEW OF THE FAULT-TOLERANT RMFF
A
LGORITHM
This section provides an informal high-level description of
the proposed Fault-Tolerant Rate-Monotonic First-Fit
(FTRMFF) algorithm. The algorithm analysis is done in
next section. For the sake of simplicity, only the extension to
tolerate one processor failure is discussed hereafter. Exten-
sions to support many processor failures or software
failures will be discussed in Sections 4.5 and 4.6, respec-
tively.
In the FTRMFF algorithm, primary and backup copies of
different tasks can be assigned to the same processor. Of
course, in order to tolerate a processor failure, the primary
copy and the backup copy of the same task should not be
assigned to the same processor. The algorithm proposed
can be viewed as the RMFF algorithm applied to a task set
including both primary and backup copies. Task copies,
both primary and backup, are ordered by increasing
periods, namely, the priority of a copy is equal to the
inverse of its period. A tie between a primary copy
i
and its
backup copy
i
is broken by giving higher priority to
i
.
Thus tasks are indexed by decreasing RM priorities, and are
assigned to the processors following the order:
1
;
1
;
2
;
2
; ...;
n
;
n
:
CTT is used to check whether a task copy can be
assigned to a processor. Thanks to Property 1 of Section 2,
CTT also provides enough information to decide whether a
backup copy should be active or passive. Indeed, while
checking for schedulability of a primary copy
i
, CTT also
computes its worst-case completion time
i
. If the schedul-
ability test for
i
succeeds, that is when
i
T
i
, then for
BERTOSSI ET AL.: FAULT-TOLERANT RATE-MONOTONIC FIRST-FIT SCHEDULING IN HARD-REAL-TIME SYSTEMS 937

each request period there are at least T
i
ÿ
i
time units to
schedule
i
as a passive copy on another processor. Let
B
i
T
i
ÿ
i
be the recovery time of the backup copy
i
.If
B
i
D
i
, then
i
may be scheduled as a passive copy, since
there is enough time to execute
i
after
i
if a processor
failure prevents
i
from being completed; otherwise
i
must
be scheduled as an active copy. The algorithm prefers to
schedule a backup copy as a passive copy whenever
possible, so as to overbook each processor with more
passive copies whose primary copies are assigned to
different processors.
It is worth noting that, although tasks could be assigned
to processors following any order, considering task copies
by decreasing RM priorities greatly simplifies the algo-
rithm. Indeed, such an ordering is the same ordering used
by the RM algorithm to schedule the tasks assigned to each
processor. Therefore, when a task
i
is assigned to a
processor, only lower priority tasks will be assigned later
to the same processor, and the time intervals for
i
's
execution on the processor will remain unchanged. In
particular, also worst case completion time
i
will remain
unchanged. This allows to determine whether the backup
copy
i
of
i
can be scheduled as a passive copy. Clearly,
with another ordering a higher priority task can be assigned
to the same processor after
i
. In this case,
i
needs to be
recomputed and
i
must be reassigned and rescheduled.
This justifies the
1
;
1
;
2
;
2
; ...
n
;
n
order of assign-
ment. Moreover, since the algorithm generalizes RMFF, it
assigns a backup copy
i
, either passive or active, to the first
processor P
j
such that
i
is not assigned to P
j
, and
i
and
the other primary and backup copies already assigned to P
j
can be scheduled on P
j
according to the RM algorithm for a
single processor.
To find a processor a task copy can be assigned to,
however, several applications of CTT are required, which
take into account the situations in which no processor fails
or any processor fails. The applications of the test depend
on the kind (primary/backup) of the task copy to be
assigned as well as on its status (active/passive) if the copy
is a backup copy. There are three main assignment cases.
(A1) To assign a primary copy
i
to a processor P
j
; two
conditions have to be checked.
.
i
must be schedulable together with all the primary
and active backup copies already assigned to P
j
;
.
i
must be schedulable together with all the primary
copies already assigned to P
j
and all the active and
backup copies assigned to P
j
such that their
corresponding primary copies are all assigned to
the same processor P
f
, and this condition must be
considered for all P
f
6 P
j
:
The first condition takes into account the situation in
which no failure occurs, while the second one takes into
account the situation in which any processor other than P
j
fails. Thus, as many applications of CTT as the total number
of processors are required in the worst case to determine
whether
i
can be assigned to P
j
. Note that the second
condition use the space reserved on P
j
to active copies
whose primary copies are not assigned to P
f
, since only one
processor failure is assumed to be tolerated.
(A2) To assign an active backup copy
i
to a processor P
j
,
assume that the primary copy
i
is already assigned to
processor P
p
6 P
j
twoconditionshavealsotobe
checked.
.
i
must be schedulable together with all the primary
and active backup copies already assigned to P
j
;
.
i
must be schedulable together with all the primary
copies already assigned to P
j
and all the active and
backup copies assigned to P
j
such that their
corresponding primary copies are all assigned to P
p
.
These conditions are analogous to those of (A1), with the
difference that the second one takes into account the
situation where the failed processor is that running the
primary copy
i
. Thus only two applications of CTT are
required to determine whether
i
can be assigned to P
j
.
(A3) Finally, to assign a passive backup copy
i
to a
processor P
j
, assuming again that the primary copy
i
is
already assigned to processor P
p
6 P
j
, only one condi-
tion has to be tested, which is identical to the second
condition of (A2). Thus only one application of CTT is
needed to determine whether
i
can be assigned to P
j
.
As soon as task copies are assigned to processors, all the
copies assigned to the same processor are scheduled with
the RM algorithm. However, in the absence of failures, each
processor executes its primary copies and active backup
copies only. When the processor assigned to
i
does not
receive the synchronization message of
i
by time hT
i
i
,
a failure of the processor running
i
is assumed and the
passive copy
i
is executed. To understand how to recover
from a failure, assume
i
is assigned to processor P
f
which
is detected at time to be failed, with belonging to
hT
i
; h 1 T
i
for any h.If
i
is an active copy scheduled
on a processor P
j
, then
i
will continue to be executed and
no further action is needed for
i
.If
i
is passive, then
i
becomes active on P
j
starting either from , if the execution
of
i
was not completed by P
f
before , or from h 1 T
i
,if
the execution of
i
was already completed before . In other
words, if >hT
i
i
; then
i
was completed before P
f
's
failure and there is no need to schedule
i
by time h 1 T
i
.
If hT
i
i
then, in order to recover the lost computation
of
i
,
i
must be executed for the first time during the
interval ; h 1 T
i
, which in general is shorter than T
i
.It
will be shown in the next section that
i
, the primary copies
of P
j
, and the backup copies of P
j
meet their deadlines even
in this case.
4ANALYSIS OF THE FAULT-TOLERANT RMFF
A
LGORITHM
In this section, necessary and sufficient schedulability
criteria are proved which extend Theorem 3 to schedule a
set of primary and backup copies to recover from one
processor failure. Based on the proposed criteria, a fault-
tolerant extension of RMFF is derived and proved to be
correct.
4.1 Schedulability Criteria
In order to extend Theorem 3, consider a generic task set
containing both primary and backup copies which must be
938 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 9, SEPTEMBER 1999

Citations
More filters

Journal ArticleDOI
01 Jun 2006
TL;DR: This paper investigates an efficient off-line scheduling algorithm generating schedules in which real-time tasks with precedence constraints can tolerate one processor's permanent failure in a heterogeneous system with fully connected network.
Abstract: Fault-tolerance is an essential requirement for real-time systems, due to potentially catastrophic consequences of faults. In this paper, we investigate an efficient off-line scheduling algorithm generating schedules in which real-time tasks with precedence constraints can tolerate one processor's permanent failure in a heterogeneous system with fully connected network. The tasks are assumed to be non-preemptable, and each task has two copies scheduled on different processors and mutually excluded in time. In the literature in recent years, the quality of a schedule has been previously improved by allowing a backup copy to overlap with other backup copies on the same processor. However, this approach assumes that tasks are independent of one other. To meet the needs of real-time systems where tasks have precedence constraints, a new overlapping scheme is proposed. We show that, given two tasks, the necessary conditions for their backup copies to safely overlap in time with each other are (1) their corresponding primary copies are scheduled on two different processors, (2) they are independent tasks, and (3) the execution of their backup copies implies the failures of the processors on which their primary copies are scheduled. For tasks with precedence constraints, the new overlapping scheme allows the backup copy of a task to overlap with its successors' primary copies, thereby further reducing schedule length. Based on a proposed reliability model, tasks are judiciously allocated to processors so as to maximize the reliability of heterogeneous systems. Additionally, times for detecting and handling of a permanent fault are incorporated into the scheduling scheme. We have performed experiments using synthetic workloads as well as a real world application. Simulation results show that compared with existing scheduling algorithms in the literature, our scheduling algorithm improves reliability by up to 22.4% (with an average of 16.4%) and achieves an improvement in performability, a measure that combines reliability and schedulability, by up to 421.9% (with an average of 49.3%).

123 citations


Cites methods from "Fault-tolerant rate-monotonic first..."

  • ...As an added measure of fault-tolerance, the proposed algorithm also takes the reliability of the processors into account....

    [...]


Journal ArticleDOI
TL;DR: A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW).
Abstract: Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.

103 citations


Cites background from "Fault-tolerant rate-monotonic first..."

  • ...Literature examples include Krishna and Shin (1986) and Bertossi et al. (1999) on periodic tasks and (Ghosh et al....

    [...]

  • ...Literature examples include Krishna and Shin (1986) and Bertossi et al. (1999) on periodic tasks and (Ghosh et al. 1994) on aperiodic tasks. For example, Figure 19 shows an example of scheduling four primary tasks (P1, P2, P3, P4) and their backups (B1, B2, B3, B4) on three processors, as indicated in Ghosh et al. (1994). When the original tasks complete execution, their backups are deallocated and the space can be used for scheduling other tasks ©S....

    [...]

  • ...Literature examples include Krishna and Shin (1986) and Bertossi et al. (1999) on periodic tasks and (Ghosh et al. 1994) on aperiodic tasks....

    [...]


Journal ArticleDOI
TL;DR: This work introduces the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead and provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors.
Abstract: The time-triggered model, with tasks scheduled in static (off line) fashion, provides a high degree of timing predictability in safety-critical distributed systems. Such systems must also tolerate transient and intermittent failures which occur far more frequently than permanent ones. Software-based recovery methods using temporal redundancy, such as task reexecution and primary/backup, while incurring performance overhead, are cost-effective methods of handling these failures. We present a constructive approach to integrating runtime recovery policies in a time-triggered distributed system. Furthermore, the method provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors. Given a general task graph with precedence and timing constraints and a specific fault model, the proposed method constructs the corresponding fault-tolerant (FT) schedule with sufficient slack to accommodate recovery. We introduce the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead. Contingency schedules, also generated offline, revise this FT schedule to mask task failures on individual processors while preserving precedence and timing constraints. We present simulation results which show that, for small-scale embedded systems having task graphs of moderate complexity, the proposed approach generates FT schedules which incur about 30-40 percent performance overhead when compared to corresponding non-fault-tolerant ones.

94 citations


Proceedings ArticleDOI
18 Aug 2002
TL;DR: An efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment that provides more features and capabilities than existing algorithms that schedule only independent tasks in real- time homogeneous systems is investigated.
Abstract: In this paper, we investigate an efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment. It provides more features and capabilities than existing algorithms that schedule only independent tasks in real-time homogeneous systems. In addition, the proposed algorithm takes the heterogeneities of computation, communication and reliability into account, thereby improving the reliability. To provide fault-tolerant capability, the algorithm employs a primary-backup copy scheme that enables the system to tolerate permanent failures in any single processor. In this scheme, a backup copy is allowed to overlap with other backup copies on the same processor, as long as their corresponding primary copies are allocated to different processors. Tasks are judiciously allocated to processors so as to reduce the schedule length as well as the reliability cost, defined to be the product of processor failure rate and task execution time. In addition, the time for detecting and handling a permanent fault is incorporated into the scheduling scheme, thus making the algorithm more practical. To quantify the combined performance of fault-tolerance and schedulability, the performability measure is introduced Compared with the existing scheduling algorithms in the literature, our scheduling algorithm achieves an average of 16.4% improvement in reliability and an average of 49.3% improvement in performability.

92 citations


Cites methods from "Fault-tolerant rate-monotonic first..."

  • ...The well-known Rate-Monotonic First-Fit assignment algorithm was extended in [2]....

    [...]


Patent
13 Jul 2001
Abstract: Provided is a method, system, and an article of manufacture for maintaining data accessible by a host in two storage devices, wherein the data is comprised of a plurality of data sets. A determination is made of a percentage of uncopied data at the first storage device, wherein uncopied data comprises data sets to be copied from the first storage device to the second storage device. If the calculated percentage is greater than a threshold amount, a rate at which uncopied data sets are transferred from the first storage device to the second storage device is increased.

84 citations


References
More filters

Book
03 Jan 1989
Abstract: The problem of multiprogram scheduling on a single processor is studied from the viewpoint of the characteristics peculiar to the program functions that need guaranteed service. It is shown that an optimum fixed priority scheduler possesses an upper bound to processor utilization which may be as low as 70 percent for large task sets. It is also shown that full processor utilization can be achieved by dynamically assigning priorities on the basis of their current deadlines. A combination of these two scheduling techniques is also discussed.

5,242 citations


01 Jan 1989
TL;DR: This survey focuses on the area of deterministic machine scheduling, and reviews complexity results and optimization and approximation algorithms for problems involving a single machine, parallel machines, open shops, flow shops and job shops.

1,381 citations


"Fault-tolerant rate-monotonic first..." refers methods in this paper

  • ...The proposed algorithm determines which tasks must use the active duplication and which can use the passive duplication....

    [...]


Journal ArticleDOI

1,140 citations


"Fault-tolerant rate-monotonic first..." refers background in this paper

  • ...A periodic task i is completely identified by a pair Ci; Ti , where Ci is i's execution time and Ti is i's request period....

    [...]

  • ...Passive copy overbooking and active copy deallocation allow many passive copies to be scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed....

    [...]


01 Jan 1993
Abstract: Sequencing and scheduling as a research area is motivated by questions that arise in production planning, in computer control, and generally in all situations in which scarce resources have to be allocated to activities over time. In this survey, we concentrate on the area of deterministic machine scheduling. We review complexity results and optimization and approximation algorithms for problems involving a single machine, parallel machines, open shops, flow shops and job shops. We also pay attention to two extensions of this area: resource-constrained project scheduling and stochastic machine scheduling.

1,106 citations


Journal ArticleDOI
TL;DR: This work studies the problem of scheduling periodic-time-critical tasks on multiprocessor computing systems and considers two heuristic algorithms that are easy to implement and yield a number of processors that is reasonably close to the minimum number.
Abstract: We study the problem of scheduling periodic-time-critical tasks on multiprocessor computing systems. A periodic-time-critical task consists of an infinite number of requests, each of which has a prescribed deadline. The scheduling problem is to specify an order in which the requests of a set of tasks are to be executed and the processor to be used, with the goal of meeting all the deadlines with a minimum number of processors. Since the problem of determining the minimum number of processors is difficult, we consider two heuristic algorithms. These are easy to implement and yield a number of processors that is reasonably close to the minimum number. We also analyze the worst-case behavior of these heuristics.

601 citations


"Fault-tolerant rate-monotonic first..." refers methods in this paper

  • ...Dhall and Liu [3] generalized the RM algorithm to...

    [...]

  • ...Liu and Layland [10] introduced the Rate-Monotonic (RM) algorithm for preemptively scheduling periodic tasks on a single processor, under the assumption that task deadlines are equal to their periods....

    [...]

  • ...Liu and Layland [10] proposed a fixed-priority scheduling algorithm, called Rate-Monotonic (RM), for solving the (nonfault-tolerant) problem stated in Section 2.1 on a single processor system, that is when m 1....

    [...]

  • ...RM was generalized to multiprocessor systems by Dhall and Liu [3], who proposed, among others, the Rate-Monotonic First-Fit (RMFF) heuristic....

    [...]

  • ...Liu and Layland proved the following two important results concerning fixed-priority scheduling algorithms....

    [...]


Frequently Asked Questions (2)
Q1. What are the contributions in "Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems" ?

In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy scheduled on a different processor. First, the paper considers the ability of the widely-used Rate-Monotonic scheduling algorithm to meet the deadlines of periodic tasks in the presence of a processor failure. Then, the paper extends the well-known Rate-Monotonic First-Fit assignment algorithm, where all the task copies, included the backup copies, are considered by Rate-Monotonic priority order and assigned to the first processor in which they fit. Moreover, the space allocated to active copies is reclaimed as soon as a failure is detected. 

However, further research is needed, e. g., to derive an analytical worst case bound on the number of processors used by the proposed FTRMFF algorithm, or to devise schedulability conditions which are weaker but simpler than the Completion Time Test, e. g., as those proposed in [ 1 ]. This optimization is left for further work. As a subject for future research, the combined duplication scheme proposed in the present paper could be used to extend the Rate-Monotonic First-Fit algorithm in order to tolerate failures also in the presence of resource reclaiming and task synchronization. Finally, further research could deal with assignment strategies which are different from those considered in this paper.