Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems
TL;DR: In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy planned on a different processor.
Abstract: Hard-real-time systems require predictable performance despite the occurrence of failures. In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy scheduled on a different processor. An active copy is always executed, while a passive copy is executed only in the case of a failure. First, the paper considers the ability of the widely-used rate-monotonic scheduling algorithm to meet the deadlines of periodic tasks in the presence of a processor failure. In particular, the completion time test is extended so as to check the schedulability on a single processor of a task set including backup copies. Then, the paper extends the well-known rate-monotonic first-fit assignment algorithm, where all the task copies, included the backup copies, are considered by rate-monotonic priority order and assigned to the first processor in which they fit. The proposed algorithm determines which tasks must use the active duplication and which can use the passive duplication. Passive duplication is preferred whenever possible, so as to overbook each processor with many passive copies whose primary copies are assigned to different processors. Moreover, the space allocated to active copies is reclaimed as soon as a failure is detected. Passive copy overbooking and active copy deallocation allow many passive copies to be scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed. Simulation studies reveal a remarkable saving of processors with respect to those needed by the usual active duplication approach in which the schedule of the non-fault-tolerant case is duplicated on two sets of processors.
Summary (4 min read)
- THROUGHOUT industrial computing, there is an increasingdemand for more complex and sophisticated hard-realtime computing systems.
- An active copy presents the advantages of requiring no synchronization with its primary copyÐit can run before, after, or concurrently with the other copyÐand of having a larger time window for executionÐnamely, the whole period of the task.
- In particular, this paper extends the RMFF algorithm to tolerate failures under the assumption that processors fail in a fail-stop manner.
- This section gives a formal definition of the scheduling problem and a precise specification of the fault tolerance model.
- Moreover, important properties of the well-known RM, CTT, and RMFF algorithms are recalled.
2.1 The Scheduling Problem
- The requests for i are periodic, with constant interval Ti between every two consecutive requests, and i's first request occurs at time 0.
- The worst case execution time for all the requests of i is constant and equal to Ci, with Ci Ti. Periodic tasks 1; :::; n are independent, that is the requests of any task do not depend on the execution of the other tasks.
2.2 The Fault-Tolerant Model
- The fault-tolerant scheduling problem consists of finding a schedule for the tasks so as to satisfy the following additional condition: (S4) fault tolerance is guaranteed, namely, conditions (S1)(S3) are verified even in the presence of failures.
- In order to achieve fault tolerance, two copies for each task are used, called primary and backup copies.
- In practice Di is smaller than Ci, since backup copies usually provide a reduced functionality in a smaller execution time than the primary copies.
- Since it must contain the indices of the primary task and of the sender and receiver processors, its size is O log n bits, also known as This message is small.
- The overhead needed for such processor failure detections is mainly given by the short-message latency of the communication subsystem employed.
2.3 The Rate-Monotonic Algorithm
- Liu and Layland  proposed a fixed-priority scheduling algorithm, called Rate-Monotonic (RM), for solving the (nonfault-tolerant) problem stated in Section 2.1 on a single processor system, that is when m 1.
- At any instant of time, a pending task with the highest priority is scheduled.
- Liu and Layland proved the following two important results concerning fixed-priority scheduling algorithms.
- The largest response time for any periodic request of 1 occurs whenever i is requested simultaneously with the requests for all higher priority tasks.
- A critical instant occurs when all tasks are in phase at time zero, which is called critical instant phasing, because it is the phasing that results in the longest response time for the first request of each task.
2.4 The Completion Time Test
- From Theorems 1 and 2, the following necessary and sufficient schedulability criterion was derived by Joseph and Pandya , as discussed also in .
- This schedulability test is called Completion Time Test (CTT).
- It is worth noting that, by Theorem 3, the schedulability of lower priority tasks does not guarantee the schedulability of higher priority tasks.
- Therefore, in order to check the schedulability of a set of tasks, each task must get through the CTT when it is scheduled with all higher priority tasks.
2.5 The Rate-Monotonic First-Fit
- Dhall and Liu  generalized the RM algorithm to accommodate multiprocessor systems.
- In particular, they proposed the so called Rate-Monotonic First-Fit (RMFF) algorithm.
- It is a partitioning algorithm, where tasks are first assigned to processors following the RM priority order and then all the tasks assigned to the same processor are scheduled with the RM algorithm.
- Dhall and Liu showed that, using a schedulability condition weaker than CTT, RMFF uses about 2.33U processors in the worst case, where U is the load of the task set.
- In practice, however, RMFF remains competitive, for its simplicity and efficiency.
3 OVERVIEW OF THE FAULT-TOLERANT RMFF ALGORITHM
- This section provides an informal high-level description of the proposed Fault-Tolerant Rate-Monotonic First-Fit algorithm.
- The algorithm prefers to schedule a backup copy as a passive copy whenever possible, so as to overbook each processor with more passive copies whose primary copies are assigned to different processors.
- Clearly, with another ordering a higher priority task can be assigned to the same processor after i.
- I must be schedulable together with all the primary and active backup copies already assigned to Pj; .
- These conditions are analogous to those of (A1), with the difference that the second one takes into account the situation where the failed processor is that running the primary copy i.
4.1 Schedulability Criteria
- In order to extend Theorem 3, consider a generic task set containing both primary and backup copies which must be scheduled all together on a single processor.
- Pj includes the active backup copies assigned to processor.
- Pj be the set of periodic tasks given in priority order which are assigned to processor Pj. Consider now the case that a failure of processor Proof.
- Hence, the proof follows from Theorem 3. tu.
4.2 Fault-Tolerant CTT
- Based on Theorems 4 and 5, two kinds of schedulability tests are needed, one to check for schedulability in the absence of failures, and the other to check for schedulability after a processor failure.
- Pj, since the recovery from the failure of any processor other than Pj must be taken into account.
4.3 Fault-Tolerant RMFF
- The first step assigns the primary copy i to the first processor in which it fits.
- The second step establishes the recovery time Bi and the status of the backup copy i.
- Thus, duplicating on two sets of processors the schedule for the nonfault-tolerant case requires at least four processors to tolerate one failure.
- The proposed FTRMFF algorithm, instead, tolerates one failure using three processors only.
- The procedure FTRMFF-Assignment is executed off-line and requires O nm2 schedulability tests to be performed.
4.4 Recovery from a Processor Failure
- Once an assignment is found by the FTRMFF algorithm, each processor.
- The uncompleted tasks assigned to Pf are recovered by the remaining nonfaulty processors.
- Note also that, in any case, all the active backup copies of primary tasks scheduled on the nonfaulty processors are deallocated from Pj. FTRMFF-Recovery Pf; (1) Do the following steps in parallel for all the processors 2.
- The procedure FTRMFF-Recovery is executed on-line and is very fast, since all the required sets, including passiveRecover Pj; Pf and recover Pj; Pf , were previously computed off-line by the FTRMFF-Assignment procedure, which already made all the schedulability tests, too.
4.5 Tolerating Many Processor Failures
- In order to tolerate many processor failures, spare processors must be employed to replace failed processors on-line.
- Pf is detected within the closest completion time of the task set primary Pf [ active Pf and the time interval between two consecutive failures is three times the largest task request period.
- This phase takes at most During the recovery phase, all the passive copies of the uncompleted tasks assigned to Pf are executed by the non-faulty processors only once (step 1.1.2 of FTRMFFReplacing), and the spare processor Ps inherits.
- The reconfiguration phase is completed by time 2Tn.
- If there are q spare processors, q faulty processors can be replaced by means of the FTRMFF-Replacing procedure, while one additional failure can be tolerated by means of the FTRMFFRecovery procedure.
4.6 Tolerating Software Failures
- In addition to processor failures, a hard-real-time system can fail also due to design faults in the software.
- To explain the ideas of the approach, assume that two different implementations of the same task specification are provided.
- Since the processors are assumed fail-stop, if the acceptance test fails, it signals the presence of an error in the software.
- The time to execute the acceptance test is assumed to be included in the primary copy execution time.
- One approach to implement the recovery from software failures is as follows.
5 SIMULATION EXPERIMENTS
- In order to evaluate the number of processors used by the FTRMFF algorithm for scheduling both primary and backup copies, simulation experiments are performed.
- For the chosen n and , the experiment is repeated 30 times, and the average result is computed.
- The performance metric in all the experiments is the number of processors required to assign a given task set.
- In the outcome of the experiments, the authors denote with N the number of processors required by the FTRMFF algorithm for a task set consisting of both primary and backup task copies, and with M the number of processors required by the RMFF algorithm for a task set with identical primary copies and no backup copies.
6 CONCLUDING REMARKS
- This paper has considered the problem of preemptively scheduling a set of independent periodic tasks under the assumption that each task deadline coincides with the next request of the same task.
- The proposed FTRMFF algorithm extends the well-known Rate-Monotonic First-Fit scheduling algorithm to tolerate failures, uses a novel combined active/passive duplication scheme, and determines by itself which tasks should use passive duplication and which should use active duplication.
- This optimization is left for further work.
- It is worth noting that the proposed algorithm works also if some backup copies are forced to be active.
- Finally, further research could deal with assignment strategies which are different from those considered in this paper.
- The C++ code used in the simulation experiments was written by Andrea Fusiello.
- This work was supported by grants from the Ministero dell'UniversitaÁ e della Ricerca Scientifica e Tecnologica, the Consiglio Nazionale delle Ricerche, and the UniversitaÁ di TrentoÐProgetto Speciale 1997.
Did you find this useful? Give us your feedback
Cites methods from "Fault-tolerant rate-monotonic first..."
...As an added measure of fault-tolerance, the proposed algorithm also takes the reliability of the processors into account....
Cites background from "Fault-tolerant rate-monotonic first..."
...Literature examples include Krishna and Shin (1986) and Bertossi et al. (1999) on periodic tasks and (Ghosh et al....
...Literature examples include Krishna and Shin (1986) and Bertossi et al. (1999) on periodic tasks and (Ghosh et al. 1994) on aperiodic tasks. For example, Figure 19 shows an example of scheduling four primary tasks (P1, P2, P3, P4) and their backups (B1, B2, B3, B4) on three processors, as indicated in Ghosh et al. (1994). When the original tasks complete execution, their backups are deallocated and the space can be used for scheduling other tasks ©S....
...Literature examples include Krishna and Shin (1986) and Bertossi et al. (1999) on periodic tasks and (Ghosh et al. 1994) on aperiodic tasks....
Cites methods from "Fault-tolerant rate-monotonic first..."
...The well-known Rate-Monotonic First-Fit assignment algorithm was extended in ....
"Fault-tolerant rate-monotonic first..." refers methods in this paper
...The proposed algorithm determines which tasks must use the active duplication and which can use the passive duplication....
"Fault-tolerant rate-monotonic first..." refers background in this paper
...A periodic task i is completely identified by a pair Ci; Ti , where Ci is i's execution time and Ti is i's request period....
...Passive copy overbooking and active copy deallocation allow many passive copies to be scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed....
"Fault-tolerant rate-monotonic first..." refers methods in this paper
...Dhall and Liu  generalized the RM algorithm to...
...Liu and Layland  introduced the Rate-Monotonic (RM) algorithm for preemptively scheduling periodic tasks on a single processor, under the assumption that task deadlines are equal to their periods....
...Liu and Layland  proposed a fixed-priority scheduling algorithm, called Rate-Monotonic (RM), for solving the (nonfault-tolerant) problem stated in Section 2.1 on a single processor system, that is when m 1....
...RM was generalized to multiprocessor systems by Dhall and Liu , who proposed, among others, the Rate-Monotonic First-Fit (RMFF) heuristic....
...Liu and Layland proved the following two important results concerning fixed-priority scheduling algorithms....
Related Papers (5)
Frequently Asked Questions (2)
Q1. What are the contributions in "Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems" ?
In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy scheduled on a different processor. First, the paper considers the ability of the widely-used Rate-Monotonic scheduling algorithm to meet the deadlines of periodic tasks in the presence of a processor failure. Then, the paper extends the well-known Rate-Monotonic First-Fit assignment algorithm, where all the task copies, included the backup copies, are considered by Rate-Monotonic priority order and assigned to the first processor in which they fit. Moreover, the space allocated to active copies is reclaimed as soon as a failure is detected.
Q2. What are the future works in "Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems" ?
However, further research is needed, e. g., to derive an analytical worst case bound on the number of processors used by the proposed FTRMFF algorithm, or to devise schedulability conditions which are weaker but simpler than the Completion Time Test, e. g., as those proposed in [ 1 ]. This optimization is left for further work. As a subject for future research, the combined duplication scheme proposed in the present paper could be used to extend the Rate-Monotonic First-Fit algorithm in order to tolerate failures also in the presence of resource reclaiming and task synchronization. Finally, further research could deal with assignment strategies which are different from those considered in this paper.