scispace - formally typeset
Journal ArticleDOI

Transparent recovery from intermittent faults in time-triggered distributed systems

Reads0
Chats0
TLDR
This work introduces the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead and provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors.
Abstract
The time-triggered model, with tasks scheduled in static (off line) fashion, provides a high degree of timing predictability in safety-critical distributed systems. Such systems must also tolerate transient and intermittent failures which occur far more frequently than permanent ones. Software-based recovery methods using temporal redundancy, such as task reexecution and primary/backup, while incurring performance overhead, are cost-effective methods of handling these failures. We present a constructive approach to integrating runtime recovery policies in a time-triggered distributed system. Furthermore, the method provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors. Given a general task graph with precedence and timing constraints and a specific fault model, the proposed method constructs the corresponding fault-tolerant (FT) schedule with sufficient slack to accommodate recovery. We introduce the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead. Contingency schedules, also generated offline, revise this FT schedule to mask task failures on individual processors while preserving precedence and timing constraints. We present simulation results which show that, for small-scale embedded systems having task graphs of moderate complexity, the proposed approach generates FT schedules which incur about 30-40 percent performance overhead when compared to corresponding non-fault-tolerant ones.

read more

Citations
More filters
Proceedings ArticleDOI

Design Optimization of Time-and Cost-Constrained Fault-Tolerant Distributed Embedded Systems

TL;DR: The design optimization approach decides the mapping of processes to processors and the assignment of fault-tolerant policies to processes such that transient faults are tolerated and the timing constraints of the application are satisfied.
Journal ArticleDOI

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication

TL;DR: This work uses checkpointing with rollback recovery and active replication for tolerating transient faults, and presents several design optimization approaches which are able to find fault-tolerant implementations given a limited amount of resources.

Scheduling and Voltage Scaling for Energy/Reliability Tradeoffs in Fault-Tolerant Time-Triggered Embedded Systems

TL;DR: Scheduling and Voltage Scaling for Energy/Reliability Tradeoffs in Fault-Tolerant Time-Triggered Embedded Systems as discussed by the authors is an example of such an approach.
Proceedings ArticleDOI

Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems

TL;DR: This paper presents a constraint logic programming-based approach to the scheduling and voltage scaling of low-power fault-tolerant hard real-time applications mapped on distributed heterogeneous embedded systems.
Journal ArticleDOI

Reliability-Driven System-Level Synthesis for Mixed-Critical Embedded Systems

TL;DR: This paper proposes a design methodology that enhances the classical system-level design flow for embedded systems to introduce reliability-awareness, and allows the designer to specify that only some parts of the systems need to be hardened against faults.
References
More filters
Journal ArticleDOI

Static scheduling algorithms for allocating directed task graphs to multiprocessors

TL;DR: A taxonomy that classifies 27 scheduling algorithms and their functionalities into different categories is proposed, with each algorithm explained through an easy-to-understand description followed by an illustrative example to demonstrate its operation.
Journal ArticleDOI

DSC: scheduling parallel tasks on an unbounded number of processors

TL;DR: A low-complexity heuristic for scheduling parallel tasks on an unbounded number of completely connected processors, named the dominant sequence clustering algorithm (DSC), which guarantees a performance within a factor of 2 of the optimum for general coarse-grain DAG's.
Journal ArticleDOI

A comparison of list schedules for parallel processing systems

TL;DR: The problem of scheduling two or more processors to minimize the execution time of a program which consists of a set of partially ordered tasks and a dynamic programming solution for the case in which execution times are random variables is presented.
Journal ArticleDOI

Distributed fault-tolerant real-time systems: the Mars approach

TL;DR: The authors focus on the maintainability of the Mars architecture, the Mars operating system, and the control of a rolling mill that produces metal plates and bars, and discuss timing analysis.
Book

Task scheduling in parallel and distributed systems

TL;DR: This chapter discusses the relationship between Matching and Two-Processor Scheduling,Optimal Scheduling Algorithms, Static Scheduling Heuristics, and Dynamic Task Allocation in the SPMD Model.
Related Papers (5)