Transparent recovery from intermittent faults in time-triggered distributed systems

doi:10.1109/TC.2003.1176980

Journal ArticleDOI

Transparent recovery from intermittent faults in time-triggered distributed systems

Nagarajan Kandasamy, +2 more

- 01 Feb 2003 -

IEEE Transactions on Computers

- Vol. 52, Iss: 2, pp 113-125

Chats0

TLDR

This work introduces the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead and provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors.

Abstract:

The time-triggered model, with tasks scheduled in static (off line) fashion, provides a high degree of timing predictability in safety-critical distributed systems. Such systems must also tolerate transient and intermittent failures which occur far more frequently than permanent ones. Software-based recovery methods using temporal redundancy, such as task reexecution and primary/backup, while incurring performance overhead, are cost-effective methods of handling these failures. We present a constructive approach to integrating runtime recovery policies in a time-triggered distributed system. Furthermore, the method provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors. Given a general task graph with precedence and timing constraints and a specific fault model, the proposed method constructs the corresponding fault-tolerant (FT) schedule with sufficient slack to accommodate recovery. We introduce the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead. Contingency schedules, also generated offline, revise this FT schedule to mask task failures on individual processors while preserving precedence and timing constraints. We present simulation results which show that, for small-scale embedded systems having task graphs of moderate complexity, the proposed approach generates FT schedules which incur about 30-40 percent performance overhead when compared to corresponding non-fault-tolerant ones.

Transparent recovery from intermittent faults in time-triggered distributed systems

Citations

Design Optimization of Time-and Cost-Constrained Fault-Tolerant Distributed Embedded Systems

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication

Scheduling and Voltage Scaling for Energy/Reliability Tradeoffs in Fault-Tolerant Time-Triggered Embedded Systems

Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems

Reliability-Driven System-Level Synthesis for Mixed-Critical Embedded Systems

References

Static scheduling algorithms for allocating directed task graphs to multiprocessors

DSC: scheduling parallel tasks on an unbounded number of processors

A comparison of list schedules for parallel processing systems

Distributed fault-tolerant real-time systems: the Mars approach

Task scheduling in parallel and distributed systems

Related Papers (5)

Design Optimization of Time-and Cost-Constrained Fault-Tolerant Distributed Embedded Systems

A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication

Distributed fault-tolerant real-time systems: the Mars approach

The time-triggered architecture