Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems

doi:10.1109/71.584093

Home
/
Papers
/
Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems

Journal Article•DOI•

Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems

S. Ghosh¹, Rami Melhem², Daniel Mosse²•Institutions (2)

Honeywell¹, University of Pittsburgh²

01 Mar 1997-IEEE Transactions on Parallel and Distributed Systems (IEEE)-Vol. 8, Iss: 3, pp 272-284

TL;DR: A scheme that provides fault tolerance through scheduling in real time multiprocessor systems, and uses two techniques that are called deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system).

read less

Abstract: Real time systems are being increasingly used in several applications which are time critical in nature. Fault tolerance is an important requirement of such systems, due to the catastrophic consequences of not tolerating faults. We study a scheme that provides fault tolerance through scheduling in real time multiprocessor systems. We schedule multiple copies of dynamic, aperiodic, nonpreemptive tasks in the system, and use two techniques that we call deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system). The paper compares the performance of our scheme with that of other fault tolerant scheduling schemes, and determines how much each of deallocation and overloading affects the acceptance ratio of tasks. The paper also provides a technique that can help real time system designers determine the number of processors required to provide fault tolerance in dynamic systems. Lastly, a formal model is developed for the analysis of systems with uniform tasks.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Book Review: Design and Analysis of Fault-Tolerant Digital SystemsDesign and Analysis of Fault-Tolerant Digital Systems: JohnsonB. W. (Addison Wesley, 1989, 577 pp., £41.35)

[...]

R. Ramaswami

01 Jan 1990-International Journal of Electrical Engineering Education

TL;DR: This book is mainly oriented towards a final year undergraduate course on fault-tolerant computing, primarily with an implementation bias, and draws considerably on the author's experience in industry, particularly reflected in the projects accompanying chapter 5.

...read moreread less

Abstract: Design and Analysis ofFault-Tolerant Digital Systems: B. W. JOHNSON (Addison Wesley, 1989,577 pp., £41.35) The book provides an introduction to the important aspects of designing fault-tolerant systems, and an evaluation of how well the reliability goals have been achieved. The book is mainly oriented towards a final year undergraduate course on fault-tolerant computing, primarily with an implementation bias. In chapters 1 and 2, definitions and basic terminology are covered, which sets the stage for the remaining chapters, and provides the background and motivation for the remainder of the book. Chapter 3 provides a thorough analysis of fault-tolerance techniques and concepts. This chapter in particular is remarkably well written, covering the issues of hardware and information redundancy, which form the mainstay offault-tolerant computing. Subsequent chapters on the use and evaluation of the various approaches illustrate the principles as they have been put into practice. At the end of chapter 5, small projects that allow the reader to apply the material presented in the preceding chapters are included. The resurgence of interest in fault-tolerance with the emergence of VLSI is the theme of chapter 6, focussing on designing fault-tolerant systems in a VLSI environment. The problems and opportunities presented by VLSI are discussed and the use of redundancy techniques in order to enhance manufacturing yield and to provide in-service reliability are reviewed. The final chapter covers testing, design for testability and testability analysis, which must be considered during each phase of the design process to guarantee that resulting designs can be thoroughly tested. Each chapter is followed by a summary of the key issues and concepts presented therein, and a separate list of references, which makes it easily readable. In addition, there is a reading list with more comprehensive and specialised references devoted to each chapter. Overall, the book is well written, and contains a great deal of information in 577 pages. The book has a definite implementation bias, and draws considerably on the author's experience in industry, particularly reflected in the projects accompanying chapter 5. The book should be a useful addition to a library, and a suitable text to accompany a lecture course on fault-tolerant computing. R. RAMASWAMI, Department ofComputation, UMIST

...read moreread less

444 citations

The spring kernel: a new paradigm for real-time operating systems

[...]

John A. Stankovic, Krithi Ramamritham

01 May 1989

TL;DR: The Spring kernel as discussed by the authors is a research oriented kernel designed to form the basis of a flexible, hard real-time operating system for such applications, which advocates a new paradigm based on the notion of predictability and a method for on-line dynamic guarantees of deadlines.

...read moreread less

Abstract: Next generation real-time systems will require greater flexibility and predictability than is commonly found in today's systems. These future systems include the space station, integrated vision/robotics/AI systems, collections of humans/robots coordinating to achieve common objectives (usually in hazardous environments such as undersea exploration or chemical plants), and various command and control applications. The Spring kernel is a research oriented kernel designed to form the basis of a flexible, hard real-time operating system for such applications. Our approach challenges several basic assumptions upon which most current real-time operating systems are built and subsequently advocates a new paradigm based on the notion of predictability and a method for on-line dynamic guarantees of deadlines. The Spring kernel is being implemented on a network of (68020 based) multiprocessors called SpringNet.

...read moreread less

177 citations

Journal Article•DOI•

Measuring the robustness of a resource allocation

[...]

Shoukat Ali¹, Anthony A. Maciejewski, Howard Jay Siegel, Jong-Kook Kim•Institutions (1)

University of Missouri¹

01 Jul 2004-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In this article, the authors investigate the robustness of an allocation of resources to tasks in parallel and distributed systems and derive a robustness metric for an arbitrary system, which can help researchers evaluate a given resource allocation for robustness against uncertainties in specified perturbation parameters.

...read moreread less

Abstract: Parallel and distributed systems may operate in an environment that undergoes unpredictable changes causing certain system performance features to degrade. Such systems need robustness to guarantee limited degradation despite fluctuations in the behavior of its component parts or environment. This research investigates the robustness of an allocation of resources to tasks in parallel and distributed systems. The main contributions are 1) a mathematical description of a metric for the robustness of a resource allocation with respect to desired system performance features against multiple perturbations in multiple system and environmental conditions, and 2) a procedure for deriving a robustness metric for an arbitrary system. For illustration, this procedure is employed to derive robustness metrics for three example distributed systems. Such a metric can help researchers evaluate a given resource allocation for robustness against uncertainties in specified perturbation parameters.

...read moreread less

175 citations

Journal Article•DOI•

A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis

[...]

G. Manimaran¹, C.S.R. Murthy•Institutions (1)

Indian Institute of Technology Madras¹

01 Nov 1998-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This work addresses the scheduling of dynamically arriving real-time tasks with PB fault-tolerant requirements onto a set of processors in such a way that the versions of the tasks are feasible in the schedule.

...read moreread less

Abstract: Many time-critical applications require dynamic scheduling with predictable performance. Tasks corresponding to these applications have deadlines to be met despite the presence of faults. In this paper, we propose an algorithm to dynamically schedule arriving real-time tasks with resource and fault-tolerant requirements on to multiprocessor systems. The tasks are assumed to be nonpreemptable and each task has two copies (versions) which are mutually excluded in space, as well as in time in the schedule, to handle permanent processor failures and to obtain better performance, respectively. Our algorithm can tolerate more than one fault at a time, and employs performance improving techniques such as 1) distance concept which decides the relative position of the two copies of a task in the task queue, 2) flexible backup overloading, which introduces a trade-off between degree of fault tolerance and performance, and 3) resource reclaiming, which reclaims resources both from deallocated backups and early completing tasks. We quantify, through simulation studies, the effectiveness of each of these techniques in improving the guarantee ratio, which is defined as the percentage of total tasks, arrived in the system, whose deadlines are met. Also, we compare through simulation studies the performance our algorithm with a best known algorithm for the problem, and show analytically the importance of distance parameter in fault-tolerant dynamic scheduling in multiprocessor real-time systems.

...read moreread less

167 citations

Cites methods from "Fault-tolerance through scheduling ..."

...In this paper, we made extensions to the original Myopic algorithm to support PB based fault–tolerants, with backup overloaded and de-allocated using dynamic logical grouping concept, for scheduling a set of real–time tasks....
[...]

Journal Article•DOI•

The interplay of power management and fault recovery in real-time systems

[...]

Rami Melhem¹, Daniel Mosse¹, Elmootazbellah Nabil Elnozahy•Institutions (1)

University of Pittsburgh¹

01 Feb 2004-IEEE Transactions on Computers

TL;DR: The results show that traditional periodic checkpointing is not the best policy for the combined purpose of conserving energy and guaranteeing recovery, and better energy savings are possible through a nonuniform distribution of checkpoints that takes into account the energy consumption and reliability factors.

...read moreread less

Abstract: We describe how to exploit the scheduling slack in a real-time system to reduce energy consumption and achieve fault tolerance at the same time. During failure-free operation, a task takes checkpoints to enable recovery from failure. Additionally, the system exploits the slack to conserve energy by reducing the processor speed. If a task fails, it will restart from a saved checkpoint and execute at maximum speed to guarantee that the deadlines are met. We show that the number of checkpoints and their placements interact in subtle ways with the power management policy. We study two checkpoint placement policies for aperiodic tasks and analytically derive the optimal number of checkpoints to conserve energy under each. This optimal number allows the CPU speed to be slowed down to the level that yields minimum energy consumption, while still guaranteeing recoverability of tasks under each checkpointing policy. The results show that traditional periodic checkpointing is not the best policy for the combined purpose of conserving energy and guaranteeing recovery. Instead, better energy savings are possible through a nonuniform distribution of checkpoints that takes into account the energy consumption and reliability factors. Depending on the amount of slack and the checkpointing overhead, energy can be reduced by up to 68 percent under nonuniform checkpointing. We also demonstrate the applicability of these checkpoint placement policies to periodic tasks.

...read moreread less

162 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

Collapse

References

PDF

Open Access

More filters

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

[...]

Michael Randolph Garey

01 Jan 1979

42,654 citations

Johnson: computers and intractability: a guide to the theory of np- completeness (freeman

[...]

Michael Randolph Garey

01 Jan 1979

12,336 citations

Journal Article•DOI•

System structure for software fault tolerance

[...]

Brian Randell¹•Institutions (1)

Newcastle University¹

01 Apr 1975-Sigplan Notices

TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".

...read moreread less

Abstract: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

...read moreread less

1,610 citations

Proceedings Article•DOI•

System structure for software fault tolerance

[...]

Brian Randell¹•Institutions (1)

Newcastle University¹

01 Jan 1975

TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".

...read moreread less

1,093 citations

Journal Article•DOI•

System structure for software fault tolerance

[...]

RandellBrian

01 Mar 1975-IEEE Transactions on Software Engineering

TL;DR: In this article, the authors present and discuss the rationale behind a method for structuring complex computing systems by the use of what they term "recovery blocks," "conversations," and "fault" tolerant interfa...

...read moreread less

Abstract: This paper presents and discusses the rationale behind a method for structuring complex computing systems by the use of what we term "recovery blocks," "conversations," and "fault" tolerant interfa...

...read moreread less

915 citations