scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems

TL;DR: A scheme that provides fault tolerance through scheduling in real time multiprocessor systems, and uses two techniques that are called deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system).
Abstract: Real time systems are being increasingly used in several applications which are time critical in nature. Fault tolerance is an important requirement of such systems, due to the catastrophic consequences of not tolerating faults. We study a scheme that provides fault tolerance through scheduling in real time multiprocessor systems. We schedule multiple copies of dynamic, aperiodic, nonpreemptive tasks in the system, and use two techniques that we call deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system). The paper compares the performance of our scheme with that of other fault tolerant scheduling schemes, and determines how much each of deallocation and overloading affects the acceptance ratio of tasks. The paper also provides a technique that can help real time system designers determine the number of processors required to provide fault tolerance in dynamic systems. Lastly, a formal model is developed for the analysis of systems with uniform tasks.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This book is mainly oriented towards a final year undergraduate course on fault-tolerant computing, primarily with an implementation bias, and draws considerably on the author's experience in industry, particularly reflected in the projects accompanying chapter 5.
Abstract: Design and Analysis ofFault-Tolerant Digital Systems: B. W. JOHNSON (Addison Wesley, 1989,577 pp., £41.35) The book provides an introduction to the important aspects of designing fault-tolerant systems, and an evaluation of how well the reliability goals have been achieved. The book is mainly oriented towards a final year undergraduate course on fault-tolerant computing, primarily with an implementation bias. In chapters 1 and 2, definitions and basic terminology are covered, which sets the stage for the remaining chapters, and provides the background and motivation for the remainder of the book. Chapter 3 provides a thorough analysis of fault-tolerance techniques and concepts. This chapter in particular is remarkably well written, covering the issues of hardware and information redundancy, which form the mainstay offault-tolerant computing. Subsequent chapters on the use and evaluation of the various approaches illustrate the principles as they have been put into practice. At the end of chapter 5, small projects that allow the reader to apply the material presented in the preceding chapters are included. The resurgence of interest in fault-tolerance with the emergence of VLSI is the theme of chapter 6, focussing on designing fault-tolerant systems in a VLSI environment. The problems and opportunities presented by VLSI are discussed and the use of redundancy techniques in order to enhance manufacturing yield and to provide in-service reliability are reviewed. The final chapter covers testing, design for testability and testability analysis, which must be considered during each phase of the design process to guarantee that resulting designs can be thoroughly tested. Each chapter is followed by a summary of the key issues and concepts presented therein, and a separate list of references, which makes it easily readable. In addition, there is a reading list with more comprehensive and specialised references devoted to each chapter. Overall, the book is well written, and contains a great deal of information in 577 pages. The book has a definite implementation bias, and draws considerably on the author's experience in industry, particularly reflected in the projects accompanying chapter 5. The book should be a useful addition to a library, and a suitable text to accompany a lecture course on fault-tolerant computing. R. RAMASWAMI, Department ofComputation, UMIST

444 citations

01 May 1989
TL;DR: The Spring kernel as discussed by the authors is a research oriented kernel designed to form the basis of a flexible, hard real-time operating system for such applications, which advocates a new paradigm based on the notion of predictability and a method for on-line dynamic guarantees of deadlines.
Abstract: Next generation real-time systems will require greater flexibility and predictability than is commonly found in today's systems. These future systems include the space station, integrated vision/robotics/AI systems, collections of humans/robots coordinating to achieve common objectives (usually in hazardous environments such as undersea exploration or chemical plants), and various command and control applications. The Spring kernel is a research oriented kernel designed to form the basis of a flexible, hard real-time operating system for such applications. Our approach challenges several basic assumptions upon which most current real-time operating systems are built and subsequently advocates a new paradigm based on the notion of predictability and a method for on-line dynamic guarantees of deadlines. The Spring kernel is being implemented on a network of (68020 based) multiprocessors called SpringNet.

177 citations

Journal ArticleDOI
TL;DR: In this article, the authors investigate the robustness of an allocation of resources to tasks in parallel and distributed systems and derive a robustness metric for an arbitrary system, which can help researchers evaluate a given resource allocation for robustness against uncertainties in specified perturbation parameters.
Abstract: Parallel and distributed systems may operate in an environment that undergoes unpredictable changes causing certain system performance features to degrade. Such systems need robustness to guarantee limited degradation despite fluctuations in the behavior of its component parts or environment. This research investigates the robustness of an allocation of resources to tasks in parallel and distributed systems. The main contributions are 1) a mathematical description of a metric for the robustness of a resource allocation with respect to desired system performance features against multiple perturbations in multiple system and environmental conditions, and 2) a procedure for deriving a robustness metric for an arbitrary system. For illustration, this procedure is employed to derive robustness metrics for three example distributed systems. Such a metric can help researchers evaluate a given resource allocation for robustness against uncertainties in specified perturbation parameters.

175 citations

Journal ArticleDOI
TL;DR: This work addresses the scheduling of dynamically arriving real-time tasks with PB fault-tolerant requirements onto a set of processors in such a way that the versions of the tasks are feasible in the schedule.
Abstract: Many time-critical applications require dynamic scheduling with predictable performance. Tasks corresponding to these applications have deadlines to be met despite the presence of faults. In this paper, we propose an algorithm to dynamically schedule arriving real-time tasks with resource and fault-tolerant requirements on to multiprocessor systems. The tasks are assumed to be nonpreemptable and each task has two copies (versions) which are mutually excluded in space, as well as in time in the schedule, to handle permanent processor failures and to obtain better performance, respectively. Our algorithm can tolerate more than one fault at a time, and employs performance improving techniques such as 1) distance concept which decides the relative position of the two copies of a task in the task queue, 2) flexible backup overloading, which introduces a trade-off between degree of fault tolerance and performance, and 3) resource reclaiming, which reclaims resources both from deallocated backups and early completing tasks. We quantify, through simulation studies, the effectiveness of each of these techniques in improving the guarantee ratio, which is defined as the percentage of total tasks, arrived in the system, whose deadlines are met. Also, we compare through simulation studies the performance our algorithm with a best known algorithm for the problem, and show analytically the importance of distance parameter in fault-tolerant dynamic scheduling in multiprocessor real-time systems.

167 citations


Cites methods from "Fault-tolerance through scheduling ..."

  • ...In this paper, we made extensions to the original Myopic algorithm to support PB based fault–tolerants, with backup overloaded and de-allocated using dynamic logical grouping concept, for scheduling a set of real–time tasks....

    [...]

Journal ArticleDOI
TL;DR: The results show that traditional periodic checkpointing is not the best policy for the combined purpose of conserving energy and guaranteeing recovery, and better energy savings are possible through a nonuniform distribution of checkpoints that takes into account the energy consumption and reliability factors.
Abstract: We describe how to exploit the scheduling slack in a real-time system to reduce energy consumption and achieve fault tolerance at the same time. During failure-free operation, a task takes checkpoints to enable recovery from failure. Additionally, the system exploits the slack to conserve energy by reducing the processor speed. If a task fails, it will restart from a saved checkpoint and execute at maximum speed to guarantee that the deadlines are met. We show that the number of checkpoints and their placements interact in subtle ways with the power management policy. We study two checkpoint placement policies for aperiodic tasks and analytically derive the optimal number of checkpoints to conserve energy under each. This optimal number allows the CPU speed to be slowed down to the level that yields minimum energy consumption, while still guaranteeing recoverability of tasks under each checkpointing policy. The results show that traditional periodic checkpointing is not the best policy for the combined purpose of conserving energy and guaranteeing recovery. Instead, better energy savings are possible through a nonuniform distribution of checkpoints that takes into account the energy consumption and reliability factors. Depending on the amount of slack and the checkpointing overhead, energy can be reduced by up to 68 percent under nonuniform checkpointing. We also demonstrate the applicability of these checkpoint placement policies to periodic tasks.

162 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
Abstract: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

1,610 citations

Proceedings ArticleDOI
01 Jan 1975
TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
Abstract: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

1,093 citations

Journal ArticleDOI
TL;DR: In this article, the authors present and discuss the rationale behind a method for structuring complex computing systems by the use of what they term "recovery blocks," "conversations," and "fault" tolerant interfa...
Abstract: This paper presents and discusses the rationale behind a method for structuring complex computing systems by the use of what we term "recovery blocks," "conversations," and "fault" tolerant interfa...

915 citations