scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An integrated approach for increasing the soft-error detection capabilities in SoCs processors

03 Oct 2005-pp 445-453
TL;DR: Experimental results are reported showing the effectiveness of the integrated (hardware and software) approach to increase the fault detection capabilities of software techniques by introducing a limited hardware redundancy in covering soft-errors affecting the processor memory elements and escaping to purely software approaches.
Abstract: Software implemented hardware fault tolerance (SIHFT) techniques are able to detect most of the transient and permanent faults during the usual system operations. However, these techniques are not capable to detect some transient faults affecting processor memory elements such as state registers inside the processor control unit, or temporary registers inside the arithmetic and logic unit. In this paper, we propose an integrated (hardware and software) approach to increase the fault detection capabilities of software techniques by introducing a limited hardware redundancy. Experimental results are reported showing the effectiveness of the proposed approach in covering soft-errors affecting the processor memory elements and escaping to purely software approaches.
Citations
More filters
01 Jan 1975
TL;DR: The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.
Abstract: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

62 citations

Proceedings Article
01 Jan 2007
TL;DR: This paper presents a new approach called symbolic fault injection which is targeted at validation of SIHFT mechanisms and is based on the concept of symbolic execution of programs, and demonstrates its viability by proving that a CRC implementation detects all possible single bit-flips.
Abstract: Computer systems that are dependable in the presence of faults are increasingly in demand. Among available fault tolerance mechanisms, software-implemented hardware fault tolerance (SIHFT) is constantly gaining in popularity, because of its cost efficiency and flexibility. Fault tolerance mechanisms are often validated using fault injection, comprising a variety of techniques for introducing faults into a system. Traditional fault injection techniques, however, suffer from a number of drawbacks, notably lack of coverage (impossibility to exhaust all test cases) and the failure to activate enough injected faults. In this paper we present a new approach called symbolic fault injection which is targeted at validation of SIHFT mechanisms and is based on the concept of symbolic execution of programs. It can be seen as the extension of a formal technique for formal program verification that makes it possible to evaluate the consequences of all possible faults (of a certain kind) in given memory locations for all possible system inputs. This makes it possible to formally prove properties of fault tolerance mechanisms. The new method for symbolic fault injection has been prototypically implemented on the basis of an industrial-strength formal verification system and we demonstrate its viability by proving that a CRC implementation detects all possible single bit-flips.

30 citations


Cites methods from "An integrated approach for increasi..."

  • ...The scenario, where mechanisms for handling hardware fault s re implemented in software, is called SIHFT (Software-Implemented Hardwa re Fault Tolerance) (for example, [7])....

    [...]

01 Jan 2012
TL;DR: To measure the reliability of a mitigated LEON3 softcore processor, an updated hardware fault-injection model is created, and novel reliability metrics are employed, and each of these techniques provides greater processor protection than a popular state-of-the-art rad-hard processor.
Abstract: Softcore processors are an attractive alternative to using expensive radiation-hardened processors for space-based applications. Since they can be implemented in the latest SRAM-based FPGA technologies, they are fast, flexible and significantly less expensive. However, unlike ASIC-based processors, the logic and routing of a softcore processor are vulnerable to the effects of single-event upsets (SEUs). To protect softcore processors from SEUs, this dissertation explores the processor design-space for the LEON3 softcore processor implemented in a commercial SRAM-based FPGA. The traditional mitigation techniques of triple modular redundancy (TMR) and duplication with compare (DWC) and checkpointing provide reliability to a softcore processor at great spatial cost. To reduce the spatial cost, terrestrial ASIC-based processor protection techniques are applied to the LEON3 processor. These techniques come at the cost of time instead of area. The software fault-tolerance techniques used to protect the logic and routing of the LEON3 softcore processor include a modified version of software implemented fault tolerance (SWIFT), consistency checks, software indications, and checkpointing. To measure the reliability of a mitigated LEON3 softcore processor, an updated hardware fault-injection model is created, and novel reliability metrics are employed. The improvement in reliabilty over an unmitigated LEON3 is measured using four metrics: architectural vulnerability factor (AVF), mean time to failure (MTTF), mean useful instructions to failure (MuITF), and reliability-area-performance (RAP). Traditional reliability techniques provide the best reliability: DWC with checkpointing improves the MTTF and MuITF by almost 35x and TMR with triplicated input and outputs improves the MTTF and MuITF by almost 6000x. Software fault-tolerance provides significant reliability for a much lower area cost. Each of these techniques provides greater processor protection than a popular state-of-the-art rad-hard processor.

16 citations

Journal ArticleDOI
TL;DR: The paper presents the methodological approach adopted to achieve the complete fault coverage, the proposed resulting architecture, and the experimental results gathered from the analysis of the fault injection campaigns.
Abstract: In the recent years both software and hardware techniques have been adopted to carry out reliable designs, aimed at autonomously detecting the occurrence of faults, to allow discarding erroneous data and possibly performing the recovery of the system. The aim of this paper is the introduction of a combined use of software and hardware approaches to achieve a complete fault coverage in generic IP processors, with respect to SEU faults. Software techniques are preferably adopted to reduce the necessity and costs of modifying the processor architecture; since a complete fault coverage cannot be achieved, partial hardware redundancy techniques are then introduced to deal with the remaining, not covered, faults. The paper presents the methodological approach adopted to achieve the complete fault coverage, the proposed resulting architecture, and the experimental results gathered from the analysis of the fault injection campaigns.

14 citations


Cites methods from "An integrated approach for increasi..."

  • ...The main contribution of this paper is a revised version of the preliminary qualitative analysis of soft errors presented in [4], and a modified application of redundancy techniques with respect to those presented in [2], driven from the results of the presented analysis....

    [...]

  • ...The approach here presented, starts from the conclusions drawn in [2] and proposes an enhancement based on a more systematic analysis of the fault/error relationship....

    [...]

Proceedings ArticleDOI
04 Oct 2006
TL;DR: The paper presents the methodological approach adopted to achieve the complete fault coverage, the proposed resulting architecture, and the experimental results gathered from the fault injection analysis campaign.
Abstract: In the recent years both software and hardware techniques have been adopted to carry out reliable designs, aimed at autonomously detecting the occurrence of faults, to allow discarding erroneous data and possibly performing the recovery of the system. The aim of this paper is the introduction of a combined use of software and hardware approaches to achieve a complete fault coverage in generic IP processors, with respect to SEU faults. Software techniques are preferably adopted to reduce the necessity and costs of modifying the processor architecture; since a complete fault coverage cannot be achieved, partial hardware redundancy techniques are then introduced to deal with the remaining, not covered, faults. The paper presents the methodological approach adopted to achieve the complete fault coverage, the proposed resulting architecture, and the experimental results gathered from the fault injection analysis campaign.

10 citations


Cites methods from "An integrated approach for increasi..."

  • ...The paper presents the methodological approach adopted to achieve the complete fault coverage, the proposed resulting architecture, and the experimental results gathered from the fault injection analysis campaign....

    [...]

  • ...In the most recent years, given the relevant computational power of IP cores, the design of reliable systems by means of software techniques received a lot of attention, due to the interesting possibility to use a commercial processor core, without requiring any customization....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
Abstract: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

1,610 citations

Proceedings ArticleDOI
01 Jan 1975
TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
Abstract: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

1,093 citations

Journal ArticleDOI
TL;DR: Principal requirements for the implementation of N-version software are summarized and the DEDIX distributed supervisor and testbed for the execution of N -version software is described.
Abstract: Evolution of the N-version software approach to the tolerance of design faults is reviewed. Principal requirements for the implementation of N-version software are summarized and the DEDIX distributed supervisor and testbed for the execution of N-version software is described. Goals of current research are presented and some potential benefits of the N-version approach are identified.

1,093 citations


"An integrated approach for increasi..." refers methods in this paper

  • ..., Recovery Blocks and N-Version Programming) [4][5]....

    [...]

  • ...In [5], a methodology is presented that proposes the replication (software redundancy) of the program and checks the results obtained executing the replicated versions....

    [...]

Book
01 Feb 1996
TL;DR: This new edition specifically deals with this dynamically changing computing environment, incorporating new topics such as fault-tolerance in multiprocessor and distributed systems.
Abstract: In the ten years since the publication of the first edition of this book, the field of fault-tolerant design has broadened in appeal, particularly with its emerging application in distributed computing. This new edition specifically deals with this dynamically changing computing environment, incorporating new topics such as fault-tolerance in multiprocessor and distributed systems.

662 citations


"An integrated approach for increasi..." refers background in this paper

  • ...As far as hardware is considered, the most commonly adopted approach exploits redundancy [1] applicable at different levels: at the system level by replicating the whole...

    [...]

Journal ArticleDOI
TL;DR: It is shown that a large number of errors can be detected by monitoring the control flow and memory-access behavior and two techniques for control-flow checking are discussed and compared with current error-detection techniques.
Abstract: Concurrent system-level error detection techniques using a watchdog processor are surveyed. A watchdog processor is a small and simple coprocessor that detects errors by monitoring the behavior of a system. Like replication, it does not depend on any fault model for error detection. However, it requires less hardware than replication. It is shown that a large number of errors can be detected by monitoring the control flow and memory-access behavior. Two techniques for control-flow checking are discussed and compared with current error-detection techniques. A scheme for memory-access checking based on capability-based addressing is described. The design of a watchdog for performing reasonable checks on the output of a main processor by executing assertions is discussed. >

599 citations


"An integrated approach for increasi..." refers background in this paper

  • ...solution consists in the introduction of a special-purpose module, called watchdog, which constantly monitors the activities carried out by the processor [2][3]....

    [...]