An integrated approach for increasing the soft-error detection capabilities in SoCs processors

doi:10.1109/DFTVS.2005.17

Home
/
Papers
/
An integrated approach for increasing the soft-error detection capabilities in SoCs processors

Proceedings Article•DOI•

An integrated approach for increasing the soft-error detection capabilities in SoCs processors

Paolo Bernardi¹, L. Bolzani¹, Maurizio Rebaudengo¹, Matteo Sonza Reorda¹, Massimo Violante¹ - Show less +1 more•Institutions (1)

Polytechnic University of Turin¹

03 Oct 2005-pp 445-453

TL;DR: Experimental results are reported showing the effectiveness of the integrated (hardware and software) approach to increase the fault detection capabilities of software techniques by introducing a limited hardware redundancy in covering soft-errors affecting the processor memory elements and escaping to purely software approaches.

read less

Abstract: Software implemented hardware fault tolerance (SIHFT) techniques are able to detect most of the transient and permanent faults during the usual system operations. However, these techniques are not capable to detect some transient faults affecting processor memory elements such as state registers inside the processor control unit, or temporary registers inside the arithmetic and logic unit. In this paper, we propose an integrated (hardware and software) approach to increase the fault detection capabilities of software techniques by introducing a limited hardware redundancy. Experimental results are reported showing the effectiveness of the proposed approach in covering soft-errors affecting the processor memory elements and escaping to purely software approaches.

...read moreread less

Citations

PDF

Open Access

More filters

System Structure for Software Fault Tolerance

[...]

Brian Randell¹•Institutions (1)

Newcastle University¹

01 Jan 1975

TL;DR: The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

...read moreread less

Abstract: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

...read moreread less

62 citations

Proceedings Article•

Symbolic Fault Injection

[...]

Daniel Larsson¹, Reiner Hähnle¹•Institutions (1)

Chalmers University of Technology¹

01 Jan 2007

TL;DR: This paper presents a new approach called symbolic fault injection which is targeted at validation of SIHFT mechanisms and is based on the concept of symbolic execution of programs, and demonstrates its viability by proving that a CRC implementation detects all possible single bit-flips.

...read moreread less

Abstract: Computer systems that are dependable in the presence of faults are increasingly in demand. Among available fault tolerance mechanisms, software-implemented hardware fault tolerance (SIHFT) is constantly gaining in popularity, because of its cost efficiency and flexibility. Fault tolerance mechanisms are often validated using fault injection, comprising a variety of techniques for introducing faults into a system. Traditional fault injection techniques, however, suffer from a number of drawbacks, notably lack of coverage (impossibility to exhaust all test cases) and the failure to activate enough injected faults. In this paper we present a new approach called symbolic fault injection which is targeted at validation of SIHFT mechanisms and is based on the concept of symbolic execution of programs. It can be seen as the extension of a formal technique for formal program verification that makes it possible to evaluate the consequences of all possible faults (of a certain kind) in given memory locations for all possible system inputs. This makes it possible to formally prove properties of fault tolerance mechanisms. The new method for symbolic fault injection has been prototypically implemented on the basis of an industrial-strength formal verification system and we demonstrate its viability by proving that a CRC implementation detects all possible single bit-flips.

...read moreread less

30 citations

Cites methods from "An integrated approach for increasi..."

...The scenario, where mechanisms for handling hardware fault s re implemented in software, is called SIHFT (Software-Implemented Hardwa re Fault Tolerance) (for example, [7])....
[...]

Hardware and software fault-tolerance of softcore processors implemented in sram-based fpgas

[...]

Michael Wirthlin¹, Nathaniel Rollins¹•Institutions (1)

Brigham Young University¹

01 Jan 2012

TL;DR: To measure the reliability of a mitigated LEON3 softcore processor, an updated hardware fault-injection model is created, and novel reliability metrics are employed, and each of these techniques provides greater processor protection than a popular state-of-the-art rad-hard processor.

...read moreread less

Abstract: Softcore processors are an attractive alternative to using expensive radiation-hardened processors for space-based applications. Since they can be implemented in the latest SRAM-based FPGA technologies, they are fast, flexible and significantly less expensive. However, unlike ASIC-based processors, the logic and routing of a softcore processor are vulnerable to the effects of single-event upsets (SEUs). To protect softcore processors from SEUs, this dissertation explores the processor design-space for the LEON3 softcore processor implemented in a commercial SRAM-based FPGA. The traditional mitigation techniques of triple modular redundancy (TMR) and duplication with compare (DWC) and checkpointing provide reliability to a softcore processor at great spatial cost. To reduce the spatial cost, terrestrial ASIC-based processor protection techniques are applied to the LEON3 processor. These techniques come at the cost of time instead of area. The software fault-tolerance techniques used to protect the logic and routing of the LEON3 softcore processor include a modified version of software implemented fault tolerance (SWIFT), consistency checks, software indications, and checkpointing. To measure the reliability of a mitigated LEON3 softcore processor, an updated hardware fault-injection model is created, and novel reliability metrics are employed. The improvement in reliabilty over an unmitigated LEON3 is measured using four metrics: architectural vulnerability factor (AVF), mean time to failure (MTTF), mean useful instructions to failure (MuITF), and reliability-area-performance (RAP). Traditional reliability techniques provide the best reliability: DWC with checkpointing improves the MTTF and MuITF by almost 35x and TMR with triplicated input and outputs improves the MTTF and MuITF by almost 6000x. Software fault-tolerance provides significant reliability for a much lower area cost. Each of these techniques provides greater processor protection than a popular state-of-the-art rad-hard processor.

...read moreread less

16 citations

Journal Article•DOI•

Software and Hardware Techniques for SEU Detection in IP Processors

[...]

Cristiana Bolchini¹, Antonio Miele¹, Maurizio Rebaudengo², Fabio Salice¹, Donatella Sciuto¹, Luca Sterpone², Massimo Violante² - Show less +3 more•Institutions (2)

Polytechnic University of Milan¹, Polytechnic University of Turin²

01 Jun 2008-Journal of Electronic Testing

TL;DR: The paper presents the methodological approach adopted to achieve the complete fault coverage, the proposed resulting architecture, and the experimental results gathered from the analysis of the fault injection campaigns.

...read moreread less

Abstract: In the recent years both software and hardware techniques have been adopted to carry out reliable designs, aimed at autonomously detecting the occurrence of faults, to allow discarding erroneous data and possibly performing the recovery of the system. The aim of this paper is the introduction of a combined use of software and hardware approaches to achieve a complete fault coverage in generic IP processors, with respect to SEU faults. Software techniques are preferably adopted to reduce the necessity and costs of modifying the processor architecture; since a complete fault coverage cannot be achieved, partial hardware redundancy techniques are then introduced to deal with the remaining, not covered, faults. The paper presents the methodological approach adopted to achieve the complete fault coverage, the proposed resulting architecture, and the experimental results gathered from the analysis of the fault injection campaigns.

...read moreread less

14 citations

Cites methods from "An integrated approach for increasi..."

...The main contribution of this paper is a revised version of the preliminary qualitative analysis of soft errors presented in [4], and a modified application of redundancy techniques with respect to those presented in [2], driven from the results of the presented analysis....
[...]
...The approach here presented, starts from the conclusions drawn in [2] and proposes an enhancement based on a more systematic analysis of the fault/error relationship....
[...]

Proceedings Article•DOI•

Combined software and hardware techniques for the design of reliable IP processors

[...]

Maurizio Rebaudengo¹, Luca Sterpone¹, Massimo Violante¹, Cristiana Bolchini¹, Antonio Miele¹, Donatella Sciuto¹ - Show less +2 more•Institutions (1)

Polytechnic University of Turin¹

04 Oct 2006

...read moreread less

10 citations

Cites methods from "An integrated approach for increasi..."

...The paper presents the methodological approach adopted to achieve the complete fault coverage, the proposed resulting architecture, and the experimental results gathered from the fault injection analysis campaign....
[...]
...In the most recent years, given the relevant computational power of IP cores, the design of reliable systems by means of software techniques received a lot of attention, due to the interesting possibility to use a commercial processor core, without requiring any customization....
[...]

References

PDF

Open Access

More filters

Journal Article•DOI•

System structure for software fault tolerance

[...]

Brian Randell¹•Institutions (1)

Newcastle University¹

01 Apr 1975-Sigplan Notices

TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".

...read moreread less

1,610 citations

Proceedings Article•DOI•

System structure for software fault tolerance

[...]

Brian Randell¹•Institutions (1)

Newcastle University¹

01 Jan 1975

TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".

...read moreread less

1,093 citations

Journal Article•DOI•

The N-Version Approach to Fault-Tolerant Software

[...]

Algirdas Avizienis¹•Institutions (1)

University of California, Berkeley¹

01 Dec 1985-IEEE Transactions on Software Engineering

TL;DR: Principal requirements for the implementation of N-version software are summarized and the DEDIX distributed supervisor and testbed for the execution of N -version software is described.

...read moreread less

Abstract: Evolution of the N-version software approach to the tolerance of design faults is reviewed. Principal requirements for the implementation of N-version software are summarized and the DEDIX distributed supervisor and testbed for the execution of N-version software is described. Goals of current research are presented and some potential benefits of the N-version approach are identified.

...read moreread less

1,093 citations

"An integrated approach for increasi..." refers methods in this paper

..., Recovery Blocks and N-Version Programming) [4][5]....
[...]
...In [5], a methodology is presented that proposes the replication (software redundancy) of the program and checks the results obtained executing the replicated versions....
[...]

Book•

Fault-tolerant computer system design

[...]

Dhiraj K. Pradhan¹•Institutions (1)

Texas A&M University¹

01 Feb 1996

TL;DR: This new edition specifically deals with this dynamically changing computing environment, incorporating new topics such as fault-tolerance in multiprocessor and distributed systems.

...read moreread less

Abstract: In the ten years since the publication of the first edition of this book, the field of fault-tolerant design has broadened in appeal, particularly with its emerging application in distributed computing. This new edition specifically deals with this dynamically changing computing environment, incorporating new topics such as fault-tolerance in multiprocessor and distributed systems.

...read moreread less

662 citations

"An integrated approach for increasi..." refers background in this paper

...As far as hardware is considered, the most commonly adopted approach exploits redundancy [1] applicable at different levels: at the system level by replicating the whole...
[...]

Journal Article•DOI•

Concurrent error detection using watchdog processors-a survey

[...]

A. Mahmood¹, Edward J. McCluskey¹•Institutions (1)

Stanford University¹

01 Feb 1988-IEEE Transactions on Computers

TL;DR: It is shown that a large number of errors can be detected by monitoring the control flow and memory-access behavior and two techniques for control-flow checking are discussed and compared with current error-detection techniques.

...read moreread less

Abstract: Concurrent system-level error detection techniques using a watchdog processor are surveyed. A watchdog processor is a small and simple coprocessor that detects errors by monitoring the behavior of a system. Like replication, it does not depend on any fault model for error detection. However, it requires less hardware than replication. It is shown that a large number of errors can be detected by monitoring the control flow and memory-access behavior. Two techniques for control-flow checking are discussed and compared with current error-detection techniques. A scheme for memory-access checking based on capability-based addressing is described. The design of a watchdog for performing reasonable checks on the output of a main processor by executing assertions is discussed. >

...read moreread less

599 citations

"An integrated approach for increasi..." refers background in this paper

...solution consists in the introduction of a special-purpose module, called watchdog, which constantly monitors the activities carried out by the processor [2][3]....
[...]