What are the cons of a checkpointing scheme?

Cons include latency (depending on the checkpointing granularity), performance (depending also on whether checkpointing is overlapped with normal execution) and the limitation to transient errors.

What are the cons of a hybrid?

Cons include the need for system-specific solutions, the low error protection (through isolation), the potential performance degradation.

What are the pros and cons of checking a system?

Cons include the potentially high storage and power overhead, the potentially very high latency and performance (depending also on whether checkpointing is overlapped with normal execution).

What are the challenges and opportunities for the fault tolerance techniques?

Further technology trends like 3D integration, incorporating heterogeneous technologies on a single platform and dark silicon pose new challenges and opportunities for the fault tolerance techniques.

What are some examples of emerging error-tolerant application domains?

Other examples of emerging error-tolerant application domains are Recognition, Mining and Synthesis (RMS) [Dubey 2005] as well as artificial neural networks (ANNs) [Temam 2012].

What are the pros and cons of the HW module?

Pros include the limited area and power, performance overhead as the new implementation will typically satisfy the system requirements, while minimizing additional cost.

What are the main criteria for further categorizing into classes?

These four classes are discussed in the following subsections, as shown in Figure 13 s. Main criteria for further categorization into classes include whether modifications are required in: existing functionalities, existing task implementations, the resource allocation, the interaction with neighbouring tasks, execution mode (of additional tasks), cooperation among HW modules.

What is the concept of storing checkpoints in a customized way?

Rather than saving checkpoints at fixed intervals, checkpoints can be stored in a customized way so that the amount of stored data is minimized.

What are the pros and cons of local schemes?

Compared to global schemes, local schemes reduce the amount of data to be stored during checkpointing but require typically a more complicated recovery algorithm.

What are the pros and cons of adding modules with different functionality?

Instead of adding modules with the same functionality, modules with different functionality can be added; the added modules play an active role in the recovery as in the previous category.

What is the difference between error recovery and repair?

Error recovery is further split into forward error recovery (FER), which includes redundancy, like for example triple modular redundancy, and backward error recovery (BER), which includes rolling back to a previously saved correct state of the system.

What are the types of systems that are amenable to non-deterministic events?

Beyond the earlier discussed types of systems, intra-module schemes may address applications that are amenable to numerous non-deterministic events: uncertain functions (like human input functions), interrupts, system calls, I/O operations due to communication with external devices.

What are the pros and cons of online multiprocessor checkpointing?

system-specific strategies have been developed which deal with events coming from the external environment, especially events due to communication with external devices s. Online multiprocessor checkpointing can be broadly characterized as local and global.

What is the other group of backward techniques?

The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points.

(Open Access) Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems (2017) | Georgia Psychou

Q: What are the contributions mentioned in the paper "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

A systematic classification of approaches that increase system resilience in presence of functional hardwareinduced errors is presented, dealing with higher system abstractions: i. e. the ( micro- ) architecture, the mapping and platform software. Hardware and software solutions are discussed in a similar fashion, so that interrelationships become apparent.

Q: What are the future works in "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

The most prominent being that mapping and SW provides a lot of flexibility due to the re-mapping possibilities of a given task sequence onto the “ fixed ” HW. Networked applications expanded further the deliverable functionality possibilities. The system behavior can be adapted at run time whenever significant environmental changes take place, or according to varying error rates. This is especially so, as errors can be masked as they propagate through the different hardware and software layers ( including the application itself ).

Classiﬁcation of Resilience Techniques Against Functional Errors at

Higher Abstraction Layers of Digital Systems

GEORGIA PSYCHOU

, DIMITRIOS RODOPOULOS

, MOHAMED M. SABRY

TOBIAS GEMMEKE

, DAVID ATIENZA

, TOBIAS G. NOLL

, FRANCKY CATTHOOR

EECS, RWTH Aachen,

IMEC,

ESL, EPFL,

IDS, RWTH Aachen; formerly Holst Center/IMEC,

IMEC & KU Leuven

Nano-scale technology nodes bring reliability concerns back to the center stage of digital system design. A

systematic classiﬁcation of approaches that increase system resilience in presence of functional hardware-

induced errors is presented, dealing with higher system abstractions: i.e. the (micro-) architecture, the map-

ping and platform software. The ﬁeld is surveyed in a systematic way based on non-overlapping categories,

which add insight into the ongoing work by exposing similarities and differences. Hardware and software

solutions are discussed in a similar fashion, so that interrelationships become apparent. The presented cat-

egories are illustrated by representative literature examples to illustrate their properties. Moreover, it is

demonstrated how hybrid schemes can be decomposed into their primitive components.

Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems—

fault tolerance; reliability, availability, and serviceability; B.8.1 [Hardware]: Performance and Reliability—

reliability, testing, and fault tolerance

General Terms: Reliability, Design

Additional Key Words and Phrases: Resilience, Reliability, Mitigation, Fault Tolerance

ACM Reference Format:

Psychou G., Rodopoulos D., Sabry M. M., Gemmeke T., Atienza D., Noll T. G. and Catthoor F. 2017. Classiﬁ-

cation of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems.

ACM Comput. Surv. V, N, Article XX (January XXXX), 38 pages.

DOI:

http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTION

The early concerns of John von Neumann [Von Neumann 1956] regarding building

reliable computing entities out of unreliable components were largely forgotten with

the gradual replacement of vacuum tubes by transistors and the following high-scale

transistor integration [Palem and Lingamneni 2012]. Now, after some decades, relia-

bility has come back to the forefront in the context of modern CMOS technology. The

This research has received funding from the EU ARTEMIS Joint Undertaking under grant agreement no.

621429 (project EMC2) and from the Dutch national programmes/funding authorities. D. Rodopoulos and

F. Catthoor acknowledge the support of the EU FP7-612069 Harpa project. M. M. Sabry and D. Atienza

were partially supported by the BodyPoweredSenSE (grant no. 20NA21 143069) RTD project, evaluated by

the Swiss NSF (SNSF) and funded by Nano-Tera.ch with Swiss Confederation ﬁnancing, as well as by the

E4Bio (no. 200021 159853) project of the Swiss NSF.

G. Psychou, T. G. Noll, EECS, RWTH Aachen University, Schinkelstr. 2, D-52062, Aachen, Germany

2,5

D. Rodopoulos, F. Catthoor, IMEC, Kapeldreef 75, 3001 Leuven, Belgium

M. M. Sabry, D. Atienza, EPFL-STI- IEL-ESL ELG 130, Station 11, 1015 Lausanne, Switzerland

T. Gemmeke, IDS, RWTH Aachen University, Mies-v. d. Rohe-Str. 15, D-52074, Aachen, Germany

Contact email: psychou@eecs.rwth-aachen.de

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted

without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that

copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned

by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-

lish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request

permissions from permissions@acm.org.

 XXXX ACM. 0360-0300/XXXX/01-ARTXX $15.00

DOI:

http://dx.doi.org/10.1145/0000000.0000000

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

XX:2 Psychou et al.

current reliability concerns originate from mechanisms, that manifest both during the

manufacturing process and during the system’s operational lifetime. Inherent time-

zero and time-dependent device variability, noise (e.g. supply voltage ﬂuctuations) and

particle strikes are some of the most prevalent causes of such concerns [Borkar 2005],

[McPherson 2006], [Kuhn et al. 2011], [Aitken et al. 2013]. The anomalous physical

conditions that are created from those effects, are called faults. Depending on various

conditions, faults can manifest as bit-level corruptions in the internal state or at the

outputs of a digital system. The term functional errors is used to capture this class of

errors, with the worst case manifestation toward the end user being a complete failure

on the expected system service.

The manifested errors can be temporary or permanent [Bondavalli et al. 2000],

[Borkar 2005]. Temporary errors include transient and intermittent errors. Transient

errors are non-deterministic (concerning time and location), e.g. as a result of a fault

due to a particle strike. Intermittent errors occur repeatedly but non-deterministically

in time at the same location and last for one cycle or even for a long (but ﬁnite) period of

time. Main causes for intermittent errors are design weaknesses, aging and wear-out,

like Bias Temperature Instability (BTI), Hot Carrier Injection (HCI) etc. In contrast,

permanent errors after their ﬁrst occurrence persist forever. Causes for permanent

errors are fabrication defects and aging.

The current work presents a classiﬁcation scheme for organizing the research do-

main on mitigation of functional errors at the higher abstraction layers that manifest

themselves during the operational lifetime, and discusses representative work for each

category. Given the multitude of reliability issues in modern digital systems, it is vital

to set the boundaries of the current survey: This survey discusses resilience schemes

at the architectural/microarchitectural layer and platform software, which have in-

creased in diversity during the last decades, following the evolution of computer archi-

tecture, parallel processing, software stack and general system design. Techniques at

application, circuit and device layers, can potentially act complementary to the tech-

niques presented here, but are not part of the current scope. Reliability-related errors

that occur due to hardware-design errors, insufﬁciently speciﬁed systems or malicious

attacks [Avizienis et al. 2004] or erroneous software interaction (i.e. manifestation of

software bugs due to software of reduced quality [Lochmann and Goeb 2011]) are be-

yond the current scope. Techniques to mitigate permanent errors that have been de-

tected during testing in order to improve yield or lifetime are not included. Techniques

to tackle permanent errors due to device and wire wear-out are incorporated though.

The main contributions of this work are:

(i) An integrated overview of the domain of functional reliability techniques (at the

higher system level stack) is presented, using a systematic, hierarchical top-down

splitting into sub-classes.

(ii) Multiple representative prior and state-of-the-art publications are mapped to these

categories to illustrate the concepts involved.

(iii) Hardware and software solutions are discussed using a similar reasoning, to allow

interrelations to become more visible.

(iv) The complementary nature of the splits allows hybrid schemes to be effectively

decomposed and better understood. That is especially important in the era of

growing cross-layer resilience design.

The current paper is organized as follows: Section 2 presents terminology regarding

reliable system design, the abstraction layers that are addressed in this work and

information on the rationale of the proposed classiﬁcation. The classiﬁcation along

with the presentation of published literature begins in Section 3 for techniques that

operate at the (micro-) architectural layers of the system and continues in Section 4

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classiﬁcation of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsXX:3

with techniques at the mapping and software part of the platform. Section 5 illustrates

ways of using the proposed framework and Section 6 discusses observations and trends

in the domain. Finally, related work is presented in Section 7 and Section 8 concludes

the paper. Moreover, from this point on, the symbol

will be used to refer the reader

to the supplementary material (see ACMCSUR website) for additional information.

2. CONTEXT AND USEFUL TERMINOLOGY

2.1. Resilient Digital System Design

This survey presents an organization of techniques that can be used to make a digi-

tal system more reliable at functional level. Reliability is deﬁned as the probability

that over a speciﬁc period the system will satisfy its speciﬁcation, i.e. the total set of

requirements to be satisﬁed by the system. Functional reliability is deﬁned as the

probability that over a speciﬁc period of time the system will fulﬁll its functionality,

i.e. the set of functions that the system should perform [IEEE_Std 1990]. Functional

reliability is related with correcting binary digits as opposed to parametric reliability

that deals with aspects of variations in operation margins [Rodopoulos et al. 2015].

Functionality is one of the major elements of the speciﬁcation set. Others may be min-

imum performance (e.g. throughput [ops/s], computational power [MIPS]), maximum

costs (e.g. silicon area [mm

], power [W], energy [J/op], latency [s/op]). In the follow-

ing, the term reliability will be used to denote the functional reliability. The term re-

silience describes the ability of a system to defer or avoid (functional) system failures

in the presence of errors. When a system becomes more resilient, its reliability is in-

creased. The terms reliable and resilient (system design) will be used interchangeably

in this paper

2.2. Computing Terminology

2.2.1. Terminology on Abstraction Layers.

This survey includes techniques implemented at the microarchitecture and architec-

ture layers, as well as at the mapping & software (SW) of the system, as shown in

Figure 1. The device, circuit and application layers are not considered. In this survey,

the term platform denotes a system composed of architectural and microarchitectural

components together with the software required to run applications. When the system

is not SW-programmable, like some small embedded systems are, the term platform

denotes only the hardware part.

Platform HW. Microarchitecture describes how the HW constituent parts are con-

nected and inter-operate to implement the operations that the HW supports. It in-

cludes the memory system, the memory interconnect and the internals of processors

[Hennessy and Patterson 2011]. This applies both to very ﬂexible SW-programmable

processors, where an instruction-set is present to control the operation sequence,

and to dedicated HW processing components. Dedicated HW processors feature min-

imum to limited ﬂexibility. Both SW-programmable and dedicated components can

be mapped on highly reconﬁgurable fabrics, like ﬁeld-programmable gate arrays (FP-

GAs). The primary difference compared with the SW-programmable processors is that

not only the control ﬂow but also the data ﬂow can be substantially changed/recon-

ﬁgured. The microarchitecture together with the Instruction Set Architecture (ISA)

constitute the computer architecture (although the term has been recently used to

include also other aspects of the design [Hennessy and Patterson 2011]).

In general, the term HW module denotes a subset of the digital system’s HW, the

internals of which cannot be observed (or it is chosen that they are not observed), cor-

respondingly to the term black box [Rodopoulos et al. 2015]. To deﬁne a HW module,

its functionality and its interface with the external world must be described. At the

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

XX:4 Psychou et al.

microarchitectural and architectural layer, examples of HW modules are a multipro-

cessor system, a single core, a functional unit, the row of a memory array, a pipeline

stage, a register (without exposing the internal circuit implementation though). In the

context of this survey, the term platform HW is an umbrella term, that encompasses

the microarchitecture and architecture layers of a system.

SET

CLR

application

layer

circuit/

device

platform

hardware

mapping&

platform

software

Scope

CMP

File

compiler/

synthesis

run-time

management

job 1 job 2

time

void read_image(char*

file, int image[N][M])

{... ...}

Encoded

video frames

Fig. 1: Scope of the current paper

Mapping. During mapping, the

algorithmic level speciﬁcation is

mapped into a pre-selected datap-

ath and control path that imple-

ments the required behaviour

Nowadays, the term is also used to

denote how an application or an ap-

plication set is split, distributed and

ordered in order to run in a multi-

processor design.

Platform SW. In order to enable

software-hardware interaction, an

instruction set is selected initially.

The instruction set deﬁnes the

hardware-software interface [Hen-

nessy and Patterson 2011]. Many

application instances sharing spe-

ciﬁc characteristics (a “domain”) can

be mapped on the same instruction

set. Each of the instructions in that

set can then be implemented in the

hardware in different ways.

Platform SW includes several

sublayers that interpret or trans-

late high level operations (derived

from the algorithmic description)

into “primitive” instructions, which

correspond to the instruction set and are ready to be executed by the hardware. Exam-

ples include: system libraries, operating systems and run-time managers

2.2.2. Additional Terminology.

A Control Data Flow Graph (CDFG) is a graph representing all possible paths

the ﬂow of data can follow during execution. An application corresponds to a sepa-

rate CDFG in the system. A process is an instantiation of a program, or a segment

of code, under execution consisting of “own” memory space, containing an image of

the executable code and data, resource descriptions, security attributes, and state in-

formation (register content, physical memory addressing etc.), i.e. all the information

necessary to execute the program

. Threads are sequences of instructions, or a ﬂow

of control, in a program which can be executed concurrently. All threads in a given

process share the private address space of that process

. The term task is used quite

ambiguously in the literature: On the one hand, the terms task and process are used

synonymously. On the other hand, the terms process and thread are considered as

“mechanic” while the term task is considered as being more conceptual and used in

the context of scheduling as a set of program instructions loaded in memory for ex-

ecution. The term task in this paper is used as an umbrella term, which can denote

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classiﬁcation of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsXX:5

complete applications, sub-parts of the CDFG like processing kernels (e.g. for-loops) or

even single computations (e.g. instructions) depending on the context.

2.3. Rationale of the classiﬁcation and its presentation

The proposed classiﬁcation tree is organized using a top-down splitting of the

types of techniques that increase the system resilience. It is accompanied by a

mapping of related work (see Figure 2). The top-down splitting allows to reach

a comprehensive list of types of techniques, which can always be expanded fur-

ther on demand. Splits are created based on properties of the techniques, which

allow them to be grouped together. More speciﬁcally, the properties in the pro-

posed framework regard: (1) the effect that the techniques have on the execution

and (2) the changes that are required on the system design for a technique to

be implemented. The properties will be elaborated as the tree is being presented.

A1.b

BOTTOM-UP

MAPPING

A1.a

A2.b

WORK

TOP-DOWN

CLASSIFICATION

Subsection x.2.Subsection x.1.

A2.a

Fig. 2: Top down splitting to create the classiﬁcation

tree and mapping of the related work

Other organizations are also

possible, like organizing

the splits around the sys-

tem functionality, hardware

components, types of errors

(transient, intermittent, per-

manent), types of resilience

metrics or the application do-

mains. The aforementioned

organization is chosen in order

to stress the reusability of

techniques but also to enable

the better understanding of

hybrid combinations. This is

especially supported through

the complementarity of the cat-

egories. It is important to note

that many actual approaches

that increase resilience typically represent hybrids and do not fall strictly into only

one of the categories.

For the presentation of the classiﬁcation tree, the following structure is followed for

each of the abstraction layers (platform hardware, mapping and platform software).

First, the main classes are presented for the different techniques. Within each class,

subcategories are presented which are illustrated with the help of a ﬁgure. Groups of

nodes are chosen to be discussed together. For the visualization of the groups, bubbles

with different colors are used, along with the subsection number and a small geomet-

rical shape (see Figure 2). The colors and the geometrical shapes are used to enable a

more explicit link with the corresponding subsections in the text. Especially the geo-

metrical shapes are used for the facilitation of the reader in the black-white printed

version. The order of the leaves, the colors and the geometrical shapes do not indicate

the signiﬁcance or the maturity of the techniques. For each of the classes, pros and cons

are discussed, based on general properties bound to each class. Among the aspects con-

sidered are: area and power overhead, performance degradation (in terms of additional

execution cycles), mitigation latency (delay until the scheme fulﬁls the intended miti-

gation function), error protection, general applicability, storage overhead. An overview

of those for the different classes can be found in Tables II-VII in the Appendix (see

supplementary material). In parallel, representative related work is discussed to fur-

ther illustrate the subcategory concept and demonstrate the usefulness of the proposed

classiﬁcation scheme for classifying existing (and future) literature

. Moreover, in Ta-

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

Figures

Citations

Finding the needle in a high-dimensional haystack: Canonical correlation analysis for neuroscientists.

Predicting the compressive strength of concrete with fly ash admixture using machine learning algorithms

Accumulating regional density dissimilarity for concept drift detection in data streams

Screen Content Quality Assessment: Overview, Benchmark, and Beyond

Cross-Domain Authorship Attribution Using Pre-trained Language Models

References

Reliability Modeling andManagement inDynamic Microprocessor-Based Systems

Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems

Tribeca: design for PVT variations with local recovery and fine-grained adaptation

A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

Compiler-guided register reliability improvement against soft errors

Related Papers (5)

Place Illusion and Plausibility Can Lead to Realistic Behaviour in Immersive Virtual Environments

Measuring Presence in Virtual Environments: A Presence Questionnaire

A framework for immersive virtual environments five: Speculations on the role of presence in virtual environments

Ensemble learning for data stream analysis

Random Forests

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

Q2. What are the future works in "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

Q3. What are the cons of a checkpointing scheme?

Q4. What are the cons of a hybrid?

Q5. What are the pros and cons of checking a system?

Q6. What are the challenges and opportunities for the fault tolerance techniques?

Q7. What are some examples of emerging error-tolerant application domains?

Q8. What are the pros and cons of the HW module?

Q9. What is the term task in this paper?

Q10. What are the main criteria for further categorizing into classes?

Q11. What is the concept of storing checkpoints in a customized way?

Q12. What are the pros and cons of local schemes?

Q13. What are the pros and cons of adding modules with different functionality?

Q14. What is the difference between error recovery and repair?

Q15. What are the types of systems that are amenable to non-deterministic events?

Q16. What are the pros and cons of online multiprocessor checkpointing?

Q17. What is the other group of backward techniques?