Journal Article•DOI•

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

Georgia Psychou¹, Dimitrios Rodopoulos², Mohamed M. Sabry³, Tobias Gemmeke¹, David Atienza³, Tobias G. Noll¹, Francky Catthoor² - Show less +3 more•Institutions (3)

RWTH Aachen University¹, Katholieke Universiteit Leuven², École Polytechnique Fédérale de Lausanne³

04 Oct 2017-ACM Computing Surveys (Assoc Computing Machinery)-Vol. 50, Iss: 4, pp 1-38

TL;DR: A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW).

read less

Abstract: Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.

...read moreread less

Summary (5 min read)

Jump to: [1. INTRODUCTION] – [2.1. Resilient Digital System Design] – [2.2. Computing Terminology] – [2.3. Rationale of the classification and its presentation] – [3. PLATFORM HARDWARE] – [3.1. Forward execution - Additional HW modules provision] – [3.3. Backward execution - Additional HW modules provision] – [3.4. Backward execution - HW modules amount fixed] – [3.5. Overall platform hardware classification] – [4. PLATFORM SOFTWARE] – [4.2. Forward execution - Tasks amount fixed] – [4.3. Backward execution - Retry without state storage] – [4.4. Backward execution - Retry with state storage] – [4.5. Overall mapping and platform software classification] – [5. USAGE OF THE CLASSIFICATION FRAMEWORK] – [5.1. Mapping of hybrid schemes] – [5.2. Comparison of closely related schemes] – [6. DISCUSSION AND FUTURE CHALLENGES] – [7. RELATED WORK ON CLASSIFICATION SCHEMES] and [8. CONCLUSION]

1. INTRODUCTION

Intermittent errors occur repeatedly but non-deterministically in time at the same location and last for one cycle or even for a long (but finite) period of time.
The main contributions of this work are: (i) An integrated overview of the domain of functional reliability techniques (at the higher system level stack) is presented, using a systematic, hierarchical top-down splitting into sub-classes.
Section 5 illustrates ways of using the proposed framework and Section 6 discusses observations and trends in the domain.

2.1. Resilient Digital System Design

This survey presents an organization of techniques that can be used to make a digital system more reliable at functional level.
Functional reliability is defined as the probability that over a specific period of time the system will fulfill its functionality, i.e. the set of functions that the system should perform [IEEE_Std 1990].
Functional reliability is related with correcting binary digits as opposed to parametric reliability that deals with aspects of variations in operation margins [Rodopoulos et al. 2015].
Functionality is one of the major elements of the specification set.
When a system becomes more resilient, its reliability is increased.

2.2. Computing Terminology

The term platform denotes a system composed of architectural and microarchitectural components together with the software required to run applications.
This applies both to very flexible SW-programmable processors, where an instruction-set is present to control the operation sequence, and to dedicated HW processing components.
The instruction set defines the hardware-software interface [Hennessy and Patterson 2011].
The term task is used quite ambiguously in the literature:.

2.3. Rationale of the classification and its presentation

The proposed classification tree is organized using a top-down splitting of the types of techniques that increase the system resilience.
A A1.b BOTTOM-UP MAPPING A1 A1.a A2 A2.b WORK #1 WORK #2 WORK #3 WORK #4 WORK #5 TOP-DOWN CLASSIFICATION Subsection x.2.Subsection x.1. A2.a that increase resilience typically represent hybrids and do not fall strictly into only one of the categories.
The colors and the geometrical shapes are used to enable a more explicit link with the corresponding subsections in the text.
For each of the classes, pros and cons are discussed, based on general properties bound to each class.
Deterministic execution is required for replicas to work.

3. PLATFORM HARDWARE

To make digital systems more robust, functional capabilities need to be provided that would be unnecessary in a fault-free environment.
This section focuses on techniques that modify the hardware capabilities for reliability purposes.
The complete classification scheme is shown in Figure 12 in Subsection 3.5.
Main criteria for further categorization include whether modifications are required in: existing functionalities, existing design implementations, resource allocation, operating conditions, the interaction with neighbouring modules, storage overhead.
Leaves of the tree have an accompanying simple ordinal number for identification.

3.1. Forward execution - Additional HW modules provision

This subsection discusses techniques that increase the resilience through adding HW modules on the platform.
Within each pair, error detection is performed through a comparison circuit.
Again, a distinction can be made between modules that are in parallel execution mode and modules that act as spares.
These schemes exploit inherent redundancy in regularly structured systems such as arrays of PEs, memories and interconnection networks or even processors.
Pros include low area and power overhead, general applicability (for systems with inherent redundancy).

3.3. Backward execution - Additional HW modules provision

This subsection discusses techniques that increase the resilience of systems through rollback to an earlier point of execution and repetition of the execution.
The corresponding categories and subsections are shown in Figure 8.6 3.3.1.
This category discusses techniques that provide additional HW modules with the same functionality as the original ones.
Spare modules would, for example, not only take over the execution after the primary module has failed but also repeat the failed execution.
Pros include the flexibility to trade-off area, power, performance, latency with error protection depending on the selected functionality.

3.4. Backward execution - HW modules amount fixed

The majority of the techniques proposed in the literature that employ backward execution, reuse the already existing HW modules as the additional area overhead of the previous category is avoided.
The literature focuses on employing hardwarebased threads in coupled execution mode.
The first stream is serviced (by the operating system) and execution resumes.
Pros include the high error protection (for transient errors only), the general applicability.
These checkpointing schemes can be characterized as global and local.

3.5. Overall platform hardware classification

The sub-trees presented in the previous subsections are combined to form the overall classification tree for platform HW techniques, as shown in Figure 12.
Starting from the top-level split of Figure 3, the intermediate nodes (colored by pale green) are followed when necessary, to reach the final classes (colored by darker green and numbered).

4. PLATFORM SOFTWARE

Techniques that extend the platform software capabilities for reliability purposes are presented.
These four classes are discussed in the following subsections, as shown in Figure 13 s. Main criteria for further categorization into classes include whether modifications are required in: existing functionalities, existing task implementations, the resource allocation, the interaction with neighbouring tasks, execution mode (of additional tasks), cooperation among HW modules.
Pros would include the limited storage, area, performance overhead, latency, high error protection (only for instruction memory errors) and general applicability.
Or the added task performs some different function, like error correction.
It must be noted that parallel execution in this context, does not nec- ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

4.2. Forward execution - Tasks amount fixed

This subsection discusses techniques that do not provide additional tasks in the system in order to make it more reliable.
Techniques that are focused around the functionality of tasks can either operate within the task boundaries, by reusing the task functionality (internal functionality reuse) or operate outside the task boundaries by rearranging its interaction with the other tasks (I/O configuration modification).
These schemes reorganize the application or instruction profile so that the re-ordered execution is more robust.
Pros include the lack of storage, power overhead and latency.
Cons include the limited error protection and rather system-specific applicability.

4.3. Backward execution - Retry without state storage

RETRY W/O STATE STORAGE PARALLEL EXECUTION SEQUENTIAL EXECUTION INTRAMODULE INTERMODULE.
Sporadic tasks are aperiodic tasks that have hard deadlines.
The task can be re-executed either on a single processor so that transient errors are removed or on a different processor so that permanent errors are avoided.
Therefore, this category is further split into intra-module and inter-module techniques.
Cons include the performance overhead, latency and limitation to transient errors.

4.4. Backward execution - Retry with state storage

The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points.
The inter-process dependencies are often recorded, so that the execution can be accurately repeated during the recovery phase.
Such additional pieces of information ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.
These surveys address both single-threaded and multithreaded/multi-process applications s. A number of techniques have been developed that do not explicitly bring the handling of non-deterministic events to the forefront.
Checkpointing at user-level utilizes run-time libraries that are linked to the application program.

4.5. Overall mapping and platform software classification

By combining the sub-trees of the previous subsections, the overall mapping and platform software classification tree is built, as shown in Figure 22.
Starting from the toplevel split of Figure 13, the intermediate nodes (colored by pale yellow) are followed when necessary, to reach the final classes (colored by darker yellow and numbered).

5. USAGE OF THE CLASSIFICATION FRAMEWORK

Identifying the primitive components (corresponding to a primitive category) and their position in the framework first, allows to handle the complexity of the sometimes highly sophisticated mitigation schemes.
A “divide and conquer” view of the publication enables the reader to delve into the most relevant implementation details (when that is necessary) in a much more controlled way.
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

5.1. Mapping of hybrid schemes

In reality, the resiliency and mitigation approaches, which are present in research papers, rarely belong to a single leaf of the previous, and indeed any, classification.
The majority of the published work consists of hybrid combinations of the leaves.
In particular, UDP-Lite packetization is (re-)designed in such a way that bits which are very sensitive to errors are better protected.
The super reliable cores execute operations that are less resilient to errors.
A run time scheduler reassigns a task that has failed on a particular RRC to another RRC (S.8), also known as Mapping & platform SW.

6. DISCUSSION AND FUTURE CHALLENGES

The proposed classification was illustrated through a representative list of schemes, to better absorb the related ideas and support the validity of the tree.
The literature on fault tolerance and resilience techniques has evolved in accordance with the trends in computer architecture and software design development.
Therefore, by properly propagating information among the different layers and providing a suitable degree of adaptivity (with design time and run time knobs), the most cost-effective solutions can be achieved.
The TDP mode 5 corresponds to a mode, where cores are operated in near threshold voltages.

8. CONCLUSION

Techniques that increase resilience and mitigate functional reliability errors were classified in a novel way.
This was achieved through a framework with complementary splits, in which primitive mitigation concepts are defined.
That allows every type of technique to be classified, by combining the appropriate components.
The framework has been accompanied by a wide variety of sources from the published literature.
Insight can be provided to the designers and researchers about the nature of existing schemes, since every node has some unique properties.

Did you find this useful? Give us your feedback

Figures (22)

Fig. 3: Basic classification1 for techniques at the platform HW

Fig. 22: Techniques that rely on mapping and platform software

Fig. 11: A local error can trigger all the CMP cores to roll-back in global checkpointing schemes

Fig. 4: Classification for forward techniques that require additional HW modules

Table II: Resilience schemes in the dark silicon constraint

Table I: Classification of run time task mapping by platform SW solutions

Fig. 18: The primary and backup tasks synchronize at each epoch in [Bressoud and Schneider 1996]

Fig. 7: Classification for forward techniques that keep the amount of HW modules fixed

Fig. 10: Illustration of concepts in the platform-HW backward category

Fig. 19: Scheduling backup tasks in a multiprocessor system in [Ghosh et al. 1994]

Fig. 9: Classification for backward techniques that reuse existing HW modules

Fig. 5: Lockstep execution in a pair-and-spare structure

Fig. 15: Redundant multithreading at the OS level in [Döbel et al. 2012]

Fig. 13: Basic classification for techniques at the mapping and platform software

Fig. 23: Mapping of hybrid schemes according to the proposed classification

Fig. 21: The insertion of a checkpoint in the CDFG turns the number of required hardware registers to four instead of two at control step 3 (adapted from [Blough et al. 1997])

Fig. 20: Classification for SWbased backward techniques that require intermediate state storage

Fig. 6: Read out (7,4) Hamming codeword and syndrome generation for zero and one error with correction

Fig. 16: Classification for forward techniques that keep the amount of tasks fixed

Fig. 2: Top down splitting to create the classification tree and mapping of the related work

Content maybe subject to copyright Report

Classiﬁcation of Resilience Techniques Against Functional Errors at

Higher Abstraction Layers of Digital Systems

GEORGIA PSYCHOU

, DIMITRIOS RODOPOULOS

, MOHAMED M. SABRY

TOBIAS GEMMEKE

, DAVID ATIENZA

, TOBIAS G. NOLL

, FRANCKY CATTHOOR

EECS, RWTH Aachen,

IMEC,

ESL, EPFL,

IDS, RWTH Aachen; formerly Holst Center/IMEC,

IMEC & KU Leuven

Nano-scale technology nodes bring reliability concerns back to the center stage of digital system design. A

systematic classiﬁcation of approaches that increase system resilience in presence of functional hardware-

induced errors is presented, dealing with higher system abstractions: i.e. the (micro-) architecture, the map-

ping and platform software. The ﬁeld is surveyed in a systematic way based on non-overlapping categories,

which add insight into the ongoing work by exposing similarities and differences. Hardware and software

solutions are discussed in a similar fashion, so that interrelationships become apparent. The presented cat-

egories are illustrated by representative literature examples to illustrate their properties. Moreover, it is

demonstrated how hybrid schemes can be decomposed into their primitive components.

Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems—

fault tolerance; reliability, availability, and serviceability; B.8.1 [Hardware]: Performance and Reliability—

reliability, testing, and fault tolerance

General Terms: Reliability, Design

Additional Key Words and Phrases: Resilience, Reliability, Mitigation, Fault Tolerance

ACM Reference Format:

Psychou G., Rodopoulos D., Sabry M. M., Gemmeke T., Atienza D., Noll T. G. and Catthoor F. 2017. Classiﬁ-

cation of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems.

ACM Comput. Surv. V, N, Article XX (January XXXX), 38 pages.

DOI:

http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTION

The early concerns of John von Neumann [Von Neumann 1956] regarding building

reliable computing entities out of unreliable components were largely forgotten with

the gradual replacement of vacuum tubes by transistors and the following high-scale

transistor integration [Palem and Lingamneni 2012]. Now, after some decades, relia-

bility has come back to the forefront in the context of modern CMOS technology. The

This research has received funding from the EU ARTEMIS Joint Undertaking under grant agreement no.

621429 (project EMC2) and from the Dutch national programmes/funding authorities. D. Rodopoulos and

F. Catthoor acknowledge the support of the EU FP7-612069 Harpa project. M. M. Sabry and D. Atienza

were partially supported by the BodyPoweredSenSE (grant no. 20NA21 143069) RTD project, evaluated by

the Swiss NSF (SNSF) and funded by Nano-Tera.ch with Swiss Confederation ﬁnancing, as well as by the

E4Bio (no. 200021 159853) project of the Swiss NSF.

G. Psychou, T. G. Noll, EECS, RWTH Aachen University, Schinkelstr. 2, D-52062, Aachen, Germany

2,5

D. Rodopoulos, F. Catthoor, IMEC, Kapeldreef 75, 3001 Leuven, Belgium

M. M. Sabry, D. Atienza, EPFL-STI- IEL-ESL ELG 130, Station 11, 1015 Lausanne, Switzerland

T. Gemmeke, IDS, RWTH Aachen University, Mies-v. d. Rohe-Str. 15, D-52074, Aachen, Germany

Contact email: psychou@eecs.rwth-aachen.de

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted

without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that

copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned

by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-

lish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request

permissions from permissions@acm.org.

 XXXX ACM. 0360-0300/XXXX/01-ARTXX $15.00

DOI:

http://dx.doi.org/10.1145/0000000.0000000

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

XX:2 Psychou et al.

current reliability concerns originate from mechanisms, that manifest both during the

manufacturing process and during the system’s operational lifetime. Inherent time-

zero and time-dependent device variability, noise (e.g. supply voltage ﬂuctuations) and

particle strikes are some of the most prevalent causes of such concerns [Borkar 2005],

[McPherson 2006], [Kuhn et al. 2011], [Aitken et al. 2013]. The anomalous physical

conditions that are created from those effects, are called faults. Depending on various

conditions, faults can manifest as bit-level corruptions in the internal state or at the

outputs of a digital system. The term functional errors is used to capture this class of

errors, with the worst case manifestation toward the end user being a complete failure

on the expected system service.

The manifested errors can be temporary or permanent [Bondavalli et al. 2000],

[Borkar 2005]. Temporary errors include transient and intermittent errors. Transient

errors are non-deterministic (concerning time and location), e.g. as a result of a fault

due to a particle strike. Intermittent errors occur repeatedly but non-deterministically

in time at the same location and last for one cycle or even for a long (but ﬁnite) period of

time. Main causes for intermittent errors are design weaknesses, aging and wear-out,

like Bias Temperature Instability (BTI), Hot Carrier Injection (HCI) etc. In contrast,

permanent errors after their ﬁrst occurrence persist forever. Causes for permanent

errors are fabrication defects and aging.

The current work presents a classiﬁcation scheme for organizing the research do-

main on mitigation of functional errors at the higher abstraction layers that manifest

themselves during the operational lifetime, and discusses representative work for each

category. Given the multitude of reliability issues in modern digital systems, it is vital

to set the boundaries of the current survey: This survey discusses resilience schemes

at the architectural/microarchitectural layer and platform software, which have in-

creased in diversity during the last decades, following the evolution of computer archi-

tecture, parallel processing, software stack and general system design. Techniques at

application, circuit and device layers, can potentially act complementary to the tech-

niques presented here, but are not part of the current scope. Reliability-related errors

that occur due to hardware-design errors, insufﬁciently speciﬁed systems or malicious

attacks [Avizienis et al. 2004] or erroneous software interaction (i.e. manifestation of

software bugs due to software of reduced quality [Lochmann and Goeb 2011]) are be-

yond the current scope. Techniques to mitigate permanent errors that have been de-

tected during testing in order to improve yield or lifetime are not included. Techniques

to tackle permanent errors due to device and wire wear-out are incorporated though.

The main contributions of this work are:

(i) An integrated overview of the domain of functional reliability techniques (at the

higher system level stack) is presented, using a systematic, hierarchical top-down

splitting into sub-classes.

(ii) Multiple representative prior and state-of-the-art publications are mapped to these

categories to illustrate the concepts involved.

(iii) Hardware and software solutions are discussed using a similar reasoning, to allow

interrelations to become more visible.

(iv) The complementary nature of the splits allows hybrid schemes to be effectively

decomposed and better understood. That is especially important in the era of

growing cross-layer resilience design.

The current paper is organized as follows: Section 2 presents terminology regarding

reliable system design, the abstraction layers that are addressed in this work and

information on the rationale of the proposed classiﬁcation. The classiﬁcation along

with the presentation of published literature begins in Section 3 for techniques that

operate at the (micro-) architectural layers of the system and continues in Section 4

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classiﬁcation of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsXX:3

with techniques at the mapping and software part of the platform. Section 5 illustrates

ways of using the proposed framework and Section 6 discusses observations and trends

in the domain. Finally, related work is presented in Section 7 and Section 8 concludes

the paper. Moreover, from this point on, the symbol

will be used to refer the reader

to the supplementary material (see ACMCSUR website) for additional information.

2. CONTEXT AND USEFUL TERMINOLOGY

2.1. Resilient Digital System Design

This survey presents an organization of techniques that can be used to make a digi-

tal system more reliable at functional level. Reliability is deﬁned as the probability

that over a speciﬁc period the system will satisfy its speciﬁcation, i.e. the total set of

requirements to be satisﬁed by the system. Functional reliability is deﬁned as the

probability that over a speciﬁc period of time the system will fulﬁll its functionality,

i.e. the set of functions that the system should perform [IEEE_Std 1990]. Functional

reliability is related with correcting binary digits as opposed to parametric reliability

that deals with aspects of variations in operation margins [Rodopoulos et al. 2015].

Functionality is one of the major elements of the speciﬁcation set. Others may be min-

imum performance (e.g. throughput [ops/s], computational power [MIPS]), maximum

costs (e.g. silicon area [mm

], power [W], energy [J/op], latency [s/op]). In the follow-

ing, the term reliability will be used to denote the functional reliability. The term re-

silience describes the ability of a system to defer or avoid (functional) system failures

in the presence of errors. When a system becomes more resilient, its reliability is in-

creased. The terms reliable and resilient (system design) will be used interchangeably

in this paper

2.2. Computing Terminology

2.2.1. Terminology on Abstraction Layers.

This survey includes techniques implemented at the microarchitecture and architec-

ture layers, as well as at the mapping & software (SW) of the system, as shown in

Figure 1. The device, circuit and application layers are not considered. In this survey,

the term platform denotes a system composed of architectural and microarchitectural

components together with the software required to run applications. When the system

is not SW-programmable, like some small embedded systems are, the term platform

denotes only the hardware part.

Platform HW. Microarchitecture describes how the HW constituent parts are con-

nected and inter-operate to implement the operations that the HW supports. It in-

cludes the memory system, the memory interconnect and the internals of processors

[Hennessy and Patterson 2011]. This applies both to very ﬂexible SW-programmable

processors, where an instruction-set is present to control the operation sequence,

and to dedicated HW processing components. Dedicated HW processors feature min-

imum to limited ﬂexibility. Both SW-programmable and dedicated components can

be mapped on highly reconﬁgurable fabrics, like ﬁeld-programmable gate arrays (FP-

GAs). The primary difference compared with the SW-programmable processors is that

not only the control ﬂow but also the data ﬂow can be substantially changed/recon-

ﬁgured. The microarchitecture together with the Instruction Set Architecture (ISA)

constitute the computer architecture (although the term has been recently used to

include also other aspects of the design [Hennessy and Patterson 2011]).

In general, the term HW module denotes a subset of the digital system’s HW, the

internals of which cannot be observed (or it is chosen that they are not observed), cor-

respondingly to the term black box [Rodopoulos et al. 2015]. To deﬁne a HW module,

its functionality and its interface with the external world must be described. At the

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

XX:4 Psychou et al.

microarchitectural and architectural layer, examples of HW modules are a multipro-

cessor system, a single core, a functional unit, the row of a memory array, a pipeline

stage, a register (without exposing the internal circuit implementation though). In the

context of this survey, the term platform HW is an umbrella term, that encompasses

the microarchitecture and architecture layers of a system.

SET

CLR

application

layer

circuit/

device

platform

hardware

mapping&

platform

software

Scope

CMP

File

compiler/

synthesis

run-time

management

job 1 job 2

time

void read_image(char*

file, int image[N][M])

{... ...}

Encoded

video frames

Fig. 1: Scope of the current paper

Mapping. During mapping, the

algorithmic level speciﬁcation is

mapped into a pre-selected datap-

ath and control path that imple-

ments the required behaviour

Nowadays, the term is also used to

denote how an application or an ap-

plication set is split, distributed and

ordered in order to run in a multi-

processor design.

Platform SW. In order to enable

software-hardware interaction, an

instruction set is selected initially.

The instruction set deﬁnes the

hardware-software interface [Hen-

nessy and Patterson 2011]. Many

application instances sharing spe-

ciﬁc characteristics (a “domain”) can

be mapped on the same instruction

set. Each of the instructions in that

set can then be implemented in the

hardware in different ways.

Platform SW includes several

sublayers that interpret or trans-

late high level operations (derived

from the algorithmic description)

into “primitive” instructions, which

correspond to the instruction set and are ready to be executed by the hardware. Exam-

ples include: system libraries, operating systems and run-time managers

2.2.2. Additional Terminology.

A Control Data Flow Graph (CDFG) is a graph representing all possible paths

the ﬂow of data can follow during execution. An application corresponds to a sepa-

rate CDFG in the system. A process is an instantiation of a program, or a segment

of code, under execution consisting of “own” memory space, containing an image of

the executable code and data, resource descriptions, security attributes, and state in-

formation (register content, physical memory addressing etc.), i.e. all the information

necessary to execute the program

. Threads are sequences of instructions, or a ﬂow

of control, in a program which can be executed concurrently. All threads in a given

process share the private address space of that process

. The term task is used quite

ambiguously in the literature: On the one hand, the terms task and process are used

synonymously. On the other hand, the terms process and thread are considered as

“mechanic” while the term task is considered as being more conceptual and used in

the context of scheduling as a set of program instructions loaded in memory for ex-

ecution. The term task in this paper is used as an umbrella term, which can denote

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classiﬁcation of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsXX:5

complete applications, sub-parts of the CDFG like processing kernels (e.g. for-loops) or

even single computations (e.g. instructions) depending on the context.

2.3. Rationale of the classiﬁcation and its presentation

The proposed classiﬁcation tree is organized using a top-down splitting of the

types of techniques that increase the system resilience. It is accompanied by a

mapping of related work (see Figure 2). The top-down splitting allows to reach

a comprehensive list of types of techniques, which can always be expanded fur-

ther on demand. Splits are created based on properties of the techniques, which

allow them to be grouped together. More speciﬁcally, the properties in the pro-

posed framework regard: (1) the effect that the techniques have on the execution

and (2) the changes that are required on the system design for a technique to

be implemented. The properties will be elaborated as the tree is being presented.

A1.b

BOTTOM-UP

MAPPING

A1.a

A2.b

WORK

TOP-DOWN

CLASSIFICATION

Subsection x.2.Subsection x.1.

A2.a

Fig. 2: Top down splitting to create the classiﬁcation

tree and mapping of the related work

Other organizations are also

possible, like organizing

the splits around the sys-

tem functionality, hardware

components, types of errors

(transient, intermittent, per-

manent), types of resilience

metrics or the application do-

mains. The aforementioned

organization is chosen in order

to stress the reusability of

techniques but also to enable

the better understanding of

hybrid combinations. This is

especially supported through

the complementarity of the cat-

egories. It is important to note

that many actual approaches

that increase resilience typically represent hybrids and do not fall strictly into only

one of the categories.

For the presentation of the classiﬁcation tree, the following structure is followed for

each of the abstraction layers (platform hardware, mapping and platform software).

First, the main classes are presented for the different techniques. Within each class,

subcategories are presented which are illustrated with the help of a ﬁgure. Groups of

nodes are chosen to be discussed together. For the visualization of the groups, bubbles

with different colors are used, along with the subsection number and a small geomet-

rical shape (see Figure 2). The colors and the geometrical shapes are used to enable a

more explicit link with the corresponding subsections in the text. Especially the geo-

metrical shapes are used for the facilitation of the reader in the black-white printed

version. The order of the leaves, the colors and the geometrical shapes do not indicate

the signiﬁcance or the maturity of the techniques. For each of the classes, pros and cons

are discussed, based on general properties bound to each class. Among the aspects con-

sidered are: area and power overhead, performance degradation (in terms of additional

execution cycles), mitigation latency (delay until the scheme fulﬁls the intended miti-

gation function), error protection, general applicability, storage overhead. An overview

of those for the different classes can be found in Tables II-VII in the Appendix (see

supplementary material). In parallel, representative related work is discussed to fur-

ther illustrate the subcategory concept and demonstrate the usefulness of the proposed

classiﬁcation scheme for classifying existing (and future) literature

. Moreover, in Ta-

ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

HTML Viewer

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

A systematic classification of approaches that increase system resilience in presence of functional hardwareinduced errors is presented, dealing with higher system abstractions: i. e. the ( micro- ) architecture, the mapping and platform software. Hardware and software solutions are discussed in a similar fashion, so that interrelationships become apparent.

Q2. What are the future works in "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

The most prominent being that mapping and SW provides a lot of flexibility due to the re-mapping possibilities of a given task sequence onto the “ fixed ” HW. Networked applications expanded further the deliverable functionality possibilities. The system behavior can be adapted at run time whenever significant environmental changes take place, or according to varying error rates. This is especially so, as errors can be masked as they propagate through the different hardware and software layers ( including the application itself ).

Q3. What are the cons of a checkpointing scheme?

Cons include latency (depending on the checkpointing granularity), performance (depending also on whether checkpointing is overlapped with normal execution) and the limitation to transient errors.

Q4. What are the cons of a hybrid?

Cons include the need for system-specific solutions, the low error protection (through isolation), the potential performance degradation.

Q5. What are the pros and cons of checking a system?

Cons include the potentially high storage and power overhead, the potentially very high latency and performance (depending also on whether checkpointing is overlapped with normal execution).

Q6. What are the challenges and opportunities for the fault tolerance techniques?

Further technology trends like 3D integration, incorporating heterogeneous technologies on a single platform and dark silicon pose new challenges and opportunities for the fault tolerance techniques.

Q7. What are some examples of emerging error-tolerant application domains?

Other examples of emerging error-tolerant application domains are Recognition, Mining and Synthesis (RMS) [Dubey 2005] as well as artificial neural networks (ANNs) [Temam 2012].

Q8. What are the pros and cons of the HW module?

Pros include the limited area and power, performance overhead as the new implementation will typically satisfy the system requirements, while minimizing additional cost.

Q9. What is the term task in this paper?

The term task in this paper is used as an umbrella term, which can denoteACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Q10. What are the main criteria for further categorizing into classes?

These four classes are discussed in the following subsections, as shown in Figure 13 s. Main criteria for further categorization into classes include whether modifications are required in: existing functionalities, existing task implementations, the resource allocation, the interaction with neighbouring tasks, execution mode (of additional tasks), cooperation among HW modules.

Q11. What is the concept of storing checkpoints in a customized way?

Rather than saving checkpoints at fixed intervals, checkpoints can be stored in a customized way so that the amount of stored data is minimized.

Q12. What are the pros and cons of local schemes?

Compared to global schemes, local schemes reduce the amount of data to be stored during checkpointing but require typically a more complicated recovery algorithm.

Q13. What are the pros and cons of adding modules with different functionality?

Instead of adding modules with the same functionality, modules with different functionality can be added; the added modules play an active role in the recovery as in the previous category.

Q14. What is the difference between error recovery and repair?

Error recovery is further split into forward error recovery (FER), which includes redundancy, like for example triple modular redundancy, and backward error recovery (BER), which includes rolling back to a previously saved correct state of the system.

Q15. What are the types of systems that are amenable to non-deterministic events?

Beyond the earlier discussed types of systems, intra-module schemes may address applications that are amenable to numerous non-deterministic events: uncertain functions (like human input functions), interrupts, system calls, I/O operations due to communication with external devices.

Q16. What are the pros and cons of online multiprocessor checkpointing?

system-specific strategies have been developed which deal with events coming from the external environment, especially events due to communication with external devices s. Online multiprocessor checkpointing can be broadly characterized as local and global.

Q17. What is the other group of backward techniques?

The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points.

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

Summary (5 min read)

1. INTRODUCTION

2.1. Resilient Digital System Design

2.2. Computing Terminology

2.3. Rationale of the classification and its presentation

3. PLATFORM HARDWARE

3.1. Forward execution - Additional HW modules provision

3.3. Backward execution - Additional HW modules provision

3.4. Backward execution - HW modules amount fixed

3.5. Overall platform hardware classification

4. PLATFORM SOFTWARE

4.2. Forward execution - Tasks amount fixed

4.3. Backward execution - Retry without state storage

4.4. Backward execution - Retry with state storage

4.5. Overall mapping and platform software classification

5. USAGE OF THE CLASSIFICATION FRAMEWORK

5.1. Mapping of hybrid schemes

5.2. Comparison of closely related schemes

6. DISCUSSION AND FUTURE CHALLENGES

7. RELATED WORK ON CLASSIFICATION SCHEMES

8. CONCLUSION

Figures (22)

Citations

References

"Classification of Resilience Techni..." refers background or methods in this paper

"Classification of Resilience Techni..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

Q2. What are the future works in "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

Q3. What are the cons of a checkpointing scheme?

Q4. What are the cons of a hybrid?

Q5. What are the pros and cons of checking a system?

Q6. What are the challenges and opportunities for the fault tolerance techniques?

Q7. What are some examples of emerging error-tolerant application domains?

Q8. What are the pros and cons of the HW module?

Q9. What is the term task in this paper?

Q10. What are the main criteria for further categorizing into classes?

Q11. What is the concept of storing checkpoints in a customized way?

Q12. What are the pros and cons of local schemes?

Q13. What are the pros and cons of adding modules with different functionality?

Q14. What is the difference between error recovery and repair?

Q15. What are the types of systems that are amenable to non-deterministic events?

Q16. What are the pros and cons of online multiprocessor checkpointing?

Q17. What is the other group of backward techniques?