scispace - formally typeset
Open AccessJournal ArticleDOI

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

Reads0
Chats0
TLDR
A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW).
Abstract
Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.

read more

Content maybe subject to copyright    Report

XX
Classification of Resilience Techniques Against Functional Errors at
Higher Abstraction Layers of Digital Systems
GEORGIA PSYCHOU
1
, DIMITRIOS RODOPOULOS
2
, MOHAMED M. SABRY
3
,
TOBIAS GEMMEKE
4
, DAVID ATIENZA
3
, TOBIAS G. NOLL
1
, FRANCKY CATTHOOR
5
,
1
EECS, RWTH Aachen,
2
IMEC,
3
ESL, EPFL,
4
IDS, RWTH Aachen; formerly Holst Center/IMEC,
5
IMEC & KU Leuven
Nano-scale technology nodes bring reliability concerns back to the center stage of digital system design. A
systematic classification of approaches that increase system resilience in presence of functional hardware-
induced errors is presented, dealing with higher system abstractions: i.e. the (micro-) architecture, the map-
ping and platform software. The field is surveyed in a systematic way based on non-overlapping categories,
which add insight into the ongoing work by exposing similarities and differences. Hardware and software
solutions are discussed in a similar fashion, so that interrelationships become apparent. The presented cat-
egories are illustrated by representative literature examples to illustrate their properties. Moreover, it is
demonstrated how hybrid schemes can be decomposed into their primitive components.
Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems—
fault tolerance; reliability, availability, and serviceability; B.8.1 [Hardware]: Performance and Reliability—
reliability, testing, and fault tolerance
General Terms: Reliability, Design
Additional Key Words and Phrases: Resilience, Reliability, Mitigation, Fault Tolerance
ACM Reference Format:
Psychou G., Rodopoulos D., Sabry M. M., Gemmeke T., Atienza D., Noll T. G. and Catthoor F. 2017. Classifi-
cation of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems.
ACM Comput. Surv. V, N, Article XX (January XXXX), 38 pages.
DOI:
http://dx.doi.org/10.1145/0000000.0000000
1. INTRODUCTION
The early concerns of John von Neumann [Von Neumann 1956] regarding building
reliable computing entities out of unreliable components were largely forgotten with
the gradual replacement of vacuum tubes by transistors and the following high-scale
transistor integration [Palem and Lingamneni 2012]. Now, after some decades, relia-
bility has come back to the forefront in the context of modern CMOS technology. The
This research has received funding from the EU ARTEMIS Joint Undertaking under grant agreement no.
621429 (project EMC2) and from the Dutch national programmes/funding authorities. D. Rodopoulos and
F. Catthoor acknowledge the support of the EU FP7-612069 Harpa project. M. M. Sabry and D. Atienza
were partially supported by the BodyPoweredSenSE (grant no. 20NA21 143069) RTD project, evaluated by
the Swiss NSF (SNSF) and funded by Nano-Tera.ch with Swiss Confederation financing, as well as by the
E4Bio (no. 200021 159853) project of the Swiss NSF.
1
G. Psychou, T. G. Noll, EECS, RWTH Aachen University, Schinkelstr. 2, D-52062, Aachen, Germany
2,5
D. Rodopoulos, F. Catthoor, IMEC, Kapeldreef 75, 3001 Leuven, Belgium
3
M. M. Sabry, D. Atienza, EPFL-STI- IEL-ESL ELG 130, Station 11, 1015 Lausanne, Switzerland
4
T. Gemmeke, IDS, RWTH Aachen University, Mies-v. d. Rohe-Str. 15, D-52074, Aachen, Germany
Contact email: psychou@eecs.rwth-aachen.de
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned
by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-
lish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
c
XXXX ACM. 0360-0300/XXXX/01-ARTXX $15.00
DOI:
http://dx.doi.org/10.1145/0000000.0000000
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

XX:2 Psychou et al.
current reliability concerns originate from mechanisms, that manifest both during the
manufacturing process and during the system’s operational lifetime. Inherent time-
zero and time-dependent device variability, noise (e.g. supply voltage fluctuations) and
particle strikes are some of the most prevalent causes of such concerns [Borkar 2005],
[McPherson 2006], [Kuhn et al. 2011], [Aitken et al. 2013]. The anomalous physical
conditions that are created from those effects, are called faults. Depending on various
conditions, faults can manifest as bit-level corruptions in the internal state or at the
outputs of a digital system. The term functional errors is used to capture this class of
errors, with the worst case manifestation toward the end user being a complete failure
on the expected system service.
The manifested errors can be temporary or permanent [Bondavalli et al. 2000],
[Borkar 2005]. Temporary errors include transient and intermittent errors. Transient
errors are non-deterministic (concerning time and location), e.g. as a result of a fault
due to a particle strike. Intermittent errors occur repeatedly but non-deterministically
in time at the same location and last for one cycle or even for a long (but finite) period of
time. Main causes for intermittent errors are design weaknesses, aging and wear-out,
like Bias Temperature Instability (BTI), Hot Carrier Injection (HCI) etc. In contrast,
permanent errors after their first occurrence persist forever. Causes for permanent
errors are fabrication defects and aging.
The current work presents a classification scheme for organizing the research do-
main on mitigation of functional errors at the higher abstraction layers that manifest
themselves during the operational lifetime, and discusses representative work for each
category. Given the multitude of reliability issues in modern digital systems, it is vital
to set the boundaries of the current survey: This survey discusses resilience schemes
at the architectural/microarchitectural layer and platform software, which have in-
creased in diversity during the last decades, following the evolution of computer archi-
tecture, parallel processing, software stack and general system design. Techniques at
application, circuit and device layers, can potentially act complementary to the tech-
niques presented here, but are not part of the current scope. Reliability-related errors
that occur due to hardware-design errors, insufficiently specified systems or malicious
attacks [Avizienis et al. 2004] or erroneous software interaction (i.e. manifestation of
software bugs due to software of reduced quality [Lochmann and Goeb 2011]) are be-
yond the current scope. Techniques to mitigate permanent errors that have been de-
tected during testing in order to improve yield or lifetime are not included. Techniques
to tackle permanent errors due to device and wire wear-out are incorporated though.
The main contributions of this work are:
(i) An integrated overview of the domain of functional reliability techniques (at the
higher system level stack) is presented, using a systematic, hierarchical top-down
splitting into sub-classes.
(ii) Multiple representative prior and state-of-the-art publications are mapped to these
categories to illustrate the concepts involved.
(iii) Hardware and software solutions are discussed using a similar reasoning, to allow
interrelations to become more visible.
(iv) The complementary nature of the splits allows hybrid schemes to be effectively
decomposed and better understood. That is especially important in the era of
growing cross-layer resilience design.
The current paper is organized as follows: Section 2 presents terminology regarding
reliable system design, the abstraction layers that are addressed in this work and
information on the rationale of the proposed classification. The classification along
with the presentation of published literature begins in Section 3 for techniques that
operate at the (micro-) architectural layers of the system and continues in Section 4
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsXX:3
with techniques at the mapping and software part of the platform. Section 5 illustrates
ways of using the proposed framework and Section 6 discusses observations and trends
in the domain. Finally, related work is presented in Section 7 and Section 8 concludes
the paper. Moreover, from this point on, the symbol
s
will be used to refer the reader
to the supplementary material (see ACMCSUR website) for additional information.
2. CONTEXT AND USEFUL TERMINOLOGY
2.1. Resilient Digital System Design
This survey presents an organization of techniques that can be used to make a digi-
tal system more reliable at functional level. Reliability is defined as the probability
that over a specific period the system will satisfy its specification, i.e. the total set of
requirements to be satisfied by the system. Functional reliability is defined as the
probability that over a specific period of time the system will fulfill its functionality,
i.e. the set of functions that the system should perform [IEEE_Std 1990]. Functional
reliability is related with correcting binary digits as opposed to parametric reliability
that deals with aspects of variations in operation margins [Rodopoulos et al. 2015].
Functionality is one of the major elements of the specification set. Others may be min-
imum performance (e.g. throughput [ops/s], computational power [MIPS]), maximum
costs (e.g. silicon area [mm
2
], power [W], energy [J/op], latency [s/op]). In the follow-
ing, the term reliability will be used to denote the functional reliability. The term re-
silience describes the ability of a system to defer or avoid (functional) system failures
in the presence of errors. When a system becomes more resilient, its reliability is in-
creased. The terms reliable and resilient (system design) will be used interchangeably
in this paper
s
.
2.2. Computing Terminology
2.2.1. Terminology on Abstraction Layers.
This survey includes techniques implemented at the microarchitecture and architec-
ture layers, as well as at the mapping & software (SW) of the system, as shown in
Figure 1. The device, circuit and application layers are not considered. In this survey,
the term platform denotes a system composed of architectural and microarchitectural
components together with the software required to run applications. When the system
is not SW-programmable, like some small embedded systems are, the term platform
denotes only the hardware part.
Platform HW. Microarchitecture describes how the HW constituent parts are con-
nected and inter-operate to implement the operations that the HW supports. It in-
cludes the memory system, the memory interconnect and the internals of processors
[Hennessy and Patterson 2011]. This applies both to very flexible SW-programmable
processors, where an instruction-set is present to control the operation sequence,
and to dedicated HW processing components. Dedicated HW processors feature min-
imum to limited flexibility. Both SW-programmable and dedicated components can
be mapped on highly reconfigurable fabrics, like field-programmable gate arrays (FP-
GAs). The primary difference compared with the SW-programmable processors is that
not only the control flow but also the data flow can be substantially changed/recon-
figured. The microarchitecture together with the Instruction Set Architecture (ISA)
constitute the computer architecture (although the term has been recently used to
include also other aspects of the design [Hennessy and Patterson 2011]).
In general, the term HW module denotes a subset of the digital system’s HW, the
internals of which cannot be observed (or it is chosen that they are not observed), cor-
respondingly to the term black box [Rodopoulos et al. 2015]. To define a HW module,
its functionality and its interface with the external world must be described. At the
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

XX:4 Psychou et al.
microarchitectural and architectural layer, examples of HW modules are a multipro-
cessor system, a single core, a functional unit, the row of a memory array, a pipeline
stage, a register (without exposing the internal circuit implementation though). In the
context of this survey, the term platform HW is an umbrella term, that encompasses
the microarchitecture and architecture layers of a system.
Q
Q
SET
CLR
D
L
application
layer
circuit/
device
platform
hardware
mapping&
platform
software
Scope
CMP
Register
File
+
compiler/
synthesis
run-time
management
job 1 job 2
time
+
*
+
void read_image(char*
file, int image[N][M])
{... ...}
Encoded
video frames
Fig. 1: Scope of the current paper
Mapping. During mapping, the
algorithmic level specification is
mapped into a pre-selected datap-
ath and control path that imple-
ments the required behaviour
s
.
Nowadays, the term is also used to
denote how an application or an ap-
plication set is split, distributed and
ordered in order to run in a multi-
processor design.
Platform SW. In order to enable
software-hardware interaction, an
instruction set is selected initially.
The instruction set defines the
hardware-software interface [Hen-
nessy and Patterson 2011]. Many
application instances sharing spe-
cific characteristics (a “domain”) can
be mapped on the same instruction
set. Each of the instructions in that
set can then be implemented in the
hardware in different ways.
Platform SW includes several
sublayers that interpret or trans-
late high level operations (derived
from the algorithmic description)
into “primitive” instructions, which
correspond to the instruction set and are ready to be executed by the hardware. Exam-
ples include: system libraries, operating systems and run-time managers
s
.
2.2.2. Additional Terminology.
A Control Data Flow Graph (CDFG) is a graph representing all possible paths
the flow of data can follow during execution. An application corresponds to a sepa-
rate CDFG in the system. A process is an instantiation of a program, or a segment
of code, under execution consisting of “own” memory space, containing an image of
the executable code and data, resource descriptions, security attributes, and state in-
formation (register content, physical memory addressing etc.), i.e. all the information
necessary to execute the program
s
. Threads are sequences of instructions, or a flow
of control, in a program which can be executed concurrently. All threads in a given
process share the private address space of that process
s
. The term task is used quite
ambiguously in the literature: On the one hand, the terms task and process are used
synonymously. On the other hand, the terms process and thread are considered as
“mechanic” while the term task is considered as being more conceptual and used in
the context of scheduling as a set of program instructions loaded in memory for ex-
ecution. The term task in this paper is used as an umbrella term, which can denote
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsXX:5
complete applications, sub-parts of the CDFG like processing kernels (e.g. for-loops) or
even single computations (e.g. instructions) depending on the context.
2.3. Rationale of the classification and its presentation
The proposed classification tree is organized using a top-down splitting of the
types of techniques that increase the system resilience. It is accompanied by a
mapping of related work (see Figure 2). The top-down splitting allows to reach
a comprehensive list of types of techniques, which can always be expanded fur-
ther on demand. Splits are created based on properties of the techniques, which
allow them to be grouped together. More specifically, the properties in the pro-
posed framework regard: (1) the effect that the techniques have on the execution
and (2) the changes that are required on the system design for a technique to
be implemented. The properties will be elaborated as the tree is being presented.
A
A1.b
BOTTOM-UP
MAPPING
A1
A1.a
A2
A2.b
WORK
#1
WORK
#2
WORK
#3
WORK
#4
WORK
#5
TOP-DOWN
CLASSIFICATION
Subsection x.2.Subsection x.1.
A2.a
Fig. 2: Top down splitting to create the classification
tree and mapping of the related work
Other organizations are also
possible, like organizing
the splits around the sys-
tem functionality, hardware
components, types of errors
(transient, intermittent, per-
manent), types of resilience
metrics or the application do-
mains. The aforementioned
organization is chosen in order
to stress the reusability of
techniques but also to enable
the better understanding of
hybrid combinations. This is
especially supported through
the complementarity of the cat-
egories. It is important to note
that many actual approaches
that increase resilience typically represent hybrids and do not fall strictly into only
one of the categories.
For the presentation of the classification tree, the following structure is followed for
each of the abstraction layers (platform hardware, mapping and platform software).
First, the main classes are presented for the different techniques. Within each class,
subcategories are presented which are illustrated with the help of a figure. Groups of
nodes are chosen to be discussed together. For the visualization of the groups, bubbles
with different colors are used, along with the subsection number and a small geomet-
rical shape (see Figure 2). The colors and the geometrical shapes are used to enable a
more explicit link with the corresponding subsections in the text. Especially the geo-
metrical shapes are used for the facilitation of the reader in the black-white printed
version. The order of the leaves, the colors and the geometrical shapes do not indicate
the significance or the maturity of the techniques. For each of the classes, pros and cons
are discussed, based on general properties bound to each class. Among the aspects con-
sidered are: area and power overhead, performance degradation (in terms of additional
execution cycles), mitigation latency (delay until the scheme fulfils the intended miti-
gation function), error protection, general applicability, storage overhead. An overview
of those for the different classes can be found in Tables II-VII in the Appendix (see
supplementary material). In parallel, representative related work is discussed to fur-
ther illustrate the subcategory concept and demonstrate the usefulness of the proposed
classification scheme for classifying existing (and future) literature
s
. Moreover, in Ta-
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Citations
More filters
Journal ArticleDOI

Finding the needle in a high-dimensional haystack: Canonical correlation analysis for neuroscientists.

TL;DR: Canonical correlation analysis is a prototypical family of methods that is useful in identifying the links between variable sets from different modalities and so is well suited to the analysis of big neuroscience datasets.
Journal ArticleDOI

Predicting the compressive strength of concrete with fly ash admixture using machine learning algorithms

TL;DR: This study includes the collection of data from the experimental work and the application of ML techniques to predict the CS of concrete containing fly ash, and shows high accuracy towards the prediction of outcome as indicated by its high coefficient correlation (R2) value.
Journal ArticleDOI

Accumulating regional density dissimilarity for concept drift detection in data streams

TL;DR: The overall results show that NN-DVI has better performance in terms of addressing problems related to concept drift-detection, including both synthetic and real-world datasets.
Journal ArticleDOI

Screen Content Quality Assessment: Overview, Benchmark, and Beyond

TL;DR: In this article, screen content, which is often computer-generated, has many characteristics distinctly different from conventional camera-captured natural scene content, and such characteristic differences impose majo...
Book ChapterDOI

Cross-Domain Authorship Attribution Using Pre-trained Language Models

TL;DR: This paper modify a successful authorship verification approach based on a multi-headed neural network language model and combine it with pre-trained language models and demonstrates the crucial effect of the normalization corpus in cross-domain attribution.
References
More filters

Reliability Modeling andManagement inDynamic Microprocessor-Based Systems

TL;DR: In this paper, the authors proposed a dynamic reliability management (DRM) scheme that management, where real-time workloads andthermal infor- results in 20-35% performance improvement during periods of peak processor demand.
Journal ArticleDOI

Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems

TL;DR: This paper studies a scheme that guarantees the timely recovery from multiple faults within hard real-time constraints in uniprocessor systems and develops a necessary and sufficient feasibility-check algorithm for fault-tolerant scheduling with complexity O(n/sup 2/-/spl kappa/).
Proceedings ArticleDOI

Tribeca: design for PVT variations with local recovery and fine-grained adaptation

TL;DR: This paper explores the power-performance efficiency gains that result from designing for typical conditions while dynamically tuning frequency and voltage to accommodate the runtime behavior of workloads, and proposes a local recovery scheme that exploits spatial variation among the units of the processor.
Journal ArticleDOI

A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

TL;DR: This paper provides a survey of architectural techniques for improving resilience of computing systems, especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc.
Proceedings ArticleDOI

Compiler-guided register reliability improvement against soft errors

TL;DR: The concept of register vulnerability factor (RVF) is proposed to characterize the probability that register transient errors can escape the register file and thus potentially impact the system reliability and two cost-effective compiler-guided techniques to improve the registerfile reliability by lowering the RVF value are proposed.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

A systematic classification of approaches that increase system resilience in presence of functional hardwareinduced errors is presented, dealing with higher system abstractions: i. e. the ( micro- ) architecture, the mapping and platform software. Hardware and software solutions are discussed in a similar fashion, so that interrelationships become apparent. 

The most prominent being that mapping and SW provides a lot of flexibility due to the re-mapping possibilities of a given task sequence onto the “ fixed ” HW. Networked applications expanded further the deliverable functionality possibilities. The system behavior can be adapted at run time whenever significant environmental changes take place, or according to varying error rates. This is especially so, as errors can be masked as they propagate through the different hardware and software layers ( including the application itself ). 

Cons include latency (depending on the checkpointing granularity), performance (depending also on whether checkpointing is overlapped with normal execution) and the limitation to transient errors. 

Cons include the need for system-specific solutions, the low error protection (through isolation), the potential performance degradation. 

Cons include the potentially high storage and power overhead, the potentially very high latency and performance (depending also on whether checkpointing is overlapped with normal execution). 

Further technology trends like 3D integration, incorporating heterogeneous technologies on a single platform and dark silicon pose new challenges and opportunities for the fault tolerance techniques. 

Other examples of emerging error-tolerant application domains are Recognition, Mining and Synthesis (RMS) [Dubey 2005] as well as artificial neural networks (ANNs) [Temam 2012]. 

Pros include the limited area and power, performance overhead as the new implementation will typically satisfy the system requirements, while minimizing additional cost. 

The term task in this paper is used as an umbrella term, which can denoteACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX. 

These four classes are discussed in the following subsections, as shown in Figure 13 s. Main criteria for further categorization into classes include whether modifications are required in: existing functionalities, existing task implementations, the resource allocation, the interaction with neighbouring tasks, execution mode (of additional tasks), cooperation among HW modules. 

Rather than saving checkpoints at fixed intervals, checkpoints can be stored in a customized way so that the amount of stored data is minimized. 

Compared to global schemes, local schemes reduce the amount of data to be stored during checkpointing but require typically a more complicated recovery algorithm. 

Instead of adding modules with the same functionality, modules with different functionality can be added; the added modules play an active role in the recovery as in the previous category. 

Error recovery is further split into forward error recovery (FER), which includes redundancy, like for example triple modular redundancy, and backward error recovery (BER), which includes rolling back to a previously saved correct state of the system. 

Beyond the earlier discussed types of systems, intra-module schemes may address applications that are amenable to numerous non-deterministic events: uncertain functions (like human input functions), interrupts, system calls, I/O operations due to communication with external devices. 

system-specific strategies have been developed which deal with events coming from the external environment, especially events due to communication with external devices s. Online multiprocessor checkpointing can be broadly characterized as local and global. 

The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points.