scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

TL;DR: A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW).
Abstract: Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.

Summary (5 min read)

1. INTRODUCTION

  • Intermittent errors occur repeatedly but non-deterministically in time at the same location and last for one cycle or even for a long (but finite) period of time.
  • The main contributions of this work are: (i) An integrated overview of the domain of functional reliability techniques (at the higher system level stack) is presented, using a systematic, hierarchical top-down splitting into sub-classes.
  • Section 5 illustrates ways of using the proposed framework and Section 6 discusses observations and trends in the domain.

2.1. Resilient Digital System Design

  • This survey presents an organization of techniques that can be used to make a digital system more reliable at functional level.
  • Functional reliability is defined as the probability that over a specific period of time the system will fulfill its functionality, i.e. the set of functions that the system should perform [IEEE_Std 1990].
  • Functional reliability is related with correcting binary digits as opposed to parametric reliability that deals with aspects of variations in operation margins [Rodopoulos et al. 2015].
  • Functionality is one of the major elements of the specification set.
  • When a system becomes more resilient, its reliability is increased.

2.2. Computing Terminology

  • The term platform denotes a system composed of architectural and microarchitectural components together with the software required to run applications.
  • This applies both to very flexible SW-programmable processors, where an instruction-set is present to control the operation sequence, and to dedicated HW processing components.
  • The instruction set defines the hardware-software interface [Hennessy and Patterson 2011].
  • The term task is used quite ambiguously in the literature:.

2.3. Rationale of the classification and its presentation

  • The proposed classification tree is organized using a top-down splitting of the types of techniques that increase the system resilience.
  • A A1.b BOTTOM-UP MAPPING A1 A1.a A2 A2.b WORK #1 WORK #2 WORK #3 WORK #4 WORK #5 TOP-DOWN CLASSIFICATION Subsection x.2.Subsection x.1. A2.a that increase resilience typically represent hybrids and do not fall strictly into only one of the categories.
  • The colors and the geometrical shapes are used to enable a more explicit link with the corresponding subsections in the text.
  • For each of the classes, pros and cons are discussed, based on general properties bound to each class.
  • Deterministic execution is required for replicas to work.

3. PLATFORM HARDWARE

  • To make digital systems more robust, functional capabilities need to be provided that would be unnecessary in a fault-free environment.
  • This section focuses on techniques that modify the hardware capabilities for reliability purposes.
  • The complete classification scheme is shown in Figure 12 in Subsection 3.5.
  • Main criteria for further categorization include whether modifications are required in: existing functionalities, existing design implementations, resource allocation, operating conditions, the interaction with neighbouring modules, storage overhead.
  • Leaves of the tree have an accompanying simple ordinal number for identification.

3.1. Forward execution - Additional HW modules provision

  • This subsection discusses techniques that increase the resilience through adding HW modules on the platform.
  • Within each pair, error detection is performed through a comparison circuit.
  • Again, a distinction can be made between modules that are in parallel execution mode and modules that act as spares.
  • These schemes exploit inherent redundancy in regularly structured systems such as arrays of PEs, memories and interconnection networks or even processors.
  • Pros include low area and power overhead, general applicability (for systems with inherent redundancy).

3.3. Backward execution - Additional HW modules provision

  • This subsection discusses techniques that increase the resilience of systems through rollback to an earlier point of execution and repetition of the execution.
  • The corresponding categories and subsections are shown in Figure 8.6 3.3.1.
  • This category discusses techniques that provide additional HW modules with the same functionality as the original ones.
  • Spare modules would, for example, not only take over the execution after the primary module has failed but also repeat the failed execution.
  • Pros include the flexibility to trade-off area, power, performance, latency with error protection depending on the selected functionality.

3.4. Backward execution - HW modules amount fixed

  • The majority of the techniques proposed in the literature that employ backward execution, reuse the already existing HW modules as the additional area overhead of the previous category is avoided.
  • The literature focuses on employing hardwarebased threads in coupled execution mode.
  • The first stream is serviced (by the operating system) and execution resumes.
  • Pros include the high error protection (for transient errors only), the general applicability.
  • These checkpointing schemes can be characterized as global and local.

3.5. Overall platform hardware classification

  • The sub-trees presented in the previous subsections are combined to form the overall classification tree for platform HW techniques, as shown in Figure 12.
  • Starting from the top-level split of Figure 3, the intermediate nodes (colored by pale green) are followed when necessary, to reach the final classes (colored by darker green and numbered).

4. PLATFORM SOFTWARE

  • Techniques that extend the platform software capabilities for reliability purposes are presented.
  • These four classes are discussed in the following subsections, as shown in Figure 13 s. Main criteria for further categorization into classes include whether modifications are required in: existing functionalities, existing task implementations, the resource allocation, the interaction with neighbouring tasks, execution mode (of additional tasks), cooperation among HW modules.
  • Pros would include the limited storage, area, performance overhead, latency, high error protection (only for instruction memory errors) and general applicability.
  • Or the added task performs some different function, like error correction.
  • It must be noted that parallel execution in this context, does not nec- ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

4.2. Forward execution - Tasks amount fixed

  • This subsection discusses techniques that do not provide additional tasks in the system in order to make it more reliable.
  • Techniques that are focused around the functionality of tasks can either operate within the task boundaries, by reusing the task functionality (internal functionality reuse) or operate outside the task boundaries by rearranging its interaction with the other tasks (I/O configuration modification).
  • These schemes reorganize the application or instruction profile so that the re-ordered execution is more robust.
  • Pros include the lack of storage, power overhead and latency.
  • Cons include the limited error protection and rather system-specific applicability.

4.3. Backward execution - Retry without state storage

  • RETRY W/O STATE STORAGE PARALLEL EXECUTION SEQUENTIAL EXECUTION INTRAMODULE INTERMODULE.
  • Sporadic tasks are aperiodic tasks that have hard deadlines.
  • The task can be re-executed either on a single processor so that transient errors are removed or on a different processor so that permanent errors are avoided.
  • Therefore, this category is further split into intra-module and inter-module techniques.
  • Cons include the performance overhead, latency and limitation to transient errors.

4.4. Backward execution - Retry with state storage

  • The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points.
  • The inter-process dependencies are often recorded, so that the execution can be accurately repeated during the recovery phase.
  • Such additional pieces of information ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.
  • These surveys address both single-threaded and multithreaded/multi-process applications s. A number of techniques have been developed that do not explicitly bring the handling of non-deterministic events to the forefront.
  • Checkpointing at user-level utilizes run-time libraries that are linked to the application program.

4.5. Overall mapping and platform software classification

  • By combining the sub-trees of the previous subsections, the overall mapping and platform software classification tree is built, as shown in Figure 22.
  • Starting from the toplevel split of Figure 13, the intermediate nodes (colored by pale yellow) are followed when necessary, to reach the final classes (colored by darker yellow and numbered).

5. USAGE OF THE CLASSIFICATION FRAMEWORK

  • Identifying the primitive components (corresponding to a primitive category) and their position in the framework first, allows to handle the complexity of the sometimes highly sophisticated mitigation schemes.
  • A “divide and conquer” view of the publication enables the reader to delve into the most relevant implementation details (when that is necessary) in a much more controlled way.
  • ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

5.1. Mapping of hybrid schemes

  • In reality, the resiliency and mitigation approaches, which are present in research papers, rarely belong to a single leaf of the previous, and indeed any, classification.
  • The majority of the published work consists of hybrid combinations of the leaves.
  • In particular, UDP-Lite packetization is (re-)designed in such a way that bits which are very sensitive to errors are better protected.
  • The super reliable cores execute operations that are less resilient to errors.
  • A run time scheduler reassigns a task that has failed on a particular RRC to another RRC (S.8), also known as Mapping & platform SW.

6. DISCUSSION AND FUTURE CHALLENGES

  • The proposed classification was illustrated through a representative list of schemes, to better absorb the related ideas and support the validity of the tree.
  • The literature on fault tolerance and resilience techniques has evolved in accordance with the trends in computer architecture and software design development.
  • Therefore, by properly propagating information among the different layers and providing a suitable degree of adaptivity (with design time and run time knobs), the most cost-effective solutions can be achieved.
  • The TDP mode 5 corresponds to a mode, where cores are operated in near threshold voltages.

8. CONCLUSION

  • Techniques that increase resilience and mitigate functional reliability errors were classified in a novel way.
  • This was achieved through a framework with complementary splits, in which primitive mitigation concepts are defined.
  • That allows every type of technique to be classified, by combining the appropriate components.
  • The framework has been accompanied by a wide variety of sources from the published literature.
  • Insight can be provided to the designers and researchers about the nature of existing schemes, since every node has some unique properties.

Did you find this useful? Give us your feedback

Figures (22)

Content maybe subject to copyright    Report

XX
Classification of Resilience Techniques Against Functional Errors at
Higher Abstraction Layers of Digital Systems
GEORGIA PSYCHOU
1
, DIMITRIOS RODOPOULOS
2
, MOHAMED M. SABRY
3
,
TOBIAS GEMMEKE
4
, DAVID ATIENZA
3
, TOBIAS G. NOLL
1
, FRANCKY CATTHOOR
5
,
1
EECS, RWTH Aachen,
2
IMEC,
3
ESL, EPFL,
4
IDS, RWTH Aachen; formerly Holst Center/IMEC,
5
IMEC & KU Leuven
Nano-scale technology nodes bring reliability concerns back to the center stage of digital system design. A
systematic classification of approaches that increase system resilience in presence of functional hardware-
induced errors is presented, dealing with higher system abstractions: i.e. the (micro-) architecture, the map-
ping and platform software. The field is surveyed in a systematic way based on non-overlapping categories,
which add insight into the ongoing work by exposing similarities and differences. Hardware and software
solutions are discussed in a similar fashion, so that interrelationships become apparent. The presented cat-
egories are illustrated by representative literature examples to illustrate their properties. Moreover, it is
demonstrated how hybrid schemes can be decomposed into their primitive components.
Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems—
fault tolerance; reliability, availability, and serviceability; B.8.1 [Hardware]: Performance and Reliability—
reliability, testing, and fault tolerance
General Terms: Reliability, Design
Additional Key Words and Phrases: Resilience, Reliability, Mitigation, Fault Tolerance
ACM Reference Format:
Psychou G., Rodopoulos D., Sabry M. M., Gemmeke T., Atienza D., Noll T. G. and Catthoor F. 2017. Classifi-
cation of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems.
ACM Comput. Surv. V, N, Article XX (January XXXX), 38 pages.
DOI:
http://dx.doi.org/10.1145/0000000.0000000
1. INTRODUCTION
The early concerns of John von Neumann [Von Neumann 1956] regarding building
reliable computing entities out of unreliable components were largely forgotten with
the gradual replacement of vacuum tubes by transistors and the following high-scale
transistor integration [Palem and Lingamneni 2012]. Now, after some decades, relia-
bility has come back to the forefront in the context of modern CMOS technology. The
This research has received funding from the EU ARTEMIS Joint Undertaking under grant agreement no.
621429 (project EMC2) and from the Dutch national programmes/funding authorities. D. Rodopoulos and
F. Catthoor acknowledge the support of the EU FP7-612069 Harpa project. M. M. Sabry and D. Atienza
were partially supported by the BodyPoweredSenSE (grant no. 20NA21 143069) RTD project, evaluated by
the Swiss NSF (SNSF) and funded by Nano-Tera.ch with Swiss Confederation financing, as well as by the
E4Bio (no. 200021 159853) project of the Swiss NSF.
1
G. Psychou, T. G. Noll, EECS, RWTH Aachen University, Schinkelstr. 2, D-52062, Aachen, Germany
2,5
D. Rodopoulos, F. Catthoor, IMEC, Kapeldreef 75, 3001 Leuven, Belgium
3
M. M. Sabry, D. Atienza, EPFL-STI- IEL-ESL ELG 130, Station 11, 1015 Lausanne, Switzerland
4
T. Gemmeke, IDS, RWTH Aachen University, Mies-v. d. Rohe-Str. 15, D-52074, Aachen, Germany
Contact email: psychou@eecs.rwth-aachen.de
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned
by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-
lish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
c
XXXX ACM. 0360-0300/XXXX/01-ARTXX $15.00
DOI:
http://dx.doi.org/10.1145/0000000.0000000
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

XX:2 Psychou et al.
current reliability concerns originate from mechanisms, that manifest both during the
manufacturing process and during the system’s operational lifetime. Inherent time-
zero and time-dependent device variability, noise (e.g. supply voltage fluctuations) and
particle strikes are some of the most prevalent causes of such concerns [Borkar 2005],
[McPherson 2006], [Kuhn et al. 2011], [Aitken et al. 2013]. The anomalous physical
conditions that are created from those effects, are called faults. Depending on various
conditions, faults can manifest as bit-level corruptions in the internal state or at the
outputs of a digital system. The term functional errors is used to capture this class of
errors, with the worst case manifestation toward the end user being a complete failure
on the expected system service.
The manifested errors can be temporary or permanent [Bondavalli et al. 2000],
[Borkar 2005]. Temporary errors include transient and intermittent errors. Transient
errors are non-deterministic (concerning time and location), e.g. as a result of a fault
due to a particle strike. Intermittent errors occur repeatedly but non-deterministically
in time at the same location and last for one cycle or even for a long (but finite) period of
time. Main causes for intermittent errors are design weaknesses, aging and wear-out,
like Bias Temperature Instability (BTI), Hot Carrier Injection (HCI) etc. In contrast,
permanent errors after their first occurrence persist forever. Causes for permanent
errors are fabrication defects and aging.
The current work presents a classification scheme for organizing the research do-
main on mitigation of functional errors at the higher abstraction layers that manifest
themselves during the operational lifetime, and discusses representative work for each
category. Given the multitude of reliability issues in modern digital systems, it is vital
to set the boundaries of the current survey: This survey discusses resilience schemes
at the architectural/microarchitectural layer and platform software, which have in-
creased in diversity during the last decades, following the evolution of computer archi-
tecture, parallel processing, software stack and general system design. Techniques at
application, circuit and device layers, can potentially act complementary to the tech-
niques presented here, but are not part of the current scope. Reliability-related errors
that occur due to hardware-design errors, insufficiently specified systems or malicious
attacks [Avizienis et al. 2004] or erroneous software interaction (i.e. manifestation of
software bugs due to software of reduced quality [Lochmann and Goeb 2011]) are be-
yond the current scope. Techniques to mitigate permanent errors that have been de-
tected during testing in order to improve yield or lifetime are not included. Techniques
to tackle permanent errors due to device and wire wear-out are incorporated though.
The main contributions of this work are:
(i) An integrated overview of the domain of functional reliability techniques (at the
higher system level stack) is presented, using a systematic, hierarchical top-down
splitting into sub-classes.
(ii) Multiple representative prior and state-of-the-art publications are mapped to these
categories to illustrate the concepts involved.
(iii) Hardware and software solutions are discussed using a similar reasoning, to allow
interrelations to become more visible.
(iv) The complementary nature of the splits allows hybrid schemes to be effectively
decomposed and better understood. That is especially important in the era of
growing cross-layer resilience design.
The current paper is organized as follows: Section 2 presents terminology regarding
reliable system design, the abstraction layers that are addressed in this work and
information on the rationale of the proposed classification. The classification along
with the presentation of published literature begins in Section 3 for techniques that
operate at the (micro-) architectural layers of the system and continues in Section 4
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsXX:3
with techniques at the mapping and software part of the platform. Section 5 illustrates
ways of using the proposed framework and Section 6 discusses observations and trends
in the domain. Finally, related work is presented in Section 7 and Section 8 concludes
the paper. Moreover, from this point on, the symbol
s
will be used to refer the reader
to the supplementary material (see ACMCSUR website) for additional information.
2. CONTEXT AND USEFUL TERMINOLOGY
2.1. Resilient Digital System Design
This survey presents an organization of techniques that can be used to make a digi-
tal system more reliable at functional level. Reliability is defined as the probability
that over a specific period the system will satisfy its specification, i.e. the total set of
requirements to be satisfied by the system. Functional reliability is defined as the
probability that over a specific period of time the system will fulfill its functionality,
i.e. the set of functions that the system should perform [IEEE_Std 1990]. Functional
reliability is related with correcting binary digits as opposed to parametric reliability
that deals with aspects of variations in operation margins [Rodopoulos et al. 2015].
Functionality is one of the major elements of the specification set. Others may be min-
imum performance (e.g. throughput [ops/s], computational power [MIPS]), maximum
costs (e.g. silicon area [mm
2
], power [W], energy [J/op], latency [s/op]). In the follow-
ing, the term reliability will be used to denote the functional reliability. The term re-
silience describes the ability of a system to defer or avoid (functional) system failures
in the presence of errors. When a system becomes more resilient, its reliability is in-
creased. The terms reliable and resilient (system design) will be used interchangeably
in this paper
s
.
2.2. Computing Terminology
2.2.1. Terminology on Abstraction Layers.
This survey includes techniques implemented at the microarchitecture and architec-
ture layers, as well as at the mapping & software (SW) of the system, as shown in
Figure 1. The device, circuit and application layers are not considered. In this survey,
the term platform denotes a system composed of architectural and microarchitectural
components together with the software required to run applications. When the system
is not SW-programmable, like some small embedded systems are, the term platform
denotes only the hardware part.
Platform HW. Microarchitecture describes how the HW constituent parts are con-
nected and inter-operate to implement the operations that the HW supports. It in-
cludes the memory system, the memory interconnect and the internals of processors
[Hennessy and Patterson 2011]. This applies both to very flexible SW-programmable
processors, where an instruction-set is present to control the operation sequence,
and to dedicated HW processing components. Dedicated HW processors feature min-
imum to limited flexibility. Both SW-programmable and dedicated components can
be mapped on highly reconfigurable fabrics, like field-programmable gate arrays (FP-
GAs). The primary difference compared with the SW-programmable processors is that
not only the control flow but also the data flow can be substantially changed/recon-
figured. The microarchitecture together with the Instruction Set Architecture (ISA)
constitute the computer architecture (although the term has been recently used to
include also other aspects of the design [Hennessy and Patterson 2011]).
In general, the term HW module denotes a subset of the digital system’s HW, the
internals of which cannot be observed (or it is chosen that they are not observed), cor-
respondingly to the term black box [Rodopoulos et al. 2015]. To define a HW module,
its functionality and its interface with the external world must be described. At the
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

XX:4 Psychou et al.
microarchitectural and architectural layer, examples of HW modules are a multipro-
cessor system, a single core, a functional unit, the row of a memory array, a pipeline
stage, a register (without exposing the internal circuit implementation though). In the
context of this survey, the term platform HW is an umbrella term, that encompasses
the microarchitecture and architecture layers of a system.
Q
Q
SET
CLR
D
L
application
layer
circuit/
device
platform
hardware
mapping&
platform
software
Scope
CMP
Register
File
+
compiler/
synthesis
run-time
management
job 1 job 2
time
+
*
+
void read_image(char*
file, int image[N][M])
{... ...}
Encoded
video frames
Fig. 1: Scope of the current paper
Mapping. During mapping, the
algorithmic level specification is
mapped into a pre-selected datap-
ath and control path that imple-
ments the required behaviour
s
.
Nowadays, the term is also used to
denote how an application or an ap-
plication set is split, distributed and
ordered in order to run in a multi-
processor design.
Platform SW. In order to enable
software-hardware interaction, an
instruction set is selected initially.
The instruction set defines the
hardware-software interface [Hen-
nessy and Patterson 2011]. Many
application instances sharing spe-
cific characteristics (a “domain”) can
be mapped on the same instruction
set. Each of the instructions in that
set can then be implemented in the
hardware in different ways.
Platform SW includes several
sublayers that interpret or trans-
late high level operations (derived
from the algorithmic description)
into “primitive” instructions, which
correspond to the instruction set and are ready to be executed by the hardware. Exam-
ples include: system libraries, operating systems and run-time managers
s
.
2.2.2. Additional Terminology.
A Control Data Flow Graph (CDFG) is a graph representing all possible paths
the flow of data can follow during execution. An application corresponds to a sepa-
rate CDFG in the system. A process is an instantiation of a program, or a segment
of code, under execution consisting of “own” memory space, containing an image of
the executable code and data, resource descriptions, security attributes, and state in-
formation (register content, physical memory addressing etc.), i.e. all the information
necessary to execute the program
s
. Threads are sequences of instructions, or a flow
of control, in a program which can be executed concurrently. All threads in a given
process share the private address space of that process
s
. The term task is used quite
ambiguously in the literature: On the one hand, the terms task and process are used
synonymously. On the other hand, the terms process and thread are considered as
“mechanic” while the term task is considered as being more conceptual and used in
the context of scheduling as a set of program instructions loaded in memory for ex-
ecution. The term task in this paper is used as an umbrella term, which can denote
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsXX:5
complete applications, sub-parts of the CDFG like processing kernels (e.g. for-loops) or
even single computations (e.g. instructions) depending on the context.
2.3. Rationale of the classification and its presentation
The proposed classification tree is organized using a top-down splitting of the
types of techniques that increase the system resilience. It is accompanied by a
mapping of related work (see Figure 2). The top-down splitting allows to reach
a comprehensive list of types of techniques, which can always be expanded fur-
ther on demand. Splits are created based on properties of the techniques, which
allow them to be grouped together. More specifically, the properties in the pro-
posed framework regard: (1) the effect that the techniques have on the execution
and (2) the changes that are required on the system design for a technique to
be implemented. The properties will be elaborated as the tree is being presented.
A
A1.b
BOTTOM-UP
MAPPING
A1
A1.a
A2
A2.b
WORK
#1
WORK
#2
WORK
#3
WORK
#4
WORK
#5
TOP-DOWN
CLASSIFICATION
Subsection x.2.Subsection x.1.
A2.a
Fig. 2: Top down splitting to create the classification
tree and mapping of the related work
Other organizations are also
possible, like organizing
the splits around the sys-
tem functionality, hardware
components, types of errors
(transient, intermittent, per-
manent), types of resilience
metrics or the application do-
mains. The aforementioned
organization is chosen in order
to stress the reusability of
techniques but also to enable
the better understanding of
hybrid combinations. This is
especially supported through
the complementarity of the cat-
egories. It is important to note
that many actual approaches
that increase resilience typically represent hybrids and do not fall strictly into only
one of the categories.
For the presentation of the classification tree, the following structure is followed for
each of the abstraction layers (platform hardware, mapping and platform software).
First, the main classes are presented for the different techniques. Within each class,
subcategories are presented which are illustrated with the help of a figure. Groups of
nodes are chosen to be discussed together. For the visualization of the groups, bubbles
with different colors are used, along with the subsection number and a small geomet-
rical shape (see Figure 2). The colors and the geometrical shapes are used to enable a
more explicit link with the corresponding subsections in the text. Especially the geo-
metrical shapes are used for the facilitation of the reader in the black-white printed
version. The order of the leaves, the colors and the geometrical shapes do not indicate
the significance or the maturity of the techniques. For each of the classes, pros and cons
are discussed, based on general properties bound to each class. Among the aspects con-
sidered are: area and power overhead, performance degradation (in terms of additional
execution cycles), mitigation latency (delay until the scheme fulfils the intended miti-
gation function), error protection, general applicability, storage overhead. An overview
of those for the different classes can be found in Tables II-VII in the Appendix (see
supplementary material). In parallel, representative related work is discussed to fur-
ther illustrate the subcategory concept and demonstrate the usefulness of the proposed
classification scheme for classifying existing (and future) literature
s
. Moreover, in Ta-
ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.

Citations
More filters
Journal ArticleDOI
TL;DR: Canonical correlation analysis is a prototypical family of methods that is useful in identifying the links between variable sets from different modalities and so is well suited to the analysis of big neuroscience datasets.

133 citations

Journal ArticleDOI
TL;DR: This study includes the collection of data from the experimental work and the application of ML techniques to predict the CS of concrete containing fly ash, and shows high accuracy towards the prediction of outcome as indicated by its high coefficient correlation (R2) value.

103 citations

Journal ArticleDOI
TL;DR: The overall results show that NN-DVI has better performance in terms of addressing problems related to concept drift-detection, including both synthetic and real-world datasets.

82 citations

Journal ArticleDOI
TL;DR: In this article, screen content, which is often computer-generated, has many characteristics distinctly different from conventional camera-captured natural scene content, and such characteristic differences impose majo...
Abstract: Screen content, which is often computer-generated, has many characteristics distinctly different from conventional camera-captured natural scene content. Such characteristic differences impose majo...

54 citations

Book ChapterDOI
05 Jun 2020
TL;DR: This paper modify a successful authorship verification approach based on a multi-headed neural network language model and combine it with pre-trained language models and demonstrates the crucial effect of the normalization corpus in cross-domain attribution.
Abstract: Authorship attribution attempts to identify the authors behind texts and has important applications mainly in cyber-security, digital humanities and social media analytics. An especially challenging but very realistic scenario is cross-domain attribution where texts of known authorship (training set) differ from texts of disputed authorship (test set) in topic or genre. In this paper, we modify a successful authorship verification approach based on a multi-headed neural network language model and combine it with pre-trained language models. Based on experiments on a controlled corpus covering several text genres where topic and genre is specifically controlled, we demonstrate that the proposed approach achieves very promising results. We also demonstrate the crucial effect of the normalization corpus in cross-domain attribution.

38 citations

References
More filters
Book
01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Abstract: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today. In this edition, the authors bring their trademark method of quantitative analysis not only to high-performance desktop machine design, but also to the design of embedded and server systems. They have illustrated their principles with designs from all three of these domains, including examples from consumer electronics, multimedia and Web technologies, and high-performance computing.

11,671 citations

Journal ArticleDOI
TL;DR: The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result.
Abstract: The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result. This problem of “doing things right” on a large scale is not essentially new; in a telephone central office, for example, a very large number of operations are performed while the errors leading to wrong numbers are kept well under control, though they have not been completely eliminated. This has been achieved, in part, through the use of self-checking circuits. The occasional failure that escapes routine checking is still detected by the customer and will, if it persists, result in customer complaint, while if it is transient it will produce only occasional wrong numbers. At the same time the rest of the central office functions satisfactorily. In a digital computer, on the other hand, a single failure usually means the complete failure, in the sense that if it is detected no more computing can be done until the failure is located and corrected, while if it escapes detection then it invalidates all subsequent operations of the machine. Put in other words, in a telephone central office there are a number of parallel paths which are more or less independent of each other; in a digital machine there is usually a single long path which passes through the same piece of equipment many, many times before the answer is obtained.

5,408 citations


"Classification of Resilience Techni..." refers background or methods in this paper

  • ...Literature examples on the aforementioned concepts include algorithmic noise tolerance (ANT) (Hegde and Shanbhag 2001) on modules with reduced functionality, and Hamming (1950) and Dutt et al. (2014) on ECC ©S. ACM Computing Surveys, Vol. 50, No. 4, Article 50....

    [...]

  • ...Figure 6 shows an example of a single bit correction with the Hamming code (Hamming 1950)....

    [...]

Journal ArticleDOI
TL;DR: The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of systems failures.
Abstract: This paper gives the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Basic definitions are given first. They are then commented upon, and supplemented by additional definitions, which address the threats to dependability and security (faults, errors, failures), their attributes, and the means for their achievement (fault prevention, fault tolerance, fault removal, fault forecasting). The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of system failures.

4,695 citations


"Classification of Resilience Techni..." refers background in this paper

  • ...Reliability-related errors that occur due to hardware (HW)-design errors, insufficiently specified systems or malicious attacks (Avizienis et al. 2004), or erroneous SW interaction (i.e., manifestation of SW bugs due to SW of reduced quality (Lochmann and Goeb 2011)) are beyond the current scope....

    [...]

  • ...Reliability-related errors that occur due to hardware-design errors, insufficiently specified systems or malicious attacks [Avizienis et al. 2004] or erroneous software interaction (i....

    [...]

01 Jan 2007
TL;DR: In this paper, the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc.
Abstract: This paper gives the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Basic definitions are given first. They are then commented upon, and supplemented by additional definitions, which address the threats to dependability and security (faults, errors, failures), their attributes, and the means for their achievement (fault prevention, fault tolerance, fault removal, fault forecasting). The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of system failures.

4,335 citations

Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?

A systematic classification of approaches that increase system resilience in presence of functional hardwareinduced errors is presented, dealing with higher system abstractions: i. e. the ( micro- ) architecture, the mapping and platform software. Hardware and software solutions are discussed in a similar fashion, so that interrelationships become apparent. 

The most prominent being that mapping and SW provides a lot of flexibility due to the re-mapping possibilities of a given task sequence onto the “ fixed ” HW. Networked applications expanded further the deliverable functionality possibilities. The system behavior can be adapted at run time whenever significant environmental changes take place, or according to varying error rates. This is especially so, as errors can be masked as they propagate through the different hardware and software layers ( including the application itself ). 

Cons include latency (depending on the checkpointing granularity), performance (depending also on whether checkpointing is overlapped with normal execution) and the limitation to transient errors. 

Cons include the need for system-specific solutions, the low error protection (through isolation), the potential performance degradation. 

Cons include the potentially high storage and power overhead, the potentially very high latency and performance (depending also on whether checkpointing is overlapped with normal execution). 

Further technology trends like 3D integration, incorporating heterogeneous technologies on a single platform and dark silicon pose new challenges and opportunities for the fault tolerance techniques. 

Other examples of emerging error-tolerant application domains are Recognition, Mining and Synthesis (RMS) [Dubey 2005] as well as artificial neural networks (ANNs) [Temam 2012]. 

Pros include the limited area and power, performance overhead as the new implementation will typically satisfy the system requirements, while minimizing additional cost. 

The term task in this paper is used as an umbrella term, which can denoteACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX. 

These four classes are discussed in the following subsections, as shown in Figure 13 s. Main criteria for further categorization into classes include whether modifications are required in: existing functionalities, existing task implementations, the resource allocation, the interaction with neighbouring tasks, execution mode (of additional tasks), cooperation among HW modules. 

Rather than saving checkpoints at fixed intervals, checkpoints can be stored in a customized way so that the amount of stored data is minimized. 

Compared to global schemes, local schemes reduce the amount of data to be stored during checkpointing but require typically a more complicated recovery algorithm. 

Instead of adding modules with the same functionality, modules with different functionality can be added; the added modules play an active role in the recovery as in the previous category. 

Error recovery is further split into forward error recovery (FER), which includes redundancy, like for example triple modular redundancy, and backward error recovery (BER), which includes rolling back to a previously saved correct state of the system. 

Beyond the earlier discussed types of systems, intra-module schemes may address applications that are amenable to numerous non-deterministic events: uncertain functions (like human input functions), interrupts, system calls, I/O operations due to communication with external devices. 

system-specific strategies have been developed which deal with events coming from the external environment, especially events due to communication with external devices s. Online multiprocessor checkpointing can be broadly characterized as local and global. 

The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points.