Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems
Summary (5 min read)
1. INTRODUCTION
- Intermittent errors occur repeatedly but non-deterministically in time at the same location and last for one cycle or even for a long (but finite) period of time.
- The main contributions of this work are: (i) An integrated overview of the domain of functional reliability techniques (at the higher system level stack) is presented, using a systematic, hierarchical top-down splitting into sub-classes.
- Section 5 illustrates ways of using the proposed framework and Section 6 discusses observations and trends in the domain.
2.1. Resilient Digital System Design
- This survey presents an organization of techniques that can be used to make a digital system more reliable at functional level.
- Functional reliability is defined as the probability that over a specific period of time the system will fulfill its functionality, i.e. the set of functions that the system should perform [IEEE_Std 1990].
- Functional reliability is related with correcting binary digits as opposed to parametric reliability that deals with aspects of variations in operation margins [Rodopoulos et al. 2015].
- Functionality is one of the major elements of the specification set.
- When a system becomes more resilient, its reliability is increased.
2.2. Computing Terminology
- The term platform denotes a system composed of architectural and microarchitectural components together with the software required to run applications.
- This applies both to very flexible SW-programmable processors, where an instruction-set is present to control the operation sequence, and to dedicated HW processing components.
- The instruction set defines the hardware-software interface [Hennessy and Patterson 2011].
- The term task is used quite ambiguously in the literature:.
2.3. Rationale of the classification and its presentation
- The proposed classification tree is organized using a top-down splitting of the types of techniques that increase the system resilience.
- A A1.b BOTTOM-UP MAPPING A1 A1.a A2 A2.b WORK #1 WORK #2 WORK #3 WORK #4 WORK #5 TOP-DOWN CLASSIFICATION Subsection x.2.Subsection x.1. A2.a that increase resilience typically represent hybrids and do not fall strictly into only one of the categories.
- The colors and the geometrical shapes are used to enable a more explicit link with the corresponding subsections in the text.
- For each of the classes, pros and cons are discussed, based on general properties bound to each class.
- Deterministic execution is required for replicas to work.
3. PLATFORM HARDWARE
- To make digital systems more robust, functional capabilities need to be provided that would be unnecessary in a fault-free environment.
- This section focuses on techniques that modify the hardware capabilities for reliability purposes.
- The complete classification scheme is shown in Figure 12 in Subsection 3.5.
- Main criteria for further categorization include whether modifications are required in: existing functionalities, existing design implementations, resource allocation, operating conditions, the interaction with neighbouring modules, storage overhead.
- Leaves of the tree have an accompanying simple ordinal number for identification.
3.1. Forward execution - Additional HW modules provision
- This subsection discusses techniques that increase the resilience through adding HW modules on the platform.
- Within each pair, error detection is performed through a comparison circuit.
- Again, a distinction can be made between modules that are in parallel execution mode and modules that act as spares.
- These schemes exploit inherent redundancy in regularly structured systems such as arrays of PEs, memories and interconnection networks or even processors.
- Pros include low area and power overhead, general applicability (for systems with inherent redundancy).
3.3. Backward execution - Additional HW modules provision
- This subsection discusses techniques that increase the resilience of systems through rollback to an earlier point of execution and repetition of the execution.
- The corresponding categories and subsections are shown in Figure 8.6 3.3.1.
- This category discusses techniques that provide additional HW modules with the same functionality as the original ones.
- Spare modules would, for example, not only take over the execution after the primary module has failed but also repeat the failed execution.
- Pros include the flexibility to trade-off area, power, performance, latency with error protection depending on the selected functionality.
3.4. Backward execution - HW modules amount fixed
- The majority of the techniques proposed in the literature that employ backward execution, reuse the already existing HW modules as the additional area overhead of the previous category is avoided.
- The literature focuses on employing hardwarebased threads in coupled execution mode.
- The first stream is serviced (by the operating system) and execution resumes.
- Pros include the high error protection (for transient errors only), the general applicability.
- These checkpointing schemes can be characterized as global and local.
3.5. Overall platform hardware classification
- The sub-trees presented in the previous subsections are combined to form the overall classification tree for platform HW techniques, as shown in Figure 12.
- Starting from the top-level split of Figure 3, the intermediate nodes (colored by pale green) are followed when necessary, to reach the final classes (colored by darker green and numbered).
4. PLATFORM SOFTWARE
- Techniques that extend the platform software capabilities for reliability purposes are presented.
- These four classes are discussed in the following subsections, as shown in Figure 13 s. Main criteria for further categorization into classes include whether modifications are required in: existing functionalities, existing task implementations, the resource allocation, the interaction with neighbouring tasks, execution mode (of additional tasks), cooperation among HW modules.
- Pros would include the limited storage, area, performance overhead, latency, high error protection (only for instruction memory errors) and general applicability.
- Or the added task performs some different function, like error correction.
- It must be noted that parallel execution in this context, does not nec- ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.
4.2. Forward execution - Tasks amount fixed
- This subsection discusses techniques that do not provide additional tasks in the system in order to make it more reliable.
- Techniques that are focused around the functionality of tasks can either operate within the task boundaries, by reusing the task functionality (internal functionality reuse) or operate outside the task boundaries by rearranging its interaction with the other tasks (I/O configuration modification).
- These schemes reorganize the application or instruction profile so that the re-ordered execution is more robust.
- Pros include the lack of storage, power overhead and latency.
- Cons include the limited error protection and rather system-specific applicability.
4.3. Backward execution - Retry without state storage
- RETRY W/O STATE STORAGE PARALLEL EXECUTION SEQUENTIAL EXECUTION INTRAMODULE INTERMODULE.
- Sporadic tasks are aperiodic tasks that have hard deadlines.
- The task can be re-executed either on a single processor so that transient errors are removed or on a different processor so that permanent errors are avoided.
- Therefore, this category is further split into intra-module and inter-module techniques.
- Cons include the performance overhead, latency and limitation to transient errors.
4.4. Backward execution - Retry with state storage
- The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points.
- The inter-process dependencies are often recorded, so that the execution can be accurately repeated during the recovery phase.
- Such additional pieces of information ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.
- These surveys address both single-threaded and multithreaded/multi-process applications s. A number of techniques have been developed that do not explicitly bring the handling of non-deterministic events to the forefront.
- Checkpointing at user-level utilizes run-time libraries that are linked to the application program.
4.5. Overall mapping and platform software classification
- By combining the sub-trees of the previous subsections, the overall mapping and platform software classification tree is built, as shown in Figure 22.
- Starting from the toplevel split of Figure 13, the intermediate nodes (colored by pale yellow) are followed when necessary, to reach the final classes (colored by darker yellow and numbered).
5. USAGE OF THE CLASSIFICATION FRAMEWORK
- Identifying the primitive components (corresponding to a primitive category) and their position in the framework first, allows to handle the complexity of the sometimes highly sophisticated mitigation schemes.
- A “divide and conquer” view of the publication enables the reader to delve into the most relevant implementation details (when that is necessary) in a much more controlled way.
- ACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.
5.1. Mapping of hybrid schemes
- In reality, the resiliency and mitigation approaches, which are present in research papers, rarely belong to a single leaf of the previous, and indeed any, classification.
- The majority of the published work consists of hybrid combinations of the leaves.
- In particular, UDP-Lite packetization is (re-)designed in such a way that bits which are very sensitive to errors are better protected.
- The super reliable cores execute operations that are less resilient to errors.
- A run time scheduler reassigns a task that has failed on a particular RRC to another RRC (S.8), also known as Mapping & platform SW.
6. DISCUSSION AND FUTURE CHALLENGES
- The proposed classification was illustrated through a representative list of schemes, to better absorb the related ideas and support the validity of the tree.
- The literature on fault tolerance and resilience techniques has evolved in accordance with the trends in computer architecture and software design development.
- Therefore, by properly propagating information among the different layers and providing a suitable degree of adaptivity (with design time and run time knobs), the most cost-effective solutions can be achieved.
- The TDP mode 5 corresponds to a mode, where cores are operated in near threshold voltages.
8. CONCLUSION
- Techniques that increase resilience and mitigate functional reliability errors were classified in a novel way.
- This was achieved through a framework with complementary splits, in which primitive mitigation concepts are defined.
- That allows every type of technique to be classified, by combining the appropriate components.
- The framework has been accompanied by a wide variety of sources from the published literature.
- Insight can be provided to the designers and researchers about the nature of existing schemes, since every node has some unique properties.
Did you find this useful? Give us your feedback
Citations
133 citations
103 citations
82 citations
54 citations
38 citations
References
11,671 citations
5,408 citations
"Classification of Resilience Techni..." refers background or methods in this paper
...Literature examples on the aforementioned concepts include algorithmic noise tolerance (ANT) (Hegde and Shanbhag 2001) on modules with reduced functionality, and Hamming (1950) and Dutt et al. (2014) on ECC ©S. ACM Computing Surveys, Vol. 50, No. 4, Article 50....
[...]
...Figure 6 shows an example of a single bit correction with the Hamming code (Hamming 1950)....
[...]
4,695 citations
"Classification of Resilience Techni..." refers background in this paper
...Reliability-related errors that occur due to hardware (HW)-design errors, insufficiently specified systems or malicious attacks (Avizienis et al. 2004), or erroneous SW interaction (i.e., manifestation of SW bugs due to SW of reduced quality (Lochmann and Goeb 2011)) are beyond the current scope....
[...]
...Reliability-related errors that occur due to hardware-design errors, insufficiently specified systems or malicious attacks [Avizienis et al. 2004] or erroneous software interaction (i....
[...]
4,335 citations
2,140 citations
Related Papers (5)
Frequently Asked Questions (17)
Q2. What are the future works in "Xx classification of resilience techniques against functional errors at higher abstraction layers of digital systems" ?
The most prominent being that mapping and SW provides a lot of flexibility due to the re-mapping possibilities of a given task sequence onto the “ fixed ” HW. Networked applications expanded further the deliverable functionality possibilities. The system behavior can be adapted at run time whenever significant environmental changes take place, or according to varying error rates. This is especially so, as errors can be masked as they propagate through the different hardware and software layers ( including the application itself ).
Q3. What are the cons of a checkpointing scheme?
Cons include latency (depending on the checkpointing granularity), performance (depending also on whether checkpointing is overlapped with normal execution) and the limitation to transient errors.
Q4. What are the cons of a hybrid?
Cons include the need for system-specific solutions, the low error protection (through isolation), the potential performance degradation.
Q5. What are the pros and cons of checking a system?
Cons include the potentially high storage and power overhead, the potentially very high latency and performance (depending also on whether checkpointing is overlapped with normal execution).
Q6. What are the challenges and opportunities for the fault tolerance techniques?
Further technology trends like 3D integration, incorporating heterogeneous technologies on a single platform and dark silicon pose new challenges and opportunities for the fault tolerance techniques.
Q7. What are some examples of emerging error-tolerant application domains?
Other examples of emerging error-tolerant application domains are Recognition, Mining and Synthesis (RMS) [Dubey 2005] as well as artificial neural networks (ANNs) [Temam 2012].
Q8. What are the pros and cons of the HW module?
Pros include the limited area and power, performance overhead as the new implementation will typically satisfy the system requirements, while minimizing additional cost.
Q9. What is the term task in this paper?
The term task in this paper is used as an umbrella term, which can denoteACM Computing Surveys, Vol. V, No. N, Article XX, Publication date: January XXXX.
Q10. What are the main criteria for further categorizing into classes?
These four classes are discussed in the following subsections, as shown in Figure 13 s. Main criteria for further categorization into classes include whether modifications are required in: existing functionalities, existing task implementations, the resource allocation, the interaction with neighbouring tasks, execution mode (of additional tasks), cooperation among HW modules.
Q11. What is the concept of storing checkpoints in a customized way?
Rather than saving checkpoints at fixed intervals, checkpoints can be stored in a customized way so that the amount of stored data is minimized.
Q12. What are the pros and cons of local schemes?
Compared to global schemes, local schemes reduce the amount of data to be stored during checkpointing but require typically a more complicated recovery algorithm.
Q13. What are the pros and cons of adding modules with different functionality?
Instead of adding modules with the same functionality, modules with different functionality can be added; the added modules play an active role in the recovery as in the previous category.
Q14. What is the difference between error recovery and repair?
Error recovery is further split into forward error recovery (FER), which includes redundancy, like for example triple modular redundancy, and backward error recovery (BER), which includes rolling back to a previously saved correct state of the system.
Q15. What are the types of systems that are amenable to non-deterministic events?
Beyond the earlier discussed types of systems, intra-module schemes may address applications that are amenable to numerous non-deterministic events: uncertain functions (like human input functions), interrupts, system calls, I/O operations due to communication with external devices.
Q16. What are the pros and cons of online multiprocessor checkpointing?
system-specific strategies have been developed which deal with events coming from the external environment, especially events due to communication with external devices s. Online multiprocessor checkpointing can be broadly characterized as local and global.
Q17. What is the other group of backward techniques?
The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points.