scispace - formally typeset
Search or ask a question

Showing papers by "Matteo Sonza Reorda published in 2021"


Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, the authors combine the accuracy of register-transfer-level (RTL) fault injection with the efficiency of software fault injection to reduce the execution time of RTL evaluation.
Abstract: The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register- Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.

19 citations


Proceedings ArticleDOI
TL;DR: System-level test, or SLT, is an increasingly important process step in today’s integrated circuit testing flows and new and promising directions for methodical developments leveraging on recent findings from software engineering are outlined.
Abstract: System-level test, or SLT, is an increasingly important process step in today's integrated circuit testing flows. Broadly speaking, SLT aims at executing functional workloads in operational modes. In this paper, we consolidate available knowledge about what SLT is precisely and why it is used despite its considerable costs and complexities. We discuss the types or failures covered by SLT, and outline approaches to quality assessment, test generation and root-cause diagnosis in the context of SLT. Observing that the theoretical understanding for all these questions has not yet reached the level of maturity of the more conventional structural and functional test methods, we outline new and promising directions for methodical developments leveraging on recent findings from software engineering.

13 citations


Proceedings ArticleDOI
25 Apr 2021
TL;DR: In this article, the authors combine the accuracy of micro-architectural simulation with the speed of software fault injection to investigate the reliability of CNNs executed in GPUs, and they are able to analyze the impact of faults affecting GPUs' hidden modules on a whole CNN execution.
Abstract: Graphic Processing Units (GPUs) are commonly used to accelerate Convolutional Neural Networks (CNNs) for object detection and classification. As CNNs are employed in safety-critical applications, such as autonomous vehicles, their reliability must be carefully evaluated. In this work, we combine the accuracy of microarchitectural simulation with the speed of software fault injection to investigate the reliability of CNNs executed in GPUs. First, with a detailed microarchitectural fault injection on a GPU model (FlexGripPlus), we characterize the effects of faults in critical and user-hidden modules (such as the Warp Scheduler and the Pipeline Registers) in the computation of convolution over a suitably selected subset of tiles. Then, with software fault injection, we propagate the fault effects in the CNN. Thanks to our approach we are able, for the first time, to analyze the impact of faults affecting GPUs’ hidden modules on a whole CNN execution (LeNET) without undermining the reliability evaluation correctness.

11 citations


Proceedings ArticleDOI
07 Apr 2021
TL;DR: In this article, the authors present a functional test method based on a Software-Based Self-Test (SBST) approach targeting the Special Function Units (SFUs) in GPUs.
Abstract: The Graphics Processing Units (GPUs) usage has extended from graphic applications to others where their high computational power is exploited (e.g., to implement Artificial Intelligence algorithms). These complex applications usually need highly intensive computations based on floating-point transcendental functions. GPUs may efficiently compute these functions in hardware using ad hoc Special Function Units (SFUs). However, a permanent fault in such units could be very critical (e.g., in safety-critical automotive applications). Thus, test methodologies for SFUs are strictly required to achieve the target reliability and safety levels. In this work, we present a functional test method based on a Software-Based Self-Test (SBST) approach targeting the SFUs in GPUs. This method exploits different approaches to build a test program and applies several optimization strategies to exploit the GPU parallelism to speed up the test procedure and reduce the required memory. The effectiveness of this methodology was proven by resorting to an open-source GPU model (FlexGripPlus) compatible with NVIDIA GPUs. The experimental results show that the proposed technique achieves 90.75% of fault coverage and up to 94.26% of Testable Fault Coverage, reducing the required memory and test duration with respect to pseudorandom strategies proposed by other authors.

8 citations


Proceedings ArticleDOI
28 Jun 2021
TL;DR: In this paper, the authors define a generic and systematic methodology to improve transition delay faults observability of existing Self-Test Library (STL) programs. And they analyze previously devised STLs in order to highlight specific points within test programs to increase the final fault coverage.
Abstract: Testing digital integrated circuits is generally done using Design-for-Testability (DfT) solutions. Such solutions, however, introduce non-negligible area and timing overheads that can be overcome by adopting functional solutions. In particular, functional test of integrated circuits plays a key role when guaranteeing the device's safety is required during the operative lifetime (in-field test), as required by standards like ISO26262. This can be achieved via the execution of a Self-Test Library (STL) by the device under test (DUT). Nevertheless, developing such test programs requires a significant manual effort, and can be non-trivial when dealing with complex modules. This paper moves the first step in defining a generic and systematic methodology to improve transition delay faults' observability of existing STLs. To do so, we analyze previously devised STLs in order to highlight specific points within test programs to be improved, leading to an increase in the final fault coverage.

6 citations


Proceedings ArticleDOI
28 Jun 2021
TL;DR: In this paper, the authors propose to load in the DUT a pattern not by shifting inside of it a bit at a time, but loading the entire pattern at once inside of them, which allows for conservative stress measures, thus it fits for stress analysis purposes.
Abstract: Burn-In equipment provide both external and internal stress to the device under test. External stress, such as thermal stress, is provided by a climatic chamber or by socket-level local temperature forcing tools, and aims at aging the circuit material, while internal stress, such as electrical stress, consists in driving the circuit nodes to produce a high internal activity. To support internal stress, Burn-In test equipment is usually characterized by large memory capabilities required to store precomputed patterns that are then sequenced to the circuit inputs. Because of the increasing complexity and density of the new generations of SoCs, evaluating the effectiveness of the patterns applied to a Device under Test (DUT) through a simulation phase requires long periods of time. Moreover, topology-related considerations are becoming more and more important in modern high-density designs, so a way to include this information into the evaluation has to be devised. In this paper we show a feasible solution to this problem: the idea is to load in the DUT a pattern not by shifting inside of it a bit at a time but loading the entire pattern at once inside of it; this kind of procedure allows for conservative stress measures, thus it fits for stress analysis purposes. Moreover, a method to take the topology of the DUT into account when calculating the activity metrics is proposed, so to obtain stress metrics which can better represent the activity a circuit is subject to. An automotive chip accounting for about 20 million of gates is considered as a case of study. Resorting to it we show both the feasibility and the effectiveness of the proposed methodology.

5 citations


Proceedings ArticleDOI
25 Apr 2021
TL;DR: In this paper, the authors consider the case where the circuit is a pipelined processor, discuss the specific challenges of this scenario and propose some techniques to automatically identify some of the uncontrollable lines.
Abstract: In several test and reliability problems (from test generation to FMECA and Burn In) it is important to preliminarily identify those lines in a circuit netlist, which can not be controlled, i.e., can not be toggled to both logic values no matter the applied stimuli. Several techniques have been proposed in the past to attack this problem. In this paper we consider the case where the circuit is a pipelined processor, discuss the specific challenges of this scenario and propose some techniques to automatically identify some of the uncontrollable lines. The approach we devised uses SAT solving as underlying technology. We report the results we gathered on the OR1200 processor, showing that our method allows to trade off between the required computational effort and the achieved results. When compared with results produced by a commercial tool, our approach is able to identify a much higher number of uncontrollable lines with reasonable computational requirements.

3 citations


Journal ArticleDOI
TL;DR: A method to extend the usage of the new analog fault models to power devices, thus allowing to compute a fault coverage figure for a given test, and quantitatively evaluate the effectiveness of some test procedures commonly used at the PCB level for the detection of faults inside power devices.
Abstract: Power electronic systems using printed circuit boards (PCBs) are broadly used in many applications, including some safety-critical ones. Several standards (e.g., ISO26262 for the automotive sector and DO-178 for avionics) mandate the adoption of effective test procedures for all electronic systems. However, the metrics to be used to compute the effectiveness of the adopted test procedures are not so clearly defined for power devices and systems. In the past years, some commercial fault simulation tools (e.g., DefectSim by Mentor Graphics and TestMAX by Synopsys) for analog circuits have been introduced, together with some new fault models. With these new tools, systematic analog fault simulation finally became practically feasible. The aim of this article is twofold: first, we propose a method to extend the usage of the new analog fault models to power devices, thus allowing to compute a fault coverage figure for a given test. Second, we adopt the method on a case study, for which we quantitatively evaluate the effectiveness of some test procedures commonly used at the PCB level for the detection of faults inside power devices. A typical power supply unit (PSU) used in industrial products, including power transistors and power diodes, is considered. The analysis of the gathered results shows that using the new method we can identify the main points of strength/weakness of the different test solutions in a quantitative and deterministic manner and pinpoint the faults escaping to each one.

2 citations


Journal ArticleDOI
TL;DR: This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices, based on adding some spare modules to perform two in-field operations: detecting and mitigating faults.
Abstract: General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices’ reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices. The proposed solution is based on adding some spare modules to perform two in-field operations: detecting and mitigating faults. The solution takes advantage of the regularity of the execution units in the device to avoid significant design changes and reduce the overhead. The proposed solution was evaluated in terms of reliability improvement and area, performance, and power overhead costs. For this purpose, we resorted to a micro-architectural open-source GPGPU model (FlexGripPlus). Experimental results show that the proposed solution can extend the reliability by up to 57%, with overhead costs lower than 2% and 8% in area and power, respectively.

2 citations


Proceedings ArticleDOI
01 Sep 2021
TL;DR: In this article, a technique based on an evolutionary approach is presented to automatically generate stress test programs, i.e., sequences of instructions achieving a high toggling activity in the target module.
Abstract: One key aspect to be considered during device testing is the minimization of the switching activity of the circuit under test (CUT), thus avoiding possible problems stemming from overheating it. But there are also scenarios, where the maximization of certain circuits' modules switching activity could be proven useful (e.g., during Burn-In) in order to exercise the circuit under extreme operating conditions in terms of temperature (and temperature gradients). Resorting to a functional approach based on Software-based Self-test guarantees that the high induced activity cannot damage the CUT nor produce any yield loss. However, the generation of effective suitable test programs remains a challenging task. In this paper, we consider a scenario where the modules to be stressed are sub-modules of a fully pipelined processor. We present a technique, based on an evolutionary approach, able to automatically generate stress test programs, i.e., sequences of instructions achieving a high toggling activity in the target module. With respect to previous approaches, the generated sequences are short and repeatable, thus guaranteeing their easy usability to stress a module (and increase its temperature). The processor we used for our experiments is the Open RISC 1200. Results demonstrate that the proposed method is effective in achieving a high value of sustained toggling activity with short (3 instructions) and repeatable sequences.

2 citations


Proceedings ArticleDOI
28 Jun 2021
TL;DR: In this paper, the authors propose an effective microarchitectural selective hardening of GPU modules to mitigate those faults that affect instructions correct execution, which reduces the hardware overhead by up to 65% when compared with traditional TMR.
Abstract: Graphics Processing Units (GPUs) are today adopted in several domains for which reliability is fundamental, such as self-driving cars and autonomous machines. Unfortunately, on one side GPUs have been shown to have a high error rate and, on the other side, the constraints imposed by real-time safety-critical applications make traditional, costly, replication-based hardening solutions inadequate. This paper proposes an effective microarchitectural selective hardening of GPU modules to mitigate those faults that affect instructions correct execution. We first characterize, through Register-Transfer Level (RTL) fault injections, the architectural vulnerabilities of a GPU model (FlexGripPlus). We specifically target transient faults in the functional units and pipeline registers of a GPU core. Then, we apply selective hardening by triplicating the locations in each module that we found to be more critical. The results show that selective hardening using Triple Modular Redundancy (TMR) can correct 85% to 99% of faults in the pipeline registers and from 50% to 100% of faults in the functional units. The proposed selective TMR strategy reduces the hardware overhead by up to 65% when compared with traditional TMR.

Posted Content
TL;DR: In this article, the authors proposed an approach to compress functional test programs belonging to a self-test library (STL) by analyzing the interaction between the micro-architectural operation performed by each instruction and its capacity to propagate fault effects on any observable output.
Abstract: In-field test of processor-based devices is a must when considering safety-critical systems (e.g., in robotics, aerospace, and automotive applications). During in-field testing, different solutions can be adopted, depending on the specific constraints of each scenario. In the last years, Self-Test Libraries (STLs) developed by IP or semiconductor companies became widely adopted. Given the strict constraints of in-field test, the size and time duration of a STL is a crucial parameter. This work introduces a novel approach to compress functional test programs belonging to an STL. The proposed approach is based on analyzing (via logic simulation) the interaction between the micro-architectural operation performed by each instruction and its capacity to propagate fault effects on any observable output, reducing the required fault simulations to only one. The proposed compaction strategy was validated by resorting to a RISC-V processor and several test programs stemming from diverse generation strategies. Results showed that the proposed compaction approach can reduce the length of test programs by up to 93.9% and their duration by up to 95%, with minimal effect on fault coverage.