scispace - formally typeset
Search or ask a question

Showing papers by "Matteo Sonza Reorda published in 2023"


Proceedings ArticleDOI
01 Apr 2023
TL;DR: In this article , the authors describe how to correctly specify statistical FIs for Convolutional Neural Networks, and propose a data analysis on the CNN parameters that drastically reduces the number of FIs needed to achieve statistically significant results without compromising the validity of the proposed method.
Abstract: Assessing the reliability of modern devices running CNN algorithms is a very difficult task. Actually, the complexity of the state-of-the-art devices makes exhaustive Fault Injection (FI) campaigns impractical and typically out of the computational capabilities. A possible solution consists of resorting to statistical FI campaigns that allow a reduction in the number of needed experiments by injecting only a carefully selected small part of it. Under specific hypothesis, statistical FIs guarantee an accurate picture of the problem, albeit selecting a reduced sample size. The main problems today are related to the choice of the sample size, the location of the faults, and the correct understanding of the statistical assumptions. The intent of this paper is twofold: first, we describe how to correctly specify statistical FIs for Convolutional Neural Networks; second, we propose a data analysis on the CNN parameters that drastically reduces the number of FIs needed to achieve statistically significant results without compromising the validity of the proposed method. The methodology is experimentally validated on two CNNs, ResNet-20 and MobileNetV2, and the results show that a statistical FI campaign on about 1.21% and 0.55% of the possible faults, provides very precise information of the CNN reliability. The statistical results have been confirmed by the exhaustive FI campaigns on the same cases of study.

5 citations


Journal ArticleDOI
TL;DR: Nicolici et al. as discussed by the authors assesses the fault coverage that can be attained through software self-test strategies for in-field test of GPUs, which can be used for autonomous systems.
Abstract: Editor’s notes: GPUs have seen an increased adoption in autonomous systems. This article assesses the fault coverage that can be attained through software self-test strategies for in-field test of GPUs.—Nicola Nicolici, McMaster University

2 citations


Proceedings ArticleDOI
01 Jan 2023
TL;DR: In this paper , the effects of transition delay faults (TDFs) in GPUs are evaluated using an open-source model of a GPU (FlexGripPlus) and a set of workloads.
Abstract: This work proposes a method to evaluate the effects of transition delay faults (TDFs) in GPUs. The method takes advantage of low-level (i.e., RT- and gate-level) descriptions of a GPU to evaluate the effects of transition delay faults in GPUs, thus paving the way to model them as errors at the instruction level, which can contribute to the resilience evaluations of large and complex applications. For this purpose, the paper describes a setup that efficiently simulates transition delay faults. The results allow us to compare their effects with stuck-at-faults (SAFs) and perform an error classification correlating these faults as instruction-level errors. We resort to an open-source model of a GPU (FlexGripPlus) and a set of workloads for the evaluation. The experimental results show that, according to the application code style, TDFs can compromise the operation of an application from 1.3 to 11.63 times less than SAFs. Moreover, for all the analyzed applications, a considerable percentage of sites of the Integer (5.4% to 51.7%), Floating-point (0.9% to 2.4%), and Special Function unit (17.0% to 35.6%) can become critical if affected by a SAF or TDF. Finally, a correlation between the fault's impact from both fault models and the instructions executed by the applications reveals that SAFs in the functional units are more prone (from 45.6% to 60.4%) to propagate errors at the software level for all units than TDFs (from 17.9% to 58.8%).

1 citations


Proceedings ArticleDOI
22 May 2023
TL;DR: In this paper , the authors present a Validity Checker Module (VCM) architecture and a constraint set that allows building SBSTs that make minimal assumptions about the firmware, targeting hard-to-test faults in the ALU and register file of multiple scalar, in-order RISC-V processor families.
Abstract: Software-Based Self-Tests (SBST) allow at-speed, native online-testing of processors by running software programs on the processor core, requiring no Design for Testability (DfT) infrastructure. The creation of such SBST programs often requires time-consuming manual labour that is expensive and requires in-depth knowledge of the processor’s architecture to target hard-to-test faults. In contrast, encoding the SBST generation task as a Bounded Model Checking (BMC) problem allows using sophisticated, state-of-the-art BMC solvers to automatically generate an SBST. Constraints for the BMC problem are encoded in a circuit called Validity Checker Module (VCM) and applied during SBST generation.In this paper, we focus on presenting a VCM architecture and a constraint set that allows building SBSTs that make minimal assumptions about the firmware, targeting hard-to-test faults in the ALU and register file of multiple scalar, in-order RISC-V processor families. The VCM architecture consists of a processor-specific mapping layer and a generic constraint set connected via a well-defined interface. The generic constraint set enforces the desired SBST behaviour, including controlling the processor’s pipeline state, memory accesses, and with that executed instructions, register state, and fault propagations. Using a generic constraint set allows for rapid SBST generation targeting new RISC-V processor families while keeping the generic constraints untouched. Lastly, we evaluate this approach on two RISC-V processor families, namely the DarkRISCV and a proprietary, industrial core showing the portability and strength of the approach, allowing for rapidly targeting new processors.

1 citations


Journal ArticleDOI
TL;DR: In this paper , the authors present a methodology based on formal methods, able to automatically generate the best two-instruction stress-inducing sequence for the targeted processor module, which is composed of a short, arbitrarily-long repeatable sequence of a pair of assembly instructions, thus guaranteeing the maximum possible constant switching activity.
Abstract: Throughout device testing, one key parameter to be considered is the switching activity (SWA) of the circuit under test (CUT). To avoid unwanted scenarios due to excessive power consumption during test, in most cases the SWA of the CUTs must be retained to a minimal value when the test stimulus is applied. However, there are specific cases where the opposite, namely the SWA maximization within the CUT, or a certain sub-module of it, can be proven beneficial. For example, during dynamic Burn-In testing we aim at maximizing the internal stress by applying suitable stimuli. This can be done in a functional manner by following the Software-Based Self-Test paradigm. However, generating such suitable programs represents a costly and arduous task for the test engineers. We consider the case where the CUT is a pipelined processor core and we aim to maximize the SWA of certain core sub-modules. We present a comprehensive methodology based on formal methods, able to automatically generate the best two-instruction stress-inducing sequence for the targeted processor module. The generated stimulus is composed of a short, arbitrarily-long repeatable sequence of a pair of assembly instructions, thus guaranteeing the maximum possible constant switching activity. The proposed method was applied to the OpenRISC 1200 and the RI5CY (PULP) processor cores demonstrating its effectiveness when compared to other methods. We show that the time for generating the best repeatable instruction sequence is limited in most cases, while the generated sequence can always achieve a significantly higher repeatable and constant SWA than other solutions.

1 citations


Journal ArticleDOI
TL;DR: In this paper , the authors present a method, based on an evolutionary approach, that is able to automatically generate maximum stress programs, i.e., sequences of instructions achieving a high switching activity in the target module.

1 citations


Journal ArticleDOI
TL;DR: In this article , a method to develop suitable self-test libraries resorting to high-level languages (HLLs) for some modules in GPUs, thus reducing the complexity of the STLs development process is described.
Abstract: Self-test libraries (STLs) that are widely used for in-field fault detection in processor-based systems can also be used in graphics processing units (GPUs). This work describes a method to develop suitable STLs resorting to high-level languages (HLLs) (e.g., CUDA) for some modules in GPUs, thus reducing the complexity of the STLs development process. The main advantages, limitations and constraints or developing HLL STLs for GPUs are discussed. The libraries are evaluated using FlexGripPlus GPU model. The experimental results show that HLL STLs can be effectively developed for regular modules in the GPU.

Proceedings ArticleDOI
21 Mar 2023
TL;DR: In this paper , the authors evaluate the impact of transient faults in the hardware structures of Special Function Unit (SFU) cores on GPU workloads and show that modular SFUs are less vulnerable to transient faults than fused SFUs.
Abstract: 11This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.Graphics Processing Units (GPUs) are crucial in modern safety-critical systems to implement complex and dense algorithms, so their reliability plays an essential role in several domains (e.g., automotive and autonomous machines). In fact, reliability evaluations in GPUs and their internal units are of special interest by their high parallelism and to identify vulnerable structures. In particular, Special Function Unit (SFU) cores, inside GPUs, are highly used in multimedia, scientific computing, and the training of neural networks. However, reliability evaluations in SFUs have remained highly unexplored. This work evaluates the impact of transient faults in the hardware structures of SFUs for GPUs. We focus on evaluating and analyzing two SFU architectures (‘fused’ and ‘modular’) and their relations to energy, area, and reliability impact on GPU workloads. The evaluation resorts to a fine-grain analysis with experiments using an RTL open-source GPU (FlexGripPlus) instrumented with both SFUs. The experimental results on both SFU architectures indicate that modular SFUs are less vulnerable to transient faults (in up to 47% for the analyzed workloads) and are more power efficient (in up to 36.6%) but require additional cost in terms of area (about 27%) in comparison with a fused SFU architecture (base for commercial devices), which seems more vulnerable to faults, but is area efficient.

Proceedings ArticleDOI
21 Mar 2023
TL;DR: In this paper , the authors investigate possible software architectures when integrating Software Test Libraries in RTOSes with their pros and cons, and propose hardening mechanisms to overcome eventual problems in case permanent or transient faults arise.
Abstract: The performance and complexity of Automotive System-on-Chip (SoC) have dramatically risen in the last decade thanks to technology scaling and moved to multicore capabilities. As a matter of fact, user requirements and the scenario complex-ity handled by devices are dramatically growing. Therefore, bare-metal safety-critical applications have shifted to a new application paradigm on top of Real-Time Operating Systems (RTOS). Safety standards require runtime self-check procedures the CPU executes from time to time. Such self-test procedures have strict requirements on their execution time and memory foot-print. The aforementioned self-test processes are also known as Software-Based Self Test encapsulated in Software Test Libraries. Following the shift to applications written on top of an RTOS, Software Test Libraries must also be integrated. This paper investigates possible software architectures when integrating Software Test Libraries in RTOSes with their pros e cons. Afterward, some hardening mechanisms are provided to overcome eventual problems in case permanent or transient faults arise. In order to simulate critical conditions, fault injections are performed via debugger in the Software Test Library to investigate their behavior and how they affect the system. Previously developed Software Test Library is integrated into a commercial RTOS called Micrium C OS-III. The fault injection campaign is performed on a real automotive System-on-Chip belonging to the SPC58 family from ST Microelectronics.

Proceedings ArticleDOI
22 May 2023
TL;DR: In this article , an Image Test Library (ITL) is developed targeting the on-line test of GPU functional units, which is able to achieve about 95% of stuck-at test coverage on the floating-point multipliers in a GPU.
Abstract: The widespread use of artificial intelligence (AI)-based systems has raised several concerns about their deployment in safety-critical systems. Industry standards, such as ISO26262 for automotive, require detecting hardware faults during the mission of the device. Similarly, new standards are being released concerning the functional safety of AI systems (e.g., ISO/IEC CD TR 5469). Hardware solutions have been proposed for the infield testing of the hardware executing AI applications; however, when used in applications such as Convolutional Neural Networks (CNNs) in image processing tasks, their usage may increase the hardware cost and affect the application performances. In this paper, for the very first time, a methodology to develop high-quality test images, to be interleaved with the normal inference process of the CNN application is proposed. An Image Test Library (ITL) is developed targeting the on-line test of GPU functional units. The proposed approach does not require changing the actual CNN (thus incurring in costly memory loading operations) since it is able to exploit the actual CNN structure. Experimental results show that a 6-image ITL is able to achieve about 95% of stuck-at test coverage on the floating-point multipliers in a GPU. The obtained ITL requires a very low test application time, as well as a very low memory space for storing the test images and the golden test responses.

Proceedings ArticleDOI
22 May 2023
TL;DR: In this article , the authors consider the case where the layout of the circuit is known and propose two novel methods able to automatically produce functional stimuli to switch pairs of neighboring nodes (i.e., nodes that are placed within a specified distance in the DUT) in short periods of time.
Abstract: In the domain of high reliability applications, Burn-In testing (BI) is always present since it is one of the prime countermeasures against the infant mortality phenomenon. Traditional static BI testing proves to be inefficient for modern circuit designs. As the devices’ feature size scales down and their structural and architectural complexity increases, so does the complexity and cost of the BI test. Different BI methods are employed by the industry where stimuli are also applied to the devices under test (DUTs) in order to effectively stress and stimulate all nets of the design. One known industry practice resorts to Design for Testability (DfT) infrastructures (e.g., scan) and is based on the application of test vectors at low frequency to excite the DUT as much as possible with the goal of switching each net of the design at least once. In this paper we consider the case where the layout of the circuit is known and propose two novel methods able to automatically produce functional stimuli to switch pairs of neighboring nodes (i.e., nodes that are placed within a specified distance in the DUT) in short periods of time. This solution has been shown to be able to trigger some latent defects in a circuit better than other methods. As a case study, we target functional units within a RISC-V processor (RI5CY). We show that the functional stimuli generated by the exact method described in the paper are able to achieve optimal results (i.e., the maximum functional switching of neighboring pairs), thus maximizing the chance that their at-speed application can activate weak points in the circuit.

Proceedings ArticleDOI
03 May 2023
TL;DR: In this paper , the authors propose a methodology to find the first failing pattern which causes the BIST signature to deviate and a way to collect good signatures from in-field devices, at key on/off, where BISTs are programmed and executed by the firmware at maximum frequency for an industrial case study produced by STMicroelectronics.
Abstract: Embedded nano-electronic devices have spread in daily life over the past ten years. Chip and embedded system manufacturing has thus become more challenging in recent years.When safety-critical sectors like the automobile are considered, addressing system anomalies and faults is crucial. Therefore, it is necessary to develop and research innovative ways to maintain high reliability in safety-critical sectors despite the complexity of present Systems-on-Chip (SoCs).In order to ensure high reliability, and be compliant with reliability standards, designers started to add additional circuitry to perform on-device tests. Built-In-Self-Test (BIST) is a technology that allows to conduct exhaustive tests within devices and, most importantly, without the need for external equipment. BIST can detect faults by outputting a signature at test end, which can be compared with a known value. Thus such known signatures are key, and in case of a signature mismatch it is not trivial to understand the root cause of the failure.This paper proposes a methodology to find the first failing pattern which causes the BIST’s signature to deviate and a way to collect good signatures from in-field devices, at key on/off, where BISTs are programmed and executed by the firmware at maximum frequency for an industrial case study produced by STMicroelectronics.The transition delay fault model is the primary target for the described work.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a low-cost system-on-chip (SOC) test tester with unlimited pseudo-random patterns created autonomously by the MCU firmware from any selected seed.
Abstract: Burn-In test equipment usually owns extensive memory capabilities to store pre-computed patterns to be applied to the circuit inputs as well as ad-hoc circuitries to drive and read the DUT pins during the BI phase. The solution proposed in this paper dramatically reduces the memory size requirement and just demands a generic microcontroller unit (MCU) equipped with a couple of embedded processors, some standard common peripheral units, and a few KB memories. Moreover, the proposed Burn-In tester could be integrated into a System Level Test equipment which is typically based on MCUs to communicate functionally with the DUT. This paper provides full details about the architecture of such a low-cost innovative tester, which can supply the DUT with unlimited pseudo-random patterns created autonomously by the MCU firmware from any selected seed. The tester prototype developed to collect experimental results includes a low-cost System-on-Chip based on a multi-core MCU and a set of peripheral cores, encompassing timers and Direct Memory Access modules. The tester prototype is used to stress an automotive chip accounting for about 20 million gates, 700 thousand scan flip-flops, and several scan modes. The combination of pseudo-random pattern generation with the ability to control different scan and Design for Testability (DfT) modes, including LBIST, permits to reach a higher coverage of stress metrics than by the application of a limited set of pre-computed ATPG patterns. The toggle coverage level reached is up to 95.89%. The application speed achieved by the tester with non-optimized connections is up to about 10MHz.

Journal ArticleDOI
TL;DR: In this paper , the authors present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects.
Abstract: Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of permanent faults in GPUs are then strongly required, thus allowing to estimate the reliability risk and to possibly mitigate it. In this paper, we present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects. We characterize over 5.83x10^5 permanent fault effects in the scheduler and controllers of a gate-level GPU model. Then, we map the observed error categories in software by instrumenting the code of 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors. Our two-level fault injection strategy reduces the evaluation time from hundreds of years of gate-level evaluation to hundreds of hours.We found that faults in the GPU parallelism management units can modify the opcode, the addresses, and the status of thread(s) and warp(s). The large majority (up to 99%) of these hardware permanent errors impacts the running software execution. Errors affecting the instruction operation or resource management hang the code, while 45% of errors in the parallelism management or control-flow induce silent data corruptions.

Journal ArticleDOI
TL;DR: In this article , the authors present a systematic method for the development of very high fault coverage test programs for PDFs, which largely outperform test programs written for other fault models.
Abstract: New semiconductor technologies for advanced applications are more prone to defects and imperfections related, among several different causes, to the manufacturing process, aging and cross-talks. These phenomena negatively affect the circuit’s timing and can be effectively modeled by means of the path delay fault (PDF) model. While path delay testing is currently supported by commercial Automatic Test Pattern Generation tools for scan designs, functional testing covering PDFs is not widely adopted, mainly because of the high cost for test generation. On the other side, functional test is already widely adopted for in-field test of stuck-at faults, which is often performed resorting to the execution of suitable test programs (Self Test Libraries, or STLs). This approach is attractive, since it can be performed at-speed with limited time constraints and high flexibility, making it a suitable in-field test solutions. Previous work assessed the feasibility and validity of functional approaches based on test programs targeting PDFs. In this work, we present the first systematic method for the development of very high fault coverage test programs for PDFs, which largely outperform test programs written for other fault models. Moreover, the proposed method allows the identification of functionally untestable faults. The effectiveness of the proposed approach was proven on an open-source RISC-V processor core, where 100% coverage of the functionally testable longest paths was achieved, compared with an initial coverage of 0.52% achieved with test programs targeting stuck-at faults. Results demonstrate that shorter paths are also effectively covered.

Proceedings ArticleDOI
03 May 2023
TL;DR: In this article , the authors present an environment dedicated to the analysis of the impact of permanent faults on GPU platforms, with the objective of exploiting the configuration features of this tool and thus, analyzing the effects of faults when changing the target architecture.
Abstract: 1 Nowadays, GPU platforms have gained wide importance in applications that require high processing power. Unfortunately, the advanced semiconductor technologies used for their manufacturing are prone to different types of faults. Hence, solutions are required to support the exploration of the resilience to faults of different architectures. Based on this motivation, this work presents an environment dedicated to the analysis of the impact of permanent faults on GPU platforms. This environment is based on GPGPU-Sim, with the objective of exploiting the configuration features of this tool and, thus, analyzing the effects of faults when changing the target architecture. To validate the environment and show its usability, a fault campaign has been carried out where three different GPU architectures (Kepler, Volta, and Turing) were used. In addition, each GPU has been modified with an arbitrary number of parallel processing cores (or SMs). Three representative applications (Vector Add, Scalar Product, and Matrix Multiply) were executed on each GPU, and the behavior of each architecture in the presence of permanent faults in the functional (i.e., integer unit and floating-point) units was analyzed. This fault campaign shows the usability of the environment and demonstrates its potential use to support decisions on the best architectural parameters for a given application.

Proceedings ArticleDOI
22 May 2023
TL;DR: In this article , the use of RISC-V processors for safety-related applications is discussed and the essential techniques necessary to obtain safety both in the functional and in the timing domain.
Abstract: With the continued success of the open RISC-V architecture, practical deployment of RISC-V processors necessitates an in-depth consideration of their testability, safety and security aspects. This survey provides an overview of recent developments in this quickly-evolving field. We start with discussing the application of state-of-the-art functional and system-level test solutions to RISC-V processors. Then, we discuss the use of RISC-V processors for safety-related applications; to this end, we outline the essential techniques necessary to obtain safety both in the functional and in the timing domain and review recent processor designs with safety features. Finally, we survey the different aspects of security with respect to RISC-V implementations and discuss the relationship between cryptographic protocols and primitives on the one hand and the RISC-V processor architecture and hardware implementation on the other. We also comment on the role of a RISC-V processor for system security and its resilience against side-channel attacks.

Proceedings ArticleDOI
22 May 2023
TL;DR: In this paper , the authors evaluate the impact of soft errors arising in the Special Function Unit (SFU) cores on the reliability of GPUs when affected by soft errors and conclude that workloads using SFUs are more vulnerable to faults (up to 5 orders of magnitude for the analyzed applications).
Abstract: 1 Currently, Graphics Processing Units (GPUs) are extensively used in several safety-critical domains to support the implementation of complex operations where reliability is a major concern. Some internal cores, such as Special Function Units (SFUs), are increasingly adopted, being crucial to achieving the necessary performance in multimedia, scientific computing, and neural network training. Unfortunately, these cores are highly unexplored in terms of their impact on reliability.In this work, we evaluate the incidence of SFUs on the reliability of GPUs when affected by soft errors. First, we analyze the impact of SFU cores on the GPU’s reliability and the running workloads. We resort to applications configured to use or not the SFU cores and evaluate the effect of soft errors by using a software-based fault injection environment (NVBITFI) in an NVIDIA Ampere GPU. Then, we focus on evaluating the impact of soft errors arising in the SFUs. A fine-grain RTL evaluation determines the soft error effects on two SFUs architectures for GPUs (’fused’ and ’modular’). The experiments use an open-source GPU (FlexGripPlus) instrumented with both SFU architectures. The results suggest that workloads using SFUs are more vulnerable to faults (from 1 up to 5 orders of magnitude for the analyzed applications). Moreover, the RTL results show that modular SFUs are less vulnerable to faults (in up to 47% for the analyzed workloads) in comparison with fused SFUs (base of commercial devices), so allowing us to identify the more robust SFU architecture.