scispace - formally typeset
Search or ask a question

Showing papers by "Matteo Sonza Reorda published in 2020"


Journal ArticleDOI
TL;DR: This work extended the capabilities of an open-source VHDL GPGPU model (FlexGrip) and developed a new version named FlexGripPlus to study and analyze the effects of SEUs in a GPG PU in a much more detailed manner, demonstrating the correlation between the number of execution units in aGPGPU and the system reliability.

44 citations


Proceedings ArticleDOI
23 Nov 2020
TL;DR: In this article, the authors consolidate available knowledge about what system-level test is precisely and why it is used despite its considerable costs and complexities, and outline approaches to quality assessment, test generation and root-cause diagnosis in the context of SLT.
Abstract: System-level test, or SLT, is an increasingly important process step in today’s integrated circuit testing flows. Broadly speaking, SLT aims at executing functional workloads in operational modes. In this paper, we consolidate available knowledge about what SLT is precisely and why it is used despite its considerable costs and complexities. We discuss the types or failures covered by SLT, and outline approaches to quality assessment, test generation and root-cause diagnosis in the context of SLT. Observing that the theoretical understanding for all these questions has not yet reached the level of maturity of the more conventional structural and functional test methods, we outline new and promising directions for methodical developments leveraging on recent findings from software engineering.

14 citations


Journal ArticleDOI
TL;DR: The proposed technique can translate fault primitives, which represent the effect of faults in a memory cell, into self-test functions and programs composed of a sequence of operations to excite the fault in the memory and to propagate its effects to a visible location, thus detecting its presence.
Abstract: The highly parallel processing capabilities and reduced power performance of General Purpose Graphics Processing Units (GPGPUs) have been crucial factors for their massive use in multiple fields, such as multimedia and high-performance computing applications. Nowadays, more demanding areas, such as automotive, employ GPGPU devices where safety and reliability are mandatory design constraints. Nevertheless, the structural complexity, the transistor density, and the implementation in the latest silicon technologies introduce challenges to match safety and reliability requirements. In these technologies, wear-out and aging are factors that may significantly increase the occurrence of permanent faults during the lifetime operation. Moreover, these faults may generate unacceptable misbehaviors during the execution of an application. These constraints require devising new methods for in-field fault detection, thus verifying the integrity and correct behavior of the device during its whole operational life. This work proposes a technique to generate functional self-test programs targeting the detection of permanent static faults in the memory of the warp scheduler of a GPGPU. The proposed technique can translate fault primitives, which represent the effect of faults in a memory cell, into self-test functions and programs composed of a sequence of operations to excite the fault in the memory and to propagate its effects to a visible location, thus detecting its presence. We focused on the memory in the warp scheduler because it represents a crucial module for the device operation. Furthermore, this memory is present in each Streaming Multiprocessor (SM) of a GPGPU. Some experimental results to validate the method have been gathered, resorting to the NVIDIA Visual Profiler and the Nsight Debugger using the NVIDIA-GEFORCE GTX GPU platform and a structural fault simulator. The CUDA programming environment was used to implement the test procedures.

11 citations


Proceedings ArticleDOI
05 Apr 2020
TL;DR: Results show how the approach can cover more than 90% of stuck-at faults on an open-source implementation of the standard, which is significantly more than what a usual functional test based on some sample application can achieve.
Abstract: The Controller Area Network (CAN) bus is a serial bus protocol widely used in the automotive domain to allow communication between different Electronic Control Units in the car. Being often part of safety-critical systems, the hardware implementing the CAN network must be constantly tested along the system lifetime, even during the operational phase. CAN controllers are relatively complex modules in charge of managing the sending and the receiving of packages through the CAN bus and defects affecting them can easily compromise the whole CAN network. In this work, the CAN controller is tested by test programs to be executed by the CPU connected to the device under test and by another unit connected to the same CAN bus. A fault grading with respect to structural permanent faults of a functional test based on the execution of a software test library for the CAN bus is presented for the first time. Results show how the approach can cover more than 90% of stuck-at faults on an open-source implementation of the standard, which is significantly more than what a usual functional test based on some sample application can achieve.

10 citations


Journal ArticleDOI
TL;DR: The proposed methodology for supporting the failure mode, effects, and criticality analysis (FMECA) aimed at identifying the critical faults and assessing their effects on the overall system improves the failure classification effectiveness.
Abstract: Complex systems are composed of numerous interconnected subsystems, each designed to perform specific functions. The different subsystems use many technological items that work together, as for the case of cyber-physical systems. Typically, a cyber-physical system is composed of different mechanical actuators driven by electrical power devices and monitored by sensors. Several approaches are available for designing and validating complex systems, and among them, behavioral-level modeling is becoming one of the most popular. When such cyber-physical systems are employed in mission- or safety-critical applications, it is mandatory to understand the impacts of faults on them and how failures in subsystems can propagate through the overall system. In this paper, we propose a methodology for supporting the failure mode, effects, and criticality analysis (FMECA) aimed at identifying the critical faults and assessing their effects on the overall system. The end goal is to analyze how a fault affecting a single subsystem possibly propagates through the whole cyber-physical system, considering also the embedded software and the mechanical elements. In particular, our approach allows the analysis of the propagation through the whole system (working at high level) of a fault injected at low level. This paper provides a solution to automate the FMECA process (until now mainly performed manually) for complex cyber-physical systems. It improves the failure classification effectiveness: considering our test case, it reduced the number of critical faults from 10 to 6. The remaining four faults are mitigated by the cyber-physical system architecture. The proposed approach has been tested on a real cyber-physical system in charge of driving a three-phase motor for industrial compressors, showing its feasibility and effectiveness.

10 citations


Proceedings ArticleDOI
25 May 2020
TL;DR: This paper proposes a systematic approach to identify faults that do not disrupt safety-critical functionalities and consequently can be considered Safe, by deploying code coverage and formal verification techniques, and enables the classification of faults that are unclassified by other technologies, improving ISO26262 compliance.
Abstract: The development of Integrated Circuits for the Automotive sector imposes on major challenges. ISO26262 compliance, as part of this process, entails complex analysis for the evaluation of potential random hardware faults. This paper proposes a systematic approach to identify faults that do not disrupt safety-critical functionalities and consequently can be considered Safe. By deploying code coverage and Formal verification techniques, our methodology enables the classification of faults that are unclassified by other technologies, improving ISO26262 compliance. Our results, in combination with Fault Simulation, achieved a Diagnostic Coverage of 93% in a CAN Controller. These figures allow an initial assessment for an ASIL B configuration of the IP.

6 citations


Proceedings ArticleDOI
19 Oct 2020
TL;DR: Experimental results showed that the presence of specific encryption mechanisms alone induces high fault detection rates in such applications, which may allow the designer to consider security and safety mechanisms together, achieving the same results with lower costs.
Abstract: Nowadays, many electronic systems store valuable Intellectual Property (IP) information inside Non-Volatile Memories (NVMs). Therefore, encryption mechanisms are widely used in order to protect such information from being stolen or modified by human attacks. Encryption techniques can be used for protecting the application code, or sensitive sets of data in the NVM. In particular, in machine-learning applications, the weights of an Artificial Neural Network (ANN) represent a highly valuable IP stemming from long time invested in training the system along the development phase. On the other side, systems implementing ANN applications are increasingly used in safety-critical domains (e.g., autonomous driving), where a high reliability level is required. In a previous paper, we have shown that encryption techniques, applied to the application code of generic systems, provide a significantly higher error detection rate. In this paper, we focus on an ANN application and we evaluate the detection rate induced by encryption mechanisms for transient faults possibly impacting the ANN weights. We performed experiments on a pre-trained ANN, whose weights represent the sensitive IP of our system. We executed fault injection campaigns to evaluate the ANN resilience when different encryption methods are used. Experimental results showed that the presence of specific encryption mechanisms alone induces high fault detection rates in such applications. This may allow the designer to consider security and safety mechanisms together, achieving the same results with lower costs.

6 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: In this article, the authors analyze transient effects in the pipeline registers of a GPU architecture, considering the source of the fault, its effect on the GPU, and the use of software-based hardening techniques.
Abstract: Graphics processing units are available solutions for high-performance safety-critical applications, such as self-driving cars. In this application domain, functional-safety and reliability are major concerns. Thus, the adoption of fault tolerance techniques is mandatory to detect or correct faults, since these devices must work properly, even when faults are present. GPUs are designed and implemented with cutting-edge technologies, which makes them sensitive to faults caused by radiation interference, such as single event upsets. These effects can lead the system to a failure, which is unacceptable in safety-critical applications. Therefore, effective detection and mitigation strategies must be adopted to harden the GPU operation. In this paper, we analyze transient effects in the pipeline registers of a GPU architecture. We run four applications at three GPU configurations, considering the source of the fault, its effect on the GPU, and the use of software-based hardening techniques. The evaluation was performed using a general-purpose soft-core GPU based on the NVIDIA G80 architecture. Results can guide designers in building more resilient GPU architectures.

6 citations


Proceedings ArticleDOI
05 Apr 2020
TL;DR: The objective is to provide researchers with an industrial-grade automotive SoC that includes all essential components, is fully customizable, and enables analysis of functional safety solutions and automotive SoCs configurations.
Abstract: The current demands for autonomous driving generated momentum for an increase in research in the different technologies required for these applications. Nonetheless, the limited access to representative designs and industrial methodologies poses a challenge to the research community. Considering this scenario, there is a high demand for an open-source solution that could support development of research targeting automotive applications. This paper presents the current status of AutoSoC, an automotive SoC benchmark suite that includes hardware and software elements and is entirely open-source. The objective is to provide researchers with an industrial-grade automotive SoC that includes all essential components, is fully customizable, and enables analysis of functional safety solutions and automotive SoC configurations. This paper describes the available configurations of the benchmark including an initial assessment for ASIL B to D configurations.

5 citations


Journal ArticleDOI
TL;DR: A method for generating stimuli to precisely identify permanent high-level faults in a IEEE 1687 reconfigurable scan chain is proposed: the system is modeled as a finite state automaton where faults correspond to multiple incorrect transitions; then, a dynamic greedy algorithm is used to select a sequence of inputs able to distinguish between all possible faults.
Abstract: With the complexity of nanoelectronic devices rapidly increasing, an efficient way to handle large number of embedded instruments became a necessity. The IEEE 1687 standard was introduced to provide flexibility in accessing and controlling such instrumentation through a reconfigurable scan chain. Nowadays, together with testing the system for defects that may affect the scan chains themselves, the diagnosis of such faults is also important. This article proposes a method for generating stimuli to precisely identify permanent high-level faults in a IEEE 1687 reconfigurable scan chain: the system is modeled as a finite state automaton where faults correspond to multiple incorrect transitions; then, a dynamic greedy algorithm is used to select a sequence of inputs able to distinguish between all possible faults. Experimental results on the widely-adopted ITC'02 and ITC'16 benchmark suites, as well as on synthetically generated circuits, clearly demonstrate the applicability and effectiveness of the proposed approach: generated sequences are two orders of magnitude shorter compared to previous methodologies, while the computational resources required remain acceptable even for larger benchmarks.

5 citations


Proceedings ArticleDOI
13 Jul 2020
TL;DR: Functional test techniques based on parallel Software-Based Self-Test routines to test memory structures in the memory hierarchy of a GPGPU (FlexGripPlus) implementing the G80 architecture of Nvidia are developed.
Abstract: Nowadays, data-intensive processing applications, such as multimedia, high-performance computing and safety-critical ones (e.g., in automotive) employ General Purpose Graphics Processing Units (GPGPUs) due to their parallel processing capabilities and high performance. In these devices, multiple levels of memories are employed in GPGPUs to hide latency and increase the performance during the operation of a kernel. Moreover, modern GPGPU architectures implement cutting-edge semiconductor technologies, reducing their size and power consumption. However, some studies proved that these technologies are prone to faults during the operative life of a device, so compromising reliability. In this work, we developed functional test techniques based on parallel Software-Based Self-Test routines to test memory structures in the memory hierarchy of a GPGPU (FlexGripPlus) implementing the G80 architecture of Nvidia.

Proceedings ArticleDOI
01 Nov 2020
TL;DR: Gathered results show that there is no correlation between stuck-at and path delay fault coverage, and provide guidelines for developing more effective functional test, based on an open-source RISC-V-based processor core as benchmark device.
Abstract: Path Delay fault test currently exploits DfT-based techniques, mainly relying on scan chains, widely supported by commercial tools. However, functional testing may be a desirable choice in this context because it allows to catch faults at-speed with no hardware overhead and it can be used both for endof-manufacturing tests and for in-field test. The purpose of this article is to compare the results that can be achieved with both approaches. This work is based on an open-source RISC-V-based processor core as benchmark device. Gathered results show that there is no correlation between stuck-at and path delay fault coverage, and provide guidelines for developing more effective functional test.

Proceedings ArticleDOI
01 Mar 2020
TL;DR: Evaluated software-based hardening techniques developed to detect SEUs in GPUs general-purpose registers are evaluated and proposed optimizations to improve performance and memory utilization are proposed.
Abstract: Graphics Processing Units (GPUs) are considered a promising solution for high-performance safety-critical applications, such as self-driving cars. In this application domain, the use of fault tolerance techniques is mandatory to detect or correct faults, since they must work properly even in the presence of faults. GPUs are designed with aggressive technology scaling, which makes them susceptible to faults caused by radiation interference, such as the Single Event Upsets (SEUs), which can lead the system to a failure, and that is unacceptable in safety-critical applications. In this paper, we evaluate different software-based hardening techniques developed to detect SEUs in GPUs general-purpose registers and propose optimizations to improve performance and memory utilization. The techniques are implemented in three case-study applications and evaluated in a general-purpose soft-core GPU based on the NVIDIA G80 architecture. A fault injection campaign is performed at register transfer level to assess the fault detection potential of the implemented techniques. Results show that the proposed improvements can be tailored for different scenarios, helping engineers in navigating the design space of hardened GPGPU applications.

Proceedings ArticleDOI
01 Mar 2020
TL;DR: This paper proposes a strategy to quantitatively evaluate the effectiveness of a test for the dissipation system based on an electrothermal model of the cooling system, which allows one to identify the maximum size of the thermal fault tolerated by the Dissipation system before the electrical device break down.
Abstract: In safety-critical systems, power electronics is widely used, e.g., for driving actuators. High currents and high voltages are often used in power electronics, which may cause considerable heating of the power devices. Hence, different mechanisms for heat dissipation and cooling of power devices are adopted. An excessive temperature increase in the power devices may lead to considerable electrical and mechanical stresses, and overheated electrical devices are subject to more rapid ageing. Therefore, an incorrect behaviour of the dissipation system can seriously damage or even block a safety-critical system. Hence, it is necessary to introduce test mechanisms to check the correct behaviour of the heatsinks. In this paper, we propose a strategy to quantitatively evaluate the effectiveness of a test for the dissipation system. The proposed approach is based on an electrothermal model of the cooling system. It allows one to identify the maximum size of the thermal fault tolerated by the dissipation system before the electrical device break down.

Proceedings ArticleDOI
06 Oct 2020
TL;DR: The design and functional verification of a Special Function Unit to execute transcendent and trigonometric operations in GPGPUs are reported about and the experimental results show the significant improvements in performance and accuracy achievable by using these modules in parallel applications running in a GPG PU.
Abstract: General Purpose Graphic Processing Units (GPGPUs) are widely used in data-intensive applications, such as multimedia and high-performance computing. These technologies are currently used also to support safety-critical applications (e.g., in the automotive and industrial domains) to implement computer vision, sensor fusion, or machine learning algorithms, which often require the processing of complex transcendent or trigonometric functions. In these cases, an integrated special function unit in the GPGPU is utilized, which is intended to increase the performance in parallel operations. However, this complex module is not present in most of the available architectural and micro-architectural open-source models of GPGPUs, so limiting the characterization and analysis of applications using these units. In this work, we report about the design and functional verification of a Special Function Unit to execute transcendent and trigonometric operations in GPGPUs. We integrated the proposed module within an open-source GPGPU (FlexGripPlus) implementing the G80 micro-architecture. The experimental results show the significant improvements in performance and accuracy achievable by using these modules in parallel applications running in a GPGPU.

Book ChapterDOI
06 Oct 2020
TL;DR: In this article, a modular approach to develop functional testing solutions based on the non-invasive Software-Based Self-Test (SBST) strategy is described, where a scalar and modular mechanism is proposed to develop test programs based on schematic organizations of functions allowing the exploration of different solutions using software functions.
Abstract: Graphic Processing Units (GPUs) are promising solutions in safety-critical applications, e.g., in the automotive domain. In these applications, reliability and functional safety are relevant factors. Nowadays, many challenges are impacting the implementation of high-performance devices, including GPUs. Moreover, there is the need for effective fault detection solutions to guarantee the correct in-field operation. This work describes a modular approach to develop functional testing solutions based on the non-invasive Software-Based Self-Test (SBST) strategy. We propose a scalar and modular mechanism to develop test programs based on schematic organizations of functions allowing the exploration of different solutions using software functions. The FlexGripPlus model was employed to evaluate experimentally the proposed strategies, targeting the embedded memories in the GPU. Results show that the proposed strategies are effective to test the target structures and detect from 98% up to 100% of permanent stuck-at faults.

Journal ArticleDOI
TL;DR: Three possible test strategies are compared for testing the correct assembling of heatsinks and the effectiveness of the different test methods considered is assessed on a case study corresponding to a Power Supply Unit (PSU).
Abstract: Power electronics technology is widely used in several areas, such as in the railways, automotive, electric vehicles, and renewable energy sectors Some of these applications are safety critical, eg, in the automotive domain The heat produced by power devices must be efficiently dissipated to allow them to work within their operational thermal limits Moreover, numerous ageing effects are due to thermal stress, which causes mechanical issues Therefore, the reliability of a circuit depends on its dissipation system, even if it consists of a simple passive heatsink mounted on the power device During the Printed Circuit Board (PCB) production, an incorrect assembly of the heatsink can cause a worse heat dissipation with a significant increase of the junction temperatures (Tj) In this paper, three possible test strategies are compared for testing the correct assembling of heatsinks The considered strategies are used at the PCB end-manufacturing The effectiveness of the different test methods considered is assessed on a case study corresponding to a Power Supply Unit (PSU)

Proceedings ArticleDOI
30 Mar 2020
TL;DR: The presence of memory encryption alone is able to strongly reduce the probability of Silent Data Corruption, without the need of implementing expensive error detection, on the OpenRISC1200 microprocessor.
Abstract: In most safety-critical systems, the robustness and the confidentiality of the application code are crucial. Such code is generally stored into Non-Volatile Memories (NVMs) that are prone to faults (e.g., due to radiation effects). Unfortunately, faults affecting the instruction code result very often into Silent Data Corruption (SDC). This condition lets faults remain undetected and it can lead to undesiderable errors that may compromise the system functionality. Thus, it is desirable that the system is able to detect faults affecting the code memory. To overcome this issue, designers often resort to expensive error detection/correction mechanisms. Furthermore, they also adopt memory encryption techniques to prevent unauthorized, hence malicious, access to the code or to protect it from any unauthorized copy. In this paper, we show that the presence of memory encryption alone is able to strongly reduce the probability of SDC, without the need of implementing expensive error detection. We have performed some experiments on the OpenRISC1200 microprocessor in order to evaluate the impact on reliability stemming from different encryption methods.