SRAM-Based FPGA Systems for Safety-Critical Applications: A Survey on Design Standards and Proposed Methodologies
Summary (7 min read)
1 Introduction and Motivations
- Since the first FPGA device was developed by Xilinx in 1984 with the XC2064 chip, the FPGA technology has enormously grown in terms of flexibility, reliability and computational power.
- In [53] , the maturity of reconfigurable FPGA technologies for safety-critical applications is discussed.
- As discussed in [11] , testing alone cannot guarantee such requirement; combining fault tolerant approaches, such as replication and diversity, together with testing and other techniques such as Failure Mode and Effects Analysis (FMEA) and reliability analysis methods, can improve the reliability of the system, but the result is still far from the 10 −9 goal.
2 The FPGA Technology
- Programmable blocks may be simple combinatorial logic (Soft Logic Blocks) or memories, multiplexers, ALUs and other kinds of prefabricated circuitry (Hard Logic Blocks).
- Logic blocks may be programmed to implement a certain functionality, the routing architecture may be programmed to interconnect various blocks, and I/O pads may be programmed to ensure off-chip connections.
- Three FPGA programming technologies exist: static memory (SRAM) based, non-volatile memory (flash and EEPROM) based and antifuse based [46] .
- The antifuse programming technology does not allow any reprogramming of the device.
- Many embedded processors that can be placed on FPGA devices exist, among which the authors can mention the Xilinx MicroBlaze and PicoBlaze, and the Altera Nios and Nios II, provided by the FPGA vendors themselves.
3 Standards regulating the design of Hardware Systems in Safety-Critical Systems
- A general framework for the design and development of hardware and software safety-critical systems is the IEC 61508 standard, and in particular the IEC 61508-2 [42] and the IEC 61508-3 [43] for hardware and software systems respectively.
- In all the other application fields, the in force regulations require adopting the standard for the design and development of both software and hardware systems.
- Moreover, after each phase the intermediate products of the phase are verified against the requirements specified in the previous phase: for example, the adequacy of the hardware architecture in fulfilling the requirements specification must be verified, the adequacy of the designed modules and their integration in fulfilling the architecture must be verified and so on.
- Thus, a system breadboard shall be designed and used covering all the operating modes and conditions of the device.
- In the following of this section the authors present a brief summary of the requirements imposed by the previously mentioned safety standards, for each phase of the V-shaped design lifecycle, placing particular emphasis on the requirements imposed by the ECSS-Q-ST-60-02C standard for FPGAs.
3.1.1 System safety requirements specification
- Starting from the requirements specification document of the whole system , requirements for the FPGA-based system are extracted and analyzed.
- In particular it is recommended to identify those requirements that involve functionalities that allow the system to reach and maintain a given safety level, those functions that allow the system to detect, identify and handle faults and those functions related to performance-and time-critical operations.
- The specification of the system requirements shall contain details relevant to the design, to achieve the safety integrity level and the required target failure measure for the safety function, as specified by the E/E/PE system safety integrity requirements specification.
- In particular, the ECSS-Q-ST-60-02C standard imposes the following additional requirements related to the occurrence of faults due to radiation: error handling test device on ground and flight.
• proof of required fault coverage during tests
- Moreover, the standard imposes the production of a feasibility study in order to estimate the requested power consumption, speed and radiation tolerance.
- At the end of this phase a document that completely collects and defines the requirements for the FPGA-based system is produced.
- This document is required to be complete, unequivocal, clear and precise, verifiable and testable.
- An interesting point is that all the standards highly recommend the use of semi-formal methods, such as logic/function block diagrams, sequence diagrams, data flow diagrams, and of formal methods, such as finite state machines, timed Petri nets, LOTOS, OBJ and Z, for the specification, analysis and verification of high-level system requirements.
3.1.2 System Architecture
- In this phase the overall architecture of the system is defined.
- In particular, the high-level components that will compose the system are identified, the interfaces among them are specified and the input and output of the system are defined.
- Moreover, the decision on how to partition the system into its hardware and software components shall be taken during this design phase.
- A significant effort shall be paid in identifying a hardware architecture able to fulfill the previously defined safety requirements: for example, architectural-level fault-tolerance schemes are selected in this phase.
- The produced architecture shall be verified according to the previously defined requirements.
3.1.3 System design and behavioral modeling
- In this phase the previously defined architecture is refined into a number of sub-components.
- The high-level behavioral specification of these components is defined in this phase.
- All the standards agree in requiring the use of hardware description languages (behavioral VHDL/Verilog) to describe the behavior of the components and about the observance of coding guidelines.
- Furthermore, proven-in-use design environments and simulators shall be used.
3.1.4 Module design
- During the module design the high level behavioral model of the design is translated into a structural description composed of the hardware modules in accordance with the architectural design.
- The ECSS-Q-ST-60-02C standard places particular emphasis on the definition of time constraints and of a detailed pin plan for FPGA designs.
- Static analysis tools are used to facilitate this process.
- After all modules have been designed and integrated in the complete system, integration testing shall be performed.
- Testing is usually black-box as the code is not directly checked for errors.
3.1.5 Synthesis, placement and routing
- After the detailed design has been completed it must be synthesized so to generate the gate-level netlist implementing the system.
- During this phase, proven-in-use simulation, synthesis tools and technological libraries must be used.
- In the placement and routing phase the synthesized netlist is placed on the chip and routing information is defined in order to meet the timing constraints.
- Moreover, the power and clock distribution is performed.
3.1.6 Final coding
- In the FPGA programming phase the placed and routed design is translated into the programming bitstream, the FPGA device is programmed and the resulting prototype is tested.
- The design validation will be performed on the produced prototypes of the system.
3.2 Validation Process
- After the development phase, the implemented design must be validated.
- At this stage (i) estimated delays shall be verified; (ii) gate-level simulations, formal verification and static timing analysis shall be performed; (iii) key parameters such as voltages, noise, frequencies, bandwidth, power consumption, shall be verified; (iv) functional verification shall be performed.
- Finally, the standard places particular emphasis on the use of IP-cores: when such modules are purchased and used, great attention must be paid in the verification of the IP-core itself and in the verification of the correct integration of the IPcore into the architecture under design.
- Complete system testing will compare the system specifications against the actual system implementation.
- The testers validate whether the requirements are completely and appropriately met.
4 FPGA research proposals, guidelines and lessons learned
- Industrial and academic guidelines and lessons learned from real-world projects regarding the design and verification of FPGA-based systems in safety-critical application fields.the authors.
- The authors have organized the survey following the structure of the V-shaped lifecycle previously presented.
- For each phase of the design process the authors report activities that should be carried out as well as proposed techniques and tools.
4.1 Safety Requirements Specifications
- In accordance to the standards, in [35] and [56] it is confirmed that the risk analysis should be carried out during the concept and requirements definition phase, along with the feasibility study and the requirements specification.
- The risk analysis should identify the critical issues of the design and identify the possible backup solutions, including but not limited to: (i) Maturity of the foreseen FPGA device family, including CAD tools, libraries and vendor support; (ii) suitability of the chosen technology for the intended mission; (iii) undetermined I/O behavior and internal initial state during power-up.
- Moreover, the author points out that designers should assess and document the radiation threats to the circuit.
- Then, using the INFORMED design method, the boundary between hardware and software components of the system is identified.
- In [61] , Sutton underlines the need of machine-readable formalisms for requirements specification in order to guarantee that all the requirements have been addressed during the design process.
4.2 System Architecture
- At this stage of the design life-cycle the target device shall be chosen and consequently the vendor's CAD tool shall be chosen and purchased.
- The use of hardware description languages, such as Verilog or VHDL, and of CAD tools to produce the architectural design is highly recommended [65] .
- For outputs that are critical for the system operation, it is recommended that the corresponding flip-flops are reset asynchronously.
- Concerning power consumption, Habinc suggests avoiding clock signal manipulations that are in conflict with synchronous design methods.
- Finally, the state of the unused pins shall be properly documented.
4.3 Behavioral Modeling and Module Design
- Implementing the defined functionalities, interfaces, interconnections and interactions [35] .
- A strict coding standard should be used to avoid systematic faults due to coding errors: it is suggested to avoid non-synthesizable code and coding instructions that would lead to the insertion of latches.
- In [19] , a VHDL guidance for safe and certifiable FPGA design is reported.
- A very large number of alternative high-level hardware programming languages has been proposed as intermediate languages between the architectural design and the description of the device structure in a hardware description language.
- Also boundary value tests shall be performed in order to evaluate the robustness of the design.
4.4 Synthesis, placement and routing, and final coding
- A number of works presenting alternative place-and-route algorithms able to increase the robustness of a given design against faults have been published in the last years.
- The work starts from the consideration that the XTMR tool from Xilinx fails in some cases to protect the design from single event upsets due to the presence of common causes of failure in the routing of the design.
- FPGA vendors do not provide any detail about the structure of the bitstream, and the problem of verifying third-party IP-cores is made harder by the fact that very often these cores are provided as obfuscated or encrypted netlists.
- Thus, designers generally perform testing activities on the programmed device, spending great effort in designing sufficiently effective test cases.
- Recently, Luna Inc. developed a software platform called Change Detection Platform (CDP) [31] .
5 Radiation Effects Analysis and Mitigation
- Radiations may produce system malfunctions [6] .
- In particular, radiations affecting digital circuits may cause changes in the contents of memory elements and in the value of signals.
- The above mentioned effect is known as Single Event Upset (SEU).
- Neither TID or SETs have been widely studied in SRAM-based FPGAs since these devices are much more susceptible to SEUs, but they must be considered when other FPGA technologies are used [63, 52] .
- Moreover techniques for mitigation of SEUs are either highly recommended or mandatory, depending on the safety level.
5.1 SEU Effects Analysis Techniques
- The sensitivity to SEUs of SRAM-based FPGA systems can be analyzed according to four main approaches: accelerated radiation ground testing, fault emulation boards, analytical computation, and fault simulation.
- Unlike radiation testing experiments, fault emulation allows focusing specifically on SEUs in the configuration memory of the FPGA, leaving out any other resources.
- Given the probability of occurrence of a SEU, the model estimates the probability of having a system failure after a given amount of time.
- Moreover, an even smaller number of simulators that specifically address the FPGA technology can be found.
- The only simulator targeting SEUs in FPGAs is SST [34] that works on the register transfer level representation of the system.
5.2 SEU Mitigation and Correction Techniques
- Many SEU mitigation techniques are discussed in the literature.
- Fabrication process-based techniques aim at reducing the effects of radiation through the use of non standard CMOS logic gates, such as the Silicon-on-insulator (SOI) technology from IBM [41] and radiation-hardened memory cells [13] .
- A generalization of hardware redundancy is device redundancy, that is, using multiple independent FPGA devices performing the same functionality, whose output is then checked by a voting system.
- An additional advantage of designbased techniques is that they can be applied to different levels of design abstraction and can address different fault types.
- With blind scrubbing the whole bitstream is reloaded, irrespective of the occurrence of faults, whereas with selective scrubbing readback operations make it possible to identify faults and correct them with partial reconfigurations.
6.1 Hydraulic Leakage Monitoring
- Hydraulic systems are used in aircraft to actuate highly critical components, such as control surfaces and landing gear.
- Leakages may cause pressure losses, which may lead to catastrophic failures, so a Hydraulic Leakage Monitoring (HLM) system is used to detect leakages and isolate defective sections of the hydraulic system by operating shut-off valves.
- Esterel modules are used both for the system and the fault model, thus allowing verification of safety properties in the presence of faults.
- The main safety property is that no more than one valve be closed at the same time, since this condition could block the hydraulic system.
- The Esterel model has then been automatically translated into VHDL, leading to the FPGA implementation.
6.2 Reactor Trip System
- Andrashov et al. [3] describe the development and V&V process used for the control logic of reactor trip systems (RTS) implemented with FPGA technology.
- The RTS is the central and most critical part of a nuclear powerplant's protection system.
- Figure 5 shows the considered RTS, consisting of three signal channels feeding a two-out-of-three voter.
- The design phase consists in the preliminary electronic design subphase, where the system is modeled at the diagram level and verification is done by design review, and the detailed electronic design subphase, where system is modeled at the schematics and VHDL level, and verification is done by simulation and static analysis.
- Thirty-four algorithms have been identified and tested by simulation with a 100% coverage of input value combinations chosen with the boundary value criterion.
6.3 Car Body Controller
- Traub et al. [62] describe the development of an FPGA-based body controller unit (BCU) , in charge of controlling a car's electrically operated windows, rear-view mirrors, and other components.
- The adopted development process is centered on model-based design, both for hardware and software.
- The BCU functions are modeled with Simulink and Stateflow diagrams, from which HDL code (for hardware modules) and C code (for software) is automatically generated.
- The code is then synthesized for the Xilinx Spartan 3 FPGA.
- The authors report data on resource requirement for different architectural approaches.
7 Open Issues
- Many issues are still unsolved and make the application of SRAM-based FPGA devices in the safety-related parts of systems still problematic.
- The lack of such tools, again, forces designers to rely on the correctness of the translation tool provided by the device vendor and on the trustworthiness of the IP-core provider.
- Finally, partial dynamic reconfiguration in safety-critical applications represents a still open point.
- This gets even worse when a number of iterations of design and sensitivity analysis is required before achieving an acceptably robust design.
8 Conclusions
- This paper summarizes the design standards for the development of FPGAbased systems in safety critical applications together with the literature proposals, industrial and academic guidelines, and lessons learned from real projects.
- Three main points about the design of FPGA-based systems in safetycritical application field can be identified.
- The first point is that it is strongly recommended to start the design of a safety-critical FPGA-based system only after a well structured and well documented design flow has been identified.
- The second recommendation is never to trust completely the CAD tools provided by the FPGA device vendor, and always to verify the intermediate products of all phases of the design process using external tools (both simulation tools and formal methods).
- Finally, even if the design and development process of an FPGA-based system is very much like the design and development process of a software system, the designer must know in depth all the technological details of the final target device that will host the system, such as special I/O pins, working frequency range, temperature, voltage and humidity ranges.
Did you find this useful? Give us your feedback
Citations
40 citations
34 citations
34 citations
Cites background or methods from "SRAM-Based FPGA Systems for Safety-..."
...FPGAs SRAM FPGAs use a configuration memory that defines the operations of the electronic circuit implemented by the FPGA....
[...]
..., redundancy, scrubbing, partial dynamic reconfiguration, combinations of the previous techniques [50, 51, 52]....
[...]
...The term multi-core device, or device for short, is used within this survey to refer to multi-core processors, System-ona-Chip (SoC), Multi-Processor System-on-a-Chip (MPSoCs), FPGA with soft-cores and combinations of the previous....
[...]
...LEON3+ Part of the flight control system is evaluated on a 4-core LEON3 design implemented on a FPGA, together with ARINC 653 compliant PikeOS in the context of probabilistic timing analysis [186]....
[...]
...• Hybrid device: Multi-core device that combines previous options, e.g., generic device with a certifiable ’safety island’ (e.g., Zynq UltraScale+), generic device with integrated FPGA that enables the integration of custom safety designs (e.g., Zynq with ARM and MicroBlaze [7])....
[...]
30 citations
Additional excerpts
...Additionally, Intuition-1 will exploit an FPGA to execute on-board artificial intelligence, as it allows for massively parallel processing (very well-fitted to deep learning algorithms), it is energy-efficient [19], it is commonly designed to support safety-critical applications [20], and can be optimized in the context of memory usage [21]....
[...]
15 citations
Cites background from "SRAM-Based FPGA Systems for Safety-..."
...Safety critical and safety related applications address the protection of FPGA configuration and the protection of user memory elements [16]....
[...]
References
2,512 citations
1,345 citations
Additional excerpts
...Radiations may produce system malfunctions([36])....
[...]
1,096 citations
"SRAM-Based FPGA Systems for Safety-..." refers background in this paper
...Fabrication process based techniques aim at reducing the effects of radiation through the use of nonstandard CMOS logic gates, such as the silicon-oninsulator (SOI) technology from IBM([53]) and radiationhardened memory cells([54])....
[...]
1,078 citations
Additional excerpts
...As has been analyzed in depth by Kuon and Rose[18], FPGA-based designs are usually larger, slower and much more energy-consuming than full-custom designs....
[...]
...As has been analyzed in depth by Kuon and Rose([18]), FPGA-based designs are usually larger, slower and much more energy-consuming than full-custom designs....
[...]
491 citations
"SRAM-Based FPGA Systems for Safety-..." refers background in this paper
...More detailed discussions about FPGA architectures can be found in [17]....
[...]
...Three FPGA programming technologies exist: static memory (SRAM) based, non-volatile memory (flash and EEPROM) based and antifuse-based([17])....
[...]
...This is basically due to the two main factors of low cost and short time to market([17])....
[...]
Related Papers (5)
Frequently Asked Questions (16)
Q2. What is the purpose of the placement and routing phase?
In the placement and routing phase the synthesized netlist is placed on the chip and routing information is defined in order to meet the timing constraints.
Q3. What is the way to avoid asynchronous design?
Since the choice between synchronous or asynchronous design is made at the HDL description phase, the use of simple HDL source code templates that are available from the FPGA vendors is highly recommended in order to avoid coding errors that would lead to an asynchronous architecture, while the desired architecture was synchronous, or viceversa.
Q4. How long does it take to fabricate a full custom design?
A full custom design may need up to three fabrication iterations and thus up to twelve or even eighteen months between the product conception and its availability to customers.
Q5. What is the importance of using self-checking test benches?
For complex designs it is important to use self-checking test benches, that can perform the test activity, automatically check the results and produce a test report, without requiring a visual inspection of the waveforms.
Q6. What is the effect of radiations on the configuration memory?
In particular, radiations affecting digital circuits may cause changes in the contents of memory elements and in the value of signals.
Q7. What are the main issues related to the design of FPGA-based systems?
The main issues related to the design of FPGA-based systems and to their adoption in safety-critical application fields are the lack of standards specifically addressing the FPGA technology and the severe susceptibility of FPGA devices to the effects of radiations.
Q8. What is the main advantage of design-based techniques?
An additional advantage of designbased techniques is that they can be applied to different levels of design abstraction and can address different fault types.
Q9. What does the FPGA need to store the configuration data while the device is not powered?
On the other hand, SRAM-based FPGAs need a supporting non-volatile memory to store the configuration data while the device is not powered.
Q10. What is the importance of the power up and down sequences?
Careful attention to the power up and down sequences of FPGA devices should always be paid, since some technologies exhibit an uncontrollable behavior on their input and output pins during these phases.
Q11. What is the general framework for the design and development of hardware and software safety-critical systems?
ware Systems in Safety-Critical SystemsA general framework for the design and development of hardware and software safety-critical systems is the IEC 61508 standard, and in particular the IEC 61508-2 [42] and the IEC 61508-3 [43] for hardware and software systems respectively.
Q12. Why are SRAM-based FPGAs rarely used in safety-related systems?
SRAM-based FPGA devices are still seldom used in those parts of systems related with the safety of the system itself, due to the vulnerability to faults of the SRAM-based configuration memory [61].
Q13. Why is the solution proposed in [35] asynchronous?
Because of this, the solution proposed in [35] is to assert the internal reset signal asynchronously and to de-assert it synchronously.
Q14. What type of FPGA can be configured to host a complete microprocessor?
They can be configured to host a complete microprocessor, or even a System-on-Chip, i.e., a complete system, composed of processor, memory and peripherals, all placed on the same chip.
Q15. Why are FPGAs often perceived as easy to modify and correct late in the development process?
Because of this, FPGAs are often perceived by designers as easy to modify and correct late in the development process, thus FPGA based systems are often designed with development methods more similar to a code and fix approach than a true hardware design process, methods that would not be accepted for the design of more costly and less flexible technologies, such as ASICs or microprocessors [17].
Q16. What are the main reasons why FPGAs are becoming more widely used in safety-related applications?
they are more and more widely employed in all application fields and the interest in using FPGA devices in safety-related applications, such as space missions or railways systems, is growing.