scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An error-detection and self-repairing method for dynamically and partially reconfigurable systems

27 May 2013-Vol. 66, Iss: 6, pp 1-7
TL;DR: This paper proposes a self-repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level that is able to detect correct and recover errors using the run-time capabilities offered by modern SRAM-based FPGAs.
Abstract: Reconfigurable systems are gaining an increasing interest in the domain of safety-critical applications, for example in space and avionic applications. In fact, the capability of reconfiguring the system during run-time execution and the high computational power of modern Field Programmable Gate Arrays (FPGAs) makes these devices suitable for data processing. Moreover, such systems must also guarantee the abilities of self-awareness, self-diagnosis and self-repair in order to cope with errors due to the harsh conditions typically existing in some environments. In this paper we propose a self-repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level. Our method is able to recover and correct errors using the run-time partial reconfiguration capabilities offered by modern SRAM-based FPGAs. Fault injection experiments have been executed on a dynamically reconfigurable system embedding a number of benchmark circuits. Results demonstrate that the method can achieve full detection of single and multiple errors, while significantly improving the system availability with respect to traditional error detection and correction methods.

Summary (6 min read)

1 INTRODUCTION

  • ECHNOLOGY scaling in the nano-metric domain and beyond supports the increasing usage of high performance and miniaturized embedded systems.
  • Among the available technology solutions, the adoption of SRAM-based FPGAs is the most suitable for the realization of dynamically and partially reconfigurable systems; however, when used in harsh environments, SRAM-based FPGAs have to withstand the radiation effects in the form of Single Event Upsets (SEUs) and Multiple Event Upsets (MEUs), especially affecting their configuration memory [2] .
  • On the contrary, the components in the dynamic region correspond to partially reconfigurable resources that can be configured in different ways depending on the system requirements [8] .
  • The proposed approach provides significant advantages compared to already developed solutions [9] [10] , mainly because it increases the error detection and correction capabilities while introducing comparable area and performance overhead.
  • Section 3 describes the proposed method, while the developed design flow is illustrated in Section 4.

2 PREVIOUS WORKS

  • State-of-the-art SRAM-based FPGAs are heterogeneous devices containing several macro blocks, like Digital Signal Processors (DSPs), Block RAMS and IO Blocks (IOBs), along with Configurable Logic Blocks (CLBs) inside the FPGA reconfiguration fabric.
  • Each of these resource types is arranged in columns that span from top to bottom of the device realizing a column of CLBs, IOBs and BRAM memories interconnected by a mesh of heterogeneous routing resources.
  • In details, with dynamic reconfiguration, the FPGA configuration memory can be read-back continuously without interfering with the circuit functionality and if any upset is detected it can be selectively re-written with the correct values, thus avoiding the accumulation of radiationinduced errors [12] .
  • Spatial redundancy using Triple Modular Redundancy (TMR) is complementarily used with the read-back and correction techniques: on one side TMR can tolerate faults with the limitation of withstanding a single fault per voting group [13] , on the other side read-back and correction avoids the accumulation of errors within the configuration memory.
  • The results achievable with this combined solution are computationally expensive and area hungry.

2.1 Main contribution

  • The main contribution of the present work, which is based on the platform preliminarily presented in [8] , is the description of an autonomous recovery approach that can be applied to Partially Reconfigurable Modules (PRMs) when errors are detected inside them.
  • The approach is implemented by the static region providing effective capabilities of error detection and correction of faults within the dynamic region.
  • In details, the proposed method is characterized by the ability of detecting MEUs into the FPGA's configuration memory, as well as to recover any number of faults in the dynamic partition, thus improving previously developed approaches, as presented in [9] , that cannot deal with MEUs.
  • The authors solution is adaptable to all modern SRAM-based FPGAs equipped with an Internal Configuration Access Port (ICAP) and based on a LUTslice architecture.

3 THE PROPOSED METHOD

  • The proposed method consists of two flows: one applied to the dynamically reconfigurable region for implementing error detection, the other one for instrumenting the circuit mapped on the FPGA so that it supports the execution of the self-repairing method against single and multiple-bit errors.
  • The static region contains the main processor, which is in charge of controlling the partially reconfigurable system operational functionalities: therefore, it is very important to tolerate and recover errors in these modules.
  • Each RF contains a different number of "minor frames", each having a height equal to the clock region (row) and numbered from left to right.
  • Practically, the F-DWC approach can be adopted by acting at the Hardware Design Language (HDL) level: the combinational functions are duplicated and both copies of the circuit LUTs are placed in a single FPGA slice using two consecutive available LUT positions.
  • In the Multiple Bit Error (MBE) region each pair of LUTs generates a check flag and thus the authors have two check flags per slice.

3.1 Error Detection Method

  • In order to fully explain their proposal, in this section the authors will specifically refer to the architecture of Xilinx Virtex-5 FPGAs.
  • As described in the previous section, the error detection mechanism implemented in the reconfigurable region is based on LUT-based checkers and carry chains for propagating the check flags.
  • Please note that the LUT checkers are only deployed when the carry chain is unavailable for comparison purposes.
  • This allows reducing the performance degradation of the circuit implemented with their method, although in this case the detection mechanism is implemented at the modular level.
  • The authors focus on the method adopted for the error detection using the carry chains for comparison; a more detailed explanation of both the LUT checkers and the carry chains insertion inside the physical place and route description of the circuit will be given in Section 4.3.

3.1.1 Single-bit error detection

  • In order to detect single-bit errors, the authors propose to duplicate each original LUT function into two identical LUTs.
  • The multiplexer "M2" receives an inverted (through the AMUX_2_BX hardwired connection) and buffered copy of the LUT A output at its "0" and "1" inputs while the selection line is tied to LUT B (which is the copy of LUT A) thus effectively performing the EX-NOR function.
  • In case the CLB column contains empty slices the dedicated COUT connection cannot be used to propagate the flag signal upwards along the column.
  • Errors affecting flip-flops cannot be directly detected.

3.1.2 Multiple-Bit error detection

  • Multiple bit errors can only be detected if the error detecting carry chain is inserted in a specific pattern that the authors will mention in this section.
  • In order to reduce the number of flags the authors propose the usage of 2 slices (out of the available 20) for merging the check flags by OR-ing them.
  • As the authors are producing two flags for each clock region (one for odd and one for even slices) they can have a maximum of 72 LUTs (out of 80 LUTs in an even or odd slice column) configured for computations in any slice column location (even or odd) within a single clock region.
  • Thus, the MBE regions require an overhead of 11.11 % for flag reduction.

3.2 Error Correction Method

  • Data errors affecting combinational logic or Flip-Flops are individuated by the error detection scheme previously described.
  • Secondly, the clock enabling signals should be de-activated to disable the propagation of errors to the next stages in the design.
  • This is possible since both static and dynamic regions have well-defined interfaces with clock enabling registers.
  • Lastly, the main processor controller enables the clock to re-start the normal operation in the DUT region involved in the correction.

4 DESIGN FLOW

  • In this section the authors describe the tool flow they developed in order to insert fine-grain duplication with comparison using the built-in slice carry chains.
  • A pre-map step generates a number of constraints for directed packing, placement and sites prohibitions, while a post-map step inserts the error detecting carry chains and the convergence logic required to reduce the number of flag signals.
  • This postmap modification is implemented by modifying the XDL file (i.e., the Xilinx interface for interacting with the Xilinx CAD flow).
  • The tool flow has been developed as a C++based software environment making heavy use of boost library and Tools for Open Source Reconfiguration (TORC).

4.1 Net-list Extraction

  • The flow starts by parsing the net-list description of the circuit implemented into the dynamic region, which was duplicated at the Hardware Description Level (HDL).
  • It is important that both instances of the design should be labeled with "inst1" and "inst2" so that each synthesized element contains the hierarchical information of the top level instance to which it belongs.
  • Global reset/clock signals are not duplicated at the module-level, as it will be explained in Section 4.2.
  • The postsynthesis Verilog file contains the circuit net-list using the Xilinx primitive cell library elements.
  • In details, each node of the graph corresponds to a data structure with a number of fields including: functional string, instance name, inputs vector, outputs vector and type of primitive element (LUT or FF).

4.2 DUT Regions Formation and Constraints Generation

  • Once the circuit net-list is created in the form of a graph, it is necessary to generate user constraints, represented within the User Constraints File (UCF) in order to perform the DUT physical space division into regions and for packing the primitive cells into slices.
  • Thirdly, LUTs with 6 inputs are grouped to form single bit error detection regions.
  • For this reason the global clock and reset signals were not duplicated due to the architectural limitation of state-of-the-art FPGA devices.
  • Slices in the single bit region use names like "SBESlice1", "SBESlice2" and so on.
  • The algorithm illustrated in figure 6 performs the generation of the constraints used for the floorplanning of the circuit including the mapping of the SBE and MBE regions.

4.3 Low-level Manipulations

  • Once the mapping is performed, the insertion of the carry chain and the definition of the comparator resources are implemented by modifying the physical place and route description of the circuit in order to properly use the hardwired combinational gates.
  • Each inserted carry chain is labeled with a unique reference to differentiate it with respect to the ones used for arithmetic computation.
  • It is also interesting to note that for each OR LUT an automatic procedure searches for an empty slice in the same CLB column and picks up the nearest one in terms of the slice site distance for the OR LUT placement.
  • The single bit error region flags are converged resulting in error detection carry chains of varying lengths.
  • Therefore, the placement should be such that an optimal balance between the usage of OR LUTs for flag convergence and the routing congestion is achieved.

5 EXPERIMENTAL RESULTS

  • The authors implemented the proposed method targeting a Xilinx Virtex-5 LX110T SRAM-based FPGA.
  • Based on an ad-hoc hardware unit), the authors adopted the Microblaze processor since it represents a state-of-the-art solution for a dynamically and partially reconfigurable system based on static and dynamic regions [4] .
  • Moreover, another GPIO port connected to the flags stemming from the DUT region and configured in interrupt mode is responsible for informing the Microblaze in case of errors.
  • The bit-stream for the C-DWC region is stored as a partial bit-stream by reading it with the ICAP from the start address to the end address.
  • In the following sections, the authors present several results mainly related to the ability of quick error detection, localization and repairing.

5.1 Area Overhead

  • The circuits include some relevant ITC'99 benchmark circuits with various complexity, two implementations of the CORDIC arithmetic processor, a miniMIPS processor, a lightweight 8080 SoC, an RS-Decoder and a DCT core from the opencores repository [24] [25] .
  • Please note that the authors did not include the amount of resources related to the static region within the area count since the static region remains the same in any Dynamically Reconfigurable system, no matter the adopted solution.
  • If compared with DMR, their approach requires 10% more resources on the average; however, DMR cannot correct errors, while their approach corrects errors and reduces the probability of single points of failure thanks to the developed fine-grain combinational logic infrastructure.
  • The authors underline that the area comparison has been performed directly on the basis of LUTs and FFs counts; if comparison is made considering the number of FPGA slices, the ratio may by slightly different due to stringent packing and placement requirements adopted for the fine-grain redundancy with comparison logic.
  • In particular, slices are used as a route-through and FFs may be placed in separate slices, since the FFs require different control signals that could not be packed together with LUTs.

5.2 Error Detection Latency

  • The measurement of the error detection latency is the key factor for making a proper self-repairing system able to autonomously repair itself obeying to real-time constraints.
  • The results the authors obtained are illustrated in Table II , where it is shown the maximum error detection latency for SBE and MBE regions.
  • In detail, the table reports the length of the carry chain detector, the delay latency with routing and logic contributions of the SBE region, as well as the distance from the detector and the delay latency for the MBE region.
  • It is notable that the SBE region latency is larger than for the MBE region because all the carry chains in each CLB that resides in the same column have been connected in a unique CLB column.
  • Two alternatives have been used in order to reduce the routing delay time.

5.3 Error Correction and Detection

  • The effectiveness of the proposed approach concerning the error correction and detection capabilities have been evaluated through the execution of a number of fault injection campaigns.
  • The experiments have been performed on the Xilinx Virtex-5 LX110T SRAM-based FPGAs by injecting transient faults into the FPGA's configuration memory and evaluating the circuit's response through the execution of circuit specific workloads.
  • Please note that the faulty bitstreams are generated by corrupting the FPGA's configuration memory bits belonging to the dynamic region, while the static region was kept fault free.
  • Table III shows the fault injection results, where for each circuit 10,000 Single Event Upsets (SEUs) have been randomly injected into the whole FPGA configuration memory bits related to the reconfigurable region.
  • All the circuits have been emulated at 50 MHz and SEUs are practically injected by downloading the corrupted bitstreams into the FPGA configuration memory.

Table III. Fault injection campaign experimental results

  • In details, the Wrong Answer reports the number of SEUs and MEUs provoking a wrong answer on the circuit outputs; the Corrected column reports the number of SEUs and MEUs properly corrected by their approach.
  • Please note that the MEU effect considered in their experiments always occurs in different slice columns involving the modification of two configuration memory bits.
  • Fectiveness of their approach, which is able to correct more than 98% of the injected errors provoking wrong answers for all the considered circuits.
  • The authors also measured the recovery time; in Table IV they reported the worst recovery time measured for all the circuits during the execution of the fault injection campaigns.
  • The authors also computed the recovery time required by the redundancy approaches, such as TMR and DMR, using active configuration memory scrubbing of all the reconfigurable region area, which is about 1.2 ms; their approach shows an improvement of more than one order of magnitude, and the advantage provided by their approach is extremely large on all the considered circuits.

5.4 Timing Analysis

  • Finally, the authors evaluated the impact on the circuit maximal working frequency on all the benchmark circuits comparing their approach with the DMR and TMR redundancy based techniques.
  • In order to elaborate the timing data the authors used the static timing analysis tool provided by the Xilinx ISE environment.
  • This phenomenon is due to the unconventional block placement of logic resources on slice columns for different circuit regions.
  • This aspect affects the timing of the circuit because their technique does not include an optimal floorplan implementation of the different circuit regions.
  • In figure 8 , the authors illustrated the obtained results showing the percentage contribution of each design phase constraints on the overall circuit delay: LUT blocks, SBE region, MBE region and Detectors.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

10 August 2022
POLITECNICO DI TORINO
Repository ISTITUZIONALE
An Error-Detection and Self-Repairing Method for Dynamically and Partially Reconfigurable Systems / SONZA
REORDA, Matteo; Sterpone, Luca; Ullah, Anees. - In: IEEE TRANSACTIONS ON COMPUTERS. - ISSN 0018-9340. -
ELETTRONICO. - 66:6(2017), pp. 1022-1033. [10.1109/TC.2016.2607749]
Original
An Error-Detection and Self-Repairing Method for Dynamically and Partially Reconfigurable Systems
IEEE postprint/Author's Accepted Manuscript
Publisher:
Published
DOI:10.1109/TC.2016.2607749
Terms of use:
openAccess
Publisher copyright
©2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any
current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating
new collecting works, for resale or lists, or reuse of any copyrighted component of this work in other works.
(Article begins on next page)
This article is made available under terms and conditions as specified in the corresponding bibliographic description in
the repository
Availability:
This version is available at: 11583/2658319 since: 2016-11-30T14:11:58Z
IEEE

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2607749, IEEE
Transactions on Computers
2 IEEE TRANSACTIONS ON COMPUTERS
An Error-Detection and Self-Repairing Method for
Dynamically and Partially Reconfigurable Systems
Abstract— Reconfigurable systems are gaining an increasing interest in the domain of safety-critical applications, for example
in the space and avionic domains. In fact, the capability of reconfiguring the system during run-time execution and the high
computational power of modern Field Programmable Gate Arrays (FPGAs) make these devices suitable for intensive data
processing tasks. Moreover, such systems must also guarantee the abilities of self-awareness, self-diagnosis and self-repair in
order to cope with errors due to the harsh conditions typically existing in some environments. In this paper we propose a self-
repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level. Our method is
able to detect, correct and recover errors using the run-time capabilities offered by modern SRAM-based FPGAs. Fault injection
campaigns have been executed on a dynamically reconfigurable system embedding a number of benchmark circuits.
Experimental results demonstrate that our method achieves full detection of single and multiple errors, while significantly
improving the system availability with respect to traditional error detection and correction methods.
Index Terms— Self-Repair; Partial and Dynamic Reconfiguration; Single Event Upsets (SEUs); Multiple Event Upsets (MEUs)
——————————
——————————
1 INTRODUCTION
ECHNOLOGY scaling in the nano-metric domain and
beyond supports the increasing usage of high perfor-
mance and miniaturized embedded systems. Howev-
er, the quest for pushing the limits of technology to the
ultra-nano scale devices has exacerbated concerns related
to power consumption and reliability that have not be
envisioned before. In particular, one of the major issues in
safety-critical applications (especially in the space and
avionic domains) is the run-time mitigation of various
radiation-induced fault effects, which may provoke tran-
sient and permanent modifications of the electronic cir-
cuit’s behavior. The problem is widely known and vari-
ous methods have been developed and proposed in the
area during the last decade. The ubiquity of embedded
systems for safety-critical applications operating in radia-
tion environments demands continuous and successful
operations of the system by autonomously overcoming
possible malfunctions. This condition requires the abili-
ties of autonomous error detection, self-diagnosis and
self-repair [1].
Among the available technology solutions, the adop-
tion of SRAM-based FPGAs is the most suitable for the
realization of dynamically and partially reconfigurable
systems; however, when used in harsh environments,
SRAM-based FPGAs have to withstand the radiation ef-
fects in the form of Single Event Upsets (SEUs) and Mul-
tiple Event Upsets (MEUs), especially affecting their con-
figuration memory [2].
The increased probability of MEUs hitting the config-
uration memory of an FPGA can limit the effectiveness of
traditional redundancy-based fault-tolerance approaches
[3]. In fact, particles can hit the same logic group of circuit
replicas enabling erroneous results to propagate. To cope
with this scenario, researchers have recently investigated
the fine-grain redundancy and its resilience to MEUs
[4][5][6]. However, for proper shielding against high fail-
ure rate while minimizing redundancy overhead in terms
of area, speed and power consumption, systems should
be designed with accurate mixed-grain redundancy and
self-repair properties which are not feasible for fine-grain
redundancy.
State-of-the-art SRAM-based FPGAs have the technol-
ogy supporting run-time dynamic and partial reconfigu-
ration (DPR), which can be used for adaptive behavior as
well as for fault repairing [7]. A self-repairing system
adopting the partial dynamic reconfiguration capabilities
of SRAM-based FPGAs is often divided in two parts,
called static region and dynamic region. The logic and rout-
ing resources and the corresponding configuration
memory frames individuated by means of clock regions
and major and minor columns, illustrated in figure 1, are
organized in a static region, also called base region. The
static region typically consists of a microprocessor, some
memory modules and input/output ports, as described in
figure 2. In general, these components are not re-
configured and their full functionality is constantly re-
quired for implementing the correct operations of the
system; for this reason the static region is often hardened
using a traditional redundancy-based approach, such as
Triple Modular Redundancy (TMR) [7]. The static region
is also responsible for the reconfiguration of the modules
placed into the reconfigurable regions. On the contrary,
the components in the dynamic region correspond to par-
tially reconfigurable resources that can be configured in
different ways depending on the system requirements [8].
The dynamically reconfigurable regions idea is extended
in this paper so that the system is able to also correct the
identified errors by applying internal reconfiguration.
M. Sonza Reorda, Fellow, IEEE, L. Sterpone, Member, IEEE,
A. Ullah, Student Member, IEEE
xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society
T
————————————————
M. Sonza Reorda, L. Sterpone and A. Ullah are with the Dipartimen-
to di Automatica e Informatica (DAUIN), Politecnico di Torino, To-
rino, Italy. For any information please refer to:
luca.ster
p
one@
p
olito.it (contact author).

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2607749, IEEE
Transactions on Computers
SONZA ET AL.: AN ERROR-DETECTION AN SELF-REPAIRING METHOD FOR DYNAMICALLY AND PARTIALLY RECONFIGURABLE SYSTEMS 3
The proposed approach provides significant advantages
compared to already developed solutions [9][10], mainly
because it increases the error detection and correction
capabilities while introducing comparable area and per-
formance overhead.
In order to practically prove the effectiveness of the
approach, we developed a complete set of tools for the
automatic generation of the constraints used for the parti-
tioning of the dynamic regions. The developed set of tools
directly acts at the physical level, automatically inserting
a carry chain into the physical net-list and adding com-
parator check flags into the circuitry; moreover, the tool is
able to cleverly place the different partitions of the dy-
namic region into proper sub-regions, thus allowing SEUs
and MEUs correction. The proposed approach drastically
improves the solution in [9] which uses the built-in slice
carry chain for error detection, only.
Our approach introduces a minimal area overhead,
which is strictly dependent upon the number of user-
defined partitions. On the average, the overhead intro-
duced by our approach is around 11% with respect to the
duplication-based approach; hence, the proposed tech-
nique is using far less computational resources if com-
pared to the standard TMR solution. Furthermore, correc-
tion is performed on a single reconfigurable frame, which
is the smallest amount of reconfigurable information that
can be read or written; therefore, we can achieve the
highest availability limits offered by the current reconfig-
urable technology.
The paper is organized as follows. Section 2 gives an
overview of the soft error detection and correction meth-
ods implemented with modern SRAM-based FPGAs and
summarizes the major contributions of this paper. Section
3 describes the proposed method, while the developed
design flow is illustrated in Section 4. Experimental re-
sults on the selected case study and their analysis are pre-
sented in Section 5. Finally, conclusions and future works
are described in Section 6.
2 PREVIOUS WORKS
State-of-the-art SRAM-based FPGAs are heterogene-
ous devices containing several macro blocks, like Digital
Signal Processors (DSPs), Block RAMS (BRAMs) and IO
Blocks (IOBs), along with Configurable Logic Blocks
(CLBs) inside the FPGA reconfiguration fabric. Each of
these resource types is arranged in columns that span
from top to bottom of the device realizing a column of
CLBs, IOBs and BRAM memories interconnected by a
mesh of heterogeneous routing resources. Each SRAM-
based FPGA chip is organized in a number of rows de-
pendent on the manufacturer families or specific part. The
most advanced devices have CLB rows connected to
global resources as well as local clock sources [11]. In or-
der to harden circuits implemented on SRAM-based
FPGAs, different architecture level techniques have been
proposed in the past. However, we can broadly classify
them into two main techniques, namely fault masking and
fault correction. In the next part of this section we will pre-
sent a detailed discussion of the previous research work
in each category.
In the recent years, two different mitigation approach-
es have been proposed to mitigate SEUs affecting the con-
figuration memory of SRAM-based FPGAs. On one side,
full hardware redundancy obtained thanks to Triple
Modular Redundancy (TMR) is used to identify and cor-
rect logic values. This solution presents a large overhead
in terms of area, power and especially delay, since it trip-
licates all the combinational and sequential logic, and the
architecture introduces delay penalties for the voter
propagation time and the routing congestion. On the oth-
er side, redundancy approaches are nowadays combined
with scrubbing, that consists in periodically reloading the
complete content of the FPGA’s configuration memory. A
more complex system is used to correct the information in
the configuration memory by using read-back and partial
configuration procedures. Through the read-back process
the content of the FPGA’s configuration memory is read
and compared with the expected value, which is stored in
a dedicated memory located outside of the FPGA. With
the advent of modern SRAM-based FPGAs this operation
may be performed through dynamic reconfiguration. In
details, with dynamic reconfiguration, the FPGA configu-
ration memory can be read-back continuously without
interfering with the circuit functionality and if any upset
is detected it can be selectively re-written with the correct
values, thus avoiding the accumulation of radiation-
induced errors [12]. However, the main drawback of this
technique is the huge detection and correction time that
makes it useless for real-time operations and ineffective
versus the single point of failure induced by configura-
tion memory bit-flips. Spatial redundancy using Triple
Modular Redundancy (TMR) is complementarily used
with the read-back and correction techniques: on one side
TMR can tolerate faults with the limitation of withstand-
ing a single fault per voting group [13], on the other side
read-back and correction avoids the accumulation of er-
rors within the configuration memory. The combination
of TMR and self-healing using dynamic partial reconfigu-
ration has been previously used in [14][15]. However, the
results achievable with this combined solution are com-
putationally expensive and area hungry.
Reconfiguration at the gate level is used in fine-grain
approaches [16] with particular efficiency from the point
of view of the area overhead, although it suffers from a
complex and not flexible control mechanism. Further-
more, because of the adopted fine granularity, this ap-
proach is infeasible for system-level healing. A self-
healing partial dynamic reconfigurable design methodol-
ogy has been proposed in [17]. However, the method in-
serts control circuitry by partitioning the circuit for error
Minor Column
Rows (Clock
Regions)
Major Column
Frame 0
Frame 35
Frame 1
Frame 25
CLB Major Column
20 CLB High
Fig. 1. Resource and Frame Layout of Modern SRAM based FPGAs

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2607749, IEEE
Transactions on Computers
4 IEEE TRANSACTIONS ON COMPUTERS
localization and detection purposes. This requires a sig-
nificant overhead.
A methodology for fault tolerant architectures using
on-line checkers for fault detection and localization was
introduced in [18]. On-line checkers for TMR- and dupli-
cation-based systems were combined with partial dynam-
ic reconfiguration in [19] in such a way that detection and
localization of faults will be performed by the checker,
while reconfiguration will recover from the error. The
detection and localization of errors is implemented as a
partial reconfigurable module which is itself subject to
errors. A previous approach based on fine-granularity
error masking has been developed in [20]; however, such
solution can only be applied to a TMR technique with a
majority voter logic scheme. Vice versa, a first overview
of recovery architectures for high computational systems
based on SRAM-based FPGAs has been presented in [21].
2.1 Main contribution
The main contribution of the present work, which is
based on the platform preliminarily presented in [8], is
the description of an autonomous recovery approach that
can be applied to Partially Reconfigurable Modules
(PRMs) when errors are detected inside them. The ap-
proach is implemented by the static region providing ef-
fective capabilities of error detection and correction of
faults within the dynamic region. Our approach allows
resilience to MEUs, since we adopt a static region protect-
ed with a fine-grain redundancy approach, as described
by [3]. In particular, we propose a new fine-grain fault
detection mechanism applied to FPGA resources: the ap-
proach is based on the comparison of Look-Up Tables
(LUTs) outputs by using the logic available to allow carry
propagation, which is generally used for fast arithmetic
computations and mostly not inferred by design tools,
following the approach preliminarily introduced in [9] for
fault detection. In details, the proposed method is charac-
terized by the ability of detecting MEUs into the FPGA’s
configuration memory, as well as to recover any number
of faults in the dynamic partition, thus improving previ-
ously developed approaches, as presented in [9], that
cannot deal with MEUs. Our solution is adaptable to all
modern SRAM-based FPGAs equipped with an Internal
Configuration Access Port (ICAP) and based on a LUT-
slice architecture.
3 THE PROPOSED METHOD
The proposed method consists of two flows: one applied
to the dynamically reconfigurable region for implement-
ing error detection, the other one for instrumenting the
circuit mapped on the FPGA so that it supports the execu-
tion of the self-repairing method against single and mul-
tiple-bit errors.
A dynamically reconfigurable system, from the architec-
tural perspective, is partitioned into static and dynamic
regions as illustrated in figure 2. The static region consists
of a processor with a static-RAM, some general purpose
IOs, flash memories, and hardware resources for manag-
ing the internal configuration access port connected to the
processor local bus. The static region contains the main
processor, which is in charge of controlling the partially
reconfigurable system operational functionalities: there-
fore, it is very important to tolerate and recover errors in
these modules. In this paper we assume that this region is
implemented using Triple Modular Redundancy. By suit-
ably mapping the three copies of the circuit elements on
the device the static region can be protected against any
single point of failure.
The dynamic region consists of the resources imple-
menting the user’s circuit. The proposed approach mainly
focuses on the dynamic region, and exploits reconfigura-
tion at the individual frame level for error detection and
correction. The dynamic region can be organized into a
Single Bit Error (SBE) region, Multi Bit Error (MBE) re-
gion and Coarse-Grain Error region. It is first necessary to
introduce some definitions related to the major character-
istics of current SRAM-based FPGAs: modern FPGAs are
row-wise divided into a number of clock regions for dy-
namic partial reconfiguration, while column-wise are or-
ganized in major columns of resources, such as CLBs,
DSPs or IOs. Each major column spans the whole height
of the device but it is configured in each clock region
(row) by a separate reconfigurable frame (RF). Each RF
contains a different number of “minor frames”, each hav-
ing a height equal to the clock region (row) and num-
bered from left to right. For example, in Xilinx Virtex-5
devices [11] a CLB RF consists of 36 minor frames (hereby
simply referred to as frames), which are responsible for
the configuration of LUTs and their routing, while con-
figuration bits for a single LUT are distributed over mul-
tiple frames. From the point of view of the circuit archi-
tecture, the proposed method is based on the Duplication
With Comparison (DWC) technique applied at two dif-
ferent levels of granularity, herein called Coarse-grained
DWC (C-DWC) and Fine-grained DWC (F-DWC).
The C-DWC is applied for slices that use the carry
chain for computations such as fast additions or multipli-
cations. In this case, the duplication is performed at the
module level and the outputs are compared at the physi-
cal level by LUT elements configured to implement XOR
combinational functions. Our approach is able to directly
modify the circuit physical description in order to use the
XOR logic function to compare the module’s outputs. In
case of error, the software tools running on the reconfigu-
rable system partially rewrite the C-DWC region.
F-DWC is applied at the place and route level, by suitably
duplicating each LUT function in two copies that are
placed in a single slice using two consecutive LUT posi-
tions. The outputs of the two LUTs are then compared
with hardwired physical resources built into the slice in


 
 

!







 !

 

"

"

#

$
Fig. 2. Placement space division into Single Bit Error region, Multipl
e
Bit Error region and the Coarse Grain Error region.

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2607749, IEEE
Transactions on Computers
SONZA ET AL.: AN ERROR-DETECTION AN SELF-REPAIRING METHOD FOR DYNAMICALLY AND PARTIALLY RECONFIGURABLE SYSTEMS 5
the form of a carry chain by using internal and not pro-
grammable resources, such as hardwired MUXCYs and
XORCYs.
The outputs generated by the XORCY functions are con-
nected in a chain of OR logic functions in order to provide
a single error detection flag for each column. Practically,
the F-DWC approach can be adopted by acting at the
Hardware Design Language (HDL) level: the combina-
tional functions are duplicated and both copies of the cir-
cuit LUTs are placed in a single FPGA slice using two
consecutive available LUT positions. Please note that the
outputs of any pair of LUTs pass through the carry chain
and at each pair position of the XORCY generate a com-
parison signal called check flag. Since we are generating a
check flag for each pair of LUTs the number of check flags
may drastically increase. This means that a considerable
amount of routing resources could be required by the
implementation of these check flags because they have to
be routed to the static region for the detection and correc-
tion of possible errors. Moreover, any such scheme will
not only have a large overhead, but it will also be fruitless
because the smallest unit of reconfiguration is a frame.
In order to have a single check flag for each frame we
propose to merge the individual check flags in two differ-
ent ways. The check flags in the SBE region are merged
through the built-in slice carry chain as shown in figure 3
(further details will be provided in section 3.1.1) where
the hardwired resources XORCY and MUXCY are labeled
as Xi and Mi, respectively. Furthermore, a whole column
of slices is connected through the carry chains to produce
a single flag for each column of slices (see for example flag
3 in the SBE region of figure 2).
In this way we achieve a huge reduction in the num-
ber of check flags, but we can only detect Single Event
Upsets (SEUs) in the SBE region, because multiple LUT
pairs are connected together by a long chain of XORs and
XNORs and thus an even number of errors will go unde-
tected due to the logic configuration of the detector.
In the Multiple Bit Error (MBE) region each pair of
LUTs generates a check flag and thus we have two check
flags per slice. The number of check flags can be reduced
by OR-ing some of the flags corresponding to the slices in
the same slice column, as shown in the magnified MBE
region in figure 4. Although some higher overhead is in-
troduced in this way, we have the ability to detect Multi-
ple Event Upsets (MEUs) in the frames mapped on this
region; in fact, the individual check flags are not merged
along the carry chain passing through multiple XORs, as
it happened in the SBE region.
3.1 Error Detection Method
In order to fully explain our proposal, in this section
we will specifically refer to the architecture of Xilinx Vir-
tex-5 FPGAs. As described in the previous section, the
error detection mechanism implemented in the reconfigu-
rable region is based on LUT-based checkers and carry
chains for propagating the check flags. Please note that
the LUT checkers are only deployed when the carry chain
is unavailable for comparison purposes. This allows re-
ducing the performance degradation of the circuit im-
plemented with our method, although in this case the
detection mechanism is implemented at the modular lev-
el. In this section, we focus on the method adopted for the
error detection using the carry chains for comparison; a
more detailed explanation of both the LUT checkers and
the carry chains insertion inside the physical place and
route description of the circuit will be given in Section 4.3.
3.1.1 Single-bit error detection
In order to detect single-bit errors, we propose to du-
plicate each original LUT function into two identical
LUTs. Furthermore, we place the two LUTs in a single
FPGA slice, where we set the Carry Input and the generic
AX inputs to 1 and 0, respectively, as illustrated in figure
3. Consequently, the hardwired XORCY logic gate in the
bottom of the slice is acting as an inverter, while the
MUXCY multiplexer in the bottom first position is simply
acting as a buffer to pass the value of LUT A.
The multiplexer “M2” receives an inverted (through
the AMUX_2_BX hardwired connection) and buffered
copy of the LUT A output at its “0” and “1” inputs while
the selection line is tied to LUT B (which is the copy of
LUT A) thus effectively performing the EX-NOR function.
The XOR gate named “X2” receives LUT A and LUT B
outputs on its inputs. Similarly, LUT C and LUT D can
also be connected with such a scheme by extending the
EX-NORs and EX-ORs along the slice. In fact, this scheme
can be extended to an entire clock region covering 20
CLBs using the COUT and CIN of slices, thus generating
two flags for the even and odd slice columns of the same
CLB, respectively. This convergence strategy can only be
applied if the CLB column has no empty slices. In case the
CLB column contains empty slices the dedicated COUT
connection cannot be used to propagate the flag signal
upwards along the column. For such a case, an ORing
LUT is introduced in the CLB column and placed in an
available empty slice. This will be discussed in greater
details in Section 4.3. It is interesting to investigate an
upper bound on the number of check flags that can be
generated for the most complex design. The flag signal is
generated per CLB tile columns and is directly related to
the device rows and columns. For example, for the Virtex-
5 VLX110T device the maximum number of check flags
for any design cannot be greater than 1,280 (160x8) [22]
[23]. As the FPGA must contain the control processor the
actual number will be quite less than 1,280 and will de-
termine the size of the GPIO port that is used by the con-
troller to detect errors. Then, it is possible to pinpoint sin-
gle bit upsets in any of the four LUTs in any slice column
in a clock region. However, errors affecting flip-flops












































Fig. 3. Single-Bit Error detection scheme implemented in a single
slice.

Citations
More filters
Journal ArticleDOI
TL;DR: This paper focuses on an alternative mechanism to reduce the repair time of traditional scrubbing approaches that relies on fine-grained error detection and partial reconfiguration, and proposes an approach to make resilient diagnosis of configuration faults.
Abstract: Field-programmable gate arrays provide several relevant advantages for critical systems, such as flexibility and high performance. However, their use in critical systems requires efficient means to mitigate transient faults in the configuration bits. This paper focuses on an alternative mechanism to reduce the repair time of traditional scrubbing approaches. It relies on fine-grained error detection and partial reconfiguration. The fine-grained information is used to dynamically choose an optimized starting position for the scrubbing procedure, reducing the mean repair time. We explore the design space provided by the technique and propose an approach to make resilient diagnosis of configuration faults. The efficiency, scalability, and robustness of the proposed mechanisms are evaluated.

20 citations


Cites background or methods from "An error-detection and self-repairi..."

  • ...This resilience is expected, as fine-grained redundancy mechanisms are expected to handle multiple faults efficiently [16]–[18]....

    [...]

  • ...In [18], a technique similar to that in [17] is used to detect errors....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes a self-repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level that is able to recover and correct errors using the run-time partial reconfiguration capabilities offered by modern SRAM-based FPGAs.
Abstract: Reconfigurable systems are gaining an increasing interest in the domain of safety-critical applications, for example in the space and avionic domains. In fact, the capability of reconfiguring the system during run-time execution and the high computational power of modern Field Programmable Gate Arrays (FPGAs) make these devices suitable for intensive data processing tasks. Moreover, such systems must also guarantee the abilities of self-awareness, self-diagnosis and self-repair in order to cope with errors due to the harsh conditions typically existing in some environments. In this paper we propose a self-repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level. Our method is able to detect correct and recover errors using the run-time capabilities offered by modern SRAM-based FPGAs. Fault injection campaigns have been executed on a dynamically reconfigurable system embedding a number of benchmark circuits. Experimental results demonstrate that our method achieves full detection of single and multiple errors, while significantly improving the system availability with respect to traditional error detection and correction methods.

18 citations

Proceedings ArticleDOI
26 May 2014
TL;DR: The content of the paper is focused on analyzing design features, fail-safe and reconfigurable features oriented to self-adaptive mitigation and redundancy approaches applied during the design phase, and experimental results reporting a clear status of the test data and fault tolerance robustness are reported.
Abstract: Reconfigurable architectures are increasingly employed in a large range of embedded applications, mainly due to their ability to provide high performance and high flexibility, combined with the possibility to be tuned according to the specific task they address. Reconfigurable systems are today used in several application areas, and are also suitable for systems employed in safety-critical environments. The actual development trend in this area is focused on the usage of the reconfigurable features to improve the fault tolerance and the self-test and the self-repair capabilities of the considered systems. The state-of-the-art of the reconfigurable systems is today represented by Very Long Instruction Word (VLIW) processors and reconfigurable systems based on partially reconfigurable SRAM-based FPGAs. In this paper, we present an overview and accurate analysis of these two type of reconfigurable systems. The content of the paper is focused on analyzing design features, fail-safe and reconfigurable features oriented to self-adaptive mitigation and redundancy approaches applied during the design phase. Experimental results reporting a clear status of the test data and fault tolerance robustness are detailed and commented.

14 citations

Proceedings ArticleDOI
24 Nov 2014
TL;DR: An approach to design and implement a soft-core processor on SRAM-based FPGAs able to autonomously deal with the occurrence of soft errors; state-of-the-art area-replication strategies are coupled with dynamic partial reconfiguration to detect faults and to consequently repair them.
Abstract: This paper presents an approach to design and implement a soft-core processor on SRAM-based FPGAs able to autonomously deal with the occurrence of soft errors; state-of-the-art area-replication strategies are coupled with dynamic partial reconfiguration to detect faults and to consequently repair them. The reconfiguration process is performed by the processor itself using a minimum set of "critical" instructions and the logic responsible for their execution is hardened, to enable the self-healing property. The methodology is applied to the OpenRISC processor, evaluating costs and benefits.

14 citations

Journal ArticleDOI
TL;DR: This work proposes a partial reconfiguration approach that aims at repairing configuration errors under real-time constraints that relies on fine-grained error detection and a repair mechanism that is finely tuned to maximize the probability of meeting a given deadline.

14 citations

References
More filters
Proceedings ArticleDOI
07 Mar 2005
TL;DR: The experimental results presented in this paper demonstrate that the number and placement of voters in the TMR design can directly affect the fault tolerance, ranging from 4.03% to 0.98% the number of upsets in the routing able to cause an error in theTMR circuit.
Abstract: Triple modular redundancy (TMR) is a suitable fault tolerant technique for SRAM-based FPGA However, one of the main challenges in achieving 100% robustness in designs protected by TMR running on programmable platforms is to prevent upsets in the routing from provoking undesirable connections between signals from distinct redundant logic parts, which can generate an error in the output This paper investigates the optimal design of the TMR logic (eg, by cleverly inserting voters) to ensure robustness Four different versions of a TMR digital filter were analyzed by fault injection Faults were randomly inserted straight into the bitstream of the FPGA The experimental results presented in this paper demonstrate that the number and placement of voters in the TMR design can directly affect the fault tolerance, ranging from 403% to 098% the number of upsets in the routing able to cause an error in the TMR circuit

243 citations


"An error-detection and self-repairi..." refers background in this paper

  • ...standing a single fault per voting group [13], on the other...

    [...]

Proceedings ArticleDOI
01 Sep 2007
TL;DR: In this paper, the authors proposed an automated software tool that uses the Partial TMR method to apply TMR incrementally until the specified percentage of resources are utilized, which gives the maximum reliability gain for the specified area cost.
Abstract: The mitigation of single-event upsets (SEUs) in field-programmable gate arrays (FPGAs) is an increasingly important subject as FPGAs are used in radiation environments such as space. Triple modular redundancy (TMR) is the most frequently used SEU mitigation technique but is very expensive in terms of area and power costs. These costs can be reduced by sacrificing some reliability and applying TMR to only part of the FPGA design. Our Partial TMR method focuses on the most critical sections of the design and increases reliability by applying TMR to continuous sections of the circuit. We introduce an automated software tool that uses the Partial TMR method to apply TMR incrementally until the specified percentage of resources are utilized. Thus the tool gives the maximum reliability gain for the specified area cost. The amount of mitigation applied can be chosen at a very fine level, giving the designer maximum flexibility when producing the final mitigated design.

108 citations


"An error-detection and self-repairi..." refers methods in this paper

  • ...A previous approach based on fine-granularity error masking has been developed in [20]; however, such solution can only be applied to a TMR technique with a majority voter logic scheme....

    [...]

01 Jan 2003
TL;DR: Evaluated TMR on two different counter designs in the presence of SEUs shows that when feedback TMR is used with triplicated clocks, it is possible to to have a counter design which is insensitive to any single configuration upset.
Abstract: Field programmable gate arrays (FPGAs) are sensitive to radiation-induced single event upsets (SEUs) within the configuration memory. Triple modular redundancy (TMR) is a technique commonly used to mitigate against design failures caused by SEUs. This paper evaluates the effectiveness and cost of TMR on two different counter designs in the presence of SEUs. The evaluation measures the reliability, area cost, and speed of different TMR styles. The tests show that when feedback TMR is used with triplicated clocks, it is possible to to have a counter design which is insensitive to any single configuration upset.

91 citations


"An error-detection and self-repairi..." refers background or methods in this paper

  • ...The increased probability of MEUs hitting the configuration memory of an FPGA can limit the effectiveness of traditional redundancy-based fault-tolerance approaches [3]....

    [...]

  • ...Our approach allows resilience to MEUs, since we adopt a static region protected with a fine-grain redundancy approach, as described by [3]....

    [...]

Journal ArticleDOI
TL;DR: This paper presents an analytical approach (versus fault injection) for soft error rate estimation in FPGA-based designs and validate the projections produced by the analytical model using field error rates obtained from failure data obtained from a large FPGa-based design used in the logical unit module board of a commercial information system.
Abstract: Soft errors due to cosmic particles are a growing reliability threat for VLSI systems. The vulnerability of FPGA-based designs to soft errors is higher than ASIC implementations since the majority of chip real estate is dedicated to memory bits, configuration bits, and user bits. Moreover, single event upsets (SEUs) in the configuration bits of SRAM-based FPGAs result in permanent errors in the mapped design. In this paper we analyze the soft error vulnerability of FPGAs used in information systems. Since the reliability requirements of these high performance information subsystems are very stringent, the reliability of the FPGA chips used in the design of such systems plays a critical role in overall system reliability. We present an analytical approach (versus fault injection) for soft error rate estimation in FPGA-based designs. We also validate the projections produced by our analytical model using field error rates obtained from failure data obtained from a large FPGA-based design used in the logical unit module board of a commercial information system. This comparison confirms that the projections obtained from our analytical tool are accurate (there is an 81% overlap in FIT rate range obtained with our analytical modeling framework and the field failure data studied).

77 citations


"An error-detection and self-repairi..." refers methods in this paper

  • ...A Previous approach based on fine-granularity error masking have been developed in [16], such solution however is only...

    [...]