An error-detection and self-repairing method for dynamically and partially reconfigurable systems

doi:10.1109/ETS.2013.6569377

Proceedings Article•DOI•

An error-detection and self-repairing method for dynamically and partially reconfigurable systems

Matteo Sonza Reorda¹, Luca Sterpone¹, Anees Ullah¹•Institutions (1)

27 May 2013-Vol. 66, Iss: 6, pp 1-7

TL;DR: This paper proposes a self-repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level that is able to detect correct and recover errors using the run-time capabilities offered by modern SRAM-based FPGAs.

read less

Abstract: Reconfigurable systems are gaining an increasing interest in the domain of safety-critical applications, for example in space and avionic applications. In fact, the capability of reconfiguring the system during run-time execution and the high computational power of modern Field Programmable Gate Arrays (FPGAs) makes these devices suitable for data processing. Moreover, such systems must also guarantee the abilities of self-awareness, self-diagnosis and self-repair in order to cope with errors due to the harsh conditions typically existing in some environments. In this paper we propose a self-repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level. Our method is able to recover and correct errors using the run-time partial reconfiguration capabilities offered by modern SRAM-based FPGAs. Fault injection experiments have been executed on a dynamically reconfigurable system embedding a number of benchmark circuits. Results demonstrate that the method can achieve full detection of single and multiple errors, while significantly improving the system availability with respect to traditional error detection and correction methods.

...read moreread less

Summary (6 min read)

Jump to: [1 INTRODUCTION] – [2 PREVIOUS WORKS] – [2.1 Main contribution] – [3 THE PROPOSED METHOD] – [3.1 Error Detection Method] – [3.1.1 Single-bit error detection] – [3.1.2 Multiple-Bit error detection] – [3.2 Error Correction Method] – [4 DESIGN FLOW] – [4.1 Net-list Extraction] – [4.2 DUT Regions Formation and Constraints Generation] – [4.3 Low-level Manipulations] – [5 EXPERIMENTAL RESULTS] – [5.1 Area Overhead] – [5.2 Error Detection Latency] – [5.3 Error Correction and Detection] – [Table III. Fault injection campaign experimental results] and [5.4 Timing Analysis]

1 INTRODUCTION

ECHNOLOGY scaling in the nano-metric domain and beyond supports the increasing usage of high performance and miniaturized embedded systems.
Among the available technology solutions, the adoption of SRAM-based FPGAs is the most suitable for the realization of dynamically and partially reconfigurable systems; however, when used in harsh environments, SRAM-based FPGAs have to withstand the radiation effects in the form of Single Event Upsets (SEUs) and Multiple Event Upsets (MEUs), especially affecting their configuration memory [2] .
On the contrary, the components in the dynamic region correspond to partially reconfigurable resources that can be configured in different ways depending on the system requirements [8] .
The proposed approach provides significant advantages compared to already developed solutions [9] [10] , mainly because it increases the error detection and correction capabilities while introducing comparable area and performance overhead.
Section 3 describes the proposed method, while the developed design flow is illustrated in Section 4.

2 PREVIOUS WORKS

State-of-the-art SRAM-based FPGAs are heterogeneous devices containing several macro blocks, like Digital Signal Processors (DSPs), Block RAMS and IO Blocks (IOBs), along with Configurable Logic Blocks (CLBs) inside the FPGA reconfiguration fabric.
Each of these resource types is arranged in columns that span from top to bottom of the device realizing a column of CLBs, IOBs and BRAM memories interconnected by a mesh of heterogeneous routing resources.
In details, with dynamic reconfiguration, the FPGA configuration memory can be read-back continuously without interfering with the circuit functionality and if any upset is detected it can be selectively re-written with the correct values, thus avoiding the accumulation of radiationinduced errors [12] .
Spatial redundancy using Triple Modular Redundancy (TMR) is complementarily used with the read-back and correction techniques: on one side TMR can tolerate faults with the limitation of withstanding a single fault per voting group [13] , on the other side read-back and correction avoids the accumulation of errors within the configuration memory.
The results achievable with this combined solution are computationally expensive and area hungry.

2.1 Main contribution

The main contribution of the present work, which is based on the platform preliminarily presented in [8] , is the description of an autonomous recovery approach that can be applied to Partially Reconfigurable Modules (PRMs) when errors are detected inside them.
The approach is implemented by the static region providing effective capabilities of error detection and correction of faults within the dynamic region.
In details, the proposed method is characterized by the ability of detecting MEUs into the FPGA's configuration memory, as well as to recover any number of faults in the dynamic partition, thus improving previously developed approaches, as presented in [9] , that cannot deal with MEUs.
The authors solution is adaptable to all modern SRAM-based FPGAs equipped with an Internal Configuration Access Port (ICAP) and based on a LUTslice architecture.

3 THE PROPOSED METHOD

The proposed method consists of two flows: one applied to the dynamically reconfigurable region for implementing error detection, the other one for instrumenting the circuit mapped on the FPGA so that it supports the execution of the self-repairing method against single and multiple-bit errors.
The static region contains the main processor, which is in charge of controlling the partially reconfigurable system operational functionalities: therefore, it is very important to tolerate and recover errors in these modules.
Each RF contains a different number of "minor frames", each having a height equal to the clock region (row) and numbered from left to right.
Practically, the F-DWC approach can be adopted by acting at the Hardware Design Language (HDL) level: the combinational functions are duplicated and both copies of the circuit LUTs are placed in a single FPGA slice using two consecutive available LUT positions.
In the Multiple Bit Error (MBE) region each pair of LUTs generates a check flag and thus the authors have two check flags per slice.

3.1 Error Detection Method

In order to fully explain their proposal, in this section the authors will specifically refer to the architecture of Xilinx Virtex-5 FPGAs.
As described in the previous section, the error detection mechanism implemented in the reconfigurable region is based on LUT-based checkers and carry chains for propagating the check flags.
Please note that the LUT checkers are only deployed when the carry chain is unavailable for comparison purposes.
This allows reducing the performance degradation of the circuit implemented with their method, although in this case the detection mechanism is implemented at the modular level.
The authors focus on the method adopted for the error detection using the carry chains for comparison; a more detailed explanation of both the LUT checkers and the carry chains insertion inside the physical place and route description of the circuit will be given in Section 4.3.

3.1.1 Single-bit error detection

In order to detect single-bit errors, the authors propose to duplicate each original LUT function into two identical LUTs.
The multiplexer "M2" receives an inverted (through the AMUX_2_BX hardwired connection) and buffered copy of the LUT A output at its "0" and "1" inputs while the selection line is tied to LUT B (which is the copy of LUT A) thus effectively performing the EX-NOR function.
In case the CLB column contains empty slices the dedicated COUT connection cannot be used to propagate the flag signal upwards along the column.
Errors affecting flip-flops cannot be directly detected.

3.1.2 Multiple-Bit error detection

Multiple bit errors can only be detected if the error detecting carry chain is inserted in a specific pattern that the authors will mention in this section.
In order to reduce the number of flags the authors propose the usage of 2 slices (out of the available 20) for merging the check flags by OR-ing them.
As the authors are producing two flags for each clock region (one for odd and one for even slices) they can have a maximum of 72 LUTs (out of 80 LUTs in an even or odd slice column) configured for computations in any slice column location (even or odd) within a single clock region.
Thus, the MBE regions require an overhead of 11.11 % for flag reduction.

3.2 Error Correction Method

Data errors affecting combinational logic or Flip-Flops are individuated by the error detection scheme previously described.
Secondly, the clock enabling signals should be de-activated to disable the propagation of errors to the next stages in the design.
This is possible since both static and dynamic regions have well-defined interfaces with clock enabling registers.
Lastly, the main processor controller enables the clock to re-start the normal operation in the DUT region involved in the correction.

4 DESIGN FLOW

In this section the authors describe the tool flow they developed in order to insert fine-grain duplication with comparison using the built-in slice carry chains.
A pre-map step generates a number of constraints for directed packing, placement and sites prohibitions, while a post-map step inserts the error detecting carry chains and the convergence logic required to reduce the number of flag signals.
This postmap modification is implemented by modifying the XDL file (i.e., the Xilinx interface for interacting with the Xilinx CAD flow).
The tool flow has been developed as a C++based software environment making heavy use of boost library and Tools for Open Source Reconfiguration (TORC).

4.1 Net-list Extraction

The flow starts by parsing the net-list description of the circuit implemented into the dynamic region, which was duplicated at the Hardware Description Level (HDL).
It is important that both instances of the design should be labeled with "inst1" and "inst2" so that each synthesized element contains the hierarchical information of the top level instance to which it belongs.
Global reset/clock signals are not duplicated at the module-level, as it will be explained in Section 4.2.
The postsynthesis Verilog file contains the circuit net-list using the Xilinx primitive cell library elements.
In details, each node of the graph corresponds to a data structure with a number of fields including: functional string, instance name, inputs vector, outputs vector and type of primitive element (LUT or FF).

4.2 DUT Regions Formation and Constraints Generation

Once the circuit net-list is created in the form of a graph, it is necessary to generate user constraints, represented within the User Constraints File (UCF) in order to perform the DUT physical space division into regions and for packing the primitive cells into slices.
Thirdly, LUTs with 6 inputs are grouped to form single bit error detection regions.
For this reason the global clock and reset signals were not duplicated due to the architectural limitation of state-of-the-art FPGA devices.
Slices in the single bit region use names like "SBESlice1", "SBESlice2" and so on.
The algorithm illustrated in figure 6 performs the generation of the constraints used for the floorplanning of the circuit including the mapping of the SBE and MBE regions.

4.3 Low-level Manipulations

Once the mapping is performed, the insertion of the carry chain and the definition of the comparator resources are implemented by modifying the physical place and route description of the circuit in order to properly use the hardwired combinational gates.
Each inserted carry chain is labeled with a unique reference to differentiate it with respect to the ones used for arithmetic computation.
It is also interesting to note that for each OR LUT an automatic procedure searches for an empty slice in the same CLB column and picks up the nearest one in terms of the slice site distance for the OR LUT placement.
The single bit error region flags are converged resulting in error detection carry chains of varying lengths.
Therefore, the placement should be such that an optimal balance between the usage of OR LUTs for flag convergence and the routing congestion is achieved.

5 EXPERIMENTAL RESULTS

The authors implemented the proposed method targeting a Xilinx Virtex-5 LX110T SRAM-based FPGA.
Based on an ad-hoc hardware unit), the authors adopted the Microblaze processor since it represents a state-of-the-art solution for a dynamically and partially reconfigurable system based on static and dynamic regions [4] .
Moreover, another GPIO port connected to the flags stemming from the DUT region and configured in interrupt mode is responsible for informing the Microblaze in case of errors.
The bit-stream for the C-DWC region is stored as a partial bit-stream by reading it with the ICAP from the start address to the end address.
In the following sections, the authors present several results mainly related to the ability of quick error detection, localization and repairing.

5.1 Area Overhead

The circuits include some relevant ITC'99 benchmark circuits with various complexity, two implementations of the CORDIC arithmetic processor, a miniMIPS processor, a lightweight 8080 SoC, an RS-Decoder and a DCT core from the opencores repository [24] [25] .
Please note that the authors did not include the amount of resources related to the static region within the area count since the static region remains the same in any Dynamically Reconfigurable system, no matter the adopted solution.
If compared with DMR, their approach requires 10% more resources on the average; however, DMR cannot correct errors, while their approach corrects errors and reduces the probability of single points of failure thanks to the developed fine-grain combinational logic infrastructure.
The authors underline that the area comparison has been performed directly on the basis of LUTs and FFs counts; if comparison is made considering the number of FPGA slices, the ratio may by slightly different due to stringent packing and placement requirements adopted for the fine-grain redundancy with comparison logic.
In particular, slices are used as a route-through and FFs may be placed in separate slices, since the FFs require different control signals that could not be packed together with LUTs.

5.2 Error Detection Latency

The measurement of the error detection latency is the key factor for making a proper self-repairing system able to autonomously repair itself obeying to real-time constraints.
The results the authors obtained are illustrated in Table II , where it is shown the maximum error detection latency for SBE and MBE regions.
In detail, the table reports the length of the carry chain detector, the delay latency with routing and logic contributions of the SBE region, as well as the distance from the detector and the delay latency for the MBE region.
It is notable that the SBE region latency is larger than for the MBE region because all the carry chains in each CLB that resides in the same column have been connected in a unique CLB column.
Two alternatives have been used in order to reduce the routing delay time.

5.3 Error Correction and Detection

The effectiveness of the proposed approach concerning the error correction and detection capabilities have been evaluated through the execution of a number of fault injection campaigns.
The experiments have been performed on the Xilinx Virtex-5 LX110T SRAM-based FPGAs by injecting transient faults into the FPGA's configuration memory and evaluating the circuit's response through the execution of circuit specific workloads.
Please note that the faulty bitstreams are generated by corrupting the FPGA's configuration memory bits belonging to the dynamic region, while the static region was kept fault free.
Table III shows the fault injection results, where for each circuit 10,000 Single Event Upsets (SEUs) have been randomly injected into the whole FPGA configuration memory bits related to the reconfigurable region.
All the circuits have been emulated at 50 MHz and SEUs are practically injected by downloading the corrupted bitstreams into the FPGA configuration memory.

Table III. Fault injection campaign experimental results

In details, the Wrong Answer reports the number of SEUs and MEUs provoking a wrong answer on the circuit outputs; the Corrected column reports the number of SEUs and MEUs properly corrected by their approach.
Please note that the MEU effect considered in their experiments always occurs in different slice columns involving the modification of two configuration memory bits.
Fectiveness of their approach, which is able to correct more than 98% of the injected errors provoking wrong answers for all the considered circuits.
The authors also measured the recovery time; in Table IV they reported the worst recovery time measured for all the circuits during the execution of the fault injection campaigns.
The authors also computed the recovery time required by the redundancy approaches, such as TMR and DMR, using active configuration memory scrubbing of all the reconfigurable region area, which is about 1.2 ms; their approach shows an improvement of more than one order of magnitude, and the advantage provided by their approach is extremely large on all the considered circuits.

5.4 Timing Analysis

Finally, the authors evaluated the impact on the circuit maximal working frequency on all the benchmark circuits comparing their approach with the DMR and TMR redundancy based techniques.
In order to elaborate the timing data the authors used the static timing analysis tool provided by the Xilinx ISE environment.
This phenomenon is due to the unconventional block placement of logic resources on slice columns for different circuit regions.
This aspect affects the timing of the circuit because their technique does not include an optimal floorplan implementation of the different circuit regions.
In figure 8 , the authors illustrated the obtained results showing the percentage contribution of each design phase constraints on the overall circuit delay: LUT blocks, SBE region, MBE region and Detectors.

Did you find this useful? Give us your feedback

Figures (11)

Fig. 4. Multiple-Bit Error detection scheme implemented in a single slice.

Fig. 8. Percentage of influence of the approach implementation phases on the circuit dynamic region.

Fig. 7. Clock period comparison for the considered circuits using different error detection and correction approaches.

Table II. Error Detection Latency for carry chain detectors

Table IV. Recovery Time comparison (worst case)

Table III. Fault injection campaign experimental results

Fig. 1. Resource and Frame Layout of Modern SRAM based FPGAs

Fig. 6. The flow of the constraint generation algorithm.

Fig. 2. Placement space division into Single Bit Error region, Multiple Bit Error region and the Coarse Grain Error region.

Fig. 3. Single-Bit Error detection scheme implemented in a single slice.

Content maybe subject to copyright Report

10 August 2022

POLITECNICO DI TORINO

Repository ISTITUZIONALE

An Error-Detection and Self-Repairing Method for Dynamically and Partially Reconfigurable Systems / SONZA

REORDA, Matteo; Sterpone, Luca; Ullah, Anees. - In: IEEE TRANSACTIONS ON COMPUTERS. - ISSN 0018-9340. -

ELETTRONICO. - 66:6(2017), pp. 1022-1033. [10.1109/TC.2016.2607749]

Original

An Error-Detection and Self-Repairing Method for Dynamically and Partially Reconfigurable Systems

IEEE postprint/Author's Accepted Manuscript

Publisher:

Published

DOI:10.1109/TC.2016.2607749

openAccess

Publisher copyright

current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating

new collecting works, for resale or lists, or reuse of any copyrighted component of this work in other works.

(Article begins on next page)

This article is made available under terms and conditions as specified in the corresponding bibliographic description in

the repository

Availability:

This version is available at: 11583/2658319 since: 2016-11-30T14:11:58Z

IEEE

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2607749, IEEE

Transactions on Computers

2 IEEE TRANSACTIONS ON COMPUTERS

An Error-Detection and Self-Repairing Method for

Dynamically and Partially Reconfigurable Systems

Abstract— Reconfigurable systems are gaining an increasing interest in the domain of safety-critical applications, for example

in the space and avionic domains. In fact, the capability of reconfiguring the system during run-time execution and the high

computational power of modern Field Programmable Gate Arrays (FPGAs) make these devices suitable for intensive data

processing tasks. Moreover, such systems must also guarantee the abilities of self-awareness, self-diagnosis and self-repair in

order to cope with errors due to the harsh conditions typically existing in some environments. In this paper we propose a self-

repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level. Our method is

able to detect, correct and recover errors using the run-time capabilities offered by modern SRAM-based FPGAs. Fault injection

campaigns have been executed on a dynamically reconfigurable system embedding a number of benchmark circuits.

Experimental results demonstrate that our method achieves full detection of single and multiple errors, while significantly

improving the system availability with respect to traditional error detection and correction methods.

Index Terms— Self-Repair; Partial and Dynamic Reconfiguration; Single Event Upsets (SEUs); Multiple Event Upsets (MEUs)

——————————



——————————

1 INTRODUCTION

ECHNOLOGY scaling in the nano-metric domain and

beyond supports the increasing usage of high perfor-

mance and miniaturized embedded systems. Howev-

er, the quest for pushing the limits of technology to the

ultra-nano scale devices has exacerbated concerns related

to power consumption and reliability that have not be

envisioned before. In particular, one of the major issues in

safety-critical applications (especially in the space and

avionic domains) is the run-time mitigation of various

radiation-induced fault effects, which may provoke tran-

sient and permanent modifications of the electronic cir-

cuit’s behavior. The problem is widely known and vari-

ous methods have been developed and proposed in the

area during the last decade. The ubiquity of embedded

systems for safety-critical applications operating in radia-

tion environments demands continuous and successful

operations of the system by autonomously overcoming

possible malfunctions. This condition requires the abili-

ties of autonomous error detection, self-diagnosis and

self-repair [1].

Among the available technology solutions, the adop-

tion of SRAM-based FPGAs is the most suitable for the

realization of dynamically and partially reconfigurable

systems; however, when used in harsh environments,

SRAM-based FPGAs have to withstand the radiation ef-

fects in the form of Single Event Upsets (SEUs) and Mul-

tiple Event Upsets (MEUs), especially affecting their con-

figuration memory [2].

The increased probability of MEUs hitting the config-

uration memory of an FPGA can limit the effectiveness of

traditional redundancy-based fault-tolerance approaches

[3]. In fact, particles can hit the same logic group of circuit

replicas enabling erroneous results to propagate. To cope

with this scenario, researchers have recently investigated

the fine-grain redundancy and its resilience to MEUs

[4][5][6]. However, for proper shielding against high fail-

ure rate while minimizing redundancy overhead in terms

of area, speed and power consumption, systems should

be designed with accurate mixed-grain redundancy and

self-repair properties which are not feasible for fine-grain

redundancy.

State-of-the-art SRAM-based FPGAs have the technol-

ogy supporting run-time dynamic and partial reconfigu-

ration (DPR), which can be used for adaptive behavior as

well as for fault repairing [7]. A self-repairing system

adopting the partial dynamic reconfiguration capabilities

of SRAM-based FPGAs is often divided in two parts,

called static region and dynamic region. The logic and rout-

ing resources and the corresponding configuration

memory frames individuated by means of clock regions

and major and minor columns, illustrated in figure 1, are

organized in a static region, also called base region. The

static region typically consists of a microprocessor, some

memory modules and input/output ports, as described in

figure 2. In general, these components are not re-

configured and their full functionality is constantly re-

quired for implementing the correct operations of the

system; for this reason the static region is often hardened

using a traditional redundancy-based approach, such as

Triple Modular Redundancy (TMR) [7]. The static region

is also responsible for the reconfiguration of the modules

placed into the reconfigurable regions. On the contrary,

the components in the dynamic region correspond to par-

tially reconfigurable resources that can be configured in

different ways depending on the system requirements [8].

The dynamically reconfigurable regions idea is extended

in this paper so that the system is able to also correct the

identified errors by applying internal reconfiguration.

M. Sonza Reorda, Fellow, IEEE, L. Sterpone, Member, IEEE,

A. Ullah, Student Member, IEEE

xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society

————————————————

M. Sonza Reorda, L. Sterpone and A. Ullah are with the Dipartimen-

to di Automatica e Informatica (DAUIN), Politecnico di Torino, To-

rino, Italy. For any information please refer to:

luca.ster

one@

olito.it (contact author).

Transactions on Computers

SONZA ET AL.: AN ERROR-DETECTION AN SELF-REPAIRING METHOD FOR DYNAMICALLY AND PARTIALLY RECONFIGURABLE SYSTEMS 3

The proposed approach provides significant advantages

compared to already developed solutions [9][10], mainly

because it increases the error detection and correction

capabilities while introducing comparable area and per-

formance overhead.

In order to practically prove the effectiveness of the

approach, we developed a complete set of tools for the

automatic generation of the constraints used for the parti-

tioning of the dynamic regions. The developed set of tools

directly acts at the physical level, automatically inserting

a carry chain into the physical net-list and adding com-

parator check flags into the circuitry; moreover, the tool is

able to cleverly place the different partitions of the dy-

namic region into proper sub-regions, thus allowing SEUs

and MEUs correction. The proposed approach drastically

improves the solution in [9] which uses the built-in slice

carry chain for error detection, only.

Our approach introduces a minimal area overhead,

which is strictly dependent upon the number of user-

defined partitions. On the average, the overhead intro-

duced by our approach is around 11% with respect to the

duplication-based approach; hence, the proposed tech-

nique is using far less computational resources if com-

pared to the standard TMR solution. Furthermore, correc-

tion is performed on a single reconfigurable frame, which

is the smallest amount of reconfigurable information that

can be read or written; therefore, we can achieve the

highest availability limits offered by the current reconfig-

urable technology.

The paper is organized as follows. Section 2 gives an

overview of the soft error detection and correction meth-

ods implemented with modern SRAM-based FPGAs and

summarizes the major contributions of this paper. Section

3 describes the proposed method, while the developed

design flow is illustrated in Section 4. Experimental re-

sults on the selected case study and their analysis are pre-

sented in Section 5. Finally, conclusions and future works

are described in Section 6.

2 PREVIOUS WORKS

State-of-the-art SRAM-based FPGAs are heterogene-

ous devices containing several macro blocks, like Digital

Signal Processors (DSPs), Block RAMS (BRAMs) and IO

Blocks (IOBs), along with Configurable Logic Blocks

(CLBs) inside the FPGA reconfiguration fabric. Each of

these resource types is arranged in columns that span

from top to bottom of the device realizing a column of

CLBs, IOBs and BRAM memories interconnected by a

mesh of heterogeneous routing resources. Each SRAM-

based FPGA chip is organized in a number of rows de-

pendent on the manufacturer families or specific part. The

most advanced devices have CLB rows connected to

global resources as well as local clock sources [11]. In or-

der to harden circuits implemented on SRAM-based

FPGAs, different architecture level techniques have been

proposed in the past. However, we can broadly classify

them into two main techniques, namely fault masking and

fault correction. In the next part of this section we will pre-

sent a detailed discussion of the previous research work

in each category.

In the recent years, two different mitigation approach-

es have been proposed to mitigate SEUs affecting the con-

figuration memory of SRAM-based FPGAs. On one side,

full hardware redundancy obtained thanks to Triple

Modular Redundancy (TMR) is used to identify and cor-

rect logic values. This solution presents a large overhead

in terms of area, power and especially delay, since it trip-

licates all the combinational and sequential logic, and the

architecture introduces delay penalties for the voter

propagation time and the routing congestion. On the oth-

er side, redundancy approaches are nowadays combined

with scrubbing, that consists in periodically reloading the

complete content of the FPGA’s configuration memory. A

more complex system is used to correct the information in

the configuration memory by using read-back and partial

configuration procedures. Through the read-back process

the content of the FPGA’s configuration memory is read

and compared with the expected value, which is stored in

a dedicated memory located outside of the FPGA. With

the advent of modern SRAM-based FPGAs this operation

may be performed through dynamic reconfiguration. In

details, with dynamic reconfiguration, the FPGA configu-

ration memory can be read-back continuously without

interfering with the circuit functionality and if any upset

is detected it can be selectively re-written with the correct

values, thus avoiding the accumulation of radiation-

induced errors [12]. However, the main drawback of this

technique is the huge detection and correction time that

makes it useless for real-time operations and ineffective

versus the single point of failure induced by configura-

tion memory bit-flips. Spatial redundancy using Triple

Modular Redundancy (TMR) is complementarily used

with the read-back and correction techniques: on one side

TMR can tolerate faults with the limitation of withstand-

ing a single fault per voting group [13], on the other side

read-back and correction avoids the accumulation of er-

rors within the configuration memory. The combination

of TMR and self-healing using dynamic partial reconfigu-

ration has been previously used in [14][15]. However, the

results achievable with this combined solution are com-

putationally expensive and area hungry.

Reconfiguration at the gate level is used in fine-grain

approaches [16] with particular efficiency from the point

of view of the area overhead, although it suffers from a

complex and not flexible control mechanism. Further-

more, because of the adopted fine granularity, this ap-

proach is infeasible for system-level healing. A self-

healing partial dynamic reconfigurable design methodol-

ogy has been proposed in [17]. However, the method in-

serts control circuitry by partitioning the circuit for error

• Minor Column

• Rows (Clock

Regions)

• Major Column

Frame 0

Frame 35

Frame 1

Frame 25

CLB Major Column

20 CLB High

Fig. 1. Resource and Frame Layout of Modern SRAM based FPGAs

Transactions on Computers

4 IEEE TRANSACTIONS ON COMPUTERS

localization and detection purposes. This requires a sig-

nificant overhead.

A methodology for fault tolerant architectures using

on-line checkers for fault detection and localization was

introduced in [18]. On-line checkers for TMR- and dupli-

cation-based systems were combined with partial dynam-

ic reconfiguration in [19] in such a way that detection and

localization of faults will be performed by the checker,

while reconfiguration will recover from the error. The

detection and localization of errors is implemented as a

partial reconfigurable module which is itself subject to

errors. A previous approach based on fine-granularity

error masking has been developed in [20]; however, such

solution can only be applied to a TMR technique with a

majority voter logic scheme. Vice versa, a first overview

of recovery architectures for high computational systems

based on SRAM-based FPGAs has been presented in [21].

2.1 Main contribution

The main contribution of the present work, which is

based on the platform preliminarily presented in [8], is

the description of an autonomous recovery approach that

can be applied to Partially Reconfigurable Modules

(PRMs) when errors are detected inside them. The ap-

proach is implemented by the static region providing ef-

fective capabilities of error detection and correction of

faults within the dynamic region. Our approach allows

resilience to MEUs, since we adopt a static region protect-

ed with a fine-grain redundancy approach, as described

by [3]. In particular, we propose a new fine-grain fault

detection mechanism applied to FPGA resources: the ap-

proach is based on the comparison of Look-Up Tables

(LUTs) outputs by using the logic available to allow carry

propagation, which is generally used for fast arithmetic

computations and mostly not inferred by design tools,

following the approach preliminarily introduced in [9] for

fault detection. In details, the proposed method is charac-

terized by the ability of detecting MEUs into the FPGA’s

configuration memory, as well as to recover any number

of faults in the dynamic partition, thus improving previ-

ously developed approaches, as presented in [9], that

cannot deal with MEUs. Our solution is adaptable to all

modern SRAM-based FPGAs equipped with an Internal

Configuration Access Port (ICAP) and based on a LUT-

slice architecture.

3 THE PROPOSED METHOD

The proposed method consists of two flows: one applied

to the dynamically reconfigurable region for implement-

ing error detection, the other one for instrumenting the

circuit mapped on the FPGA so that it supports the execu-

tion of the self-repairing method against single and mul-

tiple-bit errors.

A dynamically reconfigurable system, from the architec-

tural perspective, is partitioned into static and dynamic

regions as illustrated in figure 2. The static region consists

of a processor with a static-RAM, some general purpose

IOs, flash memories, and hardware resources for manag-

ing the internal configuration access port connected to the

processor local bus. The static region contains the main

processor, which is in charge of controlling the partially

reconfigurable system operational functionalities: there-

fore, it is very important to tolerate and recover errors in

these modules. In this paper we assume that this region is

implemented using Triple Modular Redundancy. By suit-

ably mapping the three copies of the circuit elements on

the device the static region can be protected against any

single point of failure.

The dynamic region consists of the resources imple-

menting the user’s circuit. The proposed approach mainly

focuses on the dynamic region, and exploits reconfigura-

tion at the individual frame level for error detection and

correction. The dynamic region can be organized into a

Single Bit Error (SBE) region, Multi Bit Error (MBE) re-

gion and Coarse-Grain Error region. It is first necessary to

introduce some definitions related to the major character-

istics of current SRAM-based FPGAs: modern FPGAs are

row-wise divided into a number of clock regions for dy-

namic partial reconfiguration, while column-wise are or-

ganized in major columns of resources, such as CLBs,

DSPs or IOs. Each major column spans the whole height

of the device but it is configured in each clock region

(row) by a separate reconfigurable frame (RF). Each RF

contains a different number of “minor frames”, each hav-

ing a height equal to the clock region (row) and num-

bered from left to right. For example, in Xilinx Virtex-5

devices [11] a CLB RF consists of 36 minor frames (hereby

simply referred to as frames), which are responsible for

the configuration of LUTs and their routing, while con-

figuration bits for a single LUT are distributed over mul-

tiple frames. From the point of view of the circuit archi-

tecture, the proposed method is based on the Duplication

With Comparison (DWC) technique applied at two dif-

ferent levels of granularity, herein called Coarse-grained

DWC (C-DWC) and Fine-grained DWC (F-DWC).

The C-DWC is applied for slices that use the carry

chain for computations such as fast additions or multipli-

cations. In this case, the duplication is performed at the

module level and the outputs are compared at the physi-

cal level by LUT elements configured to implement XOR

combinational functions. Our approach is able to directly

modify the circuit physical description in order to use the

XOR logic function to compare the module’s outputs. In

case of error, the software tools running on the reconfigu-

rable system partially rewrite the C-DWC region.

F-DWC is applied at the place and route level, by suitably

duplicating each LUT function in two copies that are

placed in a single slice using two consecutive LUT posi-

tions. The outputs of the two LUTs are then compared

with hardwired physical resources built into the slice in





 

 



!



















 !



 



Fig. 2. Placement space division into Single Bit Error region, Multipl

Bit Error region and the Coarse Grain Error region.

Transactions on Computers

SONZA ET AL.: AN ERROR-DETECTION AN SELF-REPAIRING METHOD FOR DYNAMICALLY AND PARTIALLY RECONFIGURABLE SYSTEMS 5

the form of a carry chain by using internal and not pro-

grammable resources, such as hardwired MUXCYs and

XORCYs.

The outputs generated by the XORCY functions are con-

nected in a chain of OR logic functions in order to provide

a single error detection flag for each column. Practically,

the F-DWC approach can be adopted by acting at the

Hardware Design Language (HDL) level: the combina-

tional functions are duplicated and both copies of the cir-

cuit LUTs are placed in a single FPGA slice using two

consecutive available LUT positions. Please note that the

outputs of any pair of LUTs pass through the carry chain

and at each pair position of the XORCY generate a com-

parison signal called check flag. Since we are generating a

check flag for each pair of LUTs the number of check flags

may drastically increase. This means that a considerable

amount of routing resources could be required by the

implementation of these check flags because they have to

be routed to the static region for the detection and correc-

tion of possible errors. Moreover, any such scheme will

not only have a large overhead, but it will also be fruitless

because the smallest unit of reconfiguration is a frame.

In order to have a single check flag for each frame we

propose to merge the individual check flags in two differ-

ent ways. The check flags in the SBE region are merged

through the built-in slice carry chain as shown in figure 3

(further details will be provided in section 3.1.1) where

the hardwired resources XORCY and MUXCY are labeled

as Xi and Mi, respectively. Furthermore, a whole column

of slices is connected through the carry chains to produce

a single flag for each column of slices (see for example flag

3 in the SBE region of figure 2).

In this way we achieve a huge reduction in the num-

ber of check flags, but we can only detect Single Event

Upsets (SEUs) in the SBE region, because multiple LUT

pairs are connected together by a long chain of XORs and

XNORs and thus an even number of errors will go unde-

tected due to the logic configuration of the detector.

In the Multiple Bit Error (MBE) region each pair of

LUTs generates a check flag and thus we have two check

flags per slice. The number of check flags can be reduced

by OR-ing some of the flags corresponding to the slices in

the same slice column, as shown in the magnified MBE

region in figure 4. Although some higher overhead is in-

troduced in this way, we have the ability to detect Multi-

ple Event Upsets (MEUs) in the frames mapped on this

region; in fact, the individual check flags are not merged

along the carry chain passing through multiple XORs, as

it happened in the SBE region.

3.1 Error Detection Method

In order to fully explain our proposal, in this section

we will specifically refer to the architecture of Xilinx Vir-

tex-5 FPGAs. As described in the previous section, the

error detection mechanism implemented in the reconfigu-

rable region is based on LUT-based checkers and carry

chains for propagating the check flags. Please note that

the LUT checkers are only deployed when the carry chain

is unavailable for comparison purposes. This allows re-

ducing the performance degradation of the circuit im-

plemented with our method, although in this case the

detection mechanism is implemented at the modular lev-

el. In this section, we focus on the method adopted for the

error detection using the carry chains for comparison; a

more detailed explanation of both the LUT checkers and

the carry chains insertion inside the physical place and

route description of the circuit will be given in Section 4.3.

3.1.1 Single-bit error detection

In order to detect single-bit errors, we propose to du-

plicate each original LUT function into two identical

LUTs. Furthermore, we place the two LUTs in a single

FPGA slice, where we set the Carry Input and the generic

AX inputs to 1 and 0, respectively, as illustrated in figure

3. Consequently, the hardwired XORCY logic gate in the

bottom of the slice is acting as an inverter, while the

MUXCY multiplexer in the bottom first position is simply

acting as a buffer to pass the value of LUT A.

The multiplexer “M2” receives an inverted (through

the AMUX_2_BX hardwired connection) and buffered

copy of the LUT A output at its “0” and “1” inputs while

the selection line is tied to LUT B (which is the copy of

LUT A) thus effectively performing the EX-NOR function.

The XOR gate named “X2” receives LUT A and LUT B

outputs on its inputs. Similarly, LUT C and LUT D can

also be connected with such a scheme by extending the

EX-NORs and EX-ORs along the slice. In fact, this scheme

can be extended to an entire clock region covering 20

CLBs using the COUT and CIN of slices, thus generating

two flags for the even and odd slice columns of the same

CLB, respectively. This convergence strategy can only be

applied if the CLB column has no empty slices. In case the

CLB column contains empty slices the dedicated COUT

connection cannot be used to propagate the flag signal

upwards along the column. For such a case, an ORing

LUT is introduced in the CLB column and placed in an

available empty slice. This will be discussed in greater

details in Section 4.3. It is interesting to investigate an

upper bound on the number of check flags that can be

generated for the most complex design. The flag signal is

generated per CLB tile columns and is directly related to

the device rows and columns. For example, for the Virtex-

5 VLX110T device the maximum number of check flags

for any design cannot be greater than 1,280 (160x8) [22]

[23]. As the FPGA must contain the control processor the

actual number will be quite less than 1,280 and will de-

termine the size of the GPIO port that is used by the con-

troller to detect errors. Then, it is possible to pinpoint sin-

gle bit upsets in any of the four LUTs in any slice column

in a clock region. However, errors affecting flip-flops

















































































































Fig. 3. Single-Bit Error detection scheme implemented in a single

slice.

HTML Viewer

An error-detection and self-repairing method for dynamically and partially reconfigurable systems

Summary (6 min read)

1 INTRODUCTION

2 PREVIOUS WORKS

2.1 Main contribution

3 THE PROPOSED METHOD

3.1 Error Detection Method

3.1.1 Single-bit error detection

3.1.2 Multiple-Bit error detection

3.2 Error Correction Method

4 DESIGN FLOW

4.1 Net-list Extraction

4.2 DUT Regions Formation and Constraints Generation

4.3 Low-level Manipulations

5 EXPERIMENTAL RESULTS

5.1 Area Overhead

5.2 Error Detection Latency

5.3 Error Correction and Detection

Table III. Fault injection campaign experimental results

5.4 Timing Analysis

Figures (11)

Citations

Cites background or methods from "An error-detection and self-repairi..."

References

"An error-detection and self-repairi..." refers background in this paper

"An error-detection and self-repairi..." refers methods in this paper

"An error-detection and self-repairi..." refers background or methods in this paper

"An error-detection and self-repairi..." refers methods in this paper

Related Papers (5)