Fault injection techniques and tools

doi:10.1109/2.585157

Mei-Chen

Hsueh ,

Timothy

K. Tsai,

and

Ravishankar

K. Iyer

University. of

Illinois at

Urbana-

Champaign

NASA-CR-2068OI

Faultinjection

TechniquesandTools

/ ..

Fault injection is important to evaluating the dependability of computer

systems. Researchers and engineers have created many novel methods to

inject faults, which can be implemented in both hardware and software.

ependability evaluation involves the study of

failures and errors. The destructive nature of

a crash and long error latency make it difficult

to identify the causes of failures in the operational

environment. It is particularly hard to recreate a

failure scenario for a large, complex system.

To identify and understand potential failures, we

use an experiment-based approach for studying the

dependability of a system. Such an approach is

applied not only during the conception and design

phases, but also during the prototype and opera-

tional phases, t;

To take an experiment-based approach, we must

first understand a system's architecture, structure,

and behavior. Specifically, we need to know its tol-

erance for faults and failures, including its built-in

detection and recovery mechanisms, 3and we need

specific instruments and tools to inject faults, create

failures or errors, and monitor their effects.

DIFFERENT PHASES, DIFFERENT TECHNIQUES

Engineers most often use low-cost, simulation-

based fault injection to evaluate the dependability

of a system that is in the conceptual and design

phases. At this point, the system under study is only

a series of high-level abstractions; implementation

details have yet to be determined, Thus the system

is simulated on the basis of simplified assumptions.

Simulation-based fault injection, which assumes

that errors or failures occur according to-predeter-

mined dismbutio_ is mefd, for evaluating theeffe¢-!

tiveaemof fault-toleran__ and a_

whid/hre:difflcult to_supp_

measurements. Testing a prototype, on the other

hand, allows us to evaluate the system without any

assumptions about system design, which yields more

accurate results. In prototype-based fault iniection,

we inject faults into the system to

• identify dependability bottlenecks,

• study system behavior in the presence of faults,

• determine the coverage of error detection and

recovery mechanisms, and

• evaluate the effectiveness of fault tolerance

mechanisms (such as reconfiguration schemes)

and performance loss.

To do prototype-based fault injection, faults are

injected either at the hardware Level (logical or elec-

trical faults) or at the software level (code or data

corruption) and the effects are monitored. The sys-

tem used for evaluation can be either a prototype or

a fully operational system. Injecting faults into an

operational system can provide information about

the failure process. However, fault injection is suit-

able for studying emulated faults only. It also fails

to provide dependability measures such as mean

time between failures and availability.

Instead of injecting faults, engineers can direcdy

measure operational systems as they handle real

workloads. 2Measurement-based analysis uses actual

data, which contains much information about nat-

urally occurring errors and failures and sometimes'

aboulDrecovery attempts. Analyzing these data can-!

_ovide uaderstanding.obactuat error and failure

_e_'csaad imighgfm=,anaiyticat models.. _

'_ d'_. Furthemaor_dammust be collected_

Figure1.Basic

componentsofa fault

injection

environment.

P-[ Controller

Fault injection system

Fault injector

Monitor

Target system

9

-_ Data collector 1

Data analyzer J

infrequently. Field conditions can vary, widely, thus

casting doubt on the statistical validity of the result.

Although each of the three experimental methods

has its limitations, their unique values complement

one another and allow for a wide spectrum of depend-

ability, studies.

FAULT INJECTION TECHNIQUES

Engineers use fault injection to test fault-tolerant

systems or components. Fault injection tests fault

detection, fault isolation, and recontiguration and

recovery capabilities.

ramm_

Figure 1 shows a fault injection environment, which

typically consists of the target system plus a fault injec-

tor_fault library,, workload generator, workload library.,

controller, monitor, data collector, and data analyzer.

The fault injector injects faults into the target system

as it executes commands from the workload generator

(applications, benchmarks, or synthetic workloads).

The monitor tracks the execution of the commands and

initiates data collection whenever necessary. The data

collector performs online data collection, and the data

analyzer, which can be offline, performs data process-

mg and analysis. The controller controls the experiment.

Physically, the controller is a program that can run

on the target system or on a separate compute_ The

fault injector can be custom-built hardware or soft-

ware. The fault injector itself can support different

fault types, fault locations, fault times, and appropri-

ate hardware semantics or software structure--the

values of which are drawn from a fault library. The

fault library in Figure 1 is a separate component,

which allows for greater flexibility and portability.

The workload generator, monitor, and other compo-

nents can be implemented the same way.

1lSmellm mmtlm Im Ininmama

Choosing between hardware and software fault

injection depends on the type of faults you are inter-

ested in and the effort required to create them. For

example, if you are interested in stuck-at faults (faults

that force a permanent value onto a point in a circuit),

a hardware injector is preferable because you can con-

trol the location of the fault. The iniection of perma-

nent faults using software methods either incurs a high

overhead or is impossible, depending on the fault.

However, if you are interested in data corruption, the

software approach might suffice. Some faults, such as

bit-flips in memory ceils, can be injected by either

method. In a case like this, additional requirements,

such as cost, accuracy, intrusiveness, and repeatabil-

ity may guide the choice of approach. Table 1 sum-

marizes commonly studied faults and injection

methods.

HARDWARE FAULT INJECTION

Hardware-implemented fault injection uses addi-

tional hardware to introduce faults into the target sys-

tem's hardware. Depending on the faults and their

locations, hardware-implemented fault injection meth-

ods fall into two categories:

• Hardware fault iniection with contact. The injec-

tor has direct physical contact with the target sys-

tem, producing voltage or current changes

externally to the target chip. Examples are meth-

ods that use pin-level probes and sockets.

• Hardware fault injection without contact. The

Computer

injector has no direct physical contact with the

target system. Instead, an external source pro-

duces some physical phenomenon, such as heavy-

ion radiation and electromagnetic interference,

causing spurious currents inside the target chip.

These methods are well suited for studying the

dependability, characteristics of prototypes that

require high time-resolution for hardware triggering

and monitoring (fault latency in the CPU, for exam-

ple) or require access to locations that cannot be eas-

ily reached by other fault injection methods.

Engineers generally model hardware methods on

low-level fault models; for example, a bridging fault

might be a short circuit. Hardware also triggers faults

and monitors their impact, thus providing high time-

resolution and low perturbation. Normally, the hard-

ware triggers faults after a specified time has expired

on a hardware timer or after it has detected an event,

such as a specified address on the address bus.

Injectionwith contact

Hardware fault injection using direct contact with

circuit pins, often called pin-level iniection, is prob-

ably the most common method of hardware-

implemented fault injection. There are two main

techniques for altering electrical currents and volt-

ages at the pins:

• Active probes. This technique adds current via

the probes attached to the pins, altering their elec-

trical currents. The probe method is usually lim-

ited to stuck-at faults, although it is possible to

attain bridging faults by placing a probe across

two or more pins. Care must be taken when using

active probes to force additional current into the

target device, as an inordinate amount of current

can damage the target hardware.

• Socket insertion. This technique inserts a socket

between the target hardware and its circuit

board. The inserted socket injects stuck-at, open,

or more complex Logicfaults into the target hard-

ware by forcing the analog signals that represent

desired logic values onto the pins of the target

hardware. The pin signals can be inverted,

ANDed, or ORed with adjacent pin signals or

even with previous signals on the same pin.

Both of these methods provide good controllabil-

ity of fault times and locations with little or no per-

turbation to the target system. Note that because

faults are modeled at the pin level, they are not iden-

tical to traditional stuck-at and bridging fault models

that generally occur inside the chip. Nonetheless, you

can achieve many of the same effects, like the exercise

• ' .Hardware, Software. "++-' . :, :_. -

:-_.. +:+Open+'+_"

• _'-.. Bridging" ' ? -

• ;+

;,_+ . Bit-flip" • ....

-_ Spuriouscurrent" .+

Powersurge

Stuck-at

;++_:_ Storagedatacorraptiom" . -+. :.. :.-

(sucha_mCmr,_omom,mdalsk_":

+--._,++--:_++_Communil_atlon'd,_tz_+rruotlort':+:.'"++,

• " ' : (such'asbusand;communicationnetwork)

ManifestationotSoflwaredefects

• (suchasmachinelevelandhigherlevels)

+ •

of error detection circuits, using these iniection meth-

ods. Active probes attached to the power supply hard-

ware inject power supply disturbance faults. However,

this can damage the injected device or increase the risk

of destructive injection.

Illgllll mlllll emlllgt

These faults are injected by creating heavy-ion radi-

ation. An ion passes through the depletion region of

the target device and generates current. Placing the

target hardware in or near an electromagnetic field

also injects faults. Engineers like these methods

because they mimic natural physical phenomena.

However, it is difficult to exactly trigger the time and

location of a fault injection using this technique

because you cannot precisely control the exact

moment of heavy-ion emission or electromagnetic

field creation.

l_SUlll INil

Messaline, 4developed at LAAS-CNRS, in Toulouse,

France, uses both active probes and sockets to con-

duct pin-level fault injectaon. Figure 2 on the next page

shows Messaline's general architecture and its envi-

ronment. Messaline can inject stuck-at, open, bridg-

ing, and complex logical faults, among others. It can

also control the length of fault existence and the fre-

quency. Signals collected from the target system can

provide feedback to the injector. Mso, a device is asso-

ciated with each injection point to sense when and if

each fault is activated and produces an error. It can

also inject up to 32 injection points simultaneously.

This tool has been used in experiments on a central-

ized, interlocking system employed in a computerized

railway control system and on a distributed system

for the Esprit Delta-4 Project.

FIST s(Fault Injection System for Study ofTransient

Fault Effect), developed at the Chalmers University of

Technology in Sweden, employs both contact and con-

tactless methods to create transient faults inside the

target system. This tool uses heavy-ion radiation to

create transient faults at random locations inside a

chip when the chip is exposed to the radiation and

can thus cause single- or multiple-bit-flips. The radi-

Apfl11997

Figure2. General

architectureof

Messaline.

ation source is mounted inside a vacuum chamber

together with a small two-processor computer sys-

tem. The computer is positioned so that one of the

processors is exposed directly under the radiation.

The other processor is used as a reference for detect-

ing whether the radiation results in any bit-flips.

Figure 3 illustrates the FIST environment.

FIST can inject faults directly inside a chip, which

cannot be done with pin-level injections. It can pro-

duce transient faults at random locations evenly in a

chip, which leads to a large variation in the errors seen

on the output pins. In addition to radiation, FIST

allows for the injection of power disturbance faults.

This is done by placing a MOS transistor between the

power supply and the Vcc pin of the processor chip to

control the amplitude of the voltage drop. Power sup-

ply disturbances usually affect multiple locations within

a chip and can cause gate propagation delay faults. The

experimental results show that the errors resulting from

both methods cause similar effects on program con-

trol-flow and data errors. However, heavy-ion radia-

tion causes mostly address bus errors, while power

supply disturbances affect mostly control signals.

MARS _ (Maintainable Real-Time System) is a dis-

tributed, fault-tolerant architecture developed at the

Technical University of Vienna. In addition to using

heavy-ion radiation as is used in FIST, 2vLARS uses

electromagnetic fields to conduct contactless fault

injection: A circuit board placed between two charged

plates or a chip placed near a charged probe causes

fault injection. Dangling wires that act as antennas

placed on individual chip pins accentuate the electro-

magnetic field effect on those pins. Researchers com-

pared these three methods (heavy-ion radiation,

pin-level injection, and electromagnetic interference)

in terms ot their capability, to exerc:se the ,MARS error

detection mechanisms. Results showed that the three

methods are complementary and generate different

t'ypes of errors. Pin-level iniections cause error detec-

tion mechanisms outside the CPU to be exercised more

effectively than heavy-ion radiation or electromag-

netic interference. The latter two methods were bet-

ter suited for exercising software and application-level

error detection mechanisms.

SOFTWAREFAULTINJECTION

In recent years, researchers have taken more inter-

est in developing solk'ware-implemented fault injec-

tion tools. Software fauh-iniection techniques are

attractive because they don't require expensive hard-

ware. Furthermore, they can be used to target appli-

cations and operatang systems, which is difficult to do

with hardware fault injection.

If the target is an application, the fault injector is

inserted into the application itself or layered between

the application and the operating system. If the target

is the operating system, the fault iniector must be

embedded in the operating system, as it is very difficult

to add a layer between the machine and the operating

system.

Although the software approach is flexible, it has

its shortcomings.

* It cannot inject faults into locations that are inac-

cessible to software.

• The software instrumentation may disturb the

workload running on the target system and even

change the structure of original software. Careful

design of the injection environment can minimize

perturbation to the workload.

Computer

Inside vacuum chamber

i.

l Reference CPU]

Reset

j .

Figure3. FIST

environment.

External

bus

Host ]computer "

Error data

Data _

Comparator

error flip-flops =

Trigger[ External

bus

Error _ Logic

data I l analyzer

I Error

,, I data

Monitoring

computer [

i

External

bus

Serialport Memory

Commands and

program loading

Reset

• The poor time-resolution of the approach may

cause fideli_ problems. For long latency faults,

such as memory, faults, the low time-resolution

may not be a problem. For short latency faults,

such as bus and CPU faults, the approach may fail

to capture certain error behavior, like propagation.

Engineers can solve this problem by taking a

hybrid approach, which combines the versatility.

of software fault injection and the accuracy of

hardware monitoring. The hybrid approach is well

suited for measuring extremely short latencies.

However, the hardware monitoring involved can

cost more and decrease flexibility, by limiting

observation points and data storage size.

We can categorize software injection methods oh

the basis of when the faults are injected: during com-

pile-time or during runtime.

Compile-timeInjectlN

To inject faults at compile-time, the program

instruction must be modified before the program

image is loaded and executed. Rather than injecting

faults into the hardware of the target system, this

method iniects errors into the source code or assem-

bly code of the target program to emulate the effect

of hardware, software, and transient faults. The mod-

ified code alters the target program instructions, caus-

ing injection. Injection generates an erroneous soft-

ware image, and when the system executes the fault

image, it activates the fault.

This method requires the modification of the pro-

gram that will evaluate fault effect, and it requires no

additional software during runtime. In addition, it

causes no perturbation to the target system during

execution. Because the fault effect is hard-coded, engi-

neers can use it to emulate permanent faults. This

method's implementation is very simple, but it does

not allow the injection of faults as the workload pro-

gram runs.

Runtlmeinjections

During runtime, a mechanism is needed to trigger

fault injection. Commonly used triggering mecha-

nisms include:

• Time-out. In this simplest of techniques, a timer

expires at a predetermined time, triggering injec-

tion. Specifically, the time-out event generates an

5.n=errupt to invoke fault injection. The timer

can be a hardware or software timer. This

method requires no modification to the applica-

tion or workload program. A hardware timer

must be [inked to the system's interrupt handler

vector. Since it injects faults on the basis of time

rather than specific events or system state, it pro-

April 1997

Fault injection techniques and tools

Citations

Improving the reliability of commodity operating systems

Analysis and characterization of inherent application resilience for approximate computing

Fault-tolerant clustering of wireless sensor networks

Xception: a technique for the experimental evaluation of dependability in modern computers

Fundamentals of fault-tolerant distributed computing in asynchronous environments

References

Fault injection: a method for validating computer-system dependability

DOCTOR: an integrated software fault injection environment for distributed real-time systems

FERRARI: a tool for the validation of system dependability properties

Evaluation of error detection schemes using fault injection by heavy-ion radiation

Fault injection for dependability validation of fault-tolerant computing systems

Related Papers (5)

Fault injection for dependability validation: a methodology and some applications

Basic concepts and taxonomy of dependable and secure computing

FERRARI: a flexible software-based fault and error injection system

Xception: a technique for the experimental evaluation of dependability in modern computers

Fault injection: a method for validating computer-system dependability