scispace - formally typeset
Open AccessJournal ArticleDOI

Fault injection techniques and tools

Reads0
Chats0
TLDR
This work uses hardware methods to evaluate low-level error detection and masking mechanisms, and software methods to test higher level mechanisms to evaluate the dependability of computer systems.
Abstract
Fault injection is important to evaluating the dependability of computer systems. Researchers and engineers have created many novel methods to inject faults, which can be implemented in both hardware and software. The contrast between the hardware and software methods lies mainly in the fault injection points they can access, the cost and the level of perturbation. Hardware methods can inject faults into chip pins and internal components, such as combinational circuits and registers that are not software-addressable. On the other hand, software methods are convenient for directly producing changes at the software-state level. Thus, we use hardware methods to evaluate low-level error detection and masking mechanisms, and software methods to test higher level mechanisms. Software methods are less expensive, but they also incur a higher perturbation overhead because they execute software on the target system.

read more

Content maybe subject to copyright    Report

Mei-Chen
Hsueh ,
Timothy
K. Tsai,
and
Ravishankar
K. Iyer
University. of
Illinois at
Urbana-
Champaign
NASA-CR-2068OI
Faultinjection
TechniquesandTools
/ ..
Fault injection is important to evaluating the dependability of computer
systems. Researchers and engineers have created many novel methods to
inject faults, which can be implemented in both hardware and software.
ependability evaluation involves the study of
failures and errors. The destructive nature of
a crash and long error latency make it difficult
to identify the causes of failures in the operational
environment. It is particularly hard to recreate a
failure scenario for a large, complex system.
To identify and understand potential failures, we
use an experiment-based approach for studying the
dependability of a system. Such an approach is
applied not only during the conception and design
phases, but also during the prototype and opera-
tional phases, t;
To take an experiment-based approach, we must
first understand a system's architecture, structure,
and behavior. Specifically, we need to know its tol-
erance for faults and failures, including its built-in
detection and recovery mechanisms, 3and we need
specific instruments and tools to inject faults, create
failures or errors, and monitor their effects.
DIFFERENT PHASES, DIFFERENT TECHNIQUES
Engineers most often use low-cost, simulation-
based fault injection to evaluate the dependability
of a system that is in the conceptual and design
phases. At this point, the system under study is only
a series of high-level abstractions; implementation
details have yet to be determined, Thus the system
is simulated on the basis of simplified assumptions.
Simulation-based fault injection, which assumes
that errors or failures occur according to-predeter-
mined dismbutio_ is mefd, for evaluating theeffe¢-!
tiveaemof fault-toleran__ and a_
whid/hre:difflcult to_supp_
measurements. Testing a prototype, on the other
hand, allows us to evaluate the system without any
assumptions about system design, which yields more
accurate results. In prototype-based fault iniection,
we inject faults into the system to
identify dependability bottlenecks,
study system behavior in the presence of faults,
determine the coverage of error detection and
recovery mechanisms, and
evaluate the effectiveness of fault tolerance
mechanisms (such as reconfiguration schemes)
and performance loss.
To do prototype-based fault injection, faults are
injected either at the hardware Level (logical or elec-
trical faults) or at the software level (code or data
corruption) and the effects are monitored. The sys-
tem used for evaluation can be either a prototype or
a fully operational system. Injecting faults into an
operational system can provide information about
the failure process. However, fault injection is suit-
able for studying emulated faults only. It also fails
to provide dependability measures such as mean
time between failures and availability.
Instead of injecting faults, engineers can direcdy
measure operational systems as they handle real
workloads. 2Measurement-based analysis uses actual
data, which contains much information about nat-
urally occurring errors and failures and sometimes'
aboulDrecovery attempts. Analyzing these data can-!
_ovide uaderstanding.obactuat error and failure
_e_'csaad imighgfm=,anaiyticat models.. _
'_ d'_. Furthemaor_dammust be collected_

Figure1.Basic
componentsofa fault
injection
environment.
P-[ Controller
Fault injection system
Fault injector
Monitor
Target system
9
-_ Data collector 1
Data analyzer J
infrequently. Field conditions can vary, widely, thus
casting doubt on the statistical validity of the result.
Although each of the three experimental methods
has its limitations, their unique values complement
one another and allow for a wide spectrum of depend-
ability, studies.
FAULT INJECTION TECHNIQUES
Engineers use fault injection to test fault-tolerant
systems or components. Fault injection tests fault
detection, fault isolation, and recontiguration and
recovery capabilities.
ramm_
Figure 1 shows a fault injection environment, which
typically consists of the target system plus a fault injec-
tor_fault library,, workload generator, workload library.,
controller, monitor, data collector, and data analyzer.
The fault injector injects faults into the target system
as it executes commands from the workload generator
(applications, benchmarks, or synthetic workloads).
The monitor tracks the execution of the commands and
initiates data collection whenever necessary. The data
collector performs online data collection, and the data
analyzer, which can be offline, performs data process-
mg and analysis. The controller controls the experiment.
Physically, the controller is a program that can run
on the target system or on a separate compute_ The
fault injector can be custom-built hardware or soft-
ware. The fault injector itself can support different
fault types, fault locations, fault times, and appropri-
ate hardware semantics or software structure--the
values of which are drawn from a fault library. The
fault library in Figure 1 is a separate component,
which allows for greater flexibility and portability.
The workload generator, monitor, and other compo-
nents can be implemented the same way.
1lSmellm mmtlm Im Ininmama
Choosing between hardware and software fault
injection depends on the type of faults you are inter-
ested in and the effort required to create them. For
example, if you are interested in stuck-at faults (faults
that force a permanent value onto a point in a circuit),
a hardware injector is preferable because you can con-
trol the location of the fault. The iniection of perma-
nent faults using software methods either incurs a high
overhead or is impossible, depending on the fault.
However, if you are interested in data corruption, the
software approach might suffice. Some faults, such as
bit-flips in memory ceils, can be injected by either
method. In a case like this, additional requirements,
such as cost, accuracy, intrusiveness, and repeatabil-
ity may guide the choice of approach. Table 1 sum-
marizes commonly studied faults and injection
methods.
HARDWARE FAULT INJECTION
Hardware-implemented fault injection uses addi-
tional hardware to introduce faults into the target sys-
tem's hardware. Depending on the faults and their
locations, hardware-implemented fault injection meth-
ods fall into two categories:
Hardware fault iniection with contact. The injec-
tor has direct physical contact with the target sys-
tem, producing voltage or current changes
externally to the target chip. Examples are meth-
ods that use pin-level probes and sockets.
Hardware fault injection without contact. The
Computer

injector has no direct physical contact with the
target system. Instead, an external source pro-
duces some physical phenomenon, such as heavy-
ion radiation and electromagnetic interference,
causing spurious currents inside the target chip.
These methods are well suited for studying the
dependability, characteristics of prototypes that
require high time-resolution for hardware triggering
and monitoring (fault latency in the CPU, for exam-
ple) or require access to locations that cannot be eas-
ily reached by other fault injection methods.
Engineers generally model hardware methods on
low-level fault models; for example, a bridging fault
might be a short circuit. Hardware also triggers faults
and monitors their impact, thus providing high time-
resolution and low perturbation. Normally, the hard-
ware triggers faults after a specified time has expired
on a hardware timer or after it has detected an event,
such as a specified address on the address bus.
Injectionwith contact
Hardware fault injection using direct contact with
circuit pins, often called pin-level iniection, is prob-
ably the most common method of hardware-
implemented fault injection. There are two main
techniques for altering electrical currents and volt-
ages at the pins:
Active probes. This technique adds current via
the probes attached to the pins, altering their elec-
trical currents. The probe method is usually lim-
ited to stuck-at faults, although it is possible to
attain bridging faults by placing a probe across
two or more pins. Care must be taken when using
active probes to force additional current into the
target device, as an inordinate amount of current
can damage the target hardware.
Socket insertion. This technique inserts a socket
between the target hardware and its circuit
board. The inserted socket injects stuck-at, open,
or more complex Logicfaults into the target hard-
ware by forcing the analog signals that represent
desired logic values onto the pins of the target
hardware. The pin signals can be inverted,
ANDed, or ORed with adjacent pin signals or
even with previous signals on the same pin.
Both of these methods provide good controllabil-
ity of fault times and locations with little or no per-
turbation to the target system. Note that because
faults are modeled at the pin level, they are not iden-
tical to traditional stuck-at and bridging fault models
that generally occur inside the chip. Nonetheless, you
can achieve many of the same effects, like the exercise
' .Hardware, Software. "++-' . :, :_. -
:-_.. +:+Open+'+_"
_'-.. Bridging" ' ? -
;+
;,_+ . Bit-flip" ....
-_ Spuriouscurrent" .+
Powersurge
Stuck-at
;++_:_ Storagedatacorraptiom" . -+. :.. :.-
(sucha_mCmr,_omom,mdalsk_":
+--._,++--:_++_Communil_atlon'd,_tz_+rruotlort':+:.'"++,
" ' : (such'asbusand;communicationnetwork)
ManifestationotSoflwaredefects
(suchasmachinelevelandhigherlevels)
+
of error detection circuits, using these iniection meth-
ods. Active probes attached to the power supply hard-
ware inject power supply disturbance faults. However,
this can damage the injected device or increase the risk
of destructive injection.
Illgllll mlllll emlllgt
These faults are injected by creating heavy-ion radi-
ation. An ion passes through the depletion region of
the target device and generates current. Placing the
target hardware in or near an electromagnetic field
also injects faults. Engineers like these methods
because they mimic natural physical phenomena.
However, it is difficult to exactly trigger the time and
location of a fault injection using this technique
because you cannot precisely control the exact
moment of heavy-ion emission or electromagnetic
field creation.
l_SUlll INil
Messaline, 4developed at LAAS-CNRS, in Toulouse,
France, uses both active probes and sockets to con-
duct pin-level fault injectaon. Figure 2 on the next page
shows Messaline's general architecture and its envi-
ronment. Messaline can inject stuck-at, open, bridg-
ing, and complex logical faults, among others. It can
also control the length of fault existence and the fre-
quency. Signals collected from the target system can
provide feedback to the injector. Mso, a device is asso-
ciated with each injection point to sense when and if
each fault is activated and produces an error. It can
also inject up to 32 injection points simultaneously.
This tool has been used in experiments on a central-
ized, interlocking system employed in a computerized
railway control system and on a distributed system
for the Esprit Delta-4 Project.
FIST s(Fault Injection System for Study ofTransient
Fault Effect), developed at the Chalmers University of
Technology in Sweden, employs both contact and con-
tactless methods to create transient faults inside the
target system. This tool uses heavy-ion radiation to
create transient faults at random locations inside a
chip when the chip is exposed to the radiation and
can thus cause single- or multiple-bit-flips. The radi-
Apfl11997

Figure2. General
architectureof
Messaline.
ation source is mounted inside a vacuum chamber
together with a small two-processor computer sys-
tem. The computer is positioned so that one of the
processors is exposed directly under the radiation.
The other processor is used as a reference for detect-
ing whether the radiation results in any bit-flips.
Figure 3 illustrates the FIST environment.
FIST can inject faults directly inside a chip, which
cannot be done with pin-level injections. It can pro-
duce transient faults at random locations evenly in a
chip, which leads to a large variation in the errors seen
on the output pins. In addition to radiation, FIST
allows for the injection of power disturbance faults.
This is done by placing a MOS transistor between the
power supply and the Vcc pin of the processor chip to
control the amplitude of the voltage drop. Power sup-
ply disturbances usually affect multiple locations within
a chip and can cause gate propagation delay faults. The
experimental results show that the errors resulting from
both methods cause similar effects on program con-
trol-flow and data errors. However, heavy-ion radia-
tion causes mostly address bus errors, while power
supply disturbances affect mostly control signals.
MARS _ (Maintainable Real-Time System) is a dis-
tributed, fault-tolerant architecture developed at the
Technical University of Vienna. In addition to using
heavy-ion radiation as is used in FIST, 2vLARS uses
electromagnetic fields to conduct contactless fault
injection: A circuit board placed between two charged
plates or a chip placed near a charged probe causes
fault injection. Dangling wires that act as antennas
placed on individual chip pins accentuate the electro-
magnetic field effect on those pins. Researchers com-
pared these three methods (heavy-ion radiation,
pin-level injection, and electromagnetic interference)
in terms ot their capability, to exerc:se the ,MARS error
detection mechanisms. Results showed that the three
methods are complementary and generate different
t'ypes of errors. Pin-level iniections cause error detec-
tion mechanisms outside the CPU to be exercised more
effectively than heavy-ion radiation or electromag-
netic interference. The latter two methods were bet-
ter suited for exercising software and application-level
error detection mechanisms.
SOFTWAREFAULTINJECTION
In recent years, researchers have taken more inter-
est in developing solk'ware-implemented fault injec-
tion tools. Software fauh-iniection techniques are
attractive because they don't require expensive hard-
ware. Furthermore, they can be used to target appli-
cations and operatang systems, which is difficult to do
with hardware fault injection.
If the target is an application, the fault injector is
inserted into the application itself or layered between
the application and the operating system. If the target
is the operating system, the fault iniector must be
embedded in the operating system, as it is very difficult
to add a layer between the machine and the operating
system.
Although the software approach is flexible, it has
its shortcomings.
* It cannot inject faults into locations that are inac-
cessible to software.
The software instrumentation may disturb the
workload running on the target system and even
change the structure of original software. Careful
design of the injection environment can minimize
perturbation to the workload.
Computer

Inside vacuum chamber
i.
l Reference CPU]
Reset
j .
Figure3. FIST
environment.
External
bus
Host ]computer "
Error data
Data _
Comparator
error flip-flops =
Trigger[ External
bus
Error _ Logic
data I l analyzer
I Error
,, I data
Monitoring
computer [
i
External
bus
Serialport Memory
Commands and
program loading
Reset
The poor time-resolution of the approach may
cause fideli_ problems. For long latency faults,
such as memory, faults, the low time-resolution
may not be a problem. For short latency faults,
such as bus and CPU faults, the approach may fail
to capture certain error behavior, like propagation.
Engineers can solve this problem by taking a
hybrid approach, which combines the versatility.
of software fault injection and the accuracy of
hardware monitoring. The hybrid approach is well
suited for measuring extremely short latencies.
However, the hardware monitoring involved can
cost more and decrease flexibility, by limiting
observation points and data storage size.
We can categorize software injection methods oh
the basis of when the faults are injected: during com-
pile-time or during runtime.
Compile-timeInjectlN
To inject faults at compile-time, the program
instruction must be modified before the program
image is loaded and executed. Rather than injecting
faults into the hardware of the target system, this
method iniects errors into the source code or assem-
bly code of the target program to emulate the effect
of hardware, software, and transient faults. The mod-
ified code alters the target program instructions, caus-
ing injection. Injection generates an erroneous soft-
ware image, and when the system executes the fault
image, it activates the fault.
This method requires the modification of the pro-
gram that will evaluate fault effect, and it requires no
additional software during runtime. In addition, it
causes no perturbation to the target system during
execution. Because the fault effect is hard-coded, engi-
neers can use it to emulate permanent faults. This
method's implementation is very simple, but it does
not allow the injection of faults as the workload pro-
gram runs.
Runtlmeinjections
During runtime, a mechanism is needed to trigger
fault injection. Commonly used triggering mecha-
nisms include:
Time-out. In this simplest of techniques, a timer
expires at a predetermined time, triggering injec-
tion. Specifically, the time-out event generates an
5.n=errupt to invoke fault injection. The timer
can be a hardware or software timer. This
method requires no modification to the applica-
tion or workload program. A hardware timer
must be [inked to the system's interrupt handler
vector. Since it injects faults on the basis of time
rather than specific events or system state, it pro-
April 1997

Citations
More filters
Journal ArticleDOI

Improving the reliability of commodity operating systems

TL;DR: Nooks, a reliability subsystem that seeks to greatly enhance operating system reliability by isolating the OS from driver failures, represents a substantial step beyond the specialized architectures and type-safe languages required by previous efforts directed at safe extensibility.
Proceedings ArticleDOI

Analysis and characterization of inherent application resilience for approximate computing

TL;DR: This work analysis and characterization of inherent application resilience present in a suite of 12 widely used applications from the domains of recognition, data mining, and search and proposes a systematic framework for Application Resilience Characterization (ARC), which characterizes the resilient parts using approximation models that abstract a wide range of approximate computing techniques.
Proceedings ArticleDOI

Fault-tolerant clustering of wireless sensor networks

TL;DR: This paper proposes an efficient mechanism to recover sensors from a failed cluster that avoids a full-scale re-clustering and does not require deployment of redundant gateways.
Journal ArticleDOI

Xception: a technique for the experimental evaluation of dependability in modern computers

TL;DR: Experimental, results are presented to demonstrate the accuracy and potential of Xception in the evaluation of the dependability properties of the complex computer systems available nowadays.
Journal ArticleDOI

Fundamentals of fault-tolerant distributed computing in asynchronous environments

TL;DR: This paper uses a formal approach to define important terms like fault, fault tolerance, and redundancy, which leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction.
References
More filters
Journal ArticleDOI

Fault injection: a method for validating computer-system dependability

TL;DR: This work surveys several fault injection studies and discusses tools such as React (Reliable Architecture Characterization Tool) that facilitate its application.
Proceedings ArticleDOI

DOCTOR: an integrated software fault injection environment for distributed real-time systems

TL;DR: An integrated software fault injection environment (DOCTOR) which is capable of generating synthetic workloads under which system dependability is evaluated, injecting various types of faults with different options, and collecting performance and dependability data is presented.
Proceedings ArticleDOI

FERRARI: a tool for the validation of system dependability properties

TL;DR: FERRARI as mentioned in this paper is a fault and error automatic real-time injector, which can evaluate complex systems by emulating most hardware faults in software, including permanent faults and transient errors.
Proceedings ArticleDOI

Evaluation of error detection schemes using fault injection by heavy-ion radiation

TL;DR: Several concurrent error detection schemes suitable for a watch-dog processor were evaluated by fault injection andSoft errors were induced into a MC6809E microprocessor by heavy-ion radiation from a Californium-252 source to characterize the errors and determine coverage and latency for the variouserror detection schemes.
Proceedings ArticleDOI

Fault injection for dependability validation of fault-tolerant computing systems

TL;DR: The authors address the dependability validation of Fault-tolerant computing systems and more specifically the validation of the fault-tolerance mechanisms through the realization of a general physical-fault injection tool (MESSALINE).
Related Papers (5)