What have the authors stated for future works in "On the accuracy of spectrum-based fault localization" ?

In future work, the authors plan to study the influence of the granularity ( statement, function level ) of program spectra on the diagnostic accuracy of spectrum-based fault localization. Finally, their study was conducted using single fault programs, and further investigation is required to be able to generalize their findings to multiple fault programs.

How does qd decrease when a block is touched often?

qd decreases only if the faulty block is touched often in passed runs, as spectrum-based fault localization works under de assumption that if a block is touched often in passed runs, it should be exonerated.

How many runs are available in the Siemens set?

Although the number of available runs in the Siemens set ranges from 1052 (tot info) to 5542 (replace), the number of runs that fail is comparatively small, ranging from a single run for tcas version 8, to 518 for print tokens version 2.

How many observations can provide a near-optimal diagnosis?

The fact that a few observations can already provide a near-optimal diagnosis enables the application of spectrum-based fault localization methods within continuous (embedded) processing, where only limited observation horizons can be maintained.

Why did the authors not include its weight coefficient in their experiments?

although the authors have recognized that it uses hit spectra of method call sequences, the authors didn’t include its weight coefficient in their experiments because the calculated values are only used to collect evidence about classes, not to identify suspicious method call sequences.

What is the effect of adding more failed runs on the diagnostic accuracy?

the authors conclude that while accumulating more failed runs only improves the accuracy of the diagnosis, the effect of including more passed runs is unpredictable.

How many runs can compensate for weak error detection quality?

a large number of runs can apparently compensate weak error detection quality: even for small qe, a large amount of runs provides sufficient information for good diagnostic accuracy, as shown in Figure 4.

Why were the versions of schedule2 and replace not considered in their experiments?

Version 9 of schedule2 and version 32 of replace were not considered in their experiments because no test case fails and therefore theexistence of a fault was never revealed.

(Open Access) On the Accuracy of Spectrum-based Fault Localization (2007) | Rui Abreu

Q: What are the contributions in "On the accuracy of spectrum-based fault localization" ?

Using the Siemens Set benchmark, the authors investigate this diagnostic accuracy as a function of several parameters ( such as quality and quantity of the program spectra collected during the execution of the system ), some of which directly relate to test design. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Furthermore, nearoptimal diagnostic accuracy ( exonerating about 80 % of the blocks of code on average ) is already obtained for low-quality error observations and limited numbers of test cases.

Q: What is the simplest way to determine the likelihood of a fault?

Under the assumption that a high similarity to the error vector indicates a high probability that the corresponding parts of the software cause the detected errors, the calculated similarity coefficients rank the parts of the program with respect to their likelihood of containing the faults.

Q: Why is the quality of the observations important?

Since most faults lead to errors only under specific input conditions, and as not all errors propagate to system failures, this parameter is relevant because error detection mechanisms are usually not ideal.

Q: How many runs are required to achieve near-optimum diagnostic accuracy?

the authors show that near-optimum diagnostic accuracy (exonerating around 80% of all code on average) is already obtained for low-quality (ambiguous) error observations, while, in addition, only a few runs are required.

On the Accuracy of Spectrum-based Fault Localization

∗

Rui Abreu Peter Zoeteweij Arjan J.C. van Gemund

Software Technology Department

Faculty of Electrical Engineering, Mathematics, and Computer Science

Delft University of Technology

P.O. Box 5031, NL-2600 GA Delft, The Netherlands

{r.f.abreu, p.zoeteweij, a.j.c.vangemund}@tudelft.nl

Abstract

Spectrum-based fault localization shortens the test-

diagnose-repair cycle by reducing the debugging eﬀort.

As a light-weight automated diagnosis technique it can

easily be integrated with existing testing schemes. How-

ever, as no model of the system is taken into account,

its diagnostic accuracy is inherently limited. Using

the Siemens Set benchmark, we investigate this diag-

nostic accuracy as a function of several parameters

(such as quality and quantity of the program spectra

collected during the execution of the system), some of

which directly relate to test design. Our results indicate

that the superior performance of a particular similar-

ity coeﬃcient, used to analyze the program spectra, is

largely independent of test design. Furthermore, near-

optimal diagnostic accuracy (exonerating about 80% of

the blocks of code on average) is already obtained for

low-quality error observations and limited numbers of

test cases. The inﬂuence of the number of test cases

is of primary importance for continuous (embedded)

processing applications, where only limited observation

horizons can be maintained.

2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or

promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted

component of this work in other works must be obtained from the IEEE.

Keywords: Test data analysis, software fault diag-

nosis, program spectra.

1 Introduction

Testing, debugging, and veriﬁcation represent a ma-

jor expenditure in the software development cycle [12],

which is to a large extent due to the labor-intensive

tasks of diagnosing the faults (bugs) that cause tests

to fail. Because under typical market conditions, only

∗

This work has been carried out as part of the TRADER

project under the responsibility of the Embedded Systems In-

stitute. This project is partially supported by the Netherlands

Ministry of Economic Aﬀairs under the BSIK03021 program.

those faults that aﬀect the user most can be solved

before the release deadline, the eﬃciency with which

faults can be diagnosed and repaired directly inﬂuences

software reliability. Automated diagnosis can help to

improve this eﬃciency.

Diagnosis techniques are complementary to testing

in two ways. First, for tests designed to verify correct

behavior, they generate information on the root cause

of test failures, focusing the subsequent tests that are

required to expose this root cause. Second, for tests de-

signed to expose speciﬁc potential root causes, the ex-

tra information generated by diagnosis techniques can

help to further reduce the set of remaining possible ex-

planations. Given its incremental nature (i.e., taking

into account the results of an entire sequence of tests),

automated diagnosis alleviates much of the work of se-

lecting tests in the latter category, and can hence have

a profound impact on the test-diagnose-repair cycle.

An important part of diagnosis and repair consist in

localizing faults, and several tools for automated de-

bugging and systems diagnosis implement an approach

to fault localization based on an analysis of the dif-

ferences in program spectra [20] for passed and failed

runs. Passed runs are executions of a program that

completed correctly, whereas failed runs are executions

in which an error was detected. A program spectrum is

an execution proﬁle that indicates which parts of a pro-

gram are active during a run. Fault localization entails

identifying the part of the program whose activity cor-

relates most with the detection of errors. Examples of

tools that implement this approach are Pinpoint [6],

which focuses on large, dynamic on-line transaction

processing systems, Tarantula [17], which focuses on

the analysis of C programs, and AMPLE [8], which fo-

cuses on object-oriented software (see Section 7 for a

discussion).

Spectrum-based fault localization does not rely on

a model of the system under investigation. It can eas-

ily be integrated with existing testing procedures, and

because of the relatively small overhead with respect

to CPU time and memory requirements, it lends itself

well for application within resource-constrained envi-

ronments [24]. However, the eﬃciency of spectrum-

based fault localization comes at the cost of a limited

diagnostic accuracy. As an indication, in one of the ex-

periments described in the present paper, on average

20% of a program still needs to be inspected after the

diagnosis.

In spectrum-based fault localization, a similarity co-

eﬃcient is used to rank potential fault locations. In

earlier work [1], we obtained preliminary evidence that

the Ochiai similarity coeﬃcient, known from the bi-

ology domain, can improve diagnostic accuracy over

eight other coeﬃcients, including those used by the

Pinpoint and Tarantula tools mentioned above. Ex-

tending as well as generalizing this previous result, in

this paper we investigate the main factors that inﬂu-

ence the accuracy of spectrum-based fault localization

in a much wider setting. Apart from the inﬂuence of

the similarity coeﬃcient on diagnostic accuracy, we also

study the inﬂuence of the quality and quantity of the

(pass/fail) observations used in the analysis.

Quality of the observations relates to the classiﬁca-

tion of runs as passed or failed. Since most faults lead

to errors only under speciﬁc input conditions, and as

not all errors propagate to system failures, this param-

eter is relevant because error detection mechanisms are

usually not ideal. Quantity of the observations relates

to the number of passed and failed runs available for

the diagnosis. If fault localization has to be performed

at run-time, e.g., as a part of a recovery mechanism,

one cannot wait to accumulate many observations to

diagnose a potentially disastrous error until suﬃcient

conﬁdence is obtained. In addition, quality and quan-

tity of the observations both relate to test coverage.

Varying the observation context with respect to these

two observational parameters allows a much more thor-

ough investigation of the inﬂuence of similarity coeﬃ-

cients. Our study is based on the Siemens set [14] of

benchmark faults (single fault locations).

The main contributions of our work are the follow-

ing. We show that the Ochiai similarity coeﬃcient con-

sistently outperforms the other coeﬃcients mentioned

above. We establish this result across the entire qual-

ity space, and for varying numbers of runs involved.

Furthermore, we show that near-optimum diagnostic

accuracy (exonerating around 80% of all code on av-

erage) is already obtained for low-quality (ambiguous)

error observations, while, in addition, only a few runs

are required. In particular, maximum diagnostic per-

formance is already reached at 6 failed runs on average.

However, including up to 20 passed runs may improve

but also degrade diagnostic performance, depending on

the program and/or input data.

The remainder of this paper is organized as follows.

In Section 2 we introduce some basic concepts and ter-

minology, and explain the diagnosis technique in more

detail. In Section 3 we describe our experimental setup.

In Sections 4, 5, and 6 we describe the experiments on

the similarity coeﬃcient, and the quality and quantity

of the observations, respectively. Related work is dis-

cussed in Section 7. We conclude, and discuss possible

directions for future work in Section 8.

2 Preliminaries

In this section we introduce program spectra, and de-

scribe how they are used in software fault localization.

2.1 Failures, Errors, and Faults

As deﬁned in [5], we use the following terminology. A

failure is an event that occurs when delivered service

deviates from correct service. An error is a system

state that may cause a failure. A fault is the cause of

an error in the system.

In this paper we apply this terminology to simple

computer programs that transform an input ﬁle to an

output ﬁle in a single run. Speciﬁcally in this setting,

faults are bugs in the program code, and failures occur

when the output for a given input deviates from the

speciﬁed output for that input.

To illustrate these concepts, consider the C func-

tion in Figure 1. It is meant to sort, using the bub-

ble sort algorithm, a sequence of n rational numbers

whose numerators and denominators are stored in the

parameters num and den, respectively. There is a fault

(bug) in the swapping code within the body of the if

statement: only the numerators of the rational num-

bers are swapped while the denominators are left in

their original order. In this case, a failure occurs

when RationalSort changes the contents of its ar-

gument arrays in such a way that the result is not a

sorted version of the original. An error occurs after

the code inside the conditional statement is executed,

while den[j] 6= den[j+1]. Such errors can be tem-

porary, and do not automatically lead to failures. For

example, if we apply RationalSort to the sequence

i, an error occurs after the ﬁrst two numera-

tors are swapped. However, this error is “canceled” by

later swapping actions, and the sequence ends up being

sorted correctly.

Error detection is a prerequisite for the fault local-

ization technique studied in this paper: we must know

void R at io na lS or t( int n , int * num , int * den ) {

/* block 1 */

int i,j , temp;

for ( i =n -1; i >=0; i -- ) {

/* block 2 */

for ( j =0; j < i; j ++ ) {

/* block 3 */

if ( Ra ti on al GT( num [ j] , den [j ],

num [ j+1] , den [ j+ 1] ) ) {

/* block 4 */

temp = num[ j ];

num [j ] = num[ j +1];

num [j + 1] = t em p;

}

Figure 1. A faulty C function for sorting rational

numbers

that something is wrong before we can try to locate

the responsible fault. Failures constitute a rudimen-

tary form of error detection, but many errors remain

latent and never lead to a failure. An example of a

technique that increases the number of errors that can

be detected is array bounds checking. Failure detec-

tion and array bounds checking are both examples of

generic error detection mechanisms, that can be ap-

plied without detailed knowledge of a program. Other

examples are the detection of null pointer handling,

malloc problems, and deadlock detection in concurrent

systems. Examples of program speciﬁc mechanisms are

precondition and postcondition checking, and the use

of assertions.

2.2 Program Spectra

A program spectrum [20] is a collection of data that

provides a speciﬁc view on the dynamic behavior of

software. This data is collected at run-time, and typ-

ically consist of a number of counters or ﬂags for the

diﬀerent parts of a program. Many diﬀerent forms of

program spectra exist, see [13] for an overview. In this

paper we work with so-called block hit spectra.

A block hit spectrum contains a ﬂag for every block

of code in a program, that indicates whether or not that

block was executed in a particular run. With a block of

code we mean a C language statement, where we do not

distinguish between the individual statements of a com-

pound statement, but where we do distinguish between

the cases of a switch statement

. As an illustration, we

have identiﬁed the blocks of code in Figure 1.

This is a slightly diﬀerent notion than a basic block, which

is a block of code that has no branch.

N parts errors

M spectra

. . . x

M 1

M 2

. . . x

M N

. . . s

Figure 2. The ingredients of fault diagnosis

2.3 Fault Localization

The hit spectra of M runs constitute a binary matrix,

whose columns correspond to N diﬀerent parts (blocks

in our case) of the program (see Figure 2). The in-

formation in which runs an error was detected consti-

tutes another column vector, the error vector. This

vector can be thought to represent a hypothetical part

of the program that is responsible for all observed er-

rors. Fault localization essentially consists in identify-

ing the part whose column vector resembles the error

vector most.

In the ﬁeld of data clustering, resemblances between

vectors of binary, nominally scaled data, such as the

columns in our matrix of program spectra, are quanti-

ﬁed by means of similarity coeﬃcients (see, e.g., [15]).

Many similarity coeﬃcients exist. As an example, be-

low are three diﬀerent similarity coeﬃcients, namely

the Jaccard coeﬃcient s

, which is used by the Pin-

point tool [6], the coeﬃcient used in the Tarantula

fault localization tool [16], and the Ochiai coeﬃcient

, used in the molecular biology domain:

(j) =

(j)

(j) + a

(j)

(1)

(j) =

(j)

(j)+a

(j)

(j)+a

(j)

(j)+a

(j)

(2)

(j) =

(j)

(j) + a

(j)) ∗ (a

(j) + a

(j))

(3)

where a

(j) = |{i | x

= p ∧ e

= q}|, and

p, q ∈ {0, 1}. Besides, x

= p indicates whether block

j was touched (p = 1) in the execution of run i or not

(p = 0). Similarly, e

= q indicates whether a run i

was faulty (q = 1) or not (q = 0).

Under the assumption that a high similarity to the

error vector indicates a high probability that the cor-

responding parts of the software cause the detected

errors, the calculated similarity coeﬃcients rank the

parts of the program with respect to their likelihood of

containing the faults.

To illustrate the approach, suppose that we ap-

ply the RationalSort function to the input sequences

, . . . , I

(see below). The block hit spectra for these

runs are as follows (’1’ denotes a hit), where block 5

corresponds to the body of the RationalGT function,

which has not been shown in Figure 1.

block

input

1 2 3 4 5 error

= h i 1 0 0 0 0 0

= h

1 1 0 0 0 0

= h

1 1 1 1 1 0

= h

1 1 1 1 1 0

= h

1 1 1 1 1 1

= h

1 1 1 0 1 0

.17 .20 .25 .33 .25

.50 .56 .62 .71 .50

.40 .44 .50 .58 .50

, I

, and I

are already sorted, and lead to passed

runs. I

is not sorted, but the denominators in this

sequence happen to be equal, hence no error occurs. I

is the example from Section 2.1: an error occurs during

its execution, but goes undetected. For I

the program

fails, since the calculated result is h

i instead

of h

i, which is a clear indication that an error

has occurred. For this data, the calculated similarity

coeﬃcients s

x∈{J,T,P }

(1), . . . , s

x∈{J,T,P }

(5) (correctly)

identify block 4 as the most likely location of the fault.

3 Experimental Setup

In this section we describe the benchmark set that we

use in our experiments. We also detail how we extract

the data of Figure 2, and deﬁne how we measure diag-

nostic accuracy.

3.1 Benchmark Set

In our study we worked with a widely-used set of test

programs known as the Siemens set [14], which is

composed of seven programs. Every single program

has a correct version and a set of faulty versions of

the same program. Each faulty version contains ex-

actly one fault. However, the fault may span through

multiple statements and/or functions. Each program

also has a set of inputs that ensures full code cover-

age. Table 1 provides more information about the pro-

grams in the package (for more detailed information

refer to [14]).

In our experiments we were not able to use all the

programs provided by the Siemens set. Because we

conduct our experiments using block hit spectra, we

can not use programs which contain faults located out-

side a block, such as global variables initialization. Ver-

sions 4 and 6 of print

tokens contain such faults and

were therefore discarded. Version 9 of schedule2 and

version 32 of replace were not considered in our ex-

periments because no test case fails and therefore the

existence of a fault was never revealed. Furthermore,

as we are comparing ranking techniques, we decided to

limit our experiment to single site faults. Hence, ver-

sions 12, and 21 of replace, versions 10, 11, 15, and

40 of tcas, version 7 of schedule, and version 1 of

print tokens were also discarded because the fault is

extended to more than one site. In total, we discarded

12 versions out of 132 versions provided by the suite,

using 120 versions in our experiments.

3.2 Data Acquisition

Collecting Spectra For obtaining block hit spectra

we automatically instrumented the source code of ev-

ery single program in the Siemens set using the parser

generator Front [4], which is used in the development

process within our industrial partner in the TRADER

project [10]. A function call was inserted at the be-

ginning of every block of code to log its execution (see

[2] for details on the instrumentation process). Instru-

mentation overhead has been measured to be approxi-

mately 6% on average (with standard deviation of 5%).

Moreover, the programs were compiled on a Fedora

Core release 4 system with gcc-3.2.

Error Detection As for each program the Siemens

set includes a correct version, we use the output of

the correct version of each program as error detection

reference. We characterize a run as ‘failed’ if its output

diﬀers from the corresponding output of the correct

version, and as ‘passed’ otherwise.

3.3 Evaluation Metric

As spectrum-based fault localization creates a ranking

of blocks in order of likelihood to be at fault, we can

retrieve how many blocks we still need to inspect until

we hit the faulty block. If there are two or more blocks

ranking with the same coeﬃcient, we use the average

ranking position for all the blocks.

Let d ∈ {1, . . . , N} be the index of the block that we

know to contain the fault. For all j ∈ {1, . . . , N}, let

denote the similarity coeﬃcient calculated for block

j. Then the ranking position is given by

τ =

|{j|s

> s

}| + |{j|s

≥ s

}| − 1

(4)

We deﬁne accuracy, or quality of the diagnosis as the

eﬀectiveness to pinpoint the faulty block. This metric

represents the percentage of blocks that need not be

considered when searching for the fault by traversing

the ranking. It is deﬁned as

= (1 −

N − 1

) · 100% (5)

Program Faulty Versions Blocks Test Cases Description

print tokens 7 110 4 130 lexical analyzer

print tokens2 10 105 4 115 lexical analyzer

replace 32 124 5 542 pattern recognition

schedule 9 53 2 650 priority scheduler

schedule2 10 60 2 710 priority scheduler

tcas 41 20 1 608 altitude separation

tot info 23 44 1 052 information measure

Table 1. Set of programs used in the experiments

100

print_tokens

print_tokens2

replace

schedule

schedule2

tcas

tot_info

Tarantula

Jaccard

Ochiai

Figure 3. Diagnostic accuracy q

4 Similarity Coeﬃcient Impact

At the end of Section 2.3 we reduced the problem

of spectrum-based fault localization to ﬁnding resem-

blances between binary vectors. The key element of

this technique is the calculation of a similarity coef-

ﬁcient. Many diﬀerent similarity coeﬃcients are used

in practice, and in this section we investigate the im-

pact of the similarity coeﬃcient on the diagnostic ac-

curacy q

For this purpose, we evaluate q

on all faults in our

benchmark set, using nine diﬀerent similarity coeﬃ-

cients. We only report the results for the Jaccard co-

eﬃcient of Eq. (1), the coeﬃcient used in the Taran-

tula fault localization tool as deﬁned in Eq. (2), and

the Ochiai coeﬃcient of Eq. (3). We experimentally

identiﬁed the latter as giving the best results among

all eight coeﬃcients used in a data clustering study in

molecular biology [7], which also included the Jaccard

coeﬃcient.

In addition to Eq. (2), the Tarantula tool uses a sec-

ond coeﬃcient, which amounts to the maximum of the

two fractions in the denominator of Eq. (2). This sec-

ond coeﬃcient is interpreted as a brightness value for

visualization purposes, but the experiments in [16] in-

dicate that the above coeﬃcient can be studied in isola-

tion. For this reason, we have not taken the brightness

coeﬃcient into account.

Figure 3 shows the results of this experiment. It

plots q

, as deﬁned by Eq. (5), for the three similarity

coeﬃcients mentioned above, averaged per program of

the Siemens set. See [1] for more details on these ex-

periments.

An important conclusion that we can draw from

these results is that under the speciﬁc conditions of our

experiment, the Ochiai coeﬃcient gives a better diag-

nosis: it always performs at least as good as the other

coeﬃcients, with an average improvement of 5% over

the second-best case, and improvements of up to 30%

for individual faults. Factors that likely contribute to

this eﬀect are the following. First, for a

(j) > 0 (the

only relevant case: a

(j) = 0 implies s

= 0) the

Tarantula coeﬃcient can be written as 1/(1 + c

(j)

with c the constant

(j)+a

(j)

(j)+a

(j)

. This depends only on

presence of a block in passed and failed runs, while the

Ochiai coeﬃcient also takes the absence in failed runs

into account. Second, compared to Jaccard (Eq. 1),

for the purpose of determining the ranking the denom-

inator of the Ochiai coeﬃcient contains an extra term

(j)·a

(j)

, which ampliﬁes the diﬀerences in the col-

umn vectors of Figure 2. This can be seen by squaring

Eq. 3, and dividing the numerator and denominator by

(j), which does not change the ranking.

5 Observation Quality Impact

Before reaching a deﬁnitive decision to prefer one sim-

ilarity coeﬃcient over another, as suggested by the re-

sults in Section 4, we want to verify that the eﬀect of

this decision is independent of speciﬁc conditions of our

experiments. Because of its relation to test coverage,

and to the error detection mechanism used to charac-

terize runs as passed or failed, an important condition

in this respect is the quality of the error detection in-

formation used in the analysis.

In this section we deﬁne a measure of quality of the

error observations, and show how it can be controlled as

a parameter if the fault location is known, as is the case

in our experimental setup. Thus, we verify the results

of the previous section for varying observation quality

values. Investigating the inﬂuence of this parameter

will also help us to assess the potential gain of more

powerful error detection mechanisms and better test

coverage on diagnostic accuracy.

On the Accuracy of Spectrum-based Fault Localization

Figures

Citations

A Survey on Software Fault Localization

A practical evaluation of spectrum-based fault localization

A model for spectra-based software diagnosis

Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs

Spectrum-Based Multiple Fault Localization

References

Algorithms for clustering data

Algorithms for clustering data

Basic concepts and taxonomy of dependable and secure computing

Basic Concepts and Taxonomy of Dependable and Secure Computing

Diagnosing multiple faults

Related Papers (5)

Empirical evaluation of the tarantula automatic fault-localization technique

Visualization of test information to assist fault localization

Scalable statistical bug isolation

Fault localization with nearest neighbor queries

A Survey on Software Fault Localization

Frequently Asked Questions (17)

Q1. What are the contributions in "On the accuracy of spectrum-based fault localization" ?

Q2. What have the authors stated for future works in "On the accuracy of spectrum-based fault localization" ?

Q3. Why is the quality of the error detection information important?

Q4. How many passed runs may improve diagnostic performance?

Q5. How does qd decrease when a block is touched often?

Q6. What is the simplest way to determine the likelihood of a fault?

Q7. Why is the quality of the observations important?

Q8. How many runs are available in the Siemens set?

Q9. What is the definition of error in the if statement?

Q10. How many runs are required to achieve near-optimum diagnostic accuracy?

Q11. How many observations can provide a near-optimal diagnosis?

Q12. Why did the authors not include its weight coefficient in their experiments?

Q13. What is the effect of adding more failed runs on the diagnostic accuracy?

Q14. How many runs can compensate for weak error detection quality?

Q15. What is the effect of the parameter on the diagnostic accuracy?

Q16. Why were the versions of schedule2 and replace not considered in their experiments?

Q17. How much of a program needs to be inspected after diagnosis?