scispace - formally typeset
Open AccessProceedings ArticleDOI

On the Accuracy of Spectrum-based Fault Localization

TLDR
This work investigates diagnostic accuracy as a function of several parameters (such as quality and quantity of the program spectra collected during the execution of the system), some of which directly relate to test design, and indicates that the superior performance of a particular similarity coefficient, used to analyze the programSpectrum-based fault localization, is largely independent of test design.
Abstract
Spectrum-based fault localization shortens the test- diagnose-repair cycle by reducing the debugging effort. As a light-weight automated diagnosis technique it can easily be integrated with existing testing schemes. However, as no model of the system is taken into account, its diagnostic accuracy is inherently limited. Using the Siemens Set benchmark, we investigate this diagnostic accuracy as a function of several parameters (such as quality and quantity of the program spectra collected during the execution of the system), some of which directly relate to test design. Our results indicate that the superior performance of a particular similarity coefficient, used to analyze the program spectra, is largely independent of test design. Furthermore, near- optimal diagnostic accuracy (exonerating about 80% of the blocks of code on average) is already obtained for low-quality error observations and limited numbers of test cases. The influence of the number of test cases is of primary importance for continuous (embedded) processing applications, where only limited observation horizons can be maintained.

read more

Content maybe subject to copyright    Report

On the Accuracy of Spectrum-based Fault Localization
Rui Abreu Peter Zoeteweij Arjan J.C. van Gemund
Software Technology Department
Faculty of Electrical Engineering, Mathematics, and Computer Science
Delft University of Technology
P.O. Box 5031, NL-2600 GA Delft, The Netherlands
{r.f.abreu, p.zoeteweij, a.j.c.vangemund}@tudelft.nl
Abstract
Spectrum-based fault localization shortens the test-
diagnose-repair cycle by reducing the debugging effort.
As a light-weight automated diagnosis technique it can
easily be integrated with existing testing schemes. How-
ever, as no model of the system is taken into account,
its diagnostic accuracy is inherently limited. Using
the Siemens Set benchmark, we investigate this diag-
nostic accuracy as a function of several parameters
(such as quality and quantity of the program spectra
collected during the execution of the system), some of
which directly relate to test design. Our results indicate
that the superior performance of a particular similar-
ity coefficient, used to analyze the program spectra, is
largely independent of test design. Furthermore, near-
optimal diagnostic accuracy (exonerating about 80% of
the blocks of code on average) is already obtained for
low-quality error observations and limited numbers of
test cases. The influence of the number of test cases
is of primary importance for continuous (embedded)
processing applications, where only limited observation
horizons can be maintained.
c
2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted
component of this work in other works must be obtained from the IEEE.
Keywords: Test data analysis, software fault diag-
nosis, program spectra.
1 Introduction
Testing, debugging, and verification represent a ma-
jor expenditure in the software development cycle [12],
which is to a large extent due to the labor-intensive
tasks of diagnosing the faults (bugs) that cause tests
to fail. Because under typical market conditions, only
This work has been carried out as part of the TRADER
project under the responsibility of the Embedded Systems In-
stitute. This project is partially supported by the Netherlands
Ministry of Economic Affairs under the BSIK03021 program.
those faults that affect the user most can be solved
before the release deadline, the efficiency with which
faults can be diagnosed and repaired directly influences
software reliability. Automated diagnosis can help to
improve this efficiency.
Diagnosis techniques are complementary to testing
in two ways. First, for tests designed to verify correct
behavior, they generate information on the root cause
of test failures, focusing the subsequent tests that are
required to expose this root cause. Second, for tests de-
signed to expose specific potential root causes, the ex-
tra information generated by diagnosis techniques can
help to further reduce the set of remaining possible ex-
planations. Given its incremental nature (i.e., taking
into account the results of an entire sequence of tests),
automated diagnosis alleviates much of the work of se-
lecting tests in the latter category, and can hence have
a profound impact on the test-diagnose-repair cycle.
An important part of diagnosis and repair consist in
localizing faults, and several tools for automated de-
bugging and systems diagnosis implement an approach
to fault localization based on an analysis of the dif-
ferences in program spectra [20] for passed and failed
runs. Passed runs are executions of a program that
completed correctly, whereas failed runs are executions
in which an error was detected. A program spectrum is
an execution profile that indicates which parts of a pro-
gram are active during a run. Fault localization entails
identifying the part of the program whose activity cor-
relates most with the detection of errors. Examples of
tools that implement this approach are Pinpoint [6],
which focuses on large, dynamic on-line transaction
processing systems, Tarantula [17], which focuses on
the analysis of C programs, and AMPLE [8], which fo-
cuses on object-oriented software (see Section 7 for a
discussion).
Spectrum-based fault localization does not rely on

a model of the system under investigation. It can eas-
ily be integrated with existing testing procedures, and
because of the relatively small overhead with respect
to CPU time and memory requirements, it lends itself
well for application within resource-constrained envi-
ronments [24]. However, the efficiency of spectrum-
based fault localization comes at the cost of a limited
diagnostic accuracy. As an indication, in one of the ex-
periments described in the present paper, on average
20% of a program still needs to be inspected after the
diagnosis.
In spectrum-based fault localization, a similarity co-
efficient is used to rank potential fault locations. In
earlier work [1], we obtained preliminary evidence that
the Ochiai similarity coefficient, known from the bi-
ology domain, can improve diagnostic accuracy over
eight other coefficients, including those used by the
Pinpoint and Tarantula tools mentioned above. Ex-
tending as well as generalizing this previous result, in
this paper we investigate the main factors that influ-
ence the accuracy of spectrum-based fault localization
in a much wider setting. Apart from the influence of
the similarity coefficient on diagnostic accuracy, we also
study the influence of the quality and quantity of the
(pass/fail) observations used in the analysis.
Quality of the observations relates to the classifica-
tion of runs as passed or failed. Since most faults lead
to errors only under specific input conditions, and as
not all errors propagate to system failures, this param-
eter is relevant because error detection mechanisms are
usually not ideal. Quantity of the observations relates
to the number of passed and failed runs available for
the diagnosis. If fault localization has to be performed
at run-time, e.g., as a part of a recovery mechanism,
one cannot wait to accumulate many observations to
diagnose a potentially disastrous error until sufficient
confidence is obtained. In addition, quality and quan-
tity of the observations both relate to test coverage.
Varying the observation context with respect to these
two observational parameters allows a much more thor-
ough investigation of the influence of similarity coeffi-
cients. Our study is based on the Siemens set [14] of
benchmark faults (single fault locations).
The main contributions of our work are the follow-
ing. We show that the Ochiai similarity coefficient con-
sistently outperforms the other coefficients mentioned
above. We establish this result across the entire qual-
ity space, and for varying numbers of runs involved.
Furthermore, we show that near-optimum diagnostic
accuracy (exonerating around 80% of all code on av-
erage) is already obtained for low-quality (ambiguous)
error observations, while, in addition, only a few runs
are required. In particular, maximum diagnostic per-
formance is already reached at 6 failed runs on average.
However, including up to 20 passed runs may improve
but also degrade diagnostic performance, depending on
the program and/or input data.
The remainder of this paper is organized as follows.
In Section 2 we introduce some basic concepts and ter-
minology, and explain the diagnosis technique in more
detail. In Section 3 we describe our experimental setup.
In Sections 4, 5, and 6 we describe the experiments on
the similarity coefficient, and the quality and quantity
of the observations, respectively. Related work is dis-
cussed in Section 7. We conclude, and discuss possible
directions for future work in Section 8.
2 Preliminaries
In this section we introduce program spectra, and de-
scribe how they are used in software fault localization.
2.1 Failures, Errors, and Faults
As defined in [5], we use the following terminology. A
failure is an event that occurs when delivered service
deviates from correct service. An error is a system
state that may cause a failure. A fault is the cause of
an error in the system.
In this paper we apply this terminology to simple
computer programs that transform an input file to an
output file in a single run. Specifically in this setting,
faults are bugs in the program code, and failures occur
when the output for a given input deviates from the
specified output for that input.
To illustrate these concepts, consider the C func-
tion in Figure 1. It is meant to sort, using the bub-
ble sort algorithm, a sequence of n rational numbers
whose numerators and denominators are stored in the
parameters num and den, respectively. There is a fault
(bug) in the swapping code within the body of the if
statement: only the numerators of the rational num-
bers are swapped while the denominators are left in
their original order. In this case, a failure occurs
when RationalSort changes the contents of its ar-
gument arrays in such a way that the result is not a
sorted version of the original. An error occurs after
the code inside the conditional statement is executed,
while den[j] 6= den[j+1]. Such errors can be tem-
porary, and do not automatically lead to failures. For
example, if we apply RationalSort to the sequence
h
4
1
,
2
2
,
0
1
i, an error occurs after the first two numera-
tors are swapped. However, this error is “canceled” by
later swapping actions, and the sequence ends up being
sorted correctly.
Error detection is a prerequisite for the fault local-
ization technique studied in this paper: we must know

void R at io na lS or t( int n , int * num , int * den ) {
/* block 1 */
int i,j , temp;
for ( i =n -1; i >=0; i -- ) {
/* block 2 */
for ( j =0; j < i; j ++ ) {
/* block 3 */
if ( Ra ti on al GT( num [ j] , den [j ],
num [ j+1] , den [ j+ 1] ) ) {
/* block 4 */
temp = num[ j ];
num [j ] = num[ j +1];
num [j + 1] = t em p;
}
}
}
}
Figure 1. A faulty C function for sorting rational
numbers
that something is wrong before we can try to locate
the responsible fault. Failures constitute a rudimen-
tary form of error detection, but many errors remain
latent and never lead to a failure. An example of a
technique that increases the number of errors that can
be detected is array bounds checking. Failure detec-
tion and array bounds checking are both examples of
generic error detection mechanisms, that can be ap-
plied without detailed knowledge of a program. Other
examples are the detection of null pointer handling,
malloc problems, and deadlock detection in concurrent
systems. Examples of program specific mechanisms are
precondition and postcondition checking, and the use
of assertions.
2.2 Program Spectra
A program spectrum [20] is a collection of data that
provides a specific view on the dynamic behavior of
software. This data is collected at run-time, and typ-
ically consist of a number of counters or flags for the
different parts of a program. Many different forms of
program spectra exist, see [13] for an overview. In this
paper we work with so-called block hit spectra.
A block hit spectrum contains a flag for every block
of code in a program, that indicates whether or not that
block was executed in a particular run. With a block of
code we mean a C language statement, where we do not
distinguish between the individual statements of a com-
pound statement, but where we do distinguish between
the cases of a switch statement
1
. As an illustration, we
have identified the blocks of code in Figure 1.
1
This is a slightly different notion than a basic block, which
is a block of code that has no branch.
N parts errors
M spectra
2
6
6
6
4
x
11
x
12
. . . x
1N
x
21
x
22
. . . x
2N
.
.
.
.
.
.
.
.
.
.
.
.
x
M 1
x
M 2
. . . x
M N
3
7
7
7
5
2
6
6
6
4
e
1
e
2
.
.
.
e
M
3
7
7
7
5
s
1
s
2
. . . s
N
Figure 2. The ingredients of fault diagnosis
2.3 Fault Localization
The hit spectra of M runs constitute a binary matrix,
whose columns correspond to N different parts (blocks
in our case) of the program (see Figure 2). The in-
formation in which runs an error was detected consti-
tutes another column vector, the error vector. This
vector can be thought to represent a hypothetical part
of the program that is responsible for all observed er-
rors. Fault localization essentially consists in identify-
ing the part whose column vector resembles the error
vector most.
In the field of data clustering, resemblances between
vectors of binary, nominally scaled data, such as the
columns in our matrix of program spectra, are quanti-
fied by means of similarity coefficients (see, e.g., [15]).
Many similarity coefficients exist. As an example, be-
low are three different similarity coefficients, namely
the Jaccard coefficient s
T
, which is used by the Pin-
point tool [6], the coefficient used in the Tarantula
fault localization tool [16], and the Ochiai coefficient
s
O
, used in the molecular biology domain:
s
J
(j) =
a
11
(j)
a
11
(j) + a
01
(j) + a
10
(j)
(1)
s
T
(j) =
a
11
(j)
a
11
(j)+a
01
(j)
a
11
(j)
a
11
(j)+a
01
(j)
+
a
10
(j)
a
10
(j)+a
00
(j)
(2)
s
O
(j) =
a
11
(j)
p
(a
11
(j) + a
01
(j)) (a
11
(j) + a
10
(j))
(3)
where a
pq
(j) = |{i | x
ij
= p e
i
= q}|, and
p, q {0, 1}. Besides, x
ij
= p indicates whether block
j was touched (p = 1) in the execution of run i or not
(p = 0). Similarly, e
i
= q indicates whether a run i
was faulty (q = 1) or not (q = 0).
Under the assumption that a high similarity to the
error vector indicates a high probability that the cor-
responding parts of the software cause the detected
errors, the calculated similarity coefficients rank the
parts of the program with respect to their likelihood of
containing the faults.
To illustrate the approach, suppose that we ap-
ply the RationalSort function to the input sequences

I
1
, . . . , I
6
(see below). The block hit spectra for these
runs are as follows (’1’ denotes a hit), where block 5
corresponds to the body of the RationalGT function,
which has not been shown in Figure 1.
block
input
1 2 3 4 5 error
I
1
= h i 1 0 0 0 0 0
I
2
= h
1
4
i
1 1 0 0 0 0
I
3
= h
2
1
,
1
1
i
1 1 1 1 1 0
I
4
= h
4
1
,
2
2
,
0
1
i
1 1 1 1 1 0
I
5
= h
3
1
,
2
2
,
4
3
,
1
4
i
1 1 1 1 1 1
I
6
= h
1
4
,
1
3
,
1
2
,
1
1
i
1 1 1 0 1 0
s
J
.17 .20 .25 .33 .25
s
T
.50 .56 .62 .71 .50
s
O
.40 .44 .50 .58 .50
I
1
, I
2
, and I
6
are already sorted, and lead to passed
runs. I
3
is not sorted, but the denominators in this
sequence happen to be equal, hence no error occurs. I
4
is the example from Section 2.1: an error occurs during
its execution, but goes undetected. For I
5
the program
fails, since the calculated result is h
1
1
,
2
2
,
4
3
,
3
4
i instead
of h
1
4
,
2
2
,
4
3
,
3
1
i, which is a clear indication that an error
has occurred. For this data, the calculated similarity
coefficients s
x∈{J,T,P }
(1), . . . , s
x∈{J,T,P }
(5) (correctly)
identify block 4 as the most likely location of the fault.
3 Experimental Setup
In this section we describe the benchmark set that we
use in our experiments. We also detail how we extract
the data of Figure 2, and define how we measure diag-
nostic accuracy.
3.1 Benchmark Set
In our study we worked with a widely-used set of test
programs known as the Siemens set [14], which is
composed of seven programs. Every single program
has a correct version and a set of faulty versions of
the same program. Each faulty version contains ex-
actly one fault. However, the fault may span through
multiple statements and/or functions. Each program
also has a set of inputs that ensures full code cover-
age. Table 1 provides more information about the pro-
grams in the package (for more detailed information
refer to [14]).
In our experiments we were not able to use all the
programs provided by the Siemens set. Because we
conduct our experiments using block hit spectra, we
can not use programs which contain faults located out-
side a block, such as global variables initialization. Ver-
sions 4 and 6 of print
tokens contain such faults and
were therefore discarded. Version 9 of schedule2 and
version 32 of replace were not considered in our ex-
periments because no test case fails and therefore the
existence of a fault was never revealed. Furthermore,
as we are comparing ranking techniques, we decided to
limit our experiment to single site faults. Hence, ver-
sions 12, and 21 of replace, versions 10, 11, 15, and
40 of tcas, version 7 of schedule, and version 1 of
print tokens were also discarded because the fault is
extended to more than one site. In total, we discarded
12 versions out of 132 versions provided by the suite,
using 120 versions in our experiments.
3.2 Data Acquisition
Collecting Spectra For obtaining block hit spectra
we automatically instrumented the source code of ev-
ery single program in the Siemens set using the parser
generator Front [4], which is used in the development
process within our industrial partner in the TRADER
project [10]. A function call was inserted at the be-
ginning of every block of code to log its execution (see
[2] for details on the instrumentation process). Instru-
mentation overhead has been measured to be approxi-
mately 6% on average (with standard deviation of 5%).
Moreover, the programs were compiled on a Fedora
Core release 4 system with gcc-3.2.
Error Detection As for each program the Siemens
set includes a correct version, we use the output of
the correct version of each program as error detection
reference. We characterize a run as ‘failed’ if its output
differs from the corresponding output of the correct
version, and as ‘passed’ otherwise.
3.3 Evaluation Metric
As spectrum-based fault localization creates a ranking
of blocks in order of likelihood to be at fault, we can
retrieve how many blocks we still need to inspect until
we hit the faulty block. If there are two or more blocks
ranking with the same coefficient, we use the average
ranking position for all the blocks.
Let d {1, . . . , N} be the index of the block that we
know to contain the fault. For all j {1, . . . , N}, let
s
j
denote the similarity coefficient calculated for block
j. Then the ranking position is given by
τ =
|{j|s
j
> s
d
}| + |{j|s
j
s
d
}| 1
2
(4)
We define accuracy, or quality of the diagnosis as the
effectiveness to pinpoint the faulty block. This metric
represents the percentage of blocks that need not be
considered when searching for the fault by traversing
the ranking. It is defined as
q
d
= (1
τ
N 1
) · 100% (5)

Program Faulty Versions Blocks Test Cases Description
print tokens 7 110 4 130 lexical analyzer
print tokens2 10 105 4 115 lexical analyzer
replace 32 124 5 542 pattern recognition
schedule 9 53 2 650 priority scheduler
schedule2 10 60 2 710 priority scheduler
tcas 41 20 1 608 altitude separation
tot info 23 44 1 052 information measure
Table 1. Set of programs used in the experiments
0
20
40
60
80
100
print_tokens
print_tokens2
replace
schedule
schedule2
tcas
tot_info
Tarantula
Jaccard
Ochiai
Figure 3. Diagnostic accuracy q
d
4 Similarity Coefficient Impact
At the end of Section 2.3 we reduced the problem
of spectrum-based fault localization to finding resem-
blances between binary vectors. The key element of
this technique is the calculation of a similarity coef-
ficient. Many different similarity coefficients are used
in practice, and in this section we investigate the im-
pact of the similarity coefficient on the diagnostic ac-
curacy q
d
.
For this purpose, we evaluate q
d
on all faults in our
benchmark set, using nine different similarity coeffi-
cients. We only report the results for the Jaccard co-
efficient of Eq. (1), the coefficient used in the Taran-
tula fault localization tool as defined in Eq. (2), and
the Ochiai coefficient of Eq. (3). We experimentally
identified the latter as giving the best results among
all eight coefficients used in a data clustering study in
molecular biology [7], which also included the Jaccard
coefficient.
In addition to Eq. (2), the Tarantula tool uses a sec-
ond coefficient, which amounts to the maximum of the
two fractions in the denominator of Eq. (2). This sec-
ond coefficient is interpreted as a brightness value for
visualization purposes, but the experiments in [16] in-
dicate that the above coefficient can be studied in isola-
tion. For this reason, we have not taken the brightness
coefficient into account.
Figure 3 shows the results of this experiment. It
plots q
d
, as defined by Eq. (5), for the three similarity
coefficients mentioned above, averaged per program of
the Siemens set. See [1] for more details on these ex-
periments.
An important conclusion that we can draw from
these results is that under the specific conditions of our
experiment, the Ochiai coefficient gives a better diag-
nosis: it always performs at least as good as the other
coefficients, with an average improvement of 5% over
the second-best case, and improvements of up to 30%
for individual faults. Factors that likely contribute to
this effect are the following. First, for a
11
(j) > 0 (the
only relevant case: a
11
(j) = 0 implies s
j
= 0) the
Tarantula coefficient can be written as 1/(1 + c
a
10
(j)
a
11
(j)
),
with c the constant
a
11
(j)+a
01
(j)
a
00
(j)+a
10
(j)
. This depends only on
presence of a block in passed and failed runs, while the
Ochiai coefficient also takes the absence in failed runs
into account. Second, compared to Jaccard (Eq. 1),
for the purpose of determining the ranking the denom-
inator of the Ochiai coefficient contains an extra term
a
01
(j)·a
10
(j)
a
11
(j)
, which amplifies the differences in the col-
umn vectors of Figure 2. This can be seen by squaring
Eq. 3, and dividing the numerator and denominator by
a
11
(j), which does not change the ranking.
5 Observation Quality Impact
Before reaching a definitive decision to prefer one sim-
ilarity coefficient over another, as suggested by the re-
sults in Section 4, we want to verify that the effect of
this decision is independent of specific conditions of our
experiments. Because of its relation to test coverage,
and to the error detection mechanism used to charac-
terize runs as passed or failed, an important condition
in this respect is the quality of the error detection in-
formation used in the analysis.
In this section we define a measure of quality of the
error observations, and show how it can be controlled as
a parameter if the fault location is known, as is the case
in our experimental setup. Thus, we verify the results
of the previous section for varying observation quality
values. Investigating the influence of this parameter
will also help us to assess the potential gain of more
powerful error detection mechanisms and better test
coverage on diagnostic accuracy.

Citations
More filters
Journal ArticleDOI

A Survey on Software Fault Localization

TL;DR: A comprehensive overview of a broad spectrum of fault localization techniques, each of which aims to streamline the fault localization process and make it more effective by attacking the problem in a unique way is provided.
Journal ArticleDOI

A practical evaluation of spectrum-based fault localization

TL;DR: This work investigates diagnostic accuracy as a function of several parameters (such as quality and quantity of the program spectra collected during the execution of the system) and shows that SFL can effectively be applied in the context of embedded software development in an industrial environment.
Journal ArticleDOI

A model for spectra-based software diagnosis

TL;DR: This article presents an improved approach to assist diagnosis of failures in software by ranking program statements or blocks in accordance with to how likely they are to be buggy, which out-performs previously proposed methods for the model program, the Siemens test suite and Space.
Journal ArticleDOI

Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs

TL;DR: Empirical analysis on Nopol shows that the approach can effectively fix bugs with buggy if conditions and missing preconditions on two large open-source projects, namely Apache Commons Math and Apache Commons Lang.
Proceedings ArticleDOI

Spectrum-Based Multiple Fault Localization

TL;DR: Experimental results show that BARINEL typically outperforms current SFL approaches at a cost complexity that is only marginally higher, and this superiority is established by formal proof.
References
More filters
Journal ArticleDOI

Basic concepts and taxonomy of dependable and secure computing

TL;DR: The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of systems failures.

Basic Concepts and Taxonomy of Dependable and Secure Computing

TL;DR: In this paper, the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc.
Journal ArticleDOI

Diagnosing multiple faults

TL;DR: The diagnostic procedure presented in this paper is model-based, inferring the behavior of the composite device from knowledge of the structure and function of the individual components comprising the device.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions in "On the accuracy of spectrum-based fault localization" ?

Using the Siemens Set benchmark, the authors investigate this diagnostic accuracy as a function of several parameters ( such as quality and quantity of the program spectra collected during the execution of the system ), some of which directly relate to test design. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Furthermore, nearoptimal diagnostic accuracy ( exonerating about 80 % of the blocks of code on average ) is already obtained for low-quality error observations and limited numbers of test cases. 

In future work, the authors plan to study the influence of the granularity ( statement, function level ) of program spectra on the diagnostic accuracy of spectrum-based fault localization. Finally, their study was conducted using single fault programs, and further investigation is required to be able to generalize their findings to multiple fault programs. 

Because of its relation to test coverage, and to the error detection mechanism used to characterize runs as passed or failed, an important condition in this respect is the quality of the error detection information used in the analysis. 

including up to 20 passed runs may improve but also degrade diagnostic performance, depending on the program and/or input data. 

qd decreases only if the faulty block is touched often in passed runs, as spectrum-based fault localization works under de assumption that if a block is touched often in passed runs, it should be exonerated. 

Under the assumption that a high similarity to the error vector indicates a high probability that the corresponding parts of the software cause the detected errors, the calculated similarity coefficients rank the parts of the program with respect to their likelihood of containing the faults. 

Since most faults lead to errors only under specific input conditions, and as not all errors propagate to system failures, this parameter is relevant because error detection mechanisms are usually not ideal. 

Although the number of available runs in the Siemens set ranges from 1052 (tot info) to 5542 (replace), the number of runs that fail is comparatively small, ranging from a single run for tcas version 8, to 518 for print tokens version 2. 

There is a fault (bug) in the swapping code within the body of the if statement: only the numerators of the rational numbers are swapped while the denominators are left in their original order. 

the authors show that near-optimum diagnostic accuracy (exonerating around 80% of all code on average) is already obtained for low-quality (ambiguous) error observations, while, in addition, only a few runs are required. 

The fact that a few observations can already provide a near-optimal diagnosis enables the application of spectrum-based fault localization methods within continuous (embedded) processing, where only limited observation horizons can be maintained. 

although the authors have recognized that it uses hit spectra of method call sequences, the authors didn’t include its weight coefficient in their experiments because the calculated values are only used to collect evidence about classes, not to identify suspicious method call sequences. 

the authors conclude that while accumulating more failed runs only improves the accuracy of the diagnosis, the effect of including more passed runs is unpredictable. 

a large number of runs can apparently compensate weak error detection quality: even for small qe, a large amount of runs provides sufficient information for good diagnostic accuracy, as shown in Figure 4. 

Investigating the influence of this parameter will also help us to assess the potential gain of more powerful error detection mechanisms and better test coverage on diagnostic accuracy. 

Version 9 of schedule2 and version 32 of replace were not considered in their experiments because no test case fails and therefore theexistence of a fault was never revealed. 

As an indication, in one of the experiments described in the present paper, on average 20% of a program still needs to be inspected after the diagnosis.