scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Multiple Fault Localization Approach Based on Multicriteria Analytical Hierarchy Process

TL;DR: The proposed approach tackles the different metrics by aggregating them into a single metric using a weighted linear formulation based on Analytic Hierarchy Process, which enables to propose a more precise localization than existing spectrum-based metrics.
Abstract: Fault localization problem is one of the most difficult processes in software debugging. Several spectrum-based ranking metrics have been proposed and none is shown to be empirically optimal. In this paper, we consider the fault localization problem as a multicriteria decision making problem. The proposed approach tackles the different metrics by aggregating them into a single metric using a weighted linear formulation. A learning step is used to maintain the right expected weights of criteria. This approach is based on Analytic Hierarchy Process (AHP), where a ranking is given to a statement in terms of suspiciousness according to a comparison of ranks given by the different metrics. Experiments performed on standard benchmark programs show that our approach enables to propose a more precise localization than existing spectrum-based metrics.

Summary (3 min read)

Introduction

  • A ranking metric is used to compute a suspiciousness score for each program statement based on observations of passing and failing test case execution.
  • Other approaches tackle fault localization as a supervised learning problem [16], [17].
  • As soon as multiple conflicting criteria are considered in the evaluation of a decision, the notion of optimality is not workable, since no criterion is systematically better than all the others.
  • Fortunately, there are some existing methods that tackle, either directly or indirectly, weights elicitation problem, in particular the AHP method described in Section III-B.

A. Fault Localization

  • A test case tci,j is a tuple 〈Di,j , Oi,j〉, where Di,j is the input data and Oi,j is the expected output.
  • Given a statement ei,j , pass(Ti) (resp. fail(Ti)) is the set of all passed (resp. all failed) test cases.
  • Moreover, the whole of suspiciousness metrics shares the same intuition: the more often a statement is executed by failing test cases, and the less often it is executed by passing test cases, the more suspicious the statement is considered.
  • OCHIAI came originally from molecular biology.

B. Analytical Hierarchy Process (AHP) [20]

  • AHP is a simple and easy to use structured process for organizing and eliciting criteria weights.
  • It involves three main steps: 1) The criteria are subjectively compared in a pairwise manner according to their respective weight wi.
  • For this reason, a consistency check is conducted by calculating the consistency ratio (CR).
  • The ROC method assumes that the true weights are uniformly distributed on the simplex of rank-order weights.

IV. AHP-LOC APPROACH

  • AHP requires a squared matrix A[1..m, 1..m], where A[i, j] scores how much criteria i is ranked better than criteria j.
  • Mj on some faulty program context (Pi, Li, Ti), it can return a set of equivalent statements in terms of suspiciousness (i.e., with the same suspiciousness degree).
  • For that reason, the authors consider two EXAM scores, the optimistic and the pessimistic one, denoted respectively RPi,j and R O i,j .
  • The performances of this passive approach depends on the quality of the given training set.

V. RUNNING EXAMPLE

  • To illustrate their approach, the authors consider the Power program given in Fig.2.
  • Third, the expert is facing a challenging tast where its opinion heavily depends on the provided test cases.
  • The authors learn the AHP matrix from faulty programs contexts without the help of any expert.
  • The fault at e4 is ranked first except for GP13 and OCHIAI.
  • Here, AHPLOC shows its best case by taking benefits of the difference that exists among the ranking metrics.

VI. EXPERIMENTS

  • This section describes the experimental settings (including benchmark programs, protocol and implementation), the experimental results and comparison with eight SBFL ranking metrics.
  • Experiments were performed on single and multiple fault programs.

A. Benchmark programs

  • The authors evaluate their approach by analyzing the performance of fault localization over 18 faulty programs coming from different real-world applications.
  • The authors have considered both Siemens and Space datasets.1 A complete description of Siemens suite and Space dataset can be found in [24], [25].
  • For each program, the authors report the number of faulty versions (single fault 1F, two faults 2F and four faults 4F), the size of the program with its lines of code (LOC) and lines of executable code (LEC), the number of test cases.
  • The single fault versions are provided, where multiple fault versions are produced by combining randomly the provided faults.
  • The authors have 5,386 single fault Java programs, 104 programs with two faults and 104 programs with four faults.

B. Experimental protocol

  • Here, the accuracy may vary depending on which statement to check first.
  • For such reason, the authors report the two exam scores, the optimistic and the pessimistic one, denoted respectively in this section O-EXAM and P-EXAM.
  • The authors approach AHP-LOC is trained on faulty programs to learn the coefficients of the AHP matrix (training set).
  • In each fold, the authors randomly select three faulty versions to form the training set and use the remaining 15 faulty versions for evaluation (testing set).
  • The authors compared their three AHP-LOC versions Active AHPLOC, Passive AHP-LOC and ROC based AHP-LOC, with eight widely-studied spectrum-based ranking metrics, TARANTULA, OCHIAI, JACCARD, M2 and AMPLE, and three recent proposed ranking metrics including GP13, NAISH2 and ER1B, which are proved as the optimal under theoretical assumptions (cf. [27]).

D. Single Fault Results

  • Table III reports an EXAM score comparison (O-EXAM, P-EXAM and ∆-EXAM) between Active AHP-LOC, Passive AHP-LOC, ROC AHP-LOC and SBFL metrics on single fault programs.
  • The first observation that the authors can draw is that, comparing their three versions, the Active AHP-LOC achieves the lowest O-EXAM and P-EXAM.
  • The authors also observe that AHPLOC (with its three versions) is more effective than SBFL approaches in most of the benchmarking instances.
  • It is also clear from the same figure, that metrics shown in red (i.e. (5) TARANTULA, (7) JACCARD and (10) AMPLE) perform poorly, as they generate big class of suspicious statements (see also Tables III, ∆-EXAM).
  • Interestingly, Passive AHP-LOC, AHP-ROC, ER1B and NAISH2 almost achieve similar better accuracy.

E. Multiple Fault Results

  • The authors report the EXAM score comparison results on programs with multiple faults.
  • The results for two and four faults are shown in Table III.
  • Fig.4 shows a comparison on P-EXAM (fig.4a) and OEXAM (fig.4b) between SBFL metrics, and AHP-LOC.
  • Here, the authors notice that the Passive AHP-LOC and AHP-ROC enable to better locate both of faults compared with SBFL metrics.
  • Their approach is rather good in term of effectiveness, because those metrics give a large class of equivalent suspicious statements.

VII. CONCLUSION

  • The authors propose a passive and active learning versions to maintain the right expected weights of criteria.
  • The authors have compared experimentally their approach with state of the art SBFL metrics on a set of multiple faults programs.
  • The results the authors obtained show that their approach enables to aggregate the benefits of various SBFL metrics to get a single efficient SBFL technique.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Multiple Fault Localization Approach based on
Multicriteria Analytical Hierarchy Process
Noureddine Aribi
, Nadjib Lazaar
, Yahia Lebbah
, Samir Loudni
, Mehdi Maamar
LITIO University of Oran 1, 1524 El-M’Naouer, 31000 Oran Algeria
LIRMM University of Montpellier, CNRS, 161 rue Ada, 34090 Montpellier France
CNRS, UMR 6072 GREYC University of Caen Normandy, 14032 Caen France
Abstract—Fault localization problem is one of the most difficult
processes in software debugging. Several spectrum-based ranking
metrics have been proposed and none is shown to be empirically
optimal. In this paper, we consider the fault localization prob-
lem as a multicriteria decision making problem. The proposed
approach tackles the different metrics by aggregating them into
a single metric using a weighted linear formulation. A learning
step is used to maintain the right expected weights of criteria.
This approach is based on Analytic Hierarchy Process (AHP),
where a ranking is given to a statement in terms of suspiciousness
according to a comparison of ranks given by the different metrics.
Experiments performed on standard benchmark programs show
that our approach enables to propose a more precise localization
than existing spectrum-based metrics.
Index Terms—Fault Localization; Spectrum-based Fault Lo-
calization; Multiple Fault; Multicriteria decision making; AHP
I. INTRODUCTION
Developing software programs is universally acknowledged
as an error-prone task. The major bottleneck in software
debugging is how to identify where the bugs are [1], this is
known as fault localization problem. Nonetheless, locating a
fault is still an extremely time-consuming and tedious task.
Over the last decade, several automated techniques have been
proposed to tackle this problem.
Spectrum-based approaches. Spectrum-based fault local-
ization (SBFL) (e.g. [2], [3]) is a class of popular fault
localization approaches that take as input a set of failing and
passing test case executions, and then highlight the suspicious
program statements that are likely responsible for the failures.
A ranking metric is used to compute a suspiciousness score
for each program statement based on observations of passing
and failing test case execution. The basic assumption is that
statements with high scores, i.e. those executed more often
by failed test cases but never or rarely by passing test cases,
are more likely to be faulty. Several ranking metrics have
been proposed to capture the notion of suspiciousness, such as
TARANTULA [4], OCHIAI [5], and JACCARD [5]. The ultimate
objective of SBFL is to have a metric able to always rank first
the faulty statements. In practice, we are very far from this
ideal [6]. SBFL metrics do not rely on a particular model of
the program under test and thus, they are easy to use and
practical in the presence of CPU time and memory resources
constraints. SBFL metrics give different interpretation of sus-
piciousness degree. In addition, the semantics of statements
and the dependencies are not taken into account. Thus, the
accuracy of SBFL approaches is inherently limited.
Multiple fault programs. Most of current localization tech-
niques are based on the single fault hypothesis. By dismissing
such assumption, faults can be tightly dependent in a program,
giving rise to numerous behaviours [7]. Thus, it makes the
localization process difficult for multiple fault approaches [8]–
[10]. The main idea of these approaches is to make a partition
on test cases into fault-clusters where each one contains the
test cases covering the same fault. The drawback is that a
test case can cover many faults with overlapping clusters,
which leads to a rough localization. Another idea consists in
localizing one fault at a time [11]. Here, we start by locating
a first fault, then correct it (which is an error-prone step),
generate again new test cases, and so on until no fault remains
in the program.
Artificial Intelligence based approaches. In the last decade,
fault localization was abstracted as a data mining (DM) prob-
lem. Podgurski et al. present a method to automatically group
faulty spectra with respect to the fault that leads to the failure
[9]. This method relies on cluster analysis. Cellier et al. [12],
[13] propose a combination of association rules and Formal
Concept Analysis (FCA) to assist in fault localization. In [14],
[15], the authors formalize the problem of fault localization as
a closed pattern mining problem. A constraint programming
model, with CLOSEDPATTERN global constraint, is used to
extract the k best patterns in terms of suspiciousness degree.
Other approaches tackle fault localization as a supervised
learning problem [16], [17].
Multicriteria decision making. Many real-world decision
problems involve several criteria. As soon as multiple con-
flicting criteria are considered in the evaluation of a decision,
the notion of optimality is not workable, since no criterion
is systematically better than all the others. In this context,
the Multicriteria decision-making (MCDM) [18] provides
a systematic approach to characterize and find the most-
preferred trade-off solutions. While many preference models
have been proposed in the literature, the additive preference
model is the most commonly used in MCDM. It consists
to aggregate additively the criteria into a single criterion, so
as to take advantage of all advanced techniques in solving
single-criterion optimization problems. A first difficult step in
this direction consists to find the right weights. Fortunately,
there are some existing methods that tackle, either directly or

indirectly, weights elicitation problem, in particular the AHP
method described in Section III-B.
In this paper, we consider the fault localization problem
as a multicriteria decision making problem. The proposed
approach tackles the different metrics by aggregating them
into a single metric using a weighted linear formulation.
We propose a passive/active learning step to maintain the
right expected weights of criteria. This approach is based
on Analytic hierarchy Process (AHP), where a ranking is
given to a statement in terms of suspiciousness according to
a comparison of ranks given by the different metrics III-B.
The approach is implemented in AHP-LOC. Experiments
performed on standard benchmark programs show that our
approach enables to propose a more precise localization than
existing spectrum-based metrics.
This paper is organized as follows. Section 2 presents re-
lated work. Section 3 recalls preliminaries. Section 4 describes
our approach. Section 5 illustrates the approach on a small
program. Section 6 reports experimental results and a complete
comparison with AHP-LOC and SBFL metrics. Finally, we
conclude this work in Section 7.
II. RELATED WORK
To the best of our knowledge, Xuan and Monperrus propose
in 2014 the first and unique work combining multiple ranking
metrics [19]. MULTRIC is based on a passive learning process
acting on multiple ranking metrics. MULTRIC consists of two
phases: learning and ranking. It combines different ranking
metrics in a weighted model. The learning phase is a passive
one using a training set. The training set corresponds to a set of
pairs of statements with their spectra. From each already dealt
faulty program, only faulty statements and their uppers and
lowers statements in terms of suspiciousness are considered.
Then, pairs of faulty/non-faulty statements are extracted and
added to the training set. Considering all possible pairs of
faulty programs can lead to a very large training set. To bypass
this limitation, MULTRIC uses a neighborhood strategy with
few uppers and lowers statements of the faulty statement.
Learning the weight of each metric is based on the assumption
that given a pair of statements (s
f
, s
n
), where s
f
is a faulty
one and s
n
is a non-faulty one, s
f
should be ranked above
s
n
. The learning is also based on a standard binary classifier
in machine learning. The ranking phase combines in a simple
manner the scores of the different metrics with a weighting
function and using the learned weights.
AHP-LOC has two distinguishing elements w.r.t., MUL-
TRIC. Firstly, AHP-LOC is an adaptive approach based on an
active learning makes it able to start with an empty training
set. Secondly, AHP-LOC benefits from multicriteria AHP in
the aggregation of different metrics.
III. BACKGROUND
A. Fault Localization
Let us consider a faulty program P
i
having n
i
lines, labeled
e
i,1
to e
i,n
i
. A test case tc
i,j
is a tuple hD
i,j
, O
i,j
i, where
D
i,j
is the input data and O
i,j
is the expected output. Let
hD
i,j
, O
i,j
i a test case and A
i,j
be the current output returned
by a program P after the execution of its input D
i,j
. If A
i,j
=
O
i,j
, tc
i,j
is considered as a passing (i.e. positive), failing (i.e.
negative) otherwise. A test suite T
i
= {tc
i,1
, tc
i,2
, ..., tc
i,m
i
}
is a set of m
i
test cases to check whether the program P
i
follows a given set of requirements.
Given a test case tc
i,j
and a program P
i
, the set of executed
(at least once) statements of P with tc
i,j
is called a test case
coverage I
i,j
= (I
i,j,1
, ..., I
i,j,n
i
), where I
i,j,k
= 1 if the k
th
statement is executed, 0 otherwise. I
i,j
indicates which parts
of the program are active during a specific execution.
SBFL techniques assign suspiciousness scores for each of
statements and rank them in a descending order of suspicious-
ness. Most of suspiciousness metrics are defined manually and
analytically on the basis of multiple assumptions on programs,
test cases and the introduced faults. Fig 1 lists the formula
of three well-known metrics: TARANTULA [4], OCHIAI [5]
and JACCARD [5]. Given a statement e
i,j
, pass(T
i
) (resp.
fail(T
i
)) is the set of all passed (resp. all failed) test cases.
pass(e
i,j
) (resp. fail(e
i,j
)) is the set of passed (resp. failed)
test cases covering e
i,j
. The basic assumption is that the
program fails when the faulty statement is executed. Moreover,
the whole of suspiciousness metrics shares the same intuition:
the more often a statement is executed by failing test cases,
and the less often it is executed by passing test cases, the
more suspicious the statement is considered. Fig.1 shows the
suspiciousness spectrum of the different metrics according to
an up to 1K passing and/or failing test cases:
TARANTULA allows some tolerance for the fault to be
executed by passing test cases (see Fig.1a). However,
this metric is not able to differentiate between statements
that are not executed by passing tests. For instance,
consider two statements e
i,j
and e
i,k
with |pass(e
i,j
)| =
|pass(e
i,k
)| = 0, |fail(e
i,j
)| = 1 and |fail(e
i,k
)| =
1000: e
i,j
and e
i,k
have the same suspiciousness degree
according to TARANTULA.
OCHIAI came originally from molecular biology. The
specificity of this metric is that it attaches a particular
importance of the presence to a statement in the failing
test cases (see Fig.1b).
JACCARD has been defined to find a proper balance
between the impact of passing/failing test cases on the
scoring measure [5] (see Fig.1c).
B. Analytical Hierarchy Process (AHP) [20]
AHP is a simple and easy to use structured process for
organizing and eliciting criteria weights. It involves three main
steps:
1) The criteria are subjectively compared in a pairwise
manner according to their respective weight w
i
. The
comparison is organized into a square matrix A =
[1..m, 1..m], where A[i, j] is the relative importance of
criterion i w.r.t., criterion j. The i
th
criterion is better
than the j
th
criterion if (A[i, j] > 1). In AHP, we
have 9 degrees of dominance where A[i, j] indicates

0
100
200
300
400
500
600
700
800
900
1000
#failed
0
100
200
300
400
500
600
700
800
900
1000
#passed
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
100
200
300
400
500
600
700
800
900
1000
#failed
0
100
200
300
400
500
600
700
800
900
1000
#passed
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
100
200
300
400
500
600
700
800
900
1000
#failed
0
100
200
300
400
500
600
700
800
900
1000
#passed
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a)$Tarantula$ (b)$Ochiai$ (c)$Jaccard$
Suspiciousness$
T (e
i,j
) =
|f ail(e
i,j
)|
|f ail(T
i
)|
|pass(e
i,j
)|
|pass(T
i
)|
+
|f ail(e
i,j
)|
|f ail(T
i
)|
O (e
i,j
) =
|f ail(e
i,j
)|
r
|f ail(e
i,j
)|+|pass(e
i,j
)|
×|f ail(T
i
)|
J(e
i,j
) =
|f ail(e
i,j
)|
|pass(e
i,j
)|+|f ail(T
i
)|
Fig. 1: Suspiciousness Degrees.
an indifference preference and A[i, j] = 9 a strong
dominance. Note that A[i, j] = 1/A[j, i] and A[i, i] = 1.
2) AHP assesses the criteria weighting vector w by solving
the characteristic equation:
A · w = λ
max
· w (1)
where λ
max
is the highest eigen value of A.
3) Inconsistencies may occur in pairwise comparisons, be-
cause AHP does not enforce the preferences to be tran-
sitive. For this reason, a consistency check is conducted
by calculating the consistency ratio (CR).
CR =
CI
RI
, CI =
(λ
max
n)
n 1
(2)
where RI is a constant taken from the Random Consistency
Index table of AHP. The weighting vector w is considered as
reliable if CR < 0.1.
The AHP can be combined with a popular direct weights
elicitation method ROC (Rank-Order Centroid) [21]. ROC
produces an estimation of the weights that minimizes the
maximum error of each weight. The ROC method assumes
that the true weights are uniformly distributed on the simplex
of rank-order weights. That is, ROC is a function based on the
average of the corners in the polytope defined by the simplex
S
w
= w
1
> w
2
> · · · > w
n
,
P
i
w
i
= 1, and w
i
> 0, such
that:
w
i
=
1
n
n
X
k=i
1
r
k
(3)
where r
i
is the rank of the i
th
criterion.
IV. AHP-LOC APPROACH
Let {P
1
, ..., P
n
} be a set of n faulty programs. A faulty
program context is a triplet (P
i
, L
i
, T
i
), i = 1..n, where P
i
is a
given faulty program, whose k faults are located at lines L
i
=
{L
i
1
, ..., L
i
k
}, and T
i
is a test suite. Let (M
1
, M
2
, ..., M
m
) be
some SBFL ranking metrics (e.g., Tarantula, Ochiai, Jaccard,
etc.).
The aim of this paper is to propose an approach aggregat-
ing SBFL ranking metrics in a single SBFL ranking metric
that takes benefit of their localization effectiveness. This is
achieved thanks to AHP technique. AHP requires a squared
matrix A[1..m, 1..m], where A[i, j] scores how much criteria
i is ranked better than criteria j.
1) Scoring a single SBFL technique.
To evaluate how well the localization accuracy is, a
suspiciousness score, denoted as EXAM score [11], is
assigned to every faulty version of each subject program.
The score defines the percentage of statements that need
to be examined before the one locating the fault: lower
is better. When running a single SBFL technique M
j
on
some faulty program context (P
i
, L
i
, T
i
), it can return a
set of equivalent statements in terms of suspiciousness
(i.e., with the same suspiciousness degree). In this case,
the effectiveness depends on which statement is to check
first. For that reason, we consider two EXAM scores, the
optimistic and the pessimistic one, denoted respectively
R
P
i,j
and R
O
i,j
. We talk about optimistic EXAM (resp.
pessimistic exam) when the first (resp. last) statement
to be checked in the set of equivalent statements is
the faulty one. We also define a third metric, EXAM
R
i,j
= (R
P
i,j
R
O
i,j
), representing the margin of the
EXAM score, and middle EXAM R
M
i,j
=
R
P
i,j
+R
O
i,j
2
.
In other words, EXAM (middle exam) represents
the distance between the optimistic and the pessimistic
scores (resp. the average between the optimistic and the
pessimistic scores).
2) Comparing two SBFL techniques.
Let M
j
and M
l
be two SBFL techniques to be compared
on some faulty program context (P
i
, L
i
, T
i
), by consid-
ering their pessimistic and optimistic results. Computing
their middle gap E
M
i,j,l
= (R
M
i,j
R
M
i,l
) enables to know
how much M
j
is better than M
l
: if E
M
i,j,l
0, then
M
l
is better, else the converse. We can also consider
their pessimistic or optimistic gaps, but the middle gap
is preferred since that it aggregates them.
Finally, when running two SBFL techniques M
j
and M
k
on all of the faulty program contexts (P
i
, L
i
, T
i
), i =
1..n, we can estimate the most efficient technique by
averaging their middle gaps on all of the programs
AV G
j,l
=
1
n
P
i=1..n
E
M
i,j,l
.
3) Computing AHP ranking matrix

Algorithm 1: Learning
1 Input D =
{hP
i
, L
i
, T
i
i|P
i
: faulty program, L
i
: fault localions, T
i
: test cases};
M set of m ranking metrics;
2 InOut A: AHP matrix;
3 foreach metric M
j
M do
4 A[j, j] 1
5 foreach hP
i
, T
i
, L
i
i D do
6 Compute R
P
i,j
using M
j
7 Compute R
O
i,j
using M
j
8 n 0; AV G
,
0
9 foreach pair of metrics M
j
, M
l
M do
10 foreach hP
i
, T
i
, L
i
i D do
11 E
P
i,j,l
R
P
i,j
R
P
i,l
12 E
O
i,j,l
R
O
i,j
R
O
i,l
13 AV G
j,l
n
n+1
AV G
j,l
+
1
n+1
(E
P
i,j,l
+ E
O
i,j,l
)
14 n n + 1;
15 foreach AV G
j,l
: j, l {1, . . . , m} do
16 Scale |AV G
j,l
| to [1, 9]
17 if AV G
j,l
< 0 then
18 A[l, j] AV G
j,l
; A[j, l] 1/AV G
j,l
;
19 else A[j, l] AV G
j,l
; A[l, j] 1/AV G
j,l
;
20 return A;
Algorithm 1 generates the AHP matrix without any prior
ranking of decision criteria, by exploiting pairs compar-
isons. It loops on all of the pairs of SBFL techniques:
for all j, l 1..m, compute AV G
j,l
. Than it computes a
scaling of m × m comparisons AV G
j,l
to the permitted
values of the AHP matrix A. As explained in section
III-B, an indirect weight elicitation is performed with
an indirect assessment of weights, which is obtained by
computing the eigenvector w associated to the highest
eigenvalue λ
max
. This technique is far superior to any
of the direct techniques due to its ability to capture the
decision maker’s trade-offs between criteria.
The consistency ratio CR (see formula 2) is checked
to see in what extent the learned coefficients of A are
plausible.
A. Passive version of AHP-LOC
Algorithm 1 enables to learn the AHP matrix from a
training set composed of faulty program contexts D =
{(P
1
, L
1
, T
1
), ..., (P
n
, L
n
, T
n
)}. The passive version of AHP-
LOC consists of running Algorithm 1 on a an already fixed set
of faulty program contexts. The performances of this passive
approach depends on the quality of the given training set.
B. Active version of AHP-LOC
The passive approach is straightforward and simple, but
it is challenging to ensure that the given training set is
sufficient to start an efficient localization. In addition, it is
not taking benefit of current/future localizations. We propose
the active version of AHP-LOC where the learning process is
dynamically done. Its principle is the following: as long as the
current learning dataset implies an AHP matrix less efficient
than some SBFL ranking metric, a new faulty program context
is added to the learning dataset D of algo.1, updating the
AHP matrix, and so on, until reaching an AHP matrix, which
triggers the best localization matrix on the last added faulty
program context.
C. AHP ranking score
Once the AHP matrix A is learned in a passive or active
way, we proceed in computing the metrics weighting vector
w. For that, we solve the equation 1. Afterward, we compute
the score of a given statement e of a given faulty program
using the following weighted aggregation function:
score
AHP
(e) =
m
X
i=1
w
i
scale(score
M
i
(e)) (4)
The scale function is used to adjusting score values of the
different scales metrics to a [0, 1] scale.
V. RUNNING EXAMPLE
To illustrate our approach, we consider the Power program
given in Fig.2. In this figure, we have six test cases where
tc
1
to tc
3
are failing test cases, and tc
4
to tc
6
are passing
test cases. According to the provided test cases, we report
the suspiciousness ranking given by five ranking metrics ((1)
AMPLE [22], (2) TARANTULA [4], (3) GP13 [23], (4) OCHIAI
[5] and (5) JACCARD [5]) and our AHP-LOC approach. In this
example, two faults are introduced at e
3
and e
4
, where the
correct statements are respectively p = y;” and p = y;”.
AHP-LOC learns the AHP matrix from faulty program
examples in a passive/active way. In this example, we asked
an expert to provide us with preferences/dominances between
the different pairs of the five ranking metrics:
A =
(1) (2) (3) (4) (5)
(1) 1 2 3 9 9
(2) 1/2 1 2 3 7
(3) 1/3 1/2 1 2 4
(4) 1/9 1/3 1/2 1 2
(5) 1/9 1/7 1/4 1/2 1
(5)
It is important to stress here that: First, it is not always
possible to obtain the opinion of an expert. Second, an expert
will not be able to provide such matrix in an AHP context
when the number of criteria exceeds 10 [20]. Third, the expert
is facing a challenging tast where its opinion heavily depends
on the provided test cases. The learning step of AHP-LOC is
proposed to bypass this limitation. We learn the AHP matrix
from faulty programs contexts without the help of any expert.
Returning to our example, we compute the weighting vec-
tor w according to the equation 1. This leads us to w =
h0.49, 0.25, 0.15, 0.07, 0.04i. Now, for each statement we are
able to give a ranking AHP score using equation 2.
The weighting vector w
AHP
= h0.49, 0.25, 0.15, 0.07, 0.04i
was generated using the AHP matrix (5), with a consistency
ration CR = 0.018 < 0.1.
Fig.2 shows how the ranking metrics can be different
with completely different rankings. The fault at e
4
is ranked
first except for GP13 and OCHIAI. As they give a greater

prominence to the failing test cases, e
4
is ranked at the
third position. The aggregation done by AHP-LOC ranks e
4
at the first position as well. For the fault introduced at e
3
,
TARANTULA, OCHIAI and GP13 rank it in the next-to-bottom
place. The localization with AMPLE is able to rank it at the
first position. But this localization is not accurate, since we
have a big equivalent class. What is interesting here is that
AHP aggregation ranks it at the second position. Here, AHP-
LOC shows its best case by taking benefits of the difference
that exists among the ranking metrics.
VI. EXPERIMENTS
This section describes the experimental settings (including
benchmark programs, protocol and implementation), the ex-
perimental results and comparison with eight SBFL ranking
metrics. Experiments were performed on single and multiple
fault programs.
A. Benchmark programs
We evaluate our approach by analyzing the performance of
fault localization over 18 faulty programs coming from differ-
ent real-world applications. As our approach does not require
a specific programming language in order to be applied, we
investigated two types of programming paradigms, namely C
and Java Object Oriented programs.
Siemens and Space datasets. We have considered both
Siemens and Space datasets.
1
A complete description of
Siemens suite and Space dataset can be found in [24], [25].
These C programs are the most common program suites used
to evaluate software testing and fault localization approaches.
The Siemens suite + Space are provided with eight C pro-
grams, each one has a correct version and a set of faulty
versions (one fault per version). The suite is also provided
with test suites for each faulty version.
Table I summarizes the 212 faulty programs. For each
program, we report the number of faulty versions (single fault
1F, two faults 2F and four faults 4F), the size of the program
with its lines of code (LOC) and lines of executable code
(LEC), the number of test cases. We have 139 versions with
single fault, 47 with two faults and 26 versions with four faults.
The single fault versions are provided, where multiple fault
versions are produced by combining randomly the provided
faults.
For C programs and to know the statements that are covered
by a given (passing/failing) test case, we used GCOV
2
profiler
tool to trace and save the coverage information of test cases
as a boolean matrix (e.g., see Fig.2). Then, each test case
is classified as positive/negative w.r.t. the provided correct
version.
Java Object-Oriented datasets We evaluate our approach
on 25, 386 faults in ten Java Open-Source softwares (see
Table II).
3
We used the granularity of methods (rather than
the more common statements, as we did in Siemens suite). In
1
sir.unl.edu/php/previewfiles.php
2
https://gcc.gnu.org/onlinedocs/gcc/Gcov.html
3
http://www.feu.de/ps/prjs/EzUnit/eval/ISSTA13
TABLE I: Siemens suite.
Program #Versions LOC LEC Test cases
(1F, 2F, 4F)
Replace (29, 6, 3) 514 245 5,542
PrintTokens2 (9, 6, 3) 358 200 4,056
PrintTokens (4, 6, 3) 348 195 4,071
Schedule (5, 6, 3) 294 152 2,650
Schedule2 (8, 6, 3) 265 128 2,680
TotInfo (19, 6, 3) 272 123 1,052
Tcas (37, 6, 3) 135 65 1,578
Space (28, 5, 5) 9,126 3,657 13,585
Total (139, 47, 26) 11,312 4,765 35,214
LOC: lines of code in the correct version
LEC: lines of executable code
TABLE II: Java Open-source projects.
Program # Versions Methods
MUT Test cases
(1F, 2F, 4F)
Daikon 4.6.4 (352, 10
3
,10
3
) 14,387 1,936 157
Eventbus 1.4 (577,10
3
,10
3
) 859 338 91
Jaxen 1.1.5 (600,10
3
,10
3
) 1,689 961 695
Jester 1.37b (411,10
3
,10
3
) 378 152 64
JExel 1.0.0b13 (537,10
3
,10
3
) 242 150 335
JParsec 2.0 (598,10
3
,10
3
) 1,011 893 510
AC Codec 1.3 (543,10
3
,10
3
) 265 229 188
AC Lang 3.0 (599,10
3
,10
3
) 5,373 2,075 1,666
Eclipse Draw2d 3.4.2 (570,10
3
,10
3
) 3,231 878 89
HTML Parser 1.6 (599,10
3
,10
3
) 1,925 785 600
Total (5386, 10
4
, 10
4
) 29,360 8,397 4,395
MUT: Methods under test
The total number of methods, not counting JUnit test methods
fact, methods have the advantage of giving a natural context
of each fault, and of being the natural skip/step into units of
contemporary debuggers (guided by the output of the fault
locator) [26].
Table II shows the details of the different projects, including
the number of faulty programs (single fault 1F, two faults
2F and four faults 4F), the number of methods (excluding
JUnit test methods), the number of methods under test, and
the number of test cases.
We have 5,386 single fault Java programs, 10
4
programs
with two faults and 10
4
programs with four faults. All are
provided by the distribution. The multiple faults are spread
over different methods.
B. Experimental protocol
Some statements can be equivalent in terms of suspi-
ciousness. Here, the accuracy may vary depending on which
statement to check first. For such reason, we report the two
exam scores, the optimistic and the pessimistic one, denoted
respectively in this section O-EXAM and P-EXAM. We recall
that we talk about O-EXAM (resp. P-EXAM) when the first
(resp. last) statement to check in the set of equivalent state-
ments is the faulty one. We also use -EXAM = O-EXAM
P-EXAM, representing the range of the EXAM score.
Our approach AHP-LOC is trained on faulty programs to
learn the coefficients of the AHP matrix (training set). To
evaluate how accurately the learned AHP matrix performs, the
resulting model is validated on the remaining part of the faulty

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, a framework for improving fault localization effectiveness by using a Fuzzy Expert System (FES) to integrate different spectrum-based fault localization (SBFL) techniques is presented.
Abstract: Many spectrum-based fault localization (SBFL) techniques have been proposed in order to improve debugging efficiency. These SBFL techniques were designed according to different underlying assumptions and then adopt different fault locator functions to evaluate the likelihood of each statement being faulty, called suspiciousness. So far no single SBFL technique claims that it can outperform all of the others under every scenario. That is, the effectiveness of fault localization may vary considerably by just adopting a single SBFL technique. Due to the aforementioned reasons, this study presents a framework for improving fault localization effectiveness by using a Fuzzy Expert System (FES) to integrate different SBFL techniques. In the presented framework, the outputs of several SBFL techniques will be put into the fuzzification and then transferred to fuzzy input sets. After undergoing the fuzzy inference based on the given fuzzy rules, the fuzzy input sets will be transferred to a fuzzy output set. Finally, the fuzzy set will be transferred to a crisp output (called a weighted suspiciousness value). The code statements will then be ranked according to their weighted suspiciousness values. In other words, no additional instrumentations and analyses on the source code and the test suite are necessary for our approach. Our experiment results indicate that our FES-based framework is effective at combining the SBFL techniques from different equivalent groups and achieves high effectiveness on the nine subject programs. It is also noted that in the literature, most of the approaches that combine multiple SBFL techniques are learning-based and they are suitable for long-term projects with sufficient historical data. Since our approach does not reference historical data for model training, it can be applied to new software projects. Thus, the application scenarios of our approach should be complementary to those of the state-of-the-art learning-based approaches.

6 citations

Book ChapterDOI
25 Jun 2021
TL;DR: LBFL as mentioned in this paper is a LambdaMart-based high-accuracy approach for software automatic fault localization, which can integrate software's diversified features and achieve very high accuracy. But, the quality of these techniques is not enough to meet the practical requirements.
Abstract: Software debugging or fault localization is a very significant task in software development and maintenance, which directly determines the quality of software. Traditional methods of fault localization rely on manual investigation, which takes too much time in large-scale software development. To mitigate this problem, many automatic fault localization techniques have been proposed which can effectively lighten the burden of programmers. However, the quality of these techniques is not enough to meet the practical requirements. In order to improve the accuracy of fault localization, we propose LBFL, a LambdaMart-based high-accuracy approach for software automatic fault localization, which can integrate software’s diversified features and achieve very high accuracy. To realize that, LBFL first extracts the static and dynamic features and normalizes them. Then these features are gathered on LambdaMart algorithm for training. Finally, LBFL sorts the code statements according to the model and generates a list which can help developers to locate faults. Exhaustive experiments indicate that LBFL can locate 76 faults in Top-1, which has at least 217% improvements over nine single techniques and has 55% improvements over ABFL approach on the Defects4J dataset.
References
More filters
BookDOI
TL;DR: In this article, the authors present a survey of the state of the art in multiple criterion decision analysis (MCDA) with an overview of the early history and current state of MCDA.
Abstract: In two volumes, this new edition presents the state of the art in Multiple Criteria Decision Analysis (MCDA). Reflecting the explosive growth in the field seen during the last several years, the editors not only present surveys of the foundations of MCDA, but look as well at many new areas and new applications. Individual chapter authors are among the most prestigious names in MCDA research, and combined their chapters bring the field completely up to date. Part I of the book considers the history and current state of MCDA, with surveys that cover the early history of MCDA and an overview that discusses the “pre-theoretical” assumptions of MCDA. Part II then presents the foundations of MCDA, with individual chapters that provide a very exhaustive review of preference modeling, along with a chapter devoted to the axiomatic basis of the different models that multiple criteria preferences. Part III looks at outranking methods, with three chapters that consider the ELECTRE methods, PROMETHEE methods, and a look at the rich literature of other outranking methods. Part IV, on Multiattribute Utility and Value Theories (MAUT), presents chapters on the fundamentals of this approach, the very well known UTA methods, the Analytic Hierarchy Process (AHP) and its more recent extension, the Analytic Network Process (ANP), as well as a chapter on MACBETH (Measuring Attractiveness by a Categorical Based Evaluation Technique). Part V looks at Non-Classical MCDA Approaches, with chapters on risk and uncertainty in MCDA, the decision rule approach to MCDA, the fuzzy integral approach, the verbal decision methods, and a tentative assessment of the role of fuzzy sets in decision analysis. Part VI, on Multiobjective Optimization, contains chapters on recent developments of vector and set optimization, the state of the art in continuous multiobjective programming, multiobjective combinatorial optimization, fuzzy multicriteria optimization, a review of the field of goal programming, interactive methods for solving multiobjective optimization problems, and relationships between MCDA and evolutionary multiobjective optimization (EMO). Part VII, on Applications, selects some of the most significant areas, including contributions of MCDA in finance, energy planning problems, telecommunication network planning and design, sustainable development, and portfolio analysis. Finally, Part VIII, on MCDM software, presents well known MCDA software packages.

4,055 citations

Journal ArticleDOI
TL;DR: The analytic hierarchy process (AHP) includes both the rating and comparison methods and requires developing a reliable hierarchic structure or feedback network that includes criteria of various types of influence, stakeholders, and decision alternatives to determine the best choice.
Abstract: People make three general types of judgments to express importance, preference, or likelihood and use them to choose the best among alternatives in the presence of environmental, social, political, and other influences. They base these judgments on knowledge in memory or from analyzing benefits, costs, and risks. From past knowledge, we sometimes can develop standards of excellence and poorness and use them to rate the alternatives one at a time. This is useful in such repetitive situations as student admissions and salary raises that must conform with established norms. Without norms one compares alternatives instead of rating them. Comparisons must fall in an admissible range of consistency. The analytic hierarchy process (AHP) includes both the rating and comparison methods. Rationality requires developing a reliable hierarchic structure or feedback network that includes criteria of various types of influence, stakeholders, and decision alternatives to determine the best choice.

3,831 citations


"A Multiple Fault Localization Appro..." refers background in this paper

  • ...Second, an expert will not be able to provide such matrix in an AHP context when the number of criteria exceeds 10 [20]....

    [...]

  • ...Analytical Hierarchy Process (AHP) [20]...

    [...]

Proceedings ArticleDOI
07 Nov 2005
TL;DR: The studies show that, on the same set of subjects, the Tarantula technique consistently outperforms the other four techniques in terms of effectiveness in fault localization, and is comparable in efficiency to the least expensive of the other five techniques.
Abstract: The high cost of locating faults in programs has motivated the development of techniques that assist in fault localization by automating part of the process of searching for faults. Empirical studies that compare these techniques have reported the relative effectiveness of four existing techniques on a set of subjects. These studies compare the rankings that the techniques compute for statements in the subject programs and the effectiveness of these rankings in locating the faults. However, it is unknown how these four techniques compare with Tarantula, another existing fault-localization technique, although this technique also provides a way to rank statements in terms of their suspiciousness. Thus, we performed a study to compare the Tarantula technique with the four techniques previously compared. This paper presents our study---it overviews the Tarantula technique along with the four other techniques studied, describes our experiment, and reports and discusses the results. Our studies show that, on the same set of subjects, the Tarantula technique consistently outperforms the other four techniques in terms of effectiveness in fault localization, and is comparable in efficiency to the least expensive of the other four techniques.

1,142 citations


"A Multiple Fault Localization Appro..." refers background or methods in this paper

  • ...According to the provided test cases, we report the suspiciousness ranking given by five ranking metrics ((1) TARANTULA [4],(2) OCHIAI [5],(3) AMPLE [22], (3) JACCARD [5] and (4) GP13 [23]) and our AHP-LOC approach....

    [...]

  • ...Fig 1 lists the formula of three well-known metrics: TARANTULA [4], OCHIAI [5] and JACCARD [5]....

    [...]

  • ...For a fair comparison between our tool and the other approaches, we have implemented the SBFL metrics, GP13, NAISH2, AMPLE, ER1B, M2, TARANTULA, OCHIAI and JACCARD in C++ as presented in [4], [5]....

    [...]

  • ...Several ranking metrics have been proposed to capture the notion of suspiciousness, such as TARANTULA [4], OCHIAI [5], and JACCARD [5]....

    [...]

Journal ArticleDOI
TL;DR: The infrastructure that is being designed and constructed to support controlled experimentation with testing and regression testing techniques is described and the impact that this infrastructure has had and can be expected to have.
Abstract: Where the creation, understanding, and assessment of software testing and regression testing techniques are concerned, controlled experimentation is an indispensable research methodology. Obtaining the infrastructure necessary to support such experimentation, however, is difficult and expensive. As a result, progress in experimentation with testing techniques has been slow, and empirical data on the costs and effectiveness of techniques remains relatively scarce. To help address this problem, we have been designing and constructing infrastructure to support controlled experimentation with testing and regression testing techniques. This paper reports on the challenges faced by researchers experimenting with testing techniques, including those that inform the design of our infrastructure. The paper then describes the infrastructure that we are creating in response to these challenges, and that we are now making available to other researchers, and discusses the impact that this infrastructure has had and can be expected to have.

1,114 citations


"A Multiple Fault Localization Appro..." refers background in this paper

  • ...1 A complete description of Siemens suite and Space dataset can be found in [24], [25]....

    [...]

Proceedings ArticleDOI
19 May 2002
TL;DR: A new technique that uses color to visually map the participation of each program statement in the outcome of the execution of the program with a test suite, consisting of both passed and failed test cases is presented.
Abstract: One of the most expensive and time-consuming components of the debugging process is locating the errors or faults. To locate faults, developers must identify statements involved in failures and select suspicious statements that might contain faults. This paper presents a new technique that uses visualization to assist with these tasks. The technique uses color to visually map the participation of each program statement in the outcome of the execution of the program with a test suite, consisting of both passed and failed test cases. Based on this visual mapping, a user can inspect the statements in the program, identify statements involved in failures, and locate potentially faulty statements. The paper also describes a prototype tool that implements our technique along with a set of empirical studies that use the tool for evaluation of the technique. The empirical studies show that, for the subject we studied, the technique can be effective in helping a user locate faults in a program.

1,063 citations


"A Multiple Fault Localization Appro..." refers background in this paper

  • ...[2], [3]) is a class of popular fault localization approaches that take as input a set of failing and passing test case executions, and then highlight the suspicious program statements that are likely responsible for the failures....

    [...]