scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A General Software Defect-Proneness Prediction Framework

01 May 2011-IEEE Transactions on Software Engineering (IEEE)-Vol. 37, Iss: 3, pp 356-370
TL;DR: The results show that the proposed framework for software defect prediction is more effective and less prone to bias than previous approaches and that small details in conducting how evaluations are conducted can completely reverse findings.
Abstract: BACKGROUND - Predicting defect-prone software components is an economically important activity and so has received a good deal of attention. However, making sense of the many, and sometimes seemingly inconsistent, results is difficult. OBJECTIVE - We propose and evaluate a general framework for software defect prediction that supports 1) unbiased and 2) comprehensive comparison between competing prediction systems. METHOD - The framework is comprised of 1) scheme evaluation and 2) defect prediction components. The scheme evaluation analyzes the prediction performance of competing learning schemes for given historical data sets. The defect predictor builds models according to the evaluated learning scheme and predicts software defects with new data according to the constructed model. In order to demonstrate the performance of the proposed framework, we use both simulation and publicly available software defect data sets. RESULTS - The results show that we should choose different learning schemes for different data sets (i.e., no scheme dominates), that small details in conducting how evaluations are conducted can completely reverse findings, and last, that our proposed framework is more effective and less prone to bias than previous approaches. CONCLUSIONS - Failure to properly or fully evaluate a learning scheme can be misleading; however, these problems may be overcome by our proposed framework.

Summary (4 min read)

1 INTRODUCTION

  • Software defect prediction has been an important research topic in the software engineering field for more than 30 years.
  • Current defect prediction work focuses on (i) estimating the number of defects remaining in software systems, (ii) discovering defect associations, and (iii) classifying the defect-proneness of software components, typically into two classes defect-prone and not defect-prone.
  • Third, to assist managers in improving the software process through analysis of the reasons why some defects frequently occur together.
  • This allows us to explore the impact of data from different sources on different processes for finding appropriate classification models apart from evaluating these processes in a fair and reasonable way.
  • Google scholar (accessed February 6, 2010) indicates an impressive 132 citations to MGF [23] within the space of three years.

3.1 Overview of the framework

  • Generally, before building defect prediction model(s) and using them for prediction purposes, the authors first need to decide which learning scheme should be used to construct the model.
  • Consequently the authors propose a new software defect prediction framework that provides guidance to address these potential shortcomings.
  • At the scheme evaluation stage, the performances of the different learning schemes are evaluated with historical data to determine whether a certain learning scheme performs sufficiently well for prediction purposes or to select the best from a set of competing schemes.
  • It is very important that the test data are not used in any way to build the learners.
  • From Fig. 1 the authors observe that all the historical data are used to build the predictor here.

3.2 Scheme evaluation

  • The scheme evaluation is a fundamental part of the software defect prediction framework.
  • To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
  • After the training-test splitting is done each round, both the training data and learning scheme(s) are used to build a learner.
  • In their experiment, the authors use a log-filtering preprocessor which replaces all numerics n with their logarithms ln(n) such as used in MGF.
  • Therefore attribute selection has to be performed on the training data.

3.3 Defect prediction

  • The defect prediction part of their framework is straightforward, it consists of predictor construction and defect prediction.
  • 2) A predictor is built with the selected learning scheme and the whole historical data.
  • Its final performance is the mean over all rounds.
  • A single round of cross-validation uses only one part of the data.
  • The detailed defect prediction process is described with pseudocode in the following Procedure Prediction.

3.4 Difference between our proposed framework and MGF

  • This should be performed only on the training data.
  • To recap, the essential problem in MGF’s study is that the test data were used for attribute selection, which actually violated the intention of holdout strategy.
  • The authors framework focuses on the attribute selection method itself instead of certain ‘best’ subset, as different training data may produce different best subsets.
  • After that, the ‘outer’ crossvalidation assesses how well the learner built with such ‘best’ attributes performs on the test data, which is really new to the learner.

4.1 Data sets

  • What’s more, the AR data from the PROMISE repository4 was also used.
  • Table 1 provides some basic summary information.
  • After preprocessing, modules that contain one or more defects were labeled as defective.
  • A more detailed description of code attributes or the origin of the MDP data sets can be obtained from [23].

4.2 Performance measures

  • The receiver operating characteristic (ROC) curve is often used to evaluate the performance of binary predictors.
  • Nevertheless, this doesn’t necessarily show that all predictors with the same balance value have the same practical usage.
  • Thus the ROC curve characterizes the performance of a binary predictor among varied threshold.
  • AUC represents the most informative and commonly used, thus it is used as another performance measure in this paper.
  • Diff = EvaPerf− PredPerf PredPerf × 100% (6) where EvaPerf represents the mean evaluation performance and PredPerf denotes the mean prediction performance.

4.3 Experiment Design

  • Two experiments are designed in the experiment.
  • One is to compare their framework with that of MGF, the second is intended to demonstrate their framework in practice and explore whether the authors should choose a particular learning scheme or not.

4.3.1 Framework comparison

  • This experiment was used to compare their framework with that of MGF who reported that a Naı̈ve Bayes data miner with a log-filtering preprocessor achieved a mean (pd, pf )= (71, 25).
  • In their experiment, the authors simulated the whole process of defect prediction to explore whether MGF’s evaluation result is misleading or not.
  • Then an iterative attribute subset selection as used in MGF’s study was performed.
  • Specifically, as described in the scheme evaluation procedure, the authors applied the learning scheme only to the training data, after which the final Naı̈ve Bayes learner was built and the test data were used to evaluate the performance of the learner.
  • Then the predictor was used to predict defect with the new data that was processed by the same way as that of the historical data.

4.3.2 Defect prediction with different learning schemes

  • This experiment is intended to demonstrate their framework and to illustrate that different elements of a learning scheme have different impacts on the predictions and to confirm that the authors should choose the combination of a data preprocessor, an attribute selector and a learning algorithm, instead of any one of them separately.
  • For this purpose, twelve different learning schemes were designed according to the following data preprocessors, attribute selectors and learning algorithms.
  • It then tries each of the remaining attributes in conjunction with the best to find the best pair of attributes.
  • For each pass, the authors took 90% of the data as historical data, and the remainder as new data.
  • The authors performed the whole process twice, with balance and AUC respectively.

4.4.1 Framework comparison

  • The framework comparison results are summarized in Table 2 7 which shows the results in terms of balance.
  • Thus the authors see what a dramatic impact a seemingly small difference in a validation procedure can have.
  • 3) The largest absolute value of balance diff in MGF framework is 25.7% on AR1 data, on which the corresponding absolute value of balance diff in the proposed framework is just 3.16%.
  • Finally a Wilcoxon signed-rank test of medians yields p = 0.0028 for a 1-tailed hypothesis that the absolute balance diff of the new framework is significantly less than that of the MGF framework.
  • On the other hand, the mean prediction performance of the proposed framework is higher than that of MGF.

4.4.2 Defect prediction with different learning schemes

  • The twelve different learning schemes were evaluated and then used to predict defect prone-modules across the same 17 data sets.
  • This reveals that different attribute selectors can be suitable for different learning algorithms.
  • 5) For both the evaluation and prediction, the AUCs of schemes NB+Log+FS, NB+Log+BE, NB+None+FS and NB+None+BE are much better than those of schemes J48+Log+FS, J48+Log+BE, J48+None+FS and J48+None+BE, respectively.
  • This means Forward Selection is more suitable for Naı̈ve Bayes with the data preprocessor None.
  • Table 11 shows the p-values of Wilcoxon signed-rank test on learning algorithms J48 vs. OneR over the 17 data sets.

5 CONCLUSION

  • The authors have presented a novel benchmark framework for software defect prediction.
  • In the evaluation stage, different learning schemes are evaluated and the best one is selected.
  • From their experimental results the authors observe that there is a bigger difference between the evaluation performance and the actual prediction performance in MGF’s study than with their framework.
  • Whilst this might seem like some small technicality the impact is profound.
  • When the authors perform statistical significance testing they find dramatically different findings and that are highly statistically significant but in opposite directions.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 1
A General Software Defect-Proneness
Prediction Framework
Qinbao Song, Zihan Jia, Martin Shepperd, Shi Ying and Jin Liu
Abstract—BACKGROUND predicting defect-prone software components is an economically important activity and so has received
a good deal of attention. However, making sense of the many, and sometimes seemingly inconsistent, results is difficult.
OBJECTIVE we propose and evaluate a general framework for software defect prediction that supports (i) unbiased and (ii)
comprehensive comparison between competing prediction systems.
METHOD the framework comprises (i) scheme evaluation and (ii) defect prediction components. The scheme evaluation analyzes the
prediction performance of competing learning schemes for given historical data sets. The defect predictor builds models according to
the evaluated learning scheme and predicts software defects with new data according to the constructed model. In order to demonstrate
the performance of the proposed framework, we use both simulation and publicly available software defect data sets.
RESULTS the results show that we should choose different learning schemes for different data sets (i.e. no scheme dominates), that
small details in conducting how evaluations are conducted can completely reverse findings and lastly that our proposed framework is
more effective, and less prone to bias than previous approaches.
CONCLUSIONS failure to properly or fully evaluate a learning scheme can be misleading, however, these problems may be overcome
by our proposed framework.
Key Words—Software defect prediction, software defect-proneness prediction, machine learning, scheme evaluation.
F
1 INTRODUCTION
Software defect prediction has been an important re-
search topic in the software engineering field for more
than 30 years. Current defect prediction work focuses
on (i) estimating the number of defects remaining in
software systems, (ii) discovering defect associations,
and (iii) classifying the defect-proneness of software
components, typically into two classes defect-prone and
not defect-prone. This paper is concerned with the third
approach.
The first type of work employs statistical approaches
[1], [2], [3], capture-recapture (CR) models [4], [5], [6], [7],
and detection profile methods (DPM) [8] to estimate the
number of defects remaining in software systems with
inspection data and process quality data. The prediction
result can be used as an important measure for the
software developer [9], and can be used to control the
software process (i.e. decide whether to schedule further
inspections or pass the software artifacts to the next
development step [10]) and gauge the likely delivered
quality of a software system [11].
The second type of work borrows association rule min-
ing algorithms from the data mining community to re-
Q. Song and Z. Jia are with the Department of Computer Science and
Technology, Xi’an Jiaotong University, Xi’an, 710049 China.
E-mail: qbsong@mail.xjtu.edu.cn, jiazh.eden@stu.xjtu.edu.cn.
M. Shepperd is with the School of Information Science, Computing, and
Mathematics, Brunel University, Uxbridge, UB8 3PH UK.
E-mail: martin.shepperd@brunel.ac.uk.
S. Ying and J. Liu are with the State Key Laboratory of Software
Engineering, Wuhan University, Wuhan, 430072 China.
E-mail: yingshi@whu.edu.cn, mailjinliu@yahoo.com.
veal software defect associations [12], which can be used
for three purposes. First to find as many related defects
as possible to the detected defect(s) and consequently
make more effective corrections to the software. This
may be useful as it permits more directed testing and
more effective use of limited testing resources. Second,
to help evaluate reviewers’ results during an inspection.
Thus a recommendation might be that his/her work
should be reinspected for completeness. Third, to assist
managers in improving the software process through
analysis of the reasons why some defects frequently
occur together. If the analysis leads to the identification
of a process problem, managers can devise corrective
action.
The third type of work classifies software compo-
nents as defect-prone and non-defect-prone by means of
metric-based classification [13], [14], [15], [16], [17], [18],
[19], [20], [21], [22], [23], [24]. Being able to predict which
components are more likely to be defect-prone supports
better targeted testing resources and therefore improved
efficiency.
Unfortunately, classification remains a largely un-
solved problem. In order to address this researchers have
been using increasingly sophisticated techniques drawn
from machine learning. This sophistication has led to
challenges in how such techniques are configured and
how they should be validated. Incomplete or inappro-
priate validation can result in unintentionally misleading
results and over-optimism on the part of the researchers.
For this reason we propose a new and more general
framework within which to conduct such validations.
To reiterate a comment made in an earlier paper by one
of the authors [MS] and also quoted by Lessmann et al.

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 2
[24] “we need to develop more reliable research proce-
dures before we can have confidence in the conclusion
of comparative studies of software prediction models”
[25]. Thus we stress that the aim of this paper is to
consider how we evaluate different processes for finding
classification models, not any particular model itself. We
consider it most unlikely any useful, universal model
exists.
Much of this research activity has followed the path
of using software metrics extracted from the code as
candidate factors to reveal whether a software com-
ponent is defect-prone or not. To accomplish this a
variety of machine learning algorithms have been used
to inductively find patterns or rules within the data to
classify software components as either defect-prone or
not. Examples include [13], [26], [27], [28], [20], [23],
and [24]. In addition Wagner [29] and Runeson et al.
[30] provide useful overviews in the form of systematic
literature reviews.
In order to motivate the need for more systematic
and unbiased methods for comparing the performance
of machine learning based defect prediction we focus
on a recent paper published in this journal by Menzies,
Greenwald and Frank [23]. For brevity we will refer to
this as the MGF paper. We choose MGF for three reasons.
First, because it has been widely cited
1
and is there-
fore influential. Second, because the approach might be
regarded as state of the art for this kind of research.
Third, because the MGF analysis is based upon datasets
in the public domain thus we are able to replicate the
work. We should stress we are not singling this work
out for being particularly outrageous. Instead we wish
to respond to their challenge “that numerous researchers
repeat our experiments and discover learning methods
that are superior to the one proposed here” (MGF).
In the study, publicly available datasets from different
organizations are used. This allows us to explore the
impact of data from different sources on different pro-
cesses for finding appropriate classification models apart
from evaluating these processes in a fair and reasonable
way. Additionally, 12 learning schemes
2
resulted from
two data preprocessors, two feature selectors, and three
classification algorithms are designed to assess the effects
of different elements of a learning scheme on defect
prediction. Although balance is a uncommon measure in
classification, the results of MGF were reported with it,
thus it is still used whilst a general measure AUC of
predictive power is employed in the paper as well.
This paper makes the following contributions: (i) a
new and more general software defect-proneness pre-
diction framework within which appropriate validations
can be conducted is proposed; (ii) the impacts of different
elements of a learning scheme on the evaluation and
prediction are explored, and concluded that a learn-
ing scheme should be evaluated as holistically and no
1. Google scholar (accessed February 6, 2010) indicates an impressive
132 citations to MGF [23] within the space of three years.
2. Please see Section 2 for the details of a learning scheme.
learning scheme dominates, consequently the evaluation
and decision process is important; and (iii) the potential
bias and misleading results of the MGF framework is
explained and confirmed, and demonstrated that the
performance of the MGF framework is varying greatly
with data from different organizations.
The remainder of the paper is organized as follows.
Section 2 provides some further background on the
current state of the art for learning software defect pre-
diction systems with particular reference to MGF. Section
3 describes our framework in detail and analyzes differ-
ences between our approach and that of MGF. Section
4 is devoted to the extensive experiments to compare
our framework and that of MGF and to evaluate the
performance of the proposed framework. Conclusions
and consideration of the significance of this work are
given in the final section.
2 RELATED WORK
MGF [23] published a study in this journal in 2007 in
which they compared the performance of two machine
learning techniques (Rule Induction and Na
¨
ıve Bayes) to
predict software components containing defects. To do
this they use the NASA MDP repository which at the
time of their research contained 10 separate data sets.
Traditionally many researchers have explored issues
like the relative merits of McCabe’s cyclomatic com-
plexity, Halstead’s software science measures and lines
of code counts for building defect predictors. However,
MGF claim that “such debates are irrelevant since how
the attributes are used to build predictors is much more
important than which particular attributes are used” and
“the choice of learning method is far more important
than which subset of the available data is used for
learning”. Their analysis found that a Na
¨
ıve Bayes clas-
sifier, after log-filtering and attribute selection based on
InfoGain had a mean probability of detection of 71% and
mean false alarms rates of 25%. This significantly out-
performed the rule induction methods of J48 and OneR
(due to Quinlan [31]).
We argue that although how is more important than
which
3
, the choice of which attribute subset is used for
learning is not only circumscribed by the attribute subset
itself and available data, but also by attribute selectors,
learning algorithms and data preprocessors. It is well
known that there is an intrinsic relationship between
a learning method and an attribute selection method.
For example, Hall and Holmes [32] concluded that the
forward selection search was well suited to Na
¨
ıve Bayes
but the backward elimination search is more suitable for
C4.5. Cardie [33] found using a decision tree to select at-
tributes helped the nearest neighbor algorithm to reduce
its prediction error. Kubat et al. [34] used a decision tree
3. That is, which attribute subset is more useful for defect prediction
not only depends on the attribute subset itself but also on the specific
data set.

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 3
filtering attributes for use with a Na
¨
ıve Bayesian classi-
fier and obtained a similar result. However, Kibler and
Aha [35] reported more mixed results on two medical
classification tasks. Therefore, before building prediction
models, we should choose the combination of all three
of learning algorithm, data pre-processing and attribute
selection method, not merely one or two of them.
Lessmann et al. [24] have also conducted a follow-
up to MGF on defect predictions, providing additional
results as well as suggestions for a methodological
framework. However, they did not perform attribute
selection when building prediction models. Thus our
work has wider application.
We also argue that MGF’s attribute selection approach
is problematic and yielded a bias in the evaluation
results, despite the use of a M×N-way cross-evaluation
method. One reason is that they ranked attributes on
the entire data set including both the training and test
data, though the class labels of the test data should have
been unknown to the predictor. That is, they violated the
intention of the holdout strategy. The potential result is
that they overestimate the performance of their learning
model and thereby report a potentially misleading result.
Moreover, after ranking attributes, they evaluated each
individual attribute separately and chose those n features
with the highest scores. Unfortunately, this strategy can-
not consider features with complementary information,
and does not account for attribute dependence. It is
also incapable of removing redundant features because
redundant features are likely to have similar rankings.
As long as features are deemed relevant to the class,
they will all be selected even though many of them are
highly correlated to each other.
These seemingly minor issues motivate the develop-
ment of our general-purpose defect prediction frame-
work described in this paper. However, we will show
the large impact they can have and how researchers
may be completely misled. Our proposed framework
consists of two parts: scheme evaluation and defect
prediction. The scheme evaluation focuses on evaluating
the performance of a learning scheme, whilst the defect
prediction focuses on building a final predictor using
historical data according to the learning scheme and after
which the predictor is used to predict the defect-prone
components of a new (or unseen) software system.
A learning scheme comprises:
1) a data preprocessor,
2) an attribute selector,
3) a learning algorithm.
So to summarize, the main difference between our
framework and that of MGF lies in: (i) we choose the
entire learning scheme, not just one out of the learning
algorithm, attribute selector or data pre-processor; (ii) we
use the appropriate data to evaluate the performance of
a scheme. That is, we build a predictive model according
to a scheme with only ‘historical’ data and validate the
model on the independent ‘new’ data. We go on to
demonstrate why this has very practical implications.
3 PROPOSED SOFTWARE DEFECT PREDIC-
TION FRAMEWORK
3.1 Overview of the framework
Generally, before building defect prediction model(s)
and using them for prediction purposes, we first need
to decide which learning scheme should be used to
construct the model. Thus the predictive performance of
the learning scheme(s) should be determined, especially
for future data. However, this step is often neglected
and so the resultant prediction model may not be trust-
worthy. Consequently we propose a new software defect
prediction framework that provides guidance to address
these potential shortcomings. The framework consists of
two components: (i) scheme evaluation and (ii) defect
prediction. Fig. 1 contains the details.
Fig. 1. Proposed software defect prediction framework
At the scheme evaluation stage, the performances of
the different learning schemes are evaluated with histor-
ical data to determine whether a certain learning scheme
performs sufficiently well for prediction purposes or to
select the best from a set of competing schemes.
From Fig. 1 we can see that the historical data are
divided into two parts: a training set for building learn-
ers with the given learning schemes, and a test set for
evaluating the performances of the learners. It is very
important that the test data are not used in any way
to build the learners. This is a necessary condition to
assess the generalization ability of a learner that is built
according to a learning scheme, and further to determine
whether or not to apply the learning scheme, or select
one best scheme from the given schemes.
At the defect prediction stage, according to the per-
formance report of the first stage, a learning scheme
is selected and used to build a prediction model and
predict software defect. From Fig. 1 we observe that all
the historical data are used to build the predictor here.
This is very different from the first stage; it is very useful

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 4
for improving the generalization ability of the predictor.
After the predictor is built, it can be used to predict the
defect-proneness of new software components.
MGF proposed a baseline experiment and reported the
performance of the Na
¨
ıve Bayes data miner with log-
filtering as well as attribute selection, which performed
the scheme evaluation but with inappropriate data. This
is because they used both the training (which can be
viewed as historical data) and test (which can be viewed
as new data) data to rank attributes, while the labels of
the new data are unavailable when choosing attributes
in practice.
3.2 Scheme evaluation
The scheme evaluation is a fundamental part of the
software defect prediction framework. At this stage,
different learning schemes are evaluated by building and
evaluating learners with them. Fig. 2 contains the details.
Fig. 2. Scheme evaluation of the proposed framework.
The first problem of scheme evaluation is how to
divide historical data into training and test data. As
mentioned above, the test data should be independent
of the learner construction. This is a necessary pre-
condition to evaluate the performance of a learner for
new data. Cross-validation is usually used to estimate
how accurately a predictive model will perform in prac-
tice. One round of cross-validation involves partitioning
a data set into complementary subsets, performing the
analysis on one subset, and validating the analysis on
the other subset. To reduce variability, multiple rounds of
cross-validation are performed using different partitions,
and the validation results are averaged over the rounds.
In our framework, a M×N-way cross-validation is used
for estimating the performance of each predictive model,
that is, each data set is first divided into N bins, and
after that a predictor is learned on (N-1) bins, and then
tested on the remaining bin. This is repeated for the N
folds so that each bin is used for training and testing
while minimizing the sampling bias. To overcome any
ordering effect and to achieve reliable statistics, each
holdout experiment is also repeated M times and in each
repetition the data sets are randomized. So overall, M×N
models are built in all during the period of evaluation,
thus M×N results are obtained on each data set about
the performance of the each learning scheme.
After the training-test splitting is done each round,
both the training data and learning scheme(s) are used to
build a learner. A learning scheme consists of a data pre-
processing method, an attribute selection method, and
a learning algorithm. The detailed learner construction
procedure is as follows:
1) Data preprocessing
This is an important part of building a practical
learner. In this step, the training data are prepro-
cessed, such as removing outliers, handling miss-
ing values, discretizing or transforming numeric
attributes. In our experiment, we use a log-filtering
preprocessor which replaces all numerics n with
their logarithms ln(n) such as used in MGF.
2) Attribute selection
The data sets may not have originally been in-
tended for defect prediction, thus even if all the
attributes are useful for its original task, not all may
be helpful for defect prediction. Therefore attribute
selection has to be performed on the training data.
Attribute selection methods can be categorized as
either filters or wrappers [36]. It should be noted
that both ‘filter and ‘wrapper methods only op-
erate on the training data. A ‘filter uses general
characteristics of the data to evaluate attributes
and operates independently of any learning algo-
rithm. In contrast, a ‘wrapper method, exists as a
wrapper around the learning algorithm searching
for a good subset using the learning algorithm
itself as part of the function evaluating attribute
subsets. Wrappers generally give better results than
filters but are more computationally intensive. In
our proposed framework, the ‘wrapper attribute
selection method is employed. To make most use
of the data, we use a M×N-way cross-validation
to evaluate the performance of different attribute
subsets.
3) Learner construction
Once attribute selection is finished, the prepro-
cessed training data are reduced to the best attribute
subset. Then the reduced training data and the
learning algorithm are used to build the learner.
Before the learner is tested, the original test data are
preprocessed in the same way and the dimension-
ality is reduced to the same best subset of attributes.
After comparing the predicted value and the actual
value of the test data, the performance of one pass

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 5
of validation is obtained. As mentioned previously,
the final ‘evaluation’ performance can be obtained
as the mean and variance values across the M×N
passes of such validation.
The detailed scheme evaluation process is described
with pseudocode in the following Procedure Evaluation
which consists of Function Learning and Function AttrSe-
lect. The Function Learning is used to build a learner with
a given learning scheme, and the Function AttrSelect
performs attribute selection with a learning algorithm.
3.3 Defect prediction
The defect prediction part of our framework is straight-
forward, it consists of predictor construction and defect
prediction.
During the period of the predictor construction,
1) A learning scheme is chosen according to the Per-
formance Report.
2) A predictor is built with the selected learning
scheme and the whole historical data. While eval-
uating a learning scheme, a learner is built with
the training data and tested on the test data. Its
final performance is the mean over all rounds. This
reveals that the evaluation indeed covers all the
data. However, a single round of cross-validation
uses only one part of the data. Therefore, as we
use all the historical data to build the predictor,
it is expected that the constructed predictor has
stronger generalization ability.
3) After the predictor is built, new data are prepro-
cessed in same way as historical data, then the con-
structed predictor can be used to predict software
defect with preprocessed new data.
The detailed defect prediction process is described
with pseudocode in the following Procedure Prediction.
3.4 Difference between our proposed framework
and MGF
Although both MGF’s and our study have involved a
M×N-way cross-validation there is, however, a signifi-
cant difference. In their study, for each data set, the
attributes were ranked by InfoGain which was calculated
on the whole data set, then the M×N-way validation was
wrapped inside scripts that explored different subset of
attributes in the order suggested by the InfoGain. In our
study, there is an M×N-way cross-validation for perfor-
mance estimation of the learner with attribute selection,
which is out of the attribute selection procedure. We
only performed attribute selection on the training data.
When a ‘wrapper selection method is performed, an-
other cross-validation can be performed to evaluate the
performance of different attribute subsets. This should
be performed only on the training data.
To recap, the essential problem in MGF’s study is that
the test data were used for attribute selection, which
actually violated the intention of holdout strategy. In
their study, the M×N-way cross-validation actually imple-
mented a holdout strategy to just select the ‘best’ subset
among the subsets recommended by InfoGain for each
data set. However, as the ”test data” is unknown at
that period of time, so the result obtained in that way
potentially overfits the current data set itself and cannot
be used to assess the future performance of the learner
built with such ‘best’ subset.
Our framework focuses on the attribute selection
method itself instead of certain ‘best’ subset, as dif-
ferent training data may produce different best sub-
sets. We treat the attribute selection method as a part
of the learning scheme. The ‘inner cross-validation is
performed on the training data, which actually selects
the ‘best’ attribute set on the training data with the
basic learning algorithm. After that, the ‘outer cross-
validation assesses how well the learner built with such
‘best’ attributes performs on the test data, which is really
new to the learner. Thus our framework can properly
assess the future performance of the learning scheme as
a whole.
4 EMPIRICAL STUDY
4.1 Data sets
We used the data taken from the public NASA MDP
repository [37], which was also used by MGF and many
others e.g. [24], [38], [39], and [22]. What’s more, the AR
data from the PROMISE repository
4
was also used. Thus
there are 17 data sets in total, 13 from NASA and the
remaining 4 from the PROMISE repository.
Table 1 provides some basic summary information.
Each data set comprises a number of software modules
(cases), each containing the corresponding number of
defects and various software static code attributes. After
preprocessing, modules that contain one or more defects
were labeled as defective. Besides LOC counts, the data
sets include Halstead attributes as well as McCabe com-
plexity measures
5
. A more detailed description of code
attributes or the origin of the MDP data sets can be
obtained from [23].
4.2 Performance measures
The receiver operating characteristic (ROC) curve is often
used to evaluate the performance of binary predictors.
A typical ROC curve is shown in Fig. 3. The y-axis
shows probability of detection (pd) and the x-axis shows
probability of false alarms (pf ).
Formal definitions for pd and pf are given in Equations
1 and 2 respectively. Obviously higher pds and lower
pf s are desired. The point (pf = 0, pd = 1) is the ideal
4. http://promise.site.uottowa.ca/SERepository
5. Whilst there is some disquiet concerning the value of such code
metrics recall that the purpose of this paper is to examine frameworks
for learning defect classifiers and not to find the ‘best’ classifier per
se. Moreover, since attribute selection is part of the framework we can
reasonably expect irrelevant or redundant attributes to be eliminated
from any final classifier.

Citations
More filters
Journal ArticleDOI
TL;DR: Although there are a set of fault prediction studies in which confidence is possible, more studies are needed that use a reliable methodology and which report their context, methodology, and performance comprehensively.
Abstract: Background: The accurate prediction of where faults are likely to occur in code can help direct test effort, reduce costs, and improve the quality of software. Objective: We investigate how the context of models, the independent variables used, and the modeling techniques applied influence the performance of fault prediction models. Method: We used a systematic literature review to identify 208 fault prediction studies published from January 2000 to December 2010. We synthesize the quantitative and qualitative results of 36 studies which report sufficient contextual and methodological information according to the criteria we develop and apply. Results: The models that perform well tend to be based on simple modeling techniques such as Naive Bayes or Logistic Regression. Combinations of independent variables have been used by models that perform well. Feature selection has been applied to these combinations when models are performing particularly well. Conclusion: The methodology used to build models seems to be influential to predictive performance. Although there are a set of fault prediction studies in which confidence is possible, more studies are needed that use a reliable methodology and which report their context, methodology, and performance comprehensively.

1,012 citations


Cites background from "A General Software Defect-Proneness..."

  • ...This is an important finding as it suggests that a relatively high number of papers reporting fault prediction are not really doing any prediction (this finding is also reported by [6])....

    [...]

  • ...trained and tested on different data [6]....

    [...]

Journal ArticleDOI
01 Feb 2015
TL;DR: The machine learning techniques have the ability for predicting software fault proneness and can be used by software practitioners and researchers, however, the application of theMachine learning techniques in software fault prediction is still limited and more number of studies should be carried out in order to obtain well formed and generalizable results.
Abstract: Reviews studies from 1991-2013 to assess application of ML techniques for SFP.Identifies seven categories of the ML techniques.Identifies 64 studies to answer the established research questions.Selects primary studies according to the quality assessment of the studies.Systematic literature review performs the following:Summarize ML techniques for SFP models.Assess performance accuracy and capability of ML techniques for constructing SFP models.Provide comparison between the ML and statistical techniques.Provide comparison of performance accuracy of different ML techniques.Summarize the strength and weakness of the ML techniques.Provides future guidelines to software practitioners and researchers. BackgroundSoftware fault prediction is the process of developing models that can be used by the software practitioners in the early phases of software development life cycle for detecting faulty constructs such as modules or classes. There are various machine learning techniques used in the past for predicting faults. MethodIn this study we perform a systematic review of studies from January 1991 to October 2013 in the literature that use the machine learning techniques for software fault prediction. We assess the performance capability of the machine learning techniques in existing research for software fault prediction. We also compare the performance of the machine learning techniques with the statistical techniques and other machine learning techniques. Further the strengths and weaknesses of machine learning techniques are summarized. ResultsIn this paper we have identified 64 primary studies and seven categories of the machine learning techniques. The results prove the prediction capability of the machine learning techniques for classifying module/class as fault prone or not fault prone. The models using the machine learning techniques for estimating software fault proneness outperform the traditional statistical models. ConclusionBased on the results obtained from the systematic review, we conclude that the machine learning techniques have the ability for predicting software fault proneness and can be used by software practitioners and researchers. However, the application of the machine learning techniques in software fault prediction is still limited and more number of studies should be carried out in order to obtain well formed and generalizable results. We provide future guidelines to practitioners and researchers based on the results obtained in this work.

483 citations


Cites background from "A General Software Defect-Proneness..."

  • ...Hence, total number of 122 studies [1, 2, 20-139] were identified for further processing and analysis....

    [...]

  • ...SF1 Sherer 1995 [20] SF33 Singh 2009b [51] SF2 Guo 2003 [21] SF34 Tosun 2009 [52] SF3 Guo 2004 [22] SF35 Zimmermann 2009 [53] SF4 Gyimothy 2005 [23] SF36 Afzal 2010 [54] SF5 Koru 2005 [24] SF37 Arisholm 2010 [55] SF6 Zhou 2006 [25] SF38 Carvalho 2010 [56] SF7 Arisholm 2007 [26] SF39 Liu 2010 [57] SF8 Catal 2007 [27] SF40 Malhotra 2010 [58] SF9 Jiang 2007 [28] SF41 Ostrand 2010 [59] SF10 Kanmani 2007 [29] SF42 Pendharkar 2010 [60] SF11 Li 2007 [30] SF43 Seliya 2010 [1] SF12 Menzies 2007 [31] SF44 Singh 2010 [61] SF13 Ma 2007 [32] SF45 Zhou 2010 [62] SF14 Pai 2007 [33] SF46 Azar 2011 [63] SF15 Turhan 2007 [34] SF47 Diri 2011 [64] SF16 Turhan 2007a [35] SF48 Malhotra 2011 [65] SF17 Carvalho 2008 [36] SF49 Martino 2011 [66] SF18 Elish 2008 [37] SF50 Mishra 2011 [67] SF19 Gondra 2008 [38] SF51 Misirh 2011 [68] SF20 Jiang 2008 [39] SF52 Ricca 2011 [69] SF21 Kaur 2008 [40] SF53 Rodriguez 2011 [70] SF22 Kim 2008 [41] SF54 Song 2011 [71] SF23 Lessmann 2008 [42] SF55 Twala 2011 [72] SF24 Menzies 2008 [43] SF56 Chen 2012 [73] SF25 Moser 2008 [44] SF57 Malhotra 2012 [74] SF26 Turhan 2008 [45] SF58 Okutan 2012 [75] SF27 Vandecruys 2008 [46] SF59 Yang 2012 [76] SF28 Bener 2009 [47] SF60 Yu 2012 [77] SF29 Catal 2009 [48] SF61 Zhou 2012 [78] SF30 Menzies 2009 [49] SF62 Cahill 2013 [79] SF31 Singh 2009 [50] SF63 Chen 2013 [80] SF32 Singh 2009a [2] SF64 Dejaeger 2013 [81] Table 4: Selected Primary Studies...

    [...]

Journal ArticleDOI
TL;DR: This paper investigates different types of class imbalance learning methods, including resampling techniques, threshold moving, and ensemble algorithms, and concludes that AdaBoost.NC shows the best overall performance in terms of the measures including balance, G-mean, and Area Under the Curve (AUC).
Abstract: To facilitate software testing, and save testing costs, a wide range of machine learning methods have been studied to predict defects in software modules. Unfortunately, the imbalanced nature of this type of data increases the learning difficulty of such a task. Class imbalance learning specializes in tackling classification problems with imbalanced distributions, which could be helpful for defect prediction, but has not been investigated in depth so far. In this paper, we study the issue of if and how class imbalance learning methods can benefit software defect prediction with the aim of finding better solutions. We investigate different types of class imbalance learning methods, including resampling techniques, threshold moving, and ensemble algorithms. Among those methods we studied, AdaBoost.NC shows the best overall performance in terms of the measures including balance, G-mean, and Area Under the Curve (AUC). To further improve the performance of the algorithm, and facilitate its use in software defect prediction, we propose a dynamic version of AdaBoost.NC, which adjusts its parameter automatically during training. Without the need to pre-define any parameters, it is shown to be more effective and efficient than the original AdaBoost.NC.

457 citations


Cites background or methods from "A General Software Defect-Proneness..."

  • ...For example, the log filter was shown to improve the performance of Naive Bayes significantly, but contributed very little to decision trees [37]....

    [...]

  • ...claimed that a high-PD predictor is still useful in practice, even if the other measures may not be good enough [37], [47]....

    [...]

Journal ArticleDOI
TL;DR: The extent to which published analyses based on the NASA defect datasets are meaningful and comparable is investigated and it is recommended that researchers indicate the provenance of the datasets they use and invest effort in understanding the data prior to applying machine learners.
Abstract: Background--Self-evidently empirical analyses rely upon the quality of their data. Likewise, replications rely upon accurate reporting and using the same rather than similar versions of datasets. In recent years, there has been much interest in using machine learners to classify software modules into defect-prone and not defect-prone categories. The publicly available NASA datasets have been extensively used as part of this research. Objective--This short note investigates the extent to which published analyses based on the NASA defect datasets are meaningful and comparable. Method--We analyze the five studies published in the IEEE Transactions on Software Engineering since 2007 that have utilized these datasets and compare the two versions of the datasets currently in use. Results--We find important differences between the two versions of the datasets, implausible values in one dataset and generally insufficient detail documented on dataset preprocessing. Conclusions--It is recommended that researchers 1) indicate the provenance of the datasets they use, 2) report any preprocessing in sufficient detail to enable meaningful replication, and 3) invest effort in understanding the data prior to applying machine learners.

444 citations

Journal ArticleDOI
TL;DR: It is found that single-repetition holdout validation tends to produce estimates with 46-229 percent more bias and 53-863 percent more variance than the top-ranked model validation techniques, and out-of-sample bootstrap validation yields the best balance between the bias and variance.
Abstract: Defect prediction models help software quality assurance teams to allocate their limited resources to the most defect-prone modules. Model validation techniques, such as $k$ -fold cross-validation, use historical data to estimate how well a model will perform in the future. However, little is known about how accurate the estimates of model validation techniques tend to be. In this paper, we investigate the bias and variance of model validation techniques in the domain of defect prediction. Analysis of 101 public defect datasets suggests that 77 percent of them are highly susceptible to producing unstable results– - selecting an appropriate model validation technique is a critical experimental design choice. Based on an analysis of 256 studies in the defect prediction literature, we select the 12 most commonly adopted model validation techniques for evaluation. Through a case study of 18 systems, we find that single-repetition holdout validation tends to produce estimates with 46-229 percent more bias and 53-863 percent more variance than the top-ranked model validation techniques. On the other hand, out-of-sample bootstrap validation yields the best balance between the bias and variance of estimates in the context of our study. Therefore, we recommend that future defect prediction studies avoid single-repetition holdout validation, and instead, use out-of-sample bootstrap validation.

414 citations


Cites background from "A General Software Defect-Proneness..."

  • ...…validation technique is the number of Events Per Variable [2], [6], [82], [99], i.e., the ratio of the number of occurrences of the least frequently occurring class of the dependent variable (i.e., the events) to the number of independent variables used to train the model (i.e., the variables)....

    [...]

References
More filters
Book
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,674 citations

Journal ArticleDOI
TL;DR: The wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain and compares the wrapper approach to induction without feature subset selection and to Relief, a filter approach tofeature subset selection.

8,610 citations

01 Jan 1994
TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.
Abstract: Algorithms for constructing decision trees are among the most well known and widely used of all machine learning methods. Among decision tree algorithms, J. Ross Quinlan's ID3 and its successor, C4.5, are probably the most popular in the machine learning community. These algorithms and variations on them have been the subject of numerous research papers since Quinlan introduced ID3. Until recently, most researchers looking for an introduction to decision trees turned to Quinlan's seminal 1986 Machine Learning journal article [Quinlan, 1986]. In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments. As such, this book will be a welcome addition to the library of many researchers and students.

8,046 citations


"A General Software Defect-Proneness..." refers methods in this paper

  • ...This significantly outperformed the rule induction methods of J48 and OneR (due to Quinlan [31])....

    [...]

Journal ArticleDOI
TL;DR: Several of Chidamber and Kemerer's OO metrics appear to be useful to predict class fault-proneness during the early phases of the life-cycle and are better predictors than "traditional" code metrics, which can only be collected at a later phase of the software development processes.
Abstract: This paper presents the results of a study in which we empirically investigated the suite of object-oriented (OO) design metrics introduced in (Chidamber and Kemerer, 1994). More specifically, our goal is to assess these metrics as predictors of fault-prone classes and, therefore, determine whether they can be used as early quality indicators. This study is complementary to the work described in (Li and Henry, 1993) where the same suite of metrics had been used to assess frequencies of maintenance changes to classes. To perform our validation accurately, we collected data on the development of eight medium-sized information management systems based on identical requirements. All eight projects were developed using a sequential life cycle model, a well-known OO analysis/design method and the C++ programming language. Based on empirical and quantitative analysis, the advantages and drawbacks of these OO metrics are discussed. Several of Chidamber and Kemerer's OO metrics appear to be useful to predict class fault-proneness during the early phases of the life-cycle. Also, on our data set, they are better predictors than "traditional" code metrics, which can only be collected at a later phase of the software development processes.

1,741 citations


"A General Software Defect-Proneness..." refers background in this paper

  • ...The third type of work classifies software components as defect-prone and non-defect-prone by means of metric-based classification [13], [14], [15], [16], [17], [18],...

    [...]

01 Apr 1999
TL;DR: This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems and performs more feature selection than ReliefF does—reducing the data dimensionality by fifty percent in most cases.
Abstract: Algorithms for feature selection fall into two broad categories: wrappers that use the learning algorithm itself to evaluate the usefulness of features and filters that evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. The algorithm often outperforms the well-known ReliefF attribute estimator when used as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees. It performs more feature selection than ReliefF does—reducing the data dimensionality by fifty percent in most cases. Also, decision and model trees built from the preprocessed data are often significantly smaller.

1,653 citations


Additional excerpts

  • ...Examples include [13], [26], [27], [ 28 ], [20], [23], and [24]....

    [...]

Frequently Asked Questions (4)
Q1. What contributions have the authors mentioned in the paper "A general software defect-proneness prediction framework" ?

OBJECTIVE – the authors propose and evaluate a general framework for software defect prediction that supports ( i ) unbiased and ( ii ) comprehensive comparison between competing prediction systems. RESULTS – the results show that the authors should choose different learning schemes for different data sets ( i. e. no scheme dominates ), that small details in conducting how evaluations are conducted can completely reverse findings and lastly that their proposed framework is more effective, and less prone to bias than previous approaches. 

12 learning schemes2 resulted from two data preprocessors, two feature selectors, and three classification algorithms are designed to assess the effects of different elements of a learning scheme on defect prediction. 

3) After the predictor is built, new data are preprocessed in same way as historical data, then the constructed predictor can be used to predict software defect with preprocessed new data. 

3) The largest absolute value of balance diff in MGF framework is 25.7% on AR1 data, on which the corresponding absolute value of balance diff in the proposed framework is just 3.16%.