Journal Article•DOI•

A General Software Defect-Proneness Prediction Framework

Qinbao Song¹, Zihan Jia¹, Martin Shepperd², Shi Ying³, Jin Liu³ - Show less +1 more•Institutions (3)

Xi'an Jiaotong University¹, Brunel University London², Wuhan University³

01 May 2011-IEEE Transactions on Software Engineering (IEEE)-Vol. 37, Iss: 3, pp 356-370

TL;DR: The results show that the proposed framework for software defect prediction is more effective and less prone to bias than previous approaches and that small details in conducting how evaluations are conducted can completely reverse findings.

read less

Abstract: BACKGROUND - Predicting defect-prone software components is an economically important activity and so has received a good deal of attention. However, making sense of the many, and sometimes seemingly inconsistent, results is difficult. OBJECTIVE - We propose and evaluate a general framework for software defect prediction that supports 1) unbiased and 2) comprehensive comparison between competing prediction systems. METHOD - The framework is comprised of 1) scheme evaluation and 2) defect prediction components. The scheme evaluation analyzes the prediction performance of competing learning schemes for given historical data sets. The defect predictor builds models according to the evaluated learning scheme and predicts software defects with new data according to the constructed model. In order to demonstrate the performance of the proposed framework, we use both simulation and publicly available software defect data sets. RESULTS - The results show that we should choose different learning schemes for different data sets (i.e., no scheme dominates), that small details in conducting how evaluations are conducted can completely reverse findings, and last, that our proposed framework is more effective and less prone to bias than previous approaches. CONCLUSIONS - Failure to properly or fully evaluate a learning scheme can be misleading; however, these problems may be overcome by our proposed framework.

...read moreread less

Summary (4 min read)

Jump to: [1 INTRODUCTION] – [2 RELATED WORK] – [3.1 Overview of the framework] – [3.2 Scheme evaluation] – [3.3 Defect prediction] – [3.4 Difference between our proposed framework and MGF] – [4.1 Data sets] – [4.2 Performance measures] – [4.3 Experiment Design] – [4.3.1 Framework comparison] – [4.3.2 Defect prediction with different learning schemes] – [4.4.1 Framework comparison] – [4.4.2 Defect prediction with different learning schemes] and [5 CONCLUSION]

1 INTRODUCTION

Software defect prediction has been an important research topic in the software engineering field for more than 30 years.
Current defect prediction work focuses on (i) estimating the number of defects remaining in software systems, (ii) discovering defect associations, and (iii) classifying the defect-proneness of software components, typically into two classes defect-prone and not defect-prone.
Third, to assist managers in improving the software process through analysis of the reasons why some defects frequently occur together.
This allows us to explore the impact of data from different sources on different processes for finding appropriate classification models apart from evaluating these processes in a fair and reasonable way.
Google scholar (accessed February 6, 2010) indicates an impressive 132 citations to MGF [23] within the space of three years.

3.1 Overview of the framework

Generally, before building defect prediction model(s) and using them for prediction purposes, the authors first need to decide which learning scheme should be used to construct the model.
Consequently the authors propose a new software defect prediction framework that provides guidance to address these potential shortcomings.
At the scheme evaluation stage, the performances of the different learning schemes are evaluated with historical data to determine whether a certain learning scheme performs sufficiently well for prediction purposes or to select the best from a set of competing schemes.
It is very important that the test data are not used in any way to build the learners.
From Fig. 1 the authors observe that all the historical data are used to build the predictor here.

3.2 Scheme evaluation

The scheme evaluation is a fundamental part of the software defect prediction framework.
To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
After the training-test splitting is done each round, both the training data and learning scheme(s) are used to build a learner.
In their experiment, the authors use a log-filtering preprocessor which replaces all numerics n with their logarithms ln(n) such as used in MGF.
Therefore attribute selection has to be performed on the training data.

3.3 Defect prediction

The defect prediction part of their framework is straightforward, it consists of predictor construction and defect prediction.
2) A predictor is built with the selected learning scheme and the whole historical data.
Its final performance is the mean over all rounds.
A single round of cross-validation uses only one part of the data.
The detailed defect prediction process is described with pseudocode in the following Procedure Prediction.

3.4 Difference between our proposed framework and MGF

This should be performed only on the training data.
To recap, the essential problem in MGF’s study is that the test data were used for attribute selection, which actually violated the intention of holdout strategy.
The authors framework focuses on the attribute selection method itself instead of certain ‘best’ subset, as different training data may produce different best subsets.
After that, the ‘outer’ crossvalidation assesses how well the learner built with such ‘best’ attributes performs on the test data, which is really new to the learner.

4.1 Data sets

What’s more, the AR data from the PROMISE repository4 was also used.
Table 1 provides some basic summary information.
After preprocessing, modules that contain one or more defects were labeled as defective.
A more detailed description of code attributes or the origin of the MDP data sets can be obtained from [23].

4.2 Performance measures

The receiver operating characteristic (ROC) curve is often used to evaluate the performance of binary predictors.
Nevertheless, this doesn’t necessarily show that all predictors with the same balance value have the same practical usage.
Thus the ROC curve characterizes the performance of a binary predictor among varied threshold.
AUC represents the most informative and commonly used, thus it is used as another performance measure in this paper.
Diff = EvaPerf− PredPerf PredPerf × 100% (6) where EvaPerf represents the mean evaluation performance and PredPerf denotes the mean prediction performance.

4.3 Experiment Design

Two experiments are designed in the experiment.
One is to compare their framework with that of MGF, the second is intended to demonstrate their framework in practice and explore whether the authors should choose a particular learning scheme or not.

4.3.1 Framework comparison

This experiment was used to compare their framework with that of MGF who reported that a Naı̈ve Bayes data miner with a log-filtering preprocessor achieved a mean (pd, pf )= (71, 25).
In their experiment, the authors simulated the whole process of defect prediction to explore whether MGF’s evaluation result is misleading or not.
Then an iterative attribute subset selection as used in MGF’s study was performed.
Specifically, as described in the scheme evaluation procedure, the authors applied the learning scheme only to the training data, after which the final Naı̈ve Bayes learner was built and the test data were used to evaluate the performance of the learner.
Then the predictor was used to predict defect with the new data that was processed by the same way as that of the historical data.

4.3.2 Defect prediction with different learning schemes

This experiment is intended to demonstrate their framework and to illustrate that different elements of a learning scheme have different impacts on the predictions and to confirm that the authors should choose the combination of a data preprocessor, an attribute selector and a learning algorithm, instead of any one of them separately.
For this purpose, twelve different learning schemes were designed according to the following data preprocessors, attribute selectors and learning algorithms.
It then tries each of the remaining attributes in conjunction with the best to find the best pair of attributes.
For each pass, the authors took 90% of the data as historical data, and the remainder as new data.
The authors performed the whole process twice, with balance and AUC respectively.

4.4.1 Framework comparison

The framework comparison results are summarized in Table 2 7 which shows the results in terms of balance.
Thus the authors see what a dramatic impact a seemingly small difference in a validation procedure can have.
3) The largest absolute value of balance diff in MGF framework is 25.7% on AR1 data, on which the corresponding absolute value of balance diff in the proposed framework is just 3.16%.
Finally a Wilcoxon signed-rank test of medians yields p = 0.0028 for a 1-tailed hypothesis that the absolute balance diff of the new framework is significantly less than that of the MGF framework.
On the other hand, the mean prediction performance of the proposed framework is higher than that of MGF.

4.4.2 Defect prediction with different learning schemes

The twelve different learning schemes were evaluated and then used to predict defect prone-modules across the same 17 data sets.
This reveals that different attribute selectors can be suitable for different learning algorithms.
5) For both the evaluation and prediction, the AUCs of schemes NB+Log+FS, NB+Log+BE, NB+None+FS and NB+None+BE are much better than those of schemes J48+Log+FS, J48+Log+BE, J48+None+FS and J48+None+BE, respectively.
This means Forward Selection is more suitable for Naı̈ve Bayes with the data preprocessor None.
Table 11 shows the p-values of Wilcoxon signed-rank test on learning algorithms J48 vs. OneR over the 17 data sets.

5 CONCLUSION

The authors have presented a novel benchmark framework for software defect prediction.
In the evaluation stage, different learning schemes are evaluated and the best one is selected.
From their experimental results the authors observe that there is a bigger difference between the evaluation performance and the actual prediction performance in MGF’s study than with their framework.
Whilst this might seem like some small technicality the impact is profound.
When the authors perform statistical significance testing they find dramatically different findings and that are highly statistically significant but in opposite directions.

Did you find this useful? Give us your feedback

Figures (15)

TABLE 6 Prediction AUC comparison of the 12 different learning schemes

TABLE 4 Prediction balance comparison of the 12 different learning schemes

TABLE 3 Evaluation balance comparison of the 12 different learning schemes

TABLE 1 Code Attributes within NASA MDP and AR Data Sets

Fig. 4. The balance diffs of the two frameworks.

TABLE 10 p-value for the algorithms Naı̈ve Bayes vs. OneR

TABLE 11 p-value for the algorithms J48 vs. OneR

TABLE 8 p-value for the attribute selectors FS vs. BE

Fig. 1. Proposed software defect prediction framework

TABLE 9 p-value for the algorithms Naı̈ve Bayes vs. J48

TABLE 7 p-value for the data preprocessors Log vs. None

TABLE 5 Evaluation AUC comparison of the 12 different learning schemes

Fig. 2. Scheme evaluation of the proposed framework.

TABLE 2 Framework comparison for all the data sets

Content maybe subject to copyright Report

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 1

A General Software Defect-Proneness

Prediction Framework

Qinbao Song, Zihan Jia, Martin Shepperd, Shi Ying and Jin Liu

Abstract—BACKGROUND – predicting defect-prone software components is an economically important activity and so has received

a good deal of attention. However, making sense of the many, and sometimes seemingly inconsistent, results is difﬁcult.

OBJECTIVE – we propose and evaluate a general framework for software defect prediction that supports (i) unbiased and (ii)

comprehensive comparison between competing prediction systems.

METHOD – the framework comprises (i) scheme evaluation and (ii) defect prediction components. The scheme evaluation analyzes the

prediction performance of competing learning schemes for given historical data sets. The defect predictor builds models according to

the evaluated learning scheme and predicts software defects with new data according to the constructed model. In order to demonstrate

the performance of the proposed framework, we use both simulation and publicly available software defect data sets.

RESULTS – the results show that we should choose different learning schemes for different data sets (i.e. no scheme dominates), that

small details in conducting how evaluations are conducted can completely reverse ﬁndings and lastly that our proposed framework is

more effective, and less prone to bias than previous approaches.

CONCLUSIONS – failure to properly or fully evaluate a learning scheme can be misleading, however, these problems may be overcome

by our proposed framework.

Key Words—Software defect prediction, software defect-proneness prediction, machine learning, scheme evaluation.

1 INTRODUCTION

Software defect prediction has been an important re-

search topic in the software engineering ﬁeld for more

than 30 years. Current defect prediction work focuses

on (i) estimating the number of defects remaining in

software systems, (ii) discovering defect associations,

and (iii) classifying the defect-proneness of software

components, typically into two classes defect-prone and

not defect-prone. This paper is concerned with the third

approach.

The ﬁrst type of work employs statistical approaches

[1], [2], [3], capture-recapture (CR) models [4], [5], [6], [7],

and detection proﬁle methods (DPM) [8] to estimate the

number of defects remaining in software systems with

inspection data and process quality data. The prediction

result can be used as an important measure for the

software developer [9], and can be used to control the

software process (i.e. decide whether to schedule further

inspections or pass the software artifacts to the next

development step [10]) and gauge the likely delivered

quality of a software system [11].

The second type of work borrows association rule min-

ing algorithms from the data mining community to re-

• Q. Song and Z. Jia are with the Department of Computer Science and

Technology, Xi’an Jiaotong University, Xi’an, 710049 China.

E-mail: qbsong@mail.xjtu.edu.cn, jiazh.eden@stu.xjtu.edu.cn.

• M. Shepperd is with the School of Information Science, Computing, and

Mathematics, Brunel University, Uxbridge, UB8 3PH UK.

E-mail: martin.shepperd@brunel.ac.uk.

• S. Ying and J. Liu are with the State Key Laboratory of Software

Engineering, Wuhan University, Wuhan, 430072 China.

E-mail: yingshi@whu.edu.cn, mailjinliu@yahoo.com.

veal software defect associations [12], which can be used

for three purposes. First to ﬁnd as many related defects

as possible to the detected defect(s) and consequently

make more effective corrections to the software. This

may be useful as it permits more directed testing and

more effective use of limited testing resources. Second,

to help evaluate reviewers’ results during an inspection.

Thus a recommendation might be that his/her work

should be reinspected for completeness. Third, to assist

managers in improving the software process through

analysis of the reasons why some defects frequently

occur together. If the analysis leads to the identiﬁcation

of a process problem, managers can devise corrective

action.

The third type of work classiﬁes software compo-

nents as defect-prone and non-defect-prone by means of

metric-based classiﬁcation [13], [14], [15], [16], [17], [18],

[19], [20], [21], [22], [23], [24]. Being able to predict which

components are more likely to be defect-prone supports

better targeted testing resources and therefore improved

efﬁciency.

Unfortunately, classiﬁcation remains a largely un-

solved problem. In order to address this researchers have

been using increasingly sophisticated techniques drawn

from machine learning. This sophistication has led to

challenges in how such techniques are conﬁgured and

how they should be validated. Incomplete or inappro-

priate validation can result in unintentionally misleading

results and over-optimism on the part of the researchers.

For this reason we propose a new and more general

framework within which to conduct such validations.

To reiterate a comment made in an earlier paper by one

of the authors [MS] and also quoted by Lessmann et al.

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 2

[24] “we need to develop more reliable research proce-

dures before we can have conﬁdence in the conclusion

of comparative studies of software prediction models”

[25]. Thus we stress that the aim of this paper is to

consider how we evaluate different processes for ﬁnding

classiﬁcation models, not any particular model itself. We

consider it most unlikely any useful, universal model

exists.

Much of this research activity has followed the path

of using software metrics extracted from the code as

candidate factors to reveal whether a software com-

ponent is defect-prone or not. To accomplish this a

variety of machine learning algorithms have been used

to inductively ﬁnd patterns or rules within the data to

classify software components as either defect-prone or

not. Examples include [13], [26], [27], [28], [20], [23],

and [24]. In addition Wagner [29] and Runeson et al.

[30] provide useful overviews in the form of systematic

literature reviews.

In order to motivate the need for more systematic

and unbiased methods for comparing the performance

of machine learning based defect prediction we focus

on a recent paper published in this journal by Menzies,

Greenwald and Frank [23]. For brevity we will refer to

this as the MGF paper. We choose MGF for three reasons.

First, because it has been widely cited

and is there-

fore inﬂuential. Second, because the approach might be

regarded as state of the art for this kind of research.

Third, because the MGF analysis is based upon datasets

in the public domain thus we are able to replicate the

work. We should stress we are not singling this work

out for being particularly outrageous. Instead we wish

to respond to their challenge “that numerous researchers

repeat our experiments and discover learning methods

that are superior to the one proposed here” (MGF).

In the study, publicly available datasets from different

organizations are used. This allows us to explore the

impact of data from different sources on different pro-

cesses for ﬁnding appropriate classiﬁcation models apart

from evaluating these processes in a fair and reasonable

way. Additionally, 12 learning schemes

resulted from

two data preprocessors, two feature selectors, and three

classiﬁcation algorithms are designed to assess the effects

of different elements of a learning scheme on defect

prediction. Although balance is a uncommon measure in

classiﬁcation, the results of MGF were reported with it,

thus it is still used whilst a general measure AUC of

predictive power is employed in the paper as well.

This paper makes the following contributions: (i) a

new and more general software defect-proneness pre-

diction framework within which appropriate validations

can be conducted is proposed; (ii) the impacts of different

elements of a learning scheme on the evaluation and

prediction are explored, and concluded that a learn-

ing scheme should be evaluated as holistically and no

1. Google scholar (accessed February 6, 2010) indicates an impressive

132 citations to MGF [23] within the space of three years.

2. Please see Section 2 for the details of a learning scheme.

learning scheme dominates, consequently the evaluation

and decision process is important; and (iii) the potential

bias and misleading results of the MGF framework is

explained and conﬁrmed, and demonstrated that the

performance of the MGF framework is varying greatly

with data from different organizations.

The remainder of the paper is organized as follows.

Section 2 provides some further background on the

current state of the art for learning software defect pre-

diction systems with particular reference to MGF. Section

3 describes our framework in detail and analyzes differ-

ences between our approach and that of MGF. Section

4 is devoted to the extensive experiments to compare

our framework and that of MGF and to evaluate the

performance of the proposed framework. Conclusions

and consideration of the signiﬁcance of this work are

given in the ﬁnal section.

2 RELATED WORK

MGF [23] published a study in this journal in 2007 in

which they compared the performance of two machine

learning techniques (Rule Induction and Na

ıve Bayes) to

predict software components containing defects. To do

this they use the NASA MDP repository which at the

time of their research contained 10 separate data sets.

Traditionally many researchers have explored issues

like the relative merits of McCabe’s cyclomatic com-

plexity, Halstead’s software science measures and lines

of code counts for building defect predictors. However,

MGF claim that “such debates are irrelevant since how

the attributes are used to build predictors is much more

important than which particular attributes are used” and

“the choice of learning method is far more important

than which subset of the available data is used for

learning”. Their analysis found that a Na

ıve Bayes clas-

siﬁer, after log-ﬁltering and attribute selection based on

InfoGain had a mean probability of detection of 71% and

mean false alarms rates of 25%. This signiﬁcantly out-

performed the rule induction methods of J48 and OneR

(due to Quinlan [31]).

We argue that although how is more important than

which

, the choice of which attribute subset is used for

learning is not only circumscribed by the attribute subset

itself and available data, but also by attribute selectors,

learning algorithms and data preprocessors. It is well

known that there is an intrinsic relationship between

a learning method and an attribute selection method.

For example, Hall and Holmes [32] concluded that the

forward selection search was well suited to Na

ıve Bayes

but the backward elimination search is more suitable for

C4.5. Cardie [33] found using a decision tree to select at-

tributes helped the nearest neighbor algorithm to reduce

its prediction error. Kubat et al. [34] used a decision tree

3. That is, which attribute subset is more useful for defect prediction

not only depends on the attribute subset itself but also on the speciﬁc

data set.

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 3

ﬁltering attributes for use with a Na

ıve Bayesian classi-

ﬁer and obtained a similar result. However, Kibler and

Aha [35] reported more mixed results on two medical

classiﬁcation tasks. Therefore, before building prediction

models, we should choose the combination of all three

of learning algorithm, data pre-processing and attribute

selection method, not merely one or two of them.

Lessmann et al. [24] have also conducted a follow-

up to MGF on defect predictions, providing additional

results as well as suggestions for a methodological

framework. However, they did not perform attribute

selection when building prediction models. Thus our

work has wider application.

We also argue that MGF’s attribute selection approach

is problematic and yielded a bias in the evaluation

results, despite the use of a M×N-way cross-evaluation

method. One reason is that they ranked attributes on

the entire data set including both the training and test

data, though the class labels of the test data should have

been unknown to the predictor. That is, they violated the

intention of the holdout strategy. The potential result is

that they overestimate the performance of their learning

model and thereby report a potentially misleading result.

Moreover, after ranking attributes, they evaluated each

individual attribute separately and chose those n features

with the highest scores. Unfortunately, this strategy can-

not consider features with complementary information,

and does not account for attribute dependence. It is

also incapable of removing redundant features because

redundant features are likely to have similar rankings.

As long as features are deemed relevant to the class,

they will all be selected even though many of them are

highly correlated to each other.

These seemingly minor issues motivate the develop-

ment of our general-purpose defect prediction frame-

work described in this paper. However, we will show

the large impact they can have and how researchers

may be completely misled. Our proposed framework

consists of two parts: scheme evaluation and defect

prediction. The scheme evaluation focuses on evaluating

the performance of a learning scheme, whilst the defect

prediction focuses on building a ﬁnal predictor using

historical data according to the learning scheme and after

which the predictor is used to predict the defect-prone

components of a new (or unseen) software system.

A learning scheme comprises:

1) a data preprocessor,

2) an attribute selector,

3) a learning algorithm.

So to summarize, the main difference between our

framework and that of MGF lies in: (i) we choose the

entire learning scheme, not just one out of the learning

algorithm, attribute selector or data pre-processor; (ii) we

use the appropriate data to evaluate the performance of

a scheme. That is, we build a predictive model according

to a scheme with only ‘historical’ data and validate the

model on the independent ‘new’ data. We go on to

demonstrate why this has very practical implications.

3 PROPOSED SOFTWARE DEFECT PREDIC-

TION FRAMEWORK

3.1 Overview of the framework

Generally, before building defect prediction model(s)

and using them for prediction purposes, we ﬁrst need

to decide which learning scheme should be used to

construct the model. Thus the predictive performance of

the learning scheme(s) should be determined, especially

for future data. However, this step is often neglected

and so the resultant prediction model may not be trust-

worthy. Consequently we propose a new software defect

prediction framework that provides guidance to address

these potential shortcomings. The framework consists of

two components: (i) scheme evaluation and (ii) defect

prediction. Fig. 1 contains the details.

Fig. 1. Proposed software defect prediction framework

At the scheme evaluation stage, the performances of

the different learning schemes are evaluated with histor-

ical data to determine whether a certain learning scheme

performs sufﬁciently well for prediction purposes or to

select the best from a set of competing schemes.

From Fig. 1 we can see that the historical data are

divided into two parts: a training set for building learn-

ers with the given learning schemes, and a test set for

evaluating the performances of the learners. It is very

important that the test data are not used in any way

to build the learners. This is a necessary condition to

assess the generalization ability of a learner that is built

according to a learning scheme, and further to determine

whether or not to apply the learning scheme, or select

one best scheme from the given schemes.

At the defect prediction stage, according to the per-

formance report of the ﬁrst stage, a learning scheme

is selected and used to build a prediction model and

predict software defect. From Fig. 1 we observe that all

the historical data are used to build the predictor here.

This is very different from the ﬁrst stage; it is very useful

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 4

for improving the generalization ability of the predictor.

After the predictor is built, it can be used to predict the

defect-proneness of new software components.

MGF proposed a baseline experiment and reported the

performance of the Na

ıve Bayes data miner with log-

ﬁltering as well as attribute selection, which performed

the scheme evaluation but with inappropriate data. This

is because they used both the training (which can be

viewed as historical data) and test (which can be viewed

as new data) data to rank attributes, while the labels of

the new data are unavailable when choosing attributes

in practice.

3.2 Scheme evaluation

The scheme evaluation is a fundamental part of the

software defect prediction framework. At this stage,

different learning schemes are evaluated by building and

evaluating learners with them. Fig. 2 contains the details.

Fig. 2. Scheme evaluation of the proposed framework.

The ﬁrst problem of scheme evaluation is how to

divide historical data into training and test data. As

mentioned above, the test data should be independent

of the learner construction. This is a necessary pre-

condition to evaluate the performance of a learner for

new data. Cross-validation is usually used to estimate

how accurately a predictive model will perform in prac-

tice. One round of cross-validation involves partitioning

a data set into complementary subsets, performing the

analysis on one subset, and validating the analysis on

the other subset. To reduce variability, multiple rounds of

cross-validation are performed using different partitions,

and the validation results are averaged over the rounds.

In our framework, a M×N-way cross-validation is used

for estimating the performance of each predictive model,

that is, each data set is ﬁrst divided into N bins, and

after that a predictor is learned on (N-1) bins, and then

tested on the remaining bin. This is repeated for the N

folds so that each bin is used for training and testing

while minimizing the sampling bias. To overcome any

ordering effect and to achieve reliable statistics, each

holdout experiment is also repeated M times and in each

repetition the data sets are randomized. So overall, M×N

models are built in all during the period of evaluation,

thus M×N results are obtained on each data set about

the performance of the each learning scheme.

After the training-test splitting is done each round,

both the training data and learning scheme(s) are used to

build a learner. A learning scheme consists of a data pre-

processing method, an attribute selection method, and

a learning algorithm. The detailed learner construction

procedure is as follows:

1) Data preprocessing

This is an important part of building a practical

learner. In this step, the training data are prepro-

cessed, such as removing outliers, handling miss-

ing values, discretizing or transforming numeric

attributes. In our experiment, we use a log-ﬁltering

preprocessor which replaces all numerics n with

their logarithms ln(n) such as used in MGF.

2) Attribute selection

The data sets may not have originally been in-

tended for defect prediction, thus even if all the

attributes are useful for its original task, not all may

be helpful for defect prediction. Therefore attribute

selection has to be performed on the training data.

Attribute selection methods can be categorized as

either ﬁlters or wrappers [36]. It should be noted

that both ‘ﬁlter’ and ‘wrapper’ methods only op-

erate on the training data. A ‘ﬁlter’ uses general

characteristics of the data to evaluate attributes

and operates independently of any learning algo-

rithm. In contrast, a ‘wrapper’ method, exists as a

wrapper around the learning algorithm searching

for a good subset using the learning algorithm

itself as part of the function evaluating attribute

subsets. Wrappers generally give better results than

ﬁlters but are more computationally intensive. In

our proposed framework, the ‘wrapper’ attribute

selection method is employed. To make most use

of the data, we use a M×N-way cross-validation

to evaluate the performance of different attribute

subsets.

3) Learner construction

Once attribute selection is ﬁnished, the prepro-

cessed training data are reduced to the best attribute

subset. Then the reduced training data and the

learning algorithm are used to build the learner.

Before the learner is tested, the original test data are

preprocessed in the same way and the dimension-

ality is reduced to the same best subset of attributes.

After comparing the predicted value and the actual

value of the test data, the performance of one pass

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, 2010 5

of validation is obtained. As mentioned previously,

the ﬁnal ‘evaluation’ performance can be obtained

as the mean and variance values across the M×N

passes of such validation.

The detailed scheme evaluation process is described

with pseudocode in the following Procedure Evaluation

which consists of Function Learning and Function AttrSe-

lect. The Function Learning is used to build a learner with

a given learning scheme, and the Function AttrSelect

performs attribute selection with a learning algorithm.

3.3 Defect prediction

The defect prediction part of our framework is straight-

forward, it consists of predictor construction and defect

prediction.

During the period of the predictor construction,

1) A learning scheme is chosen according to the Per-

formance Report.

2) A predictor is built with the selected learning

scheme and the whole historical data. While eval-

uating a learning scheme, a learner is built with

the training data and tested on the test data. Its

ﬁnal performance is the mean over all rounds. This

reveals that the evaluation indeed covers all the

data. However, a single round of cross-validation

uses only one part of the data. Therefore, as we

use all the historical data to build the predictor,

it is expected that the constructed predictor has

stronger generalization ability.

3) After the predictor is built, new data are prepro-

cessed in same way as historical data, then the con-

structed predictor can be used to predict software

defect with preprocessed new data.

The detailed defect prediction process is described

with pseudocode in the following Procedure Prediction.

3.4 Difference between our proposed framework

and MGF

Although both MGF’s and our study have involved a

M×N-way cross-validation there is, however, a signiﬁ-

cant difference. In their study, for each data set, the

attributes were ranked by InfoGain which was calculated

on the whole data set, then the M×N-way validation was

wrapped inside scripts that explored different subset of

attributes in the order suggested by the InfoGain. In our

study, there is an M×N-way cross-validation for perfor-

mance estimation of the learner with attribute selection,

which is out of the attribute selection procedure. We

only performed attribute selection on the training data.

When a ‘wrapper’ selection method is performed, an-

other cross-validation can be performed to evaluate the

performance of different attribute subsets. This should

be performed only on the training data.

To recap, the essential problem in MGF’s study is that

the test data were used for attribute selection, which

actually violated the intention of holdout strategy. In

their study, the M×N-way cross-validation actually imple-

mented a holdout strategy to just select the ‘best’ subset

among the subsets recommended by InfoGain for each

data set. However, as the ”test data” is unknown at

that period of time, so the result obtained in that way

potentially overﬁts the current data set itself and cannot

be used to assess the future performance of the learner

built with such ‘best’ subset.

Our framework focuses on the attribute selection

method itself instead of certain ‘best’ subset, as dif-

ferent training data may produce different best sub-

sets. We treat the attribute selection method as a part

of the learning scheme. The ‘inner’ cross-validation is

performed on the training data, which actually selects

the ‘best’ attribute set on the training data with the

basic learning algorithm. After that, the ‘outer’ cross-

validation assesses how well the learner built with such

‘best’ attributes performs on the test data, which is really

new to the learner. Thus our framework can properly

assess the future performance of the learning scheme as

a whole.

4 EMPIRICAL STUDY

4.1 Data sets

We used the data taken from the public NASA MDP

repository [37], which was also used by MGF and many

others e.g. [24], [38], [39], and [22]. What’s more, the AR

data from the PROMISE repository

was also used. Thus

there are 17 data sets in total, 13 from NASA and the

remaining 4 from the PROMISE repository.

Table 1 provides some basic summary information.

Each data set comprises a number of software modules

(cases), each containing the corresponding number of

defects and various software static code attributes. After

preprocessing, modules that contain one or more defects

were labeled as defective. Besides LOC counts, the data

sets include Halstead attributes as well as McCabe com-

plexity measures

. A more detailed description of code

attributes or the origin of the MDP data sets can be

obtained from [23].

4.2 Performance measures

The receiver operating characteristic (ROC) curve is often

used to evaluate the performance of binary predictors.

A typical ROC curve is shown in Fig. 3. The y-axis

shows probability of detection (pd) and the x-axis shows

probability of false alarms (pf ).

Formal deﬁnitions for pd and pf are given in Equations

1 and 2 respectively. Obviously higher pds and lower

pf s are desired. The point (pf = 0, pd = 1) is the ideal

4. http://promise.site.uottowa.ca/SERepository

5. Whilst there is some disquiet concerning the value of such code

metrics recall that the purpose of this paper is to examine frameworks

for learning defect classiﬁers and not to ﬁnd the ‘best’ classiﬁer per

se. Moreover, since attribute selection is part of the framework we can

reasonably expect irrelevant or redundant attributes to be eliminated

from any ﬁnal classiﬁer.

HTML Viewer

Frequently Asked Questions (4)

Q1. What contributions have the authors mentioned in the paper "A general software defect-proneness prediction framework" ?

OBJECTIVE – the authors propose and evaluate a general framework for software defect prediction that supports ( i ) unbiased and ( ii ) comprehensive comparison between competing prediction systems. RESULTS – the results show that the authors should choose different learning schemes for different data sets ( i. e. no scheme dominates ), that small details in conducting how evaluations are conducted can completely reverse findings and lastly that their proposed framework is more effective, and less prone to bias than previous approaches.

Q2. How many learning schemes were used in the study?

12 learning schemes2 resulted from two data preprocessors, two feature selectors, and three classification algorithms are designed to assess the effects of different elements of a learning scheme on defect prediction.

Q3. What is the procedure for constructing a predictor?

3) After the predictor is built, new data are preprocessed in same way as historical data, then the constructed predictor can be used to predict software defect with preprocessed new data.

Q4. What is the largest absolute value of balance diff in the proposed framework?

3) The largest absolute value of balance diff in MGF framework is 25.7% on AR1 data, on which the corresponding absolute value of balance diff in the proposed framework is just 3.16%.

A General Software Defect-Proneness Prediction Framework

Summary (4 min read)

1 INTRODUCTION

3.1 Overview of the framework

3.2 Scheme evaluation

3.3 Defect prediction

3.4 Difference between our proposed framework and MGF

4.1 Data sets

4.2 Performance measures

4.3 Experiment Design

4.3.1 Framework comparison

4.3.2 Defect prediction with different learning schemes

4.4.1 Framework comparison

4.4.2 Defect prediction with different learning schemes

5 CONCLUSION

Figures (15)

Citations

Cites background from "A General Software Defect-Proneness..."

Cites background from "A General Software Defect-Proneness..."

Cites background or methods from "A General Software Defect-Proneness..."

Cites background from "A General Software Defect-Proneness..."

References

"A General Software Defect-Proneness..." refers methods in this paper

"A General Software Defect-Proneness..." refers background in this paper

Additional excerpts

Related Papers (5)

Frequently Asked Questions (4)

Q1. What contributions have the authors mentioned in the paper "A general software defect-proneness prediction framework" ?

Q2. How many learning schemes were used in the study?

Q3. What is the procedure for constructing a predictor?

Q4. What is the largest absolute value of balance diff in the proposed framework?