A General Software Defect-Proneness Prediction Framework
Summary (4 min read)
1 INTRODUCTION
- Software defect prediction has been an important research topic in the software engineering field for more than 30 years.
- Current defect prediction work focuses on (i) estimating the number of defects remaining in software systems, (ii) discovering defect associations, and (iii) classifying the defect-proneness of software components, typically into two classes defect-prone and not defect-prone.
- Third, to assist managers in improving the software process through analysis of the reasons why some defects frequently occur together.
- This allows us to explore the impact of data from different sources on different processes for finding appropriate classification models apart from evaluating these processes in a fair and reasonable way.
- Google scholar (accessed February 6, 2010) indicates an impressive 132 citations to MGF [23] within the space of three years.
3.1 Overview of the framework
- Generally, before building defect prediction model(s) and using them for prediction purposes, the authors first need to decide which learning scheme should be used to construct the model.
- Consequently the authors propose a new software defect prediction framework that provides guidance to address these potential shortcomings.
- At the scheme evaluation stage, the performances of the different learning schemes are evaluated with historical data to determine whether a certain learning scheme performs sufficiently well for prediction purposes or to select the best from a set of competing schemes.
- It is very important that the test data are not used in any way to build the learners.
- From Fig. 1 the authors observe that all the historical data are used to build the predictor here.
3.2 Scheme evaluation
- The scheme evaluation is a fundamental part of the software defect prediction framework.
- To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
- After the training-test splitting is done each round, both the training data and learning scheme(s) are used to build a learner.
- In their experiment, the authors use a log-filtering preprocessor which replaces all numerics n with their logarithms ln(n) such as used in MGF.
- Therefore attribute selection has to be performed on the training data.
3.3 Defect prediction
- The defect prediction part of their framework is straightforward, it consists of predictor construction and defect prediction.
- 2) A predictor is built with the selected learning scheme and the whole historical data.
- Its final performance is the mean over all rounds.
- A single round of cross-validation uses only one part of the data.
- The detailed defect prediction process is described with pseudocode in the following Procedure Prediction.
3.4 Difference between our proposed framework and MGF
- This should be performed only on the training data.
- To recap, the essential problem in MGF’s study is that the test data were used for attribute selection, which actually violated the intention of holdout strategy.
- The authors framework focuses on the attribute selection method itself instead of certain ‘best’ subset, as different training data may produce different best subsets.
- After that, the ‘outer’ crossvalidation assesses how well the learner built with such ‘best’ attributes performs on the test data, which is really new to the learner.
4.1 Data sets
- What’s more, the AR data from the PROMISE repository4 was also used.
- Table 1 provides some basic summary information.
- After preprocessing, modules that contain one or more defects were labeled as defective.
- A more detailed description of code attributes or the origin of the MDP data sets can be obtained from [23].
4.2 Performance measures
- The receiver operating characteristic (ROC) curve is often used to evaluate the performance of binary predictors.
- Nevertheless, this doesn’t necessarily show that all predictors with the same balance value have the same practical usage.
- Thus the ROC curve characterizes the performance of a binary predictor among varied threshold.
- AUC represents the most informative and commonly used, thus it is used as another performance measure in this paper.
- Diff = EvaPerf− PredPerf PredPerf × 100% (6) where EvaPerf represents the mean evaluation performance and PredPerf denotes the mean prediction performance.
4.3 Experiment Design
- Two experiments are designed in the experiment.
- One is to compare their framework with that of MGF, the second is intended to demonstrate their framework in practice and explore whether the authors should choose a particular learning scheme or not.
4.3.1 Framework comparison
- This experiment was used to compare their framework with that of MGF who reported that a Naı̈ve Bayes data miner with a log-filtering preprocessor achieved a mean (pd, pf )= (71, 25).
- In their experiment, the authors simulated the whole process of defect prediction to explore whether MGF’s evaluation result is misleading or not.
- Then an iterative attribute subset selection as used in MGF’s study was performed.
- Specifically, as described in the scheme evaluation procedure, the authors applied the learning scheme only to the training data, after which the final Naı̈ve Bayes learner was built and the test data were used to evaluate the performance of the learner.
- Then the predictor was used to predict defect with the new data that was processed by the same way as that of the historical data.
4.3.2 Defect prediction with different learning schemes
- This experiment is intended to demonstrate their framework and to illustrate that different elements of a learning scheme have different impacts on the predictions and to confirm that the authors should choose the combination of a data preprocessor, an attribute selector and a learning algorithm, instead of any one of them separately.
- For this purpose, twelve different learning schemes were designed according to the following data preprocessors, attribute selectors and learning algorithms.
- It then tries each of the remaining attributes in conjunction with the best to find the best pair of attributes.
- For each pass, the authors took 90% of the data as historical data, and the remainder as new data.
- The authors performed the whole process twice, with balance and AUC respectively.
4.4.1 Framework comparison
- The framework comparison results are summarized in Table 2 7 which shows the results in terms of balance.
- Thus the authors see what a dramatic impact a seemingly small difference in a validation procedure can have.
- 3) The largest absolute value of balance diff in MGF framework is 25.7% on AR1 data, on which the corresponding absolute value of balance diff in the proposed framework is just 3.16%.
- Finally a Wilcoxon signed-rank test of medians yields p = 0.0028 for a 1-tailed hypothesis that the absolute balance diff of the new framework is significantly less than that of the MGF framework.
- On the other hand, the mean prediction performance of the proposed framework is higher than that of MGF.
4.4.2 Defect prediction with different learning schemes
- The twelve different learning schemes were evaluated and then used to predict defect prone-modules across the same 17 data sets.
- This reveals that different attribute selectors can be suitable for different learning algorithms.
- 5) For both the evaluation and prediction, the AUCs of schemes NB+Log+FS, NB+Log+BE, NB+None+FS and NB+None+BE are much better than those of schemes J48+Log+FS, J48+Log+BE, J48+None+FS and J48+None+BE, respectively.
- This means Forward Selection is more suitable for Naı̈ve Bayes with the data preprocessor None.
- Table 11 shows the p-values of Wilcoxon signed-rank test on learning algorithms J48 vs. OneR over the 17 data sets.
5 CONCLUSION
- The authors have presented a novel benchmark framework for software defect prediction.
- In the evaluation stage, different learning schemes are evaluated and the best one is selected.
- From their experimental results the authors observe that there is a bigger difference between the evaluation performance and the actual prediction performance in MGF’s study than with their framework.
- Whilst this might seem like some small technicality the impact is profound.
- When the authors perform statistical significance testing they find dramatically different findings and that are highly statistically significant but in opposite directions.
Did you find this useful? Give us your feedback
Citations
1,012 citations
Cites background from "A General Software Defect-Proneness..."
...This is an important finding as it suggests that a relatively high number of papers reporting fault prediction are not really doing any prediction (this finding is also reported by [6])....
[...]
...trained and tested on different data [6]....
[...]
483 citations
Cites background from "A General Software Defect-Proneness..."
...Hence, total number of 122 studies [1, 2, 20-139] were identified for further processing and analysis....
[...]
...SF1 Sherer 1995 [20] SF33 Singh 2009b [51] SF2 Guo 2003 [21] SF34 Tosun 2009 [52] SF3 Guo 2004 [22] SF35 Zimmermann 2009 [53] SF4 Gyimothy 2005 [23] SF36 Afzal 2010 [54] SF5 Koru 2005 [24] SF37 Arisholm 2010 [55] SF6 Zhou 2006 [25] SF38 Carvalho 2010 [56] SF7 Arisholm 2007 [26] SF39 Liu 2010 [57] SF8 Catal 2007 [27] SF40 Malhotra 2010 [58] SF9 Jiang 2007 [28] SF41 Ostrand 2010 [59] SF10 Kanmani 2007 [29] SF42 Pendharkar 2010 [60] SF11 Li 2007 [30] SF43 Seliya 2010 [1] SF12 Menzies 2007 [31] SF44 Singh 2010 [61] SF13 Ma 2007 [32] SF45 Zhou 2010 [62] SF14 Pai 2007 [33] SF46 Azar 2011 [63] SF15 Turhan 2007 [34] SF47 Diri 2011 [64] SF16 Turhan 2007a [35] SF48 Malhotra 2011 [65] SF17 Carvalho 2008 [36] SF49 Martino 2011 [66] SF18 Elish 2008 [37] SF50 Mishra 2011 [67] SF19 Gondra 2008 [38] SF51 Misirh 2011 [68] SF20 Jiang 2008 [39] SF52 Ricca 2011 [69] SF21 Kaur 2008 [40] SF53 Rodriguez 2011 [70] SF22 Kim 2008 [41] SF54 Song 2011 [71] SF23 Lessmann 2008 [42] SF55 Twala 2011 [72] SF24 Menzies 2008 [43] SF56 Chen 2012 [73] SF25 Moser 2008 [44] SF57 Malhotra 2012 [74] SF26 Turhan 2008 [45] SF58 Okutan 2012 [75] SF27 Vandecruys 2008 [46] SF59 Yang 2012 [76] SF28 Bener 2009 [47] SF60 Yu 2012 [77] SF29 Catal 2009 [48] SF61 Zhou 2012 [78] SF30 Menzies 2009 [49] SF62 Cahill 2013 [79] SF31 Singh 2009 [50] SF63 Chen 2013 [80] SF32 Singh 2009a [2] SF64 Dejaeger 2013 [81] Table 4: Selected Primary Studies...
[...]
457 citations
Cites background or methods from "A General Software Defect-Proneness..."
...For example, the log filter was shown to improve the performance of Naive Bayes significantly, but contributed very little to decision trees [37]....
[...]
...claimed that a high-PD predictor is still useful in practice, even if the other measures may not be good enough [37], [47]....
[...]
444 citations
414 citations
Cites background from "A General Software Defect-Proneness..."
...…validation technique is the number of Events Per Variable [2], [6], [82], [99], i.e., the ratio of the number of occurrences of the least frequently occurring class of the dependent variable (i.e., the events) to the number of independent variables used to train the model (i.e., the variables)....
[...]
References
21,674 citations
8,610 citations
8,046 citations
"A General Software Defect-Proneness..." refers methods in this paper
...This significantly outperformed the rule induction methods of J48 and OneR (due to Quinlan [31])....
[...]
1,741 citations
"A General Software Defect-Proneness..." refers background in this paper
...The third type of work classifies software components as defect-prone and non-defect-prone by means of metric-based classification [13], [14], [15], [16], [17], [18],...
[...]
1,653 citations
Additional excerpts
...Examples include [13], [26], [27], [ 28 ], [20], [23], and [24]....
[...]
Related Papers (5)
Frequently Asked Questions (4)
Q2. How many learning schemes were used in the study?
12 learning schemes2 resulted from two data preprocessors, two feature selectors, and three classification algorithms are designed to assess the effects of different elements of a learning scheme on defect prediction.
Q3. What is the procedure for constructing a predictor?
3) After the predictor is built, new data are preprocessed in same way as historical data, then the constructed predictor can be used to predict software defect with preprocessed new data.
Q4. What is the largest absolute value of balance diff in the proposed framework?
3) The largest absolute value of balance diff in MGF framework is 25.7% on AR1 data, on which the corresponding absolute value of balance diff in the proposed framework is just 3.16%.