What have the authors stated for future works in "Heterogeneous defect prediction" ?

As future work, the authors will explore the feasibility of building various prediction and recommendation models using heterogeneous datasets.

How many test splits are used for a prediction model?

For CPDP-CM, CPDP-IFS, and HDP, the authors build a prediction model by using a source dataset and test the model on the same test splits used in WPDP.

What is the way to compare the HDP approach to CPDP?

Since the authors focus on the distribution or correlation of metric values when matching metrics, it is beneficial to be able to apply the HDP approach on datasets even in different granularity levels.

What is the way to test HDP with other machine learners?

In their experimental settings, HDP tends to work well with the learners based on the linear relationship between a metric and a label (bug-proneness).

What is the way to predict a buggy instance?

Suppose that a simple model predicts that an instance is buggy when the metric value of the instance is more than 40 in the case of Figure 4.

(Open Access) Heterogeneous defect prediction (2015) | Jaechang Nam

Q: What are the contributions mentioned in the paper "Heterogeneous defect prediction" ?

The authors can build a prediction model with defect data collected from a software project and predict defects in the same project, i. e. within-project defect prediction ( WPDP ). However, CPDP requires projects that have the same metric set, meaning the metric sets should be identical between projects. To address the limitation, the authors propose heterogeneous defect prediction ( HDP ) to predict defects across projects with heterogeneous metric sets. Their HDP approach conducts metric selection and metric matching to build a prediction model between projects with heterogeneous metric sets. Their empirical study on 28 subjects shows that about 68 % of predictions using their approach outperform or are comparable to WPDP with statistical significance.

Q: How do the authors measure the similarity between source and target metrics?

To match source and target metrics, the authors measure the similarity of each source and target metric pair by using several existing methods such as percentiles, Kolmogorov-Smirnov Test, and Spearman’s correlation coefficient [30, 49].

Q: What is the cutoff threshold for a group of matched metrics?

After applying the cutoff threshold, the authors used the maximum weighted bipartite matching [31] technique to select a group of matched metrics, whose sum of matching scores is highest, without duplicated metrics.

Heterogeneous Defect Prediction

Jaechang Nam and Sunghun Kim

Department of Computer Science and Engineering

The Hong Kong University of Science and Technology

Hong Kong, China

{jcnam,hunkim}@cse.ust.hk

ABSTRACT

Software defect prediction is one of the most active research

areas in software engineering. We can build a prediction

model with defect data collected from a software project

and predict defects in the same project, i.e. within-project

defect prediction (WPDP). Researchers also proposed cross-

project defect prediction (CPDP) to predict defects for new

projects lacking in defect data by using prediction models

built by other projects. In recent studies, CPDP is proved

to be feasible. However, CPDP requires projects that have

the same metric set, meaning the metric sets should be iden-

tical between projects. As a result, current techniques for

CPDP are diﬃcult to apply across projects with heteroge-

neous metric sets.

To address the limitation, we propose heterogeneous de-

fect prediction (HDP) to predict defects across projects with

heterogeneous metric sets. Our HDP approach conducts

metric selection and metric matching to build a prediction

model between projects with heterogeneous metric sets. Our

empirical study on 28 subjects shows that about 68% of pre-

dictions using our approach outperform or are comparable

to WPDP with statistical signiﬁcance.

Categories and Subject Descriptors

D.2.9 [Software Engineering]: Management—software qual-

ity assurance

General Terms

Algorithm, Experimentation

Keywords

Defect prediction, quality assurance, heterogeneous metrics

1. INTRODUCTION

Software defect prediction is one of the most active re-

search areas in software engineering [8, 9, 24, 25, 26, 36,

37, 43, 47, 58, 59]. If software quality assurance teams can

predict defects before releasing a software product, they can

eﬀectively allocate limited resources for quality control [36,

38, 43, 58]. For example, Ostrand et al. applied defect pre-

diction in two large software systems of AT&T for eﬀective

and eﬃcient testing activities [38].

Most defect prediction models are based on machine learn-

ing, therefore it is a must to collect defect datasets to train a

prediction model [8, 36]. The defect datasets consist of var-

ious software metrics and labels. Commonly used software

metrics for defect prediction are complexity metrics (such as

lines of code, Halstead metrics, McCabe’s cyclometic com-

plexity, and CK metrics) and process metrics [2, 16, 32, 42].

Labels indicate whether the source code is buggy or clean

for binary classiﬁcation [24, 37].

Most proposed defect prediction models have been evalu-

ated on within-project defect prediction (WPDP) settings [8,

24, 36]. In Figure 1a, each instance representing a source

code ﬁle or function consists of software metric values and is

labeled as buggy or clean. In the WPDP setting, a predic-

tion model is trained using the labeled instances in Project

A and predict unlabeled (‘?’) instances in the same project

as buggy or clean.

However, it is diﬃcult to build a prediction model for new

software projects or projects with little historical informa-

tion [59] since they do not have enough training instances.

Various process metrics and label information can be ex-

tracted from the historical data of software repositories such

as version control and issue tracking systems [42]. Thus, it

is diﬃcult to collect process metrics and instance labels in

new projects or projects that have little historical data [9,

37, 59]. For example, without instances being labeled us-

ing past defect data it is not possible to build a prediction

model.

To address this issue, researchers have proposed cross-

project defect prediction (CPDP) [19, 29, 37, 43, 51, 59].

CPDP approaches predict defects even for new projects lack-

ing in historical data by reusing prediction models built by

other project datasets. As shown in Figure 1b, a prediction

model is trained by labeled instances in Project A (source)

and predicts defects in Project B (target).

However, most CPDP approaches have a serious limita-

tion: CPDP is only feasible for projects which have exactly

the same metric set as shown in Figure 1b. Finding other

projects with exactly the same metric set can be challenging.

Publicly available defect datasets that are widely used in de-

fect prediction literature usually have heterogeneous metric

sets [8, 35, 37]. For example, many NASA datasets in the

Test

Training

Model

Project A

: Metric value

: Buggy-labeled instance

: Clean-labeled instance

: Unlabeled instance

(a) Within-Project Defect Prediction (WPDP)

Training

Test

Model

Project A

(source)

Project B

(target)

Same metric set

(b) Cross-Project Defect Prediction (CPDP)

Training

Test

Model

Project A

(source)

Project C

(target)

Heterogeneous!metric sets

Figure 1: Various Defect Prediction Scenarios

PROMISE repository have 37 metrics but AEEEM datasets

used by D’Ambroas et al. have 61 metrics [8, 35]. The

only common metric between NASA and AEEEM datasets

is lines of code (LOC). CPDP between NASA and AEEEM

datasets with all metric sets is not feasible since they have

completely diﬀerent metrics [51].

Some CPDP studies use only common metrics when source

and target datasets have heterogeneous metric sets [29, 51].

For example, Turhan et al. use the only 17 common metrics

between the NASA and SOFTLAB datasets that have het-

erogeneous metric sets [51]. However, ﬁnding other projects

with multiple common metrics can be challenging. As men-

tioned, there is only one common metric between NASA and

AEEEM. Also, only using common metrics may degrade the

performance of CPDP models. That is because some in-

formative metrics necessary for building a good prediction

model may not be in the common metrics across datasets.

For example, in the study of Turhan et al., the performance

of CPDP (0.35) by their approach did not outperform that

of WPDP (0.39) in terms of the average f-measure [51].

In this paper, we propose the heterogeneous defect predic-

tion (HDP) approach to predict defects across projects even

with heterogeneous metric sets. If the proposed approach is

feasible as in Figure 1c, we could reuse any existing defect

datasets to build a prediction model. For example, many

PROMISE defect datasets even if they have heterogeneous

metric sets [35] could be used as training datasets to predict

defects in any project.

The key idea of our HDP approach is matching metrics

that have similar distributions between source and target

datasets. In addition, we also used metric selection to re-

move less informative metrics of a source dataset for a pre-

diction model before metric matching.

Our empirical study shows that HDP models are feasible

and their prediction performance is promising. About 68%

of HDP predictions outperform or are comparable to WPDP

predictions with statistical signiﬁcance.

Our contributions are as follows:

• Propose the heterogeneous defect prediction models.

• Conduct an extensive and large-scale empirical study to

evaluate the heterogeneous defect prediction models.

2. BACKGROUND AND RELATED WORK

The CPDP approaches have been studied by many re-

searchers of late [29, 37, 43, 51, 59]. Since the performance

of CPDP is usually very poor [59], researchers have proposed

various techniques to improve CPDP [29, 37, 51, 54].

Watanabe et al. proposed the metric compensation ap-

proach for CPDP [54]. The metric compensation transforms

a target dataset similar to a source dataset by using the av-

erage metric values [54]. To evaluate the performance of the

metric compensation, Watanabe et al. collected two defect

datasets with the same metric set (8 object-oriented metrics)

from two software projects and then conducted CPDP [54].

Rahman et al. evaluated the CPDP performance in terms

of cost-eﬀectiveness and conﬁrmed that the prediction per-

formance of CPDP is comparable to WPDP [43]. For the

empirical study, Rahman et al. collected 9 datasets with the

same process metric set [43].

Fukushima et al. conducted an empirical study of just-in-

time defect prediction in the CPDP setting [9]. They used

16 datasets with the same metric set [9]. The 11 datasets

were provided by Kamei et al. but 5 projects were newly

collected with the same metric set of the 11 datasets [9, 20].

However, collecting datasets with the same metric set

might limit CPDP. For example, if existing defect datasets

contain object-oriented metrics such as CK metrics [2], col-

lecting the same object-oriented metrics is impossible for

projects that are written in non-object-oriented languages.

Turhan et al. proposed the nearest-neighbour (NN) ﬁlter

to improve the performance of CPDP [51]. The basic idea of

the NN ﬁlter is that prediction models are built by source in-

stances that are nearest-neighbours of target instances [51].

To conduct CPDP, Turhan et al. used 10 NASA and SOFT-

LAB datasets in the PROMISE repository [35, 51].

Ma et al. proposed Transfer Naive Bayes (TNB) [29].

The TNB builds a prediction model by weighting source

instances similar to target instances [29]. Using the same

datasets used by Turhan et al., Ma et al. evaluated the

TNB models for CPDP [29, 51].

Since the datasets used in the empirical studies of Turhan

et al. and Ma et al. have heterogeneous metric sets, they

conducted CPDP using the common metrics [29, 51]. There

is another CPDP study with the top-K common metric sub-

set [17]. However, as explained in Section 1, CPDP using

common metrics is worse than WPDP [17, 51].

Nam et al. adapted a state-of-the-art transfer learning

technique called Transfer Component Analysis (TCA) and

proposed TCA+ [37]. They used 8 datasets in two groups,

ReLink and AEEEM, with 26 and 61 metrics respectively [37].

However, Nam et al. could not conduct CPDP between

ReLink and AEEEM because they have heterogeneous met-

ric sets. Since the project pool with the same metric set is

very limited, conducting CPDP using a project group with

the same metric set can be limited as well. For example,

at most 18% of defect datasets in the PROMISE repository

have the same metric set [35]. In other words, we cannot di-

rectly conduct CPDP for the 18% of the defect datasets by

using the remaining (82%) datasets in the PROMISE repos-

itory [35]. CPDP studies conducted by Canfora et al. and

Panichella et al. use 10 Java projects only with the same

metric set from the PROMISE repository [4, 35, 39]

Zhang et al. proposed the universal model for CPDP [57].

The universal model is built using 1398 projects from Source-

Forge and Google code and leads to comparable prediction

results to WPDP in their experimental setting [57].

However, the universal defect prediction model may be

diﬃcult to apply for the projects with heterogeneous met-

ric sets since the universal model uses 26 metrics including

code metrics, object-oriented metrics, and process metrics.

In other words, the model can only be applicable for target

datasets with the same 26 metrics. In the case where the

target project has not been developed in object-oriented lan-

guages, a universal model built using object-oriented metrics

cannot be used for the target dataset.

He et al. addressed the limitations due to heterogeneous

metric sets in CPDP studies listed above [18]. Their ap-

proach, CPDP-IFS, used distribution characteristic vectors

of an instance as metrics. The prediction performance of

their best approach is comparable to or helpful in improv-

ing regular CPDP models [18].

However, the approach by He et al. is not compared with

WPDP [18]. Although their best approach is helpful to im-

prove regular CPDP models, the evaluation might be weak

since the prediction performance of a regular CPDP is usu-

ally very poor [59]. In addition, He et al. conducted exper-

iments on only 11 projects in 3 dataset groups [18].

We propose HDP to address the above limitations caused

by projects with heterogeneous metric sets. Contrary to the

study by He et al. [18], we compare HDP to WPDP, and

HDP achieved better or comparable prediction performance

to WPDP in about 68% of predictions. In addition, we

conducted extensive experiments on 28 projects in 5 dataset

groups. In Section 3, we explain our approach in detail.

3. APPROACH

Figure 2 shows the overview of HDP based on metric se-

lection and metric matching. In the ﬁgure, we have two

datasets, Source and Target, with heterogeneous metric sets.

Each row and column of a dataset represents an instance

and a metric, respectively, and the last column represents

instance labels. As shown in the ﬁgure, the metric sets in

the source and target datasets are not identical (X

to X

and Y

to Y

respectively).

When given source and target datasets with heterogeneous

metric sets, for metric selection we ﬁrst apply a feature selec-

tion technique to the source. Feature selection is a common

approach used in machine learning for selecting a subset of

features by removing redundant and irrelevant features [13].

We apply widely used feature selection techniques for metric

selection of a source dataset as in Section 3.1 [10, 47].

After that, metrics based on their similarity such as dis-

tribution or correlation between the source and target met-

Label#

1" 1" 3"

10"

Buggy"

8" 0" 1" 0"

Clean"

" " " " "

9" 0" 1" 1"

Clean"

Metric

Matching

Source: Project A

Target: Project B

Prediction

Model

Build

(training)

Predict

(test)

Metric

Selection

Label#

3" 1" 1" 0" 2" 1" 9" ?"

1" 1" 9" 0" 2" 3" 8" ?"

" " " " " " " "

0" 1" 1" 1" 2" 1" 1" ?"

1" 3"

10"

Buggy"

8" 1" 0"

Clean"

" " " "

9" 1" 1"

Clean"

1" 3"

10"

Buggy"

8" 1" 0"

Clean"

" " " "

9" 1" 1"

Clean"

9" 1" 1" ?"

8" 3" 9" ?"

" " " "

1" 1" 1" ?"

Figure 2: Heterogeneous defect prediction

rics are matched up. In Figure 2, three target metrics are

matched with the same number of source metrics.

After these processes, we ﬁnally arrive at a matched source

and target metric set. With the ﬁnal source dataset, HDP

builds a model and predicts labels of target instances.

In the following subsections, we explain the metric selec-

tion and matching in detail.

3.1 Metric Selection in Source Datasets

For metric selection, we used various feature selection ap-

proaches widely used in defect prediction such as gain ra-

tio, chi-square, relief-F, and signiﬁcance attribute evalua-

tion [10, 47]. According to benchmark studies about various

feature selection approaches, a single best feature selection

approach for all prediction models does not exist [5, 15, 28].

For this reason, we conduct experiments under diﬀerent fea-

ture selection approaches. When applying feature selection

approaches, we select top 15% of metrics as suggested by

Gao et al. [10]. In addition, we compare the prediction re-

sults with or without metric selection in the experiments.

3.2 Matching Source and Target Metrics

To match source and target metrics, we measure the sim-

ilarity of each source and target metric pair by using several

existing methods such as percentiles, Kolmogorov-Smirnov

Test, and Spearman’s correlation coeﬃcient [30, 49]. We de-

ﬁne the following three analyzers for metric matching:

• Percentile based matching (PAnalyzer)

• Kolmogorov-Smirnov Test based matching (KSAnalyzer)

• Spearman’s correlation based matching (SCoAnalyzer)

The key idea of these analyzers is computing matching

scores for all pairs between the source and target metrics.

Figure 3 shows a sample matching. There are two source

metrics (X

and X

) and two target metrics (Y

and Y

Thus, there are four possible matching pairs, (X

), (X

), and (X

). The numbers in rectangles between

Source Metrics Target Metrics

0.8

0.4

0.5

0.3

Figure 3: An example of metric matching between

source and target datasets.

matched source and target metrics in Figure 3 represent

matching scores computed by an analyzer. For example,

the matching score between the metrics, X

and Y

, is 0.8.

From all pairs between the source and target metrics, we

remove poorly matched metrics whose matching score is not

greater than a speciﬁc cutoﬀ threshold. For example, if the

matching score cutoﬀ threshold is 0.3, we include only the

matched metrics whose matching score is greater than 0.3.

In Figure 3, the edge (X

) in matched metrics will be

excluded when the cutoﬀ threshold is 0.3. Thus, all the

candidate matching pairs we can consider include the edges

), (X

), and (X

) in this example. In Section 4,

we design our empirical study under diﬀerent matching score

cutoﬀ thresholds to investigate their impact on prediction.

We may not have any matched metrics based on the cutoﬀ

threshold. In this case, we cannot conduct defect prediction.

In Figure 3, if the cutoﬀ threshold is 0.9, none of the matched

metrics are considered for HDP so we cannot build a pre-

diction model for the target dataset. For this reason, we

investigate target prediction coverage (i.e. what percentage

of target datasets could be predicted?) in our experiments.

After applying the cutoﬀ threshold, we used the maximum

weighted bipartite matching [31] technique to select a group

of matched metrics, whose sum of matching scores is highest,

without duplicated metrics. In Figure 3, after applying the

cutoﬀ threshold of 0.30, we can form two groups of matched

metrics without duplicated metrics. The ﬁrst group con-

sists of the edges, (X

) and (X

), and another group

consists of the edge (X

). In each group, there are no

duplicated metrics. The sum of matching scores in the ﬁrst

group is 1.3 (=0.8+0.5) and that of the second group is 0.4.

The ﬁrst group has a greater sum (1.3) of matching scores

than the second one (0.4). Thus, we select the ﬁrst match-

ing group as the set of matched metrics for the given source

and target metrics with the cutoﬀ threshold of 0.30 in this

example.

Each analyzer for the metric matching scores is described

below.

3.2.1 PAnalyzer

PAnalyzer simply compares 9 percentiles (10th, 20th,. . . ,

90th) of ordered values between source and target metrics.

First, we compute the diﬀerence of n-th percentiles in

source and target metric values by the following equation:

(n) =

(n)

(1)

, where P

(n) is the comparison function for n-th percentiles

of i-th source and j-th target metrics, and sp

(n) and bp

(n)

are smaller and bigger percentile values respectively at n-th

percentiles of i-th source and j-th target metrics. For exam-

ple, if the 10th percentile of the source metric values is 20

and that of target metric values is 15, the diﬀerence is 0.75

(10) = 15/20 = 0.75).

Using this percentile comparison function, a matching score

between source and target metrics is calculated by the fol-

lowing equation:

k=1

(10 × k)

(2)

, where M

is a matching score between i-th source and j-th

target metrics. The best matching score of this equation is

1.0 when the values of the source and target metrics of all 9

percentiles are the same.

3.2.2 KSAnalyzer

KSAnalyzer uses a p-value from the Kolmogorov-Smirnov

Test (KS-test) as a matching score between source and tar-

get metrics. The KS-test is a non-parametric two sample

test that can be applicable when we cannot be sure about

the normality of two samples and/or the same variance [27,

30]. Since metrics in some defect datasets used in our em-

pirical study have exponential distributions [36] and metrics

in other datasets have unknown distributions and variances,

the KS-test is a suitable statistical test to compute p-values

for these datasets. In statistical testing, a p-value shows the

probability of whether two samples are signiﬁcantly diﬀerent

or not. We used the KolmogorovSmirnovTest implemented

in the Apache commons math library.

The matching score is:

= p

(3)

, where p

is a p-value from the KS-test of i-th source and

j-th target metrics. A p-value tends to be zero when two

metrics are signiﬁcantly diﬀerent.

3.2.3 SCoAnalyzer

In SCoAnalyzer, we used the Spearman’s rank correla-

tion coeﬃcient as a matching score for source and target

metrics [49]. Spearman’s rank correlation measures how

two samples are correlated [49]. To compute the coeﬃcient,

we used the SpearmansCorrelation in the Apache commons

math library. Since the size of metric vectors should be the

same to compute the coeﬃcient, we randomly select metric

values from a metric vector that is of a greater size than an-

other metric vector. For example, if the sizes of the source

and target metric vectors are 110 and 100 respectively, we

randomly select 100 metric values from the source metric to

agree to the size between the source and target metrics. All

metric values are sorted before computing the coeﬃcient.

The matching score is as follows:

= c

(4)

, where c

is a Spearman’s rank correlation coeﬃcient be-

tween i-th source and j-th target metrics.

3.3 Building Prediction Models

After applying metric selection and matching, we can ﬁ-

nally build a prediction model using a source dataset with

selected and matched metrics. Then, as a regular defect

prediction model, we can predict defects on a target dataset

with metrics matched to selected source metrics.

Table 1: The 28 defect datasets from ﬁve groups.

Group Dataset

# of instances # of

metrics

Prediction

GranularityAll Buggy(%)

AEEEM

[8, 37]

EQ 325 129(39.69%)

61 Class

JDT 997 206(20.66%)

LC 399 64(9.26%)

ML 1862 245(13.16%)

PDE 1492 209(14.01%)

ReLink

[56]

Apache 194 98(50.52%)

26 FileSafe 56 22(39.29%)

ZXing 399 118(29.57%)

MORPH

[40]

ant-1.3 125 20(16.00%)

20 Class

arc 234 27(11.54%)

camel-1.0 339 13(3.83%)

poi-1.5 237 141(59.49%)

redaktor 176 27(15.34%)

skarbonka 45 9(20.00%)

tomcat 858 77(8.97%)

velocity-1.4 196 147(75.00%)

xalan-2.4 723 110(15.21%)

xerces-1.2 440 71(16.14%)

NASA

[35, 45]

cm1 327 42(12.84%)

37 Function

mw1 253 27(10.67%)

pc1 705 61(8.65%)

pc3 1077 134(12.44%)

pc4 1458 178(12.21%)

SOFTLAB

[51]

ar1 121 9(7.44%)

Function

ar3 63 8(12.70%)

ar4 107 20(18.69%)

ar5 36 8(22.22%)

ar6 101 15(14.85%)

4. EXPERIMENTAL SETUP

4.1 Research Questions

To systematically evaluate heterogeneous defect predic-

tion (HDP) models, we set three research questions.

• RQ1: Is heterogeneous defect prediction comparable to

WPDP (Baseline1)?

• RQ2: Is heterogeneous defect prediction comparable to

CPDP using common metrics (CPDP-CM, Baseline2)?

• RQ3: Is heterogeneous defect prediction comparable to

CPDP-IFS (Baseline3)?

RQ1, RQ2, and RQ3 lead us to investigate whether our HDP

is comparable to WPDP (Baseline1), CPDP-CM (Baseline2),

and CDDP-IFS (Baseline3) [18].

4.2 Benchmark Datasets

We collected publicly available datasets from previous stud-

ies [8, 37, 40, 51, 56]. Table 1 lists all dataset groups used in

our experiments. Each dataset group has a heterogeneous

metric set as shown in the table. Prediction Granularity in

the last column of the table means the prediction granularity

of instances. Since we focus on the distribution or correla-

tion of metric values when matching metrics, it is beneﬁcial

to be able to apply the HDP approach on datasets even in

diﬀerent granularity levels.

We used ﬁve groups with 28 defect datasets: AEEEM,

ReLink, MORPH, NASA, and SOFTLAB.

AEEEM was used to benchmark diﬀerent defect predic-

tion models [8] and to evaluate CPDP techniques [18, 37].

Each AEEEM dataset consists of 61 metrics including object-

oriented (OO) metrics, previous-defect metrics, entropy met-

rics of change and code, and churn-of-source-code metrics [8].

Datasets in ReLink were used by Wu et al. [56] to improve

the defect prediction performance by increasing the quality

of the defect data and have 26 code complexity metrics ex-

tracted by the Understand tool [52].

The MORPH group contains defect datasets of several

open source projects used in the study about the dataset

privacy issue for defect prediction [40]. The 20 metrics used

in MORPH are McCabe’s cyclomatic metrics, CK metrics,

and other OO metrics [40].

NASA and SOFTLAB contain proprietary datasets from

NASA and a Turkish software company, respectively [51].

We used ﬁve NASA datasets, which share the same metric

set in the PROMISE repository [35, 45]. We used cleaned

NASA datasets (DS

version) [45]. For the SOFTLAB group,

we used all SOFTLAB datasets in the PROMISE reposi-

tory [35]. The metrics used in both NASA and SOFTLAB

groups are Halstead and McCabe’s cyclomatic metrics but

NASA has additional complexity metrics such as parameter

count and percentage of comments [35].

Predicting defects is conducted across diﬀerent dataset

groups. For example, we build a prediction model by Apache

in ReLink and tested the model on velocity-1.4 in MORPH

(Apache⇒velocity-1.4).

We did not conduct defect prediction across projects in the

same group where datasets have the same metric set since

the focus of our study is on prediction across datasets with

heterogeneous metric sets. In total, we have 600 possible

prediction combinations from these 28 datasets.

4.3 Matching Score Cutoff Thresholds

To build HDP models, we apply various cutoﬀ thresholds

for matching scores to observe how prediction performance

varies according to diﬀerent cutoﬀ values. Matched metrics

by analyzers have their own matching scores as explained in

Section 3. We apply diﬀerent cutoﬀ values (0.05 and 0.10,

0.20,. . . ,0.90) for the HDP models. If a matching score cut-

oﬀ is 0.50, we remove matched metrics with the matching

score ≤ 0.50 and build a prediction model with matched

metrics with the score > 0.50. The number of matched met-

rics varies by each prediction combination. For example,

when using KSAnalyzer with the cutoﬀ of 0.05, the number

of matched metrics is four in cm1⇒ar5 while that is one

in ar6⇒pc3. The average number of matched metrics also

varies by analyzers and cutoﬀ values; 4 (PAnalyzer), 2 (KS-

Analyzer), and 5 (SCoAnalyzer) in the cutoﬀ of 0.05 but 1

(PAnalyzer), 1 (KSAnalyzer), and 4 (SCoAnalyzer) in the

cutoﬀ of 0.90 on average.

4.4 Baselines

We compare HDP to three baselines: WPDP (Baseline1),

CPDP using common metrics (CPDP-CM) between source

and target datasets (Baseline2), and CPDP-IFS (Baseline3).

We ﬁrst compare HDP to WPDP. Comparing HDP to

WPDP will provide empirical evidence of whether our HDP

models are applicable in practice.

We conduct CPDP using only common metrics (CPDP-

CM) between source and target datasets as in previous CPDP

studies [18, 29, 51]. For example, AEEEM and MORPH

have OO metrics as common metrics so we use them to

build prediction models for datasets between AEEEM and

MORPH. Since using common metrics has been adopted to

address the limitation on heterogeneous metric sets in previ-

ous CPDP studies [18, 29, 51], we set CPDP-CM as a base-

line to evaluate our HDP models. The number of matched

metrics varies across the dataset group. Between AEEEM

Hereafter a rightward arrow (⇒) denotes a prediction com-

bination.

Heterogeneous defect prediction

Figures

Citations

A survey of transfer learning

Automatically learning semantic features for defect prediction

Analysis of transfer learning for deep neural network based plant classification models

A survey on heterogeneous transfer learning

Cross-project defect prediction using a connectivity-based unsupervised classifier

References

The WEKA data mining software: an update

A Survey on Transfer Learning

An introduction to variable and feature selection

Individual Comparisons by Ranking Methods

A metrics suite for object oriented design

Related Papers (5)

On the relative value of cross-company and within-company data for defect prediction

Cross-project defect prediction: a large scale experiment on data vs. domain vs. process

Data Mining Static Code Attributes to Learn Defect Predictors

A Survey on Transfer Learning

Evaluating defect prediction approaches: a benchmark and an extensive comparison

Frequently Asked Questions (11)

Q1. What have the authors stated for future works in "Heterogeneous defect prediction" ?

Q2. What are the contributions mentioned in the paper "Heterogeneous defect prediction" ?

Q3. How many test splits are used for a prediction model?

Q4. What are the main reasons why researchers have proposed different techniques to improve CPDP?

Q5. What is the main reason why software quality assurance teams can predict defects before releasing a software?

Q6. How do the authors measure the similarity between source and target metrics?

Q7. What is the way to compare the HDP approach to CPDP?

Q8. What is the way to test HDP with other machine learners?

Q9. What is the cutoff threshold for a group of matched metrics?

Q10. What is the match between source and target metrics?

Q11. What is the way to predict a buggy instance?