What are the contributions mentioned in the paper "Branch coverage prediction in automated testing" ?

In this paper, the authors investigate the possibility to use source-code metricsto predict the coverage achieved by testdata generation tools. The authors compare different machine learning algorithms and conduct a fine-grained feature analysis aimed at investigating the factors that most impact the prediction accuracy. In this paper, the authors investigate the possibility to use source-code metrics to predict the coverage achieved by test-data generation tools. The authors argue that knowing a priori the branch-coverage that can be achieved with test-data generation tools can help developers into taking informed-decision about those issues.

What are the future works mentioned in the paper "Branch coverage prediction in automated testing" ?

Future efforts will involve both the horizontal and the vertical extension of this work: with the former, the authors plan to still enlarge the training dataset ; with the latter they aim at detecting evenmore sophisticated features aimed at improving the precision of themodel.

What is the reason for the lack of a complete regression testing pipeline?

due to the time constraints between frequent commits, a complete regression testing is not feasible for large projects 5. Furthermore, even test suite augmentation6, i.e., the automatic generation considering code changes and their effect on the previous codebase, is hardly doable due to the extensive amount of time needed to generate tests for just a single class.

What is the accurate algorithm amongst the considered ones?

RANDOM FOREST REGRESSOR is the most accurate algorithm amongst the considered ones, scoring an average 0.15 and a 0.21 MAE respectively for the EVOSUITE and the RANDOOPmodel over the different search-budgets.

What is the main rationale behind the use of halstead metrics?

The main rationale behind the choice of this set of metrics lies indeed in the fact that they use operands and operators as atomic units of measurement.

How many times do the authors repeat the outer cross-validation loop?

It is worth to note that, to cope with the randomness arising fromusing different data splits 50, the authors repeat the outer cross-validation loop 10 times.

What is the purpose of the Halstead complexitymetrics?

TheHalstead complexitymetrics have been developed byMaurice Halstead 10 with the goal to quantitativelymeasure the complexity of a program directly from the source code.

How many MAEs are achieved for EVOSUITE?

It is worth to remark that, in their previous work 11, the authors achieve a MAE of 0.291 and 0.225 respectively for the EVOSUITE and RANDOOP model (only trained with the default-budget).

What is the main reason why RANDOOP relies on random-search?

RANDOOP relies on random-search: thismeans that the additional budget given to the search ismerely used to generate additional random inputs, without any guidance towards better solutions.

What is the argument that makes the prediction of the achievable branch harder?

The authors argue that the intrinsic randomness of RANDOOP makes harder the prediction of the achievable branch-coverage, compared to EVOSUITE.

How does a test-case generation tool achieve high coverage?

in some cases —especially for a very trivial class— a testing tool achieves a high-level of coverage in a short amount of time, sometimes even the 100% of coverage with the default search-budget.

(Open Access) Branch coverage prediction in automated testing (2019) | Giovanni Grano

ZurichOpenRepositoryand

Archive

UniversityofZurich

UniversityLibrary

Strickhofstrasse39

CH-8057Zurich

www.zora.uzh.ch

Year:2019

BranchCoveragePredictioninAutomatedTesting

Grano,Giovanni;Titov,TimofeyV;Panichella,Sebastiano;Gall,HaraldC

Abstract:Softwaretestingiscrucialincontinuousintegration(CI).Ideally, ateverycommit,allthe

testcasesshouldbeexecutedand, moreover,newtestcasesshouldbegeneratedforthenewsource

code.ThisisespeciallytrueinaContinuousTestGeneration(CTG)environment,wheretheautomatic

generationoftestcasesisintegratedintothecontinuousintegrationpipeline.Inthiscontext,developers

wanttoachieveacertainminimumlevelofcoverageforeverysoftwarebuild.However,executingall

thetestcasesand,moreover,generatingnewonesforalltheclassesateverycommitisnotfeasible.

Asaconsequence,developershavetoselectwhichsubsetofclasseshastobetestedand/ortargeted

bytest-casegeneration.Wearguethatknowingapriorithebranch-coveragethatcanbeachievedwith

test-datagenerationtoolscanhelpdevelopersintotakinginformed-decisionaboutthoseissues.Inthis

paper,weinvestigatethepossibilitytousesource-codemetricstopredictthecoverageachievedbytest-

datagenerationtools. Weusefourdierentcategoriesofsource-codefeaturesandassesstheprediction

on a large dataset involving more than 3’000 Java classes.We compare dierent machine learning

algorithms and conduct a ne-grained feature analysis aimed at investigating the factors that most

impactthepredictionaccuracy. Moreover,weextendourinvestigationtofourdierentsearch-budgets.

Ourevaluationshowsthat thebestmodelachievesan average0.15 and0.21MAEon nestedcross-

validationoverthedierentbudgets,respectivelyonEvoSuiteandRandoop.Finally,thediscussionof

theresultsdemonstratetherelevanceofcoupling-relatedfeaturesforthepredictionaccuracy.

DOI:https://doi.org/10.1002/smr.2158

PostedattheZurichOpenRepositoryandArchive,UniversityofZurich

ZORAURL:https://doi.org/10.5167/uzh-169144

JournalArticle

AcceptedVersion

Originallypublishedat:

Grano,Giovanni;Titov,TimofeyV;Panichella,Sebastiano;Gall,HaraldC(2019).BranchCoverage

PredictioninAutomatedTesting.JournalofSoftware:EvolutionandProcess,31(9):1-22.

DOI:https://doi.org/10.1002/smr.2158

Received 26 April 2016; Revised 6 June 2016; Accepted 6 June 2016

DOI: xxx/xxxx

RESEARCH PAPER

Branch Coverage Prediction in Automated Testing

†

Giovanni Grano | Timofey V. Titov | Sebastiano Panichella | Harald C. Gall

Department of Informatics, University of

Zurich, Zurich, Switzerland

Correspondence

Giovanni Grano, Binzmühlestrasse 14, Zurich,

Switzerland

Email: grano@iﬁ.uzh.ch

Summary

Software testing is crucial in continuous integration (CI). Ideally, at every commit, all the test cases

should be executed and, moreover, new test cases should be generated for the new source code.

This is especially true in a Continuous Test Generation (CTG) environment, where the automatic

generation of test cases is integrated into the continuous integration pipeline. In this context,

developers want to achieve a certain minimum level of coverage for every software build. How-

ever, executing all the test cases and, moreover, generating new ones for all the classes at every

commit is not feasible. As a consequence, developers have to select which subset of classes

has to be tested and/or targeted by test-case generation. We argue that knowing a priori the

branch-coverage that can be achieved with test-data generation tools can help developers into

taking informed-decision about those issues. In this paper, we investigate the possibility to use

source-code metrics to predict the coverage achieved by test-data generation tools. We use four

different categories of source-code features and assess the prediction on a large dataset involv-

ing more than 3’000 Java classes. We compare different machine learning algorithms and conduct

a ﬁne-grained feature analysis aimed at investigating the factors that most impact the prediction

accuracy. Moreover, we extend our investigation to four different search-budgets. Our evaluation

shows that the best model achieves an average 0.15 and 0.21 MAE on nested cross-validation

over the different budgets, respectively on EVOSUITE and RANDOOP. Finally, the discussion of the

results demonstrate the relevance of coupling-related features for the prediction accuracy.

KEYWORDS:

Machine Learning, Software Testing, Automated Software Testing, Coverage Prediction

INTRODUCTION

Software testing is widely recognized as a crucial task in any software development process

, estimated at being at least about half of the entire

development cost

2,3

. In the last years, we witnessed a wide adoption of continuous integration (CI) practices, where new or changed code is inte-

grated extremely frequently into the main codebase. Testing plays an important role in such a pipeline: in an ideal world, at every single commit,

every system’s test case should be executed (regression testing). Moreover, additional test cases might be automatically generated to test all the

new —or modiﬁed— code introduced into the main codebase

. This is especially true in a Continuous Test Generation (CTG) environment, where

the generation of test cases is directly integrated into the continuous integration cycle

. However, due to the time constraints between frequent

commits, a complete regression testing is not feasible for large projects

. Furthermore, even test suite augmentation

, i.e., the automatic generation

considering code changes and their effect on the previous codebase, is hardly doable due to the extensive amount of time needed to generate tests

for just a single class.

†

This work is an extension of the conference paper presented at the MaLTeSQuE 2018 workshop.

2 Giovanni Grano ET AL

As developers want to ensure a certain minimum level of branch coverage for every build, these computational constraints cause many

challenges. For instance, developers have to select and rank a subset of classes for which run test-date generation tools, or again, allocate a search-

budget (i.e., time) to devote to the test generation per each class. Knowing a priori the coverage that will be achieved by test-data generation tools

with a given search-budget can help developers in taking informed decisions to answer such questions: we give following some practical examples

of decisions that can be taken exploiting such a prediction. With the goal to maximize the branch-coverage on the entire system, developers might

want to prioritize the test-data generation effort towards the classes for which they know a high branch-coverage can be achieved. On the contrary,

they would avoid to spend precious computational time in running test-case generation tools against classes that will never reach a satisfying level

of coverage; for these cases, developers will likely manually write more effective tests. Similarly, knowing the achievable coverage given a certain

search-budget, developers might be able to allocate such a budget in a more efﬁcient way, with the goal to maximize the achieved coverage and to

minimize the time spent for the generation.

To address such questions, we built a machine learning (ML) model to predict the branch coverage that will be achieved by two test-data genera-

tion tools —EVOSUITE

and RANDOOP

— on a given class under test (CUT). However, being the achievable branch-coverage strongly depending from

the search-budget (i.e., the time) allocated for the generations, we run each of the aforementioned tools by experimenting four different search-

budgets. It is important to note that the branch-coverage achieved with every budget represents the dependent variable our models try to predict.

Therefore, in this study we train and evaluate four different machine learning models for each tool, where every model is specialized into the pre-

diction of the achievable coverage given the allocated search-budget. It is worth to note that this specialization is needed to address particular

questions, like the choice of the search-budget for each test-case generation.

To select the features needed to train the aforementioned models, we investigate metrics able to represent —or measure— the complexity of

a class under test. This, we select a total of 79 factors coming from four different categories. We focus on source-code metrics for the following

reasons: (i) they can be obtained statically, without actually executing the code; (ii) they are easy to compute, and (iii) they usual come for free

in a continuous integration (CI) environment, where the code is constantly analyzed by several quality checkers. Amongst the others, we rely on

well-established source-code metrics such as the Chidamber and Kemerer (CK)

and the Halstead metrics

. To detect the best algorithm for the

branch-coverage prediction, we experiment with four distinct algorithms covering distinct algorithmic families. At the end, we ﬁnd the Random

Forest Regressor algorithm to be the best performing ones in the context of brach-coverage prediction. Our ﬁnal model shows an average Mean

Absolute Error (MAE) of about 0.15 for EVOSUITE and of about 0.22 for RANDOOP, on average over the experimented budgets. Considering the

performance of the devised models, we argue that they can be practically useful to predict the coverage that will be achieved by test-data generation

tools in a real-case scenario. We believe that this approach can support developers in taking informed decision when it comes to deploy and practical

use test-case generation.

Contributions of the Paper

In this paper we deﬁne and evaluate machine learning models with the goal to predict the achievable branch coverage by test-data generation tools

like EVOSUITE

and RANDOOP

. The main contributions of the paper are:

• We investigate four different categories of code-quality metrics, i.e., Package Level, CK and OO, Java Reserved Keyword and Halstead metrics

as a features for the machine learning models;

• We experiment the performance of four different machine learning algorithms, i.e., Huber Regression, Support Vector Regression, Multi-

Layer Perceptron and Random Forest Regressor, for the branch prediction model;

• We perform a large scale study involving seven large open-source projects for a total of 3,105 Java classes;

• We extensively ran EVOSUITE and RANDOOP over all the classes of the study context, experimenting four different budgets and multiple

executions. The overall execution was parallelized over several multi-core servers, as the mere generation of the such amount of tests would

take months on a single CPU setup.

Novel Contribution of the Extension

This paper is an extension of our seminal work

, in which we ﬁrstly propose to predict the branch coverage that will be achieved by test-data

generation tools. Following, we summarize the novel contribution of this paper in respect to the original one:

• We introduce a new set of features, i.e., the Halstead metrics

. They aim at determining a quantitative measure of complexity directly from

the operators and operands in a class;

• We introduce a Random Forest Regressor —an an ensemble algorithm— for the branch prediction problem;

Giovanni Grano ET AL 3

3105 Java classes

from 7 projects

Final Dataset

Automated Testing Tools

EvoSuite

Randoop Jacoco

Achieved Coverage

Metric Extractors

Observation Frame

4 budgets

FIGURE 1 Construction process of the training dataset

• We add a ﬁne-grained analysis aiming at understanding the importance of the employed features in the prediction;

• We build the model with three more different budgets than the default one. We generate a test suite multiple times for each class and budget,

averaging the results; our preliminary ﬁndings were based on a single generation.

• We use a larger dataset to train and evaluate the proposed machine learning models.

Structure of the Paper

Section 2 describes the features we used to train and validate the proposed machine learning models; details about the extraction process are also

detailed in this section. Section 3.1 introduces the research questions of the empirical study together with its context: we present there the subjects

of the study, the procedure we use to build the training set and the machine learning algorithms employed. Section 4 describes the steps towards the

resolution of the proposed research questions, while the achieved results are presented in Section 5; the practical implications of the main ﬁndinds

are then discussed in Section 6. Section 7 discusses the main threats while related work are presented in Section 8. At the end, Section 9 concludes

the paper, drawing the future work.

DATASET AND FEATURES DESCRIPTION

Supervised learning models —like the ones we employ in this study— are used to predict a certain output given a set of inputs, learning from exam-

ples of input/output pairs

. More formally, we deﬁne x

(i)

as independent input variables —also called input features— and y

(i)

as the output —also

called dependent variable— that we are trying to predict. A pair (x

(i)

, y

(i)

) is a single training example and the entire dataset used to learn, i.e., the

training set is a list of m training examples {(x

(i)

, y

(i)

), ∀i = 1, ..., m}. The learning problem consists on learning a function h = X → Y with a good

accuracy of h in predicting the value of y.

In the context of this study, we build four different datasets, one for each investigated search-budget, consisting of a tuple (x

(i)

, y

(i)

b,t

) for each

class i, where:

• x

(i)

is a vector containing the values for all the factors we described in Section 2.2, for a given class i;

• y

(i)

b,t

is the branch-coverage value —in the range [0, 1]— of the test suite generated by a test-data generator tool t, for the class i with the

search-budget b. This is the dependent variable we want to predict.

It is worth to note that we investigate the prediction performance for two tools; since we experiment four search-budgets, we build a total of eight

different training sets. The process we use for the construction of such training sets is depicted in Figure 1. At ﬁrst, we execute the scripts needed

for the extraction of the factors described in Section 2.2 (

2 in Figure 1); those values form the {x

(i)

, ∀i = 1, ..., m} features vector, one for each

subject class i. It is worth to note that the features vector is the same for all the budgets: indeed, all the factors refers to the classes under test;

thus, their value is not affected by the used search-budget. Therefore, we compute the dependent variables, as explained in Section 2.1, obtaining

the {y

(i)

b,t

, ∀i = 1, ..., m}, where y

(i)

b,t

is the average coverage obtained for the class i by the tool t with a search budget b. At the end of the process,

4 Giovanni Grano ET AL

we result with a training dataset for each combination of tool and budget. More formally, for each budget b and tool t, we have a training dataset

{(x

(i)

, y

(i)

b,t

), ∀i = 1, ..., m} where x

(i)

and y

(i)

b,t

are respectively (i) the features vector for the class i, and (ii) the average coverage achieved over

the independent runs by a test-data generator tool t, for a subject class i with a search-budget b. The procedure we use to calculate the dependent

variable is reported in Section 2.1.

2.1

Dependent Variable

As dependent variable, we use the branch coverage achieved by the two experimented automated tools, i.e., EVOSUITE and RANDOOP. We run both

the tools with four different budgets: default (i.e., 60 and 90 seconds, respectively for EVOSUITE and RANDOOP), 180 seconds (i.e., 3 minutes), 300

seconds (i.e., 5 minutes) and 600 seconds (i.e., 10 minutes). We select these budgets for the following reasons: 180 and 300 seconds have been

the most used budgets exploited in the literature so far

13,14,15,16

; 10 minutes is a longer budget that we select to have an intuition on what can be

expected with extra time allowed for the search. We do not experiment longer budgets because (i) the computation-time needed to compute the

dependent variable for such a longer budget, and (ii) more importantly, the usage of test-data generation tools with such a long budget would be

hardly feasible in practice.

To collect the dependent variable —i.e., the variable we want to predict — we run both the tools for 10 times on the CUTs used in the study,

obtaining 10 different test suites (10 for each tool, i.e., 20 in total). We repeat this process for each budget we experiment. It is worth to note that,

for the 10 minutes budget, we only generate 3 tests suites. To sum up, for each class in our dataset we run each tool 33 times. Thus, averaging the

branch coverage of those suites by class, tool and budget, we obtain the {y

(i)

b,t

, ∀i = 1, ..., m}, where y

(i)

b,t

is the average coverage obtained for the

class i by a tool t with a search budget b. We use the so computed dependent variable to build the different training datasets as described above.

We averaged the results of different generations due to the non-deterministic nature of the algorithms underlying test-data generation tools. It is

worth to note that such a multiple execution represents an improvement over our previous work

, where we executed the tools once per class and

with the default search-budget only.

To calculate the coverage for each run we proceed as follows (refer to

1 box in Figure 1): while EVOSUITE automatically reports such informa-

tion in its CSV report, we have to compile, execute and measure the achieved coverage for the tests generated by RANDOOP. For the measurement

step we rely on Jacoco.

It is worth to note that we discard the data-points with branch coverage equals to 0. It is also important to underline how

time expensive the described process is: we ran both EVOSUITE and RANDOOP multiple times for the 3,105 classes we use in the study, using four

different search-budgets for about 820,000 executions. In a nutshell, we estimate the entire test generation process to take about 250 days on sin-

gle core machine. To speed up such process, we ran the generation on an OPENSTACK cluster using three different 16 cores Ubuntu server, with

64GB of RAM memory each. Figure 2 shows the distribution of the achieved branch-coverage for both EVOSUITE and RANDOOP over the four exper-

imented budgets. On one hand, we can observe that EVOSUITE consistently reaches higher branch-coverage on the CUTs while the search-budget

increases: indeed, the mean of the achieved coverage ranges from 74% with the default budget (i.e., 60 seconds) to the 81% with the 600 seconds

(i.e., 10 minutes) budget. On the other hand, the branch-coverage reached by RANDOOP does not seem particularly inﬂuenced by the budget given

to the search. It is worth to note that we measure the achieved coverage for RANDOOP over the regression test suites generated by the tool.

2.2

Independent Variables

In this study, we consider 79 factors belonging to 4 different categories, that might be correlated with the coverage that will be achieved by auto-

mated testing tools of a given target. We train our models on a set of features designed primarily to capture the code complexity of CUTs. The ﬁrst

set of features comes from JDEPEND

and captures information about the outer context layer of a CUT. We then use the Chidamber and Kemerer

(CK)

metrics —as depth of inheritance tree (DIT) and coupling between objects (CBO)— along with other object-oriented metrics, e.g., number

of static invocations (NOSI) and number of public methods (NOPM). These metrics have been computed using an open source tool provided by

Aniche

. To capture even more ﬁne-grained details, we include the counts for 52 Java keywords, including keywords such as synchronized, import

or instanceof. In addition to the ones used in our preliminary study

, we also include the Halstead metrics

. These metrics aim at measuring the

complexity of a given program looking at its operators and operands.

Package Level Features

Table 1 summarizes the features computed at package-level calculated with JDepend

. Originally, such features have been developed to represent

an indication of the quality of a package. For instance, TotalClasses is a measure of the extensibility of a package. The features Ca and Ce respectively

are meant to capture the responsibility and independence of the package. In our application, both represent complexity indicators for the purpose

https://www.jacoco.org

Branch coverage prediction in automated testing

Figures

Citations

The impact of test case summaries on bug fixing performance: An empirical investigation.

Can this fault be detected: A study on fault detection via automated test generation

An Empirical Study of the Relationship between Continuous Integration and Test Code Evolution

Integrated Pairwise Testing based Genetic Algorithm for Test Optimization

Автоматическая настройка систем автоматического управления газотурбинными установками с использованием алгоритмов библиотеки Apache Commons Math

References

LIBSVM: A library for support vector machines

Scikit-learn: Machine Learning in Python

Learning internal representations by error propagation

Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations

An introduction to variable and feature selection

Related Papers (5)

A Large-Scale Evaluation of Automated Unit Test Generation Using EvoSuite

EvoSuite: automatic test suite generation for object-oriented software

A coverage analysis tool for the effectiveness of software testing

Method for testing software based on testing framework

A data flow coverage testing tool for C

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Branch coverage prediction in automated testing" ?

Q2. What are the future works mentioned in the paper "Branch coverage prediction in automated testing" ?

Q3. What is the reason for the lack of a complete regression testing pipeline?

Q4. What is the accurate algorithm amongst the considered ones?

Q5. What is the main rationale behind the use of halstead metrics?

Q6. How many times do the authors repeat the outer cross-validation loop?

Q7. What is the purpose of the Halstead complexitymetrics?

Q8. How many MAEs are achieved for EVOSUITE?

Q9. What is the main reason why RANDOOP relies on random-search?

Q10. What is the role of testing in a continuous integration pipeline?

Q11. What is the argument that makes the prediction of the achievable branch harder?

Q12. How does a test-case generation tool achieve high coverage?