scispace - formally typeset
Open AccessJournal ArticleDOI

Branch coverage prediction in automated testing

Reads0
Chats0
TLDR
It is argued that knowing a priori the branch coverage that can be achieved with test‐data generation tools can help developers into taking informed decision about issues and it is investigated the possibility to use source‐code metrics to predict the coverage achieved by test‐ data generation tools.
Abstract
This is the peer reviewed version which has been published in final form at [DOI]. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.

read more

Content maybe subject to copyright    Report

ZurichOpenRepositoryand
Archive
UniversityofZurich
UniversityLibrary
Strickhofstrasse39
CH-8057Zurich
www.zora.uzh.ch
Year:2019
BranchCoveragePredictioninAutomatedTesting
Grano,Giovanni;Titov,TimofeyV;Panichella,Sebastiano;Gall,HaraldC
Abstract:Softwaretestingiscrucialincontinuousintegration(CI).Ideally, ateverycommit,allthe
testcasesshouldbeexecutedand, moreover,newtestcasesshouldbegeneratedforthenewsource
code.ThisisespeciallytrueinaContinuousTestGeneration(CTG)environment,wheretheautomatic
generationoftestcasesisintegratedintothecontinuousintegrationpipeline.Inthiscontext,developers
wanttoachieveacertainminimumlevelofcoverageforeverysoftwarebuild.However,executingall
thetestcasesand,moreover,generatingnewonesforalltheclassesateverycommitisnotfeasible.
Asaconsequence,developershavetoselectwhichsubsetofclasseshastobetestedand/ortargeted
bytest-casegeneration.Wearguethatknowingapriorithebranch-coveragethatcanbeachievedwith
test-datagenerationtoolscanhelpdevelopersintotakinginformed-decisionaboutthoseissues.Inthis
paper,weinvestigatethepossibilitytousesource-codemetricstopredictthecoverageachievedbytest-
datagenerationtools. Weusefourdierentcategoriesofsource-codefeaturesandassesstheprediction
on a large dataset involving more than 3’000 Java classes.We compare dierent machine learning
algorithms and conduct a ne-grained feature analysis aimed at investigating the factors that most
impactthepredictionaccuracy. Moreover,weextendourinvestigationtofourdierentsearch-budgets.
Ourevaluationshowsthat thebestmodelachievesan average0.15 and0.21MAEon nestedcross-
validationoverthedierentbudgets,respectivelyonEvoSuiteandRandoop.Finally,thediscussionof
theresultsdemonstratetherelevanceofcoupling-relatedfeaturesforthepredictionaccuracy.
DOI:https://doi.org/10.1002/smr.2158
PostedattheZurichOpenRepositoryandArchive,UniversityofZurich
ZORAURL:https://doi.org/10.5167/uzh-169144
JournalArticle
AcceptedVersion
Originallypublishedat:
Grano,Giovanni;Titov,TimofeyV;Panichella,Sebastiano;Gall,HaraldC(2019).BranchCoverage
PredictioninAutomatedTesting.JournalofSoftware:EvolutionandProcess,31(9):1-22.
DOI:https://doi.org/10.1002/smr.2158

Received 26 April 2016; Revised 6 June 2016; Accepted 6 June 2016
DOI: xxx/xxxx
RESEARCH PAPER
Branch Coverage Prediction in Automated Testing
Giovanni Grano | Timofey V. Titov | Sebastiano Panichella | Harald C. Gall
Department of Informatics, University of
Zurich, Zurich, Switzerland
Correspondence
Giovanni Grano, Binzmühlestrasse 14, Zurich,
Switzerland
Email: grano@ifi.uzh.ch
Summary
Software testing is crucial in continuous integration (CI). Ideally, at every commit, all the test cases
should be executed and, moreover, new test cases should be generated for the new source code.
This is especially true in a Continuous Test Generation (CTG) environment, where the automatic
generation of test cases is integrated into the continuous integration pipeline. In this context,
developers want to achieve a certain minimum level of coverage for every software build. How-
ever, executing all the test cases and, moreover, generating new ones for all the classes at every
commit is not feasible. As a consequence, developers have to select which subset of classes
has to be tested and/or targeted by test-case generation. We argue that knowing a priori the
branch-coverage that can be achieved with test-data generation tools can help developers into
taking informed-decision about those issues. In this paper, we investigate the possibility to use
source-code metrics to predict the coverage achieved by test-data generation tools. We use four
different categories of source-code features and assess the prediction on a large dataset involv-
ing more than 3’000 Java classes. We compare different machine learning algorithms and conduct
a fine-grained feature analysis aimed at investigating the factors that most impact the prediction
accuracy. Moreover, we extend our investigation to four different search-budgets. Our evaluation
shows that the best model achieves an average 0.15 and 0.21 MAE on nested cross-validation
over the different budgets, respectively on EVOSUITE and RANDOOP. Finally, the discussion of the
results demonstrate the relevance of coupling-related features for the prediction accuracy.
KEYWORDS:
Machine Learning, Software Testing, Automated Software Testing, Coverage Prediction
1
INTRODUCTION
Software testing is widely recognized as a crucial task in any software development process
1
, estimated at being at least about half of the entire
development cost
2,3
. In the last years, we witnessed a wide adoption of continuous integration (CI) practices, where new or changed code is inte-
grated extremely frequently into the main codebase. Testing plays an important role in such a pipeline: in an ideal world, at every single commit,
every system’s test case should be executed (regression testing). Moreover, additional test cases might be automatically generated to test all the
new —or modified— code introduced into the main codebase
4
. This is especially true in a Continuous Test Generation (CTG) environment, where
the generation of test cases is directly integrated into the continuous integration cycle
4
. However, due to the time constraints between frequent
commits, a complete regression testing is not feasible for large projects
5
. Furthermore, even test suite augmentation
6
, i.e., the automatic generation
considering code changes and their effect on the previous codebase, is hardly doable due to the extensive amount of time needed to generate tests
for just a single class.
This work is an extension of the conference paper presented at the MaLTeSQuE 2018 workshop.

2 Giovanni Grano ET AL
As developers want to ensure a certain minimum level of branch coverage for every build, these computational constraints cause many
challenges. For instance, developers have to select and rank a subset of classes for which run test-date generation tools, or again, allocate a search-
budget (i.e., time) to devote to the test generation per each class. Knowing a priori the coverage that will be achieved by test-data generation tools
with a given search-budget can help developers in taking informed decisions to answer such questions: we give following some practical examples
of decisions that can be taken exploiting such a prediction. With the goal to maximize the branch-coverage on the entire system, developers might
want to prioritize the test-data generation effort towards the classes for which they know a high branch-coverage can be achieved. On the contrary,
they would avoid to spend precious computational time in running test-case generation tools against classes that will never reach a satisfying level
of coverage; for these cases, developers will likely manually write more effective tests. Similarly, knowing the achievable coverage given a certain
search-budget, developers might be able to allocate such a budget in a more efficient way, with the goal to maximize the achieved coverage and to
minimize the time spent for the generation.
To address such questions, we built a machine learning (ML) model to predict the branch coverage that will be achieved by two test-data genera-
tion tools —EVOSUITE
7
and RANDOOP
8
on a given class under test (CUT). However, being the achievable branch-coverage strongly depending from
the search-budget (i.e., the time) allocated for the generations, we run each of the aforementioned tools by experimenting four different search-
budgets. It is important to note that the branch-coverage achieved with every budget represents the dependent variable our models try to predict.
Therefore, in this study we train and evaluate four different machine learning models for each tool, where every model is specialized into the pre-
diction of the achievable coverage given the allocated search-budget. It is worth to note that this specialization is needed to address particular
questions, like the choice of the search-budget for each test-case generation.
To select the features needed to train the aforementioned models, we investigate metrics able to represent —or measure— the complexity of
a class under test. This, we select a total of 79 factors coming from four different categories. We focus on source-code metrics for the following
reasons: (i) they can be obtained statically, without actually executing the code; (ii) they are easy to compute, and (iii) they usual come for free
in a continuous integration (CI) environment, where the code is constantly analyzed by several quality checkers. Amongst the others, we rely on
well-established source-code metrics such as the Chidamber and Kemerer (CK)
9
and the Halstead metrics
10
. To detect the best algorithm for the
branch-coverage prediction, we experiment with four distinct algorithms covering distinct algorithmic families. At the end, we find the Random
Forest Regressor algorithm to be the best performing ones in the context of brach-coverage prediction. Our final model shows an average Mean
Absolute Error (MAE) of about 0.15 for EVOSUITE and of about 0.22 for RANDOOP, on average over the experimented budgets. Considering the
performance of the devised models, we argue that they can be practically useful to predict the coverage that will be achieved by test-data generation
tools in a real-case scenario. We believe that this approach can support developers in taking informed decision when it comes to deploy and practical
use test-case generation.
Contributions of the Paper
In this paper we define and evaluate machine learning models with the goal to predict the achievable branch coverage by test-data generation tools
like EVOSUITE
7
and RANDOOP
8
. The main contributions of the paper are:
We investigate four different categories of code-quality metrics, i.e., Package Level, CK and OO, Java Reserved Keyword and Halstead metrics
10
as a features for the machine learning models;
We experiment the performance of four different machine learning algorithms, i.e., Huber Regression, Support Vector Regression, Multi-
Layer Perceptron and Random Forest Regressor, for the branch prediction model;
We perform a large scale study involving seven large open-source projects for a total of 3,105 Java classes;
We extensively ran EVOSUITE and RANDOOP over all the classes of the study context, experimenting four different budgets and multiple
executions. The overall execution was parallelized over several multi-core servers, as the mere generation of the such amount of tests would
take months on a single CPU setup.
Novel Contribution of the Extension
This paper is an extension of our seminal work
11
, in which we firstly propose to predict the branch coverage that will be achieved by test-data
generation tools. Following, we summarize the novel contribution of this paper in respect to the original one:
We introduce a new set of features, i.e., the Halstead metrics
10
. They aim at determining a quantitative measure of complexity directly from
the operators and operands in a class;
We introduce a Random Forest Regressor —an an ensemble algorithm— for the branch prediction problem;

Giovanni Grano ET AL 3
3105 Java classes
from 7 projects
Final Dataset
Automated Testing Tools
EvoSuite
Randoop Jacoco
Achieved Coverage
Metric Extractors
Observation Frame
2
1
4 budgets
FIGURE 1 Construction process of the training dataset
We add a fine-grained analysis aiming at understanding the importance of the employed features in the prediction;
We build the model with three more different budgets than the default one. We generate a test suite multiple times for each class and budget,
averaging the results; our preliminary findings were based on a single generation.
We use a larger dataset to train and evaluate the proposed machine learning models.
Structure of the Paper
Section 2 describes the features we used to train and validate the proposed machine learning models; details about the extraction process are also
detailed in this section. Section 3.1 introduces the research questions of the empirical study together with its context: we present there the subjects
of the study, the procedure we use to build the training set and the machine learning algorithms employed. Section 4 describes the steps towards the
resolution of the proposed research questions, while the achieved results are presented in Section 5; the practical implications of the main findinds
are then discussed in Section 6. Section 7 discusses the main threats while related work are presented in Section 8. At the end, Section 9 concludes
the paper, drawing the future work.
2
DATASET AND FEATURES DESCRIPTION
Supervised learning models —like the ones we employ in this study— are used to predict a certain output given a set of inputs, learning from exam-
ples of input/output pairs
12
. More formally, we define x
(i)
as independent input variables —also called input features and y
(i)
as the output —also
called dependent variable that we are trying to predict. A pair (x
(i)
, y
(i)
) is a single training example and the entire dataset used to learn, i.e., the
training set is a list of m training examples {(x
(i)
, y
(i)
), i = 1, ..., m}. The learning problem consists on learning a function h = X Y with a good
accuracy of h in predicting the value of y.
In the context of this study, we build four different datasets, one for each investigated search-budget, consisting of a tuple (x
(i)
, y
(i)
b,t
) for each
class i, where:
x
(i)
is a vector containing the values for all the factors we described in Section 2.2, for a given class i;
y
(i)
b,t
is the branch-coverage value —in the range [0, 1] of the test suite generated by a test-data generator tool t, for the class i with the
search-budget b. This is the dependent variable we want to predict.
It is worth to note that we investigate the prediction performance for two tools; since we experiment four search-budgets, we build a total of eight
different training sets. The process we use for the construction of such training sets is depicted in Figure 1. At first, we execute the scripts needed
for the extraction of the factors described in Section 2.2 (
2 in Figure 1); those values form the {x
(i)
, i = 1, ..., m} features vector, one for each
subject class i. It is worth to note that the features vector is the same for all the budgets: indeed, all the factors refers to the classes under test;
thus, their value is not affected by the used search-budget. Therefore, we compute the dependent variables, as explained in Section 2.1, obtaining
the {y
(i)
b,t
, i = 1, ..., m}, where y
(i)
b,t
is the average coverage obtained for the class i by the tool t with a search budget b. At the end of the process,

4 Giovanni Grano ET AL
we result with a training dataset for each combination of tool and budget. More formally, for each budget b and tool t, we have a training dataset
{(x
(i)
, y
(i)
b,t
), i = 1, ..., m} where x
(i)
and y
(i)
b,t
are respectively (i) the features vector for the class i, and (ii) the average coverage achieved over
the independent runs by a test-data generator tool t, for a subject class i with a search-budget b. The procedure we use to calculate the dependent
variable is reported in Section 2.1.
2.1
Dependent Variable
As dependent variable, we use the branch coverage achieved by the two experimented automated tools, i.e., EVOSUITE and RANDOOP. We run both
the tools with four different budgets: default (i.e., 60 and 90 seconds, respectively for EVOSUITE and RANDOOP), 180 seconds (i.e., 3 minutes), 300
seconds (i.e., 5 minutes) and 600 seconds (i.e., 10 minutes). We select these budgets for the following reasons: 180 and 300 seconds have been
the most used budgets exploited in the literature so far
13,14,15,16
; 10 minutes is a longer budget that we select to have an intuition on what can be
expected with extra time allowed for the search. We do not experiment longer budgets because (i) the computation-time needed to compute the
dependent variable for such a longer budget, and (ii) more importantly, the usage of test-data generation tools with such a long budget would be
hardly feasible in practice.
To collect the dependent variable i.e., the variable we want to predict we run both the tools for 10 times on the CUTs used in the study,
obtaining 10 different test suites (10 for each tool, i.e., 20 in total). We repeat this process for each budget we experiment. It is worth to note that,
for the 10 minutes budget, we only generate 3 tests suites. To sum up, for each class in our dataset we run each tool 33 times. Thus, averaging the
branch coverage of those suites by class, tool and budget, we obtain the {y
(i)
b,t
, i = 1, ..., m}, where y
(i)
b,t
is the average coverage obtained for the
class i by a tool t with a search budget b. We use the so computed dependent variable to build the different training datasets as described above.
We averaged the results of different generations due to the non-deterministic nature of the algorithms underlying test-data generation tools. It is
worth to note that such a multiple execution represents an improvement over our previous work
11
, where we executed the tools once per class and
with the default search-budget only.
To calculate the coverage for each run we proceed as follows (refer to
1 box in Figure 1): while EVOSUITE automatically reports such informa-
tion in its CSV report, we have to compile, execute and measure the achieved coverage for the tests generated by RANDOOP. For the measurement
step we rely on Jacoco.
1
It is worth to note that we discard the data-points with branch coverage equals to 0. It is also important to underline how
time expensive the described process is: we ran both EVOSUITE and RANDOOP multiple times for the 3,105 classes we use in the study, using four
different search-budgets for about 820,000 executions. In a nutshell, we estimate the entire test generation process to take about 250 days on sin-
gle core machine. To speed up such process, we ran the generation on an OPENSTACK cluster using three different 16 cores Ubuntu server, with
64GB of RAM memory each. Figure 2 shows the distribution of the achieved branch-coverage for both EVOSUITE and RANDOOP over the four exper-
imented budgets. On one hand, we can observe that EVOSUITE consistently reaches higher branch-coverage on the CUTs while the search-budget
increases: indeed, the mean of the achieved coverage ranges from 74% with the default budget (i.e., 60 seconds) to the 81% with the 600 seconds
(i.e., 10 minutes) budget. On the other hand, the branch-coverage reached by RANDOOP does not seem particularly influenced by the budget given
to the search. It is worth to note that we measure the achieved coverage for RANDOOP over the regression test suites generated by the tool.
2.2
Independent Variables
In this study, we consider 79 factors belonging to 4 different categories, that might be correlated with the coverage that will be achieved by auto-
mated testing tools of a given target. We train our models on a set of features designed primarily to capture the code complexity of CUTs. The first
set of features comes from JDEPEND
17
and captures information about the outer context layer of a CUT. We then use the Chidamber and Kemerer
(CK)
9
metrics —as depth of inheritance tree (DIT) and coupling between objects (CBO)— along with other object-oriented metrics, e.g., number
of static invocations (NOSI) and number of public methods (NOPM). These metrics have been computed using an open source tool provided by
Aniche
18
. To capture even more fine-grained details, we include the counts for 52 Java keywords, including keywords such as synchronized, import
or instanceof. In addition to the ones used in our preliminary study
11
, we also include the Halstead metrics
10
. These metrics aim at measuring the
complexity of a given program looking at its operators and operands.
Package Level Features
Table 1 summarizes the features computed at package-level calculated with JDepend
18
. Originally, such features have been developed to represent
an indication of the quality of a package. For instance, TotalClasses is a measure of the extensibility of a package. The features Ca and Ce respectively
are meant to capture the responsibility and independence of the package. In our application, both represent complexity indicators for the purpose
1
https://www.jacoco.org

Citations
More filters

The impact of test case summaries on bug fixing performance: An empirical investigation.

TL;DR: An approach which automatically generates test case summaries of the portion of code exercised by each individual test, thereby improving understandability, is proposed, which can complement the current techniques around automated unit test generation or searchbased techniques designed to generate a possibly minimal set of test cases.
Journal ArticleDOI

Can this fault be detected: A study on fault detection via automated test generation

TL;DR: A study on whether a fault can be detected by specific code coverage in automated test generation and shows that the choice of code coverage can be learned via multi-objective optimization from sampled faults and directly applied to new faults.
Proceedings ArticleDOI

An Empirical Study of the Relationship between Continuous Integration and Test Code Evolution

TL;DR: This work demonstrates that Continuous Integration can be empirically associated with a healthier test code evolution and builds a mixed-effects model to study software development factors than can possibly explain the test ratio.
Journal ArticleDOI

Integrated Pairwise Testing based Genetic Algorithm for Test Optimization

TL;DR: The proposed approach and algorithm for generating test cases for web applications uses the System Graph (consisting of links and data dependencies) and results show that genetic algorithm used increased the fitness value and code coverage.

Автоматическая настройка систем автоматического управления газотурбинными установками с использованием алгоритмов библиотеки Apache Commons Math

TL;DR: The application of optimization techniques Java Apache Commons Math library to configure automatic control systems converted gas turbine unit (GTU) in the process of testing and operation is demonstrated, demonstrating the effectiveness of optimization methods and algorithms library.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book ChapterDOI

Learning internal representations by error propagation

TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
Journal ArticleDOI

An introduction to variable and feature selection

TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Branch coverage prediction in automated testing" ?

In this paper, the authors investigate the possibility to use source-code metricsto predict the coverage achieved by testdata generation tools. The authors compare different machine learning algorithms and conduct a fine-grained feature analysis aimed at investigating the factors that most impact the prediction accuracy. In this paper, the authors investigate the possibility to use source-code metrics to predict the coverage achieved by test-data generation tools. The authors argue that knowing a priori the branch-coverage that can be achieved with test-data generation tools can help developers into taking informed-decision about those issues. 

Future efforts will involve both the horizontal and the vertical extension of this work: with the former, the authors plan to still enlarge the training dataset ; with the latter they aim at detecting evenmore sophisticated features aimed at improving the precision of themodel. 

due to the time constraints between frequent commits, a complete regression testing is not feasible for large projects 5. Furthermore, even test suite augmentation6, i.e., the automatic generation considering code changes and their effect on the previous codebase, is hardly doable due to the extensive amount of time needed to generate tests for just a single class. 

RANDOM FOREST REGRESSOR is the most accurate algorithm amongst the considered ones, scoring an average 0.15 and a 0.21 MAE respectively for the EVOSUITE and the RANDOOPmodel over the different search-budgets. 

The main rationale behind the choice of this set of metrics lies indeed in the fact that they use operands and operators as atomic units of measurement. 

It is worth to note that, to cope with the randomness arising fromusing different data splits 50, the authors repeat the outer cross-validation loop 10 times. 

TheHalstead complexitymetrics have been developed byMaurice Halstead 10 with the goal to quantitativelymeasure the complexity of a program directly from the source code. 

It is worth to remark that, in their previous work 11, the authors achieve a MAE of 0.291 and 0.225 respectively for the EVOSUITE and RANDOOP model (only trained with the default-budget). 

RANDOOP relies on random-search: thismeans that the additional budget given to the search ismerely used to generate additional random inputs, without any guidance towards better solutions. 

Testing plays an important role in such a pipeline: in an ideal world, at every single commit, every system’s test case should be executed (regression testing). 

The authors argue that the intrinsic randomness of RANDOOP makes harder the prediction of the achievable branch-coverage, compared to EVOSUITE. 

in some cases —especially for a very trivial class— a testing tool achieves a high-level of coverage in a short amount of time, sometimes even the 100% of coverage with the default search-budget.