scispace - formally typeset
Open AccessJournal ArticleDOI

Whole Test Suite Generation

TLDR
This work proposes a novel paradigm in which whole test suites are evolved with the aim of covering all coverage goals at the same time while keeping the total size as small as possible, and implemented this novel approach in the EvoSuite tool.
Abstract
Not all bugs lead to program crashes, and not always is there a formal specification to check the correctness of a software test's outcome. A common scenario in software testing is therefore that test data are generated, and a tester manually adds test oracles. As this is a difficult task, it is important to produce small yet representative test sets, and this representativeness is typically measured using code coverage. There is, however, a fundamental problem with the common approach of targeting one coverage goal at a time: Coverage goals are not independent, not equally difficult, and sometimes infeasible-the result of test generation is therefore dependent on the order of coverage goals and how many of them are feasible. To overcome this problem, we propose a novel paradigm in which whole test suites are evolved with the aim of covering all coverage goals at the same time while keeping the total size as small as possible. This approach has several advantages, as for example, its effectiveness is not affected by the number of infeasible targets in the code. We have implemented this novel approach in the EvoSuite tool, and compared it to the common approach of addressing one goal at a time. Evaluated on open source libraries and an industrial case study for a total of 1,741 classes, we show that EvoSuite achieved up to 188 times the branch coverage of a traditional approach targeting single branches, with up to 62 percent smaller test suites.

read more

Content maybe subject to copyright    Report

Whole Test Suite Generation
Gordon Fraser, Member, IEEE and Andrea Arcuri, Member, IEEE.
Abstract—Not all bugs lead to program crashes, and not always is
there a formal specification to check the correctness of a software test’s
outcome. A common scenario in software testing is therefore that test
data is generated, and a tester manually adds test oracles. As this
is a difficult task, it is important to produce small yet representative
test sets, and this representativeness is typically measured using code
coverage. There is, however, a fundamental problem with the common
approach of targeting one coverage goal at a time: Coverage goals
are not independent, not equally difficult, and sometimes infeasible
the result of test generation is therefore dependent on the order of
coverage goals and how many of them are feasible. To overcome this
problem, we propose a novel paradigm in which whole test suites are
evolved with the aim of covering all coverage goals at the same time,
while keeping the total size as small as possible. This approach has
several advantages, as for example its effectiveness is not affected by
the number of infeasible targets in the code. We have implemented this
novel approach in the EVOSUITE tool, and compared it to the common
approach of addressing one goal at a time. Evaluated on open source
libraries and an industrial case study for a total of 1,741 classes, we
show that EVOSUITE achieved up to 188 times the branch coverage of
a traditional approach targeting single branches, with up to 62% smaller
test suites.
Index Terms—Search based software engineering, length, branch cov-
erage, genetic algorithm, infeasible goal, collateral coverage
1 INTRODUCTION
I
T is widely recognized that software testing is an essential
component of any successful software development process.
A software test consists of an input that executes the program
and a definition of the expected outcome. Many techniques
to automatically produce inputs have been proposed over the
years, and today are able to produce test suites with high code
coverage. Yet, the problem of the expected outcome persists,
and has become known as the oracle problem. Sometimes,
essential properties of programs are formally specified, or have
to hold universally such that no explicit oracles need to be
defined (e.g., programs should normally not crash). However,
in the general case one cannot assume the availability of an
automated oracle. This means that, if we produce test inputs,
then a human tester needs to specify the oracle in terms of
the expected outcome. To make this feasible, test generation
needs to aim not only at high code coverage, but also at small
test suites that make oracle generation as easy as possible.
Gordon Fraser is with Saarland University Computer Science,
Saarbr
¨
ucken, Germany, email: fraser@cs.uni-saarland.de. Andrea Arcuri
is with the Certus Software V&V Center at Simula Research Laboratory,
P.O. Box 134, Lysaker, Norway, email: arcuri@simula.no
1 public class Stack {
2 int[] values = new int[3];
3 int size = 0;
4 void push(int x) {
5 if(size >= values.length)
Requires a full stack
6 resize();
7 if(size < values.length)
Else branch is infeasible
8 values[size++] = x;
9 }
10 int pop() {
11 if(size > 0)
May imply coverage in push and resize
12 return values[size−−];
13 else
14 throw new EmptyStackException();
15 }
16 private void resize(){
17 int[] tmp = new int[values.length 2];
18 for(int i = 0; i < values.length; i++)
19 tmp[i] = values[i];
20 values = tmp;
21 }
22 }
Fig. 1. Example stack implementation: Some branches
are more difficult to cover than others, some lead to
coverage of further branches, and some can be infeasible.
A common approach in the literature is to generate a
test case for each coverage goal (e.g., branches in branch
coverage), and then to combine them in a single test suite (e.g.,
see [
43]). However, the size of a resulting test suite is difficult
to predict, as a test case generated for one goal may implicitly
also cover any number of further coverage goals. This is
usually called collateral or serendipitous coverage (e.g., [
25]).
For example, consider the stack implementation in Figure
1:
Covering the true branch in Line 11 is necessarily preceded
by the true branch in Line
7, and may or may not also be
preceded by the true branch in Line 5. In fact, the order
in which each goal is select can thus play a major role, as
there can be dependencies among goals. Although there have
been attempts to exploit collateral coverage to optimize test
generation (e.g., [
25]), to the best of our knowledge there is
no conclusive evaluation in the literature of their effectiveness.

Stack stack0 = new Stack();
try {
stack0.pop();
} catch(EmptyStackException e) {
}
Stack stack0 = new Stack();
int int0 = 510;
stack0.push(int0);
stack0.push(int0);
stack0.push(int0);
stack0.push(int0);
stack0.pop();
Fig. 2. Test suite consisting of two tests, produced by
EVOSUITE for the Stack class shown in Figure 1: All
feasible branches are covered.
There are further issues to the approach of targeting one
test goal at a time: Some targets are more difficult to cover
than others. For example, covering the true branch in Line
5
of the stack example is more difficult than covering the false
branch of the same line, as the true branch requires a Stack
object which has filled its internal array. Furthermore, coverage
goals can be infeasible, such that there exists no input that
would cover them. For example, in Figure
1 the false branch
of the if condition in Line
7 is infeasible. Even if this
particular infeasible branch may be easy to detect this is not
true in general (it is, in fact, an undecidable problem [23]),
and thus targeting infeasible goals will per definition fail and
the effort would be wasted. This leads to the question of how
to properly allocate how much of the testing budget (e.g., the
maximum total time allowed for testing by the user) is used
for each target, and how to redistribute such budget to other
uncovered targets when the current target is covered before its
budget is fully consumed. Although in the literature there has
been preliminary work based on software metrics to predict
the difficulty of coverage goals in procedural code [
31], its
evaluation and usefulness on object-oriented software is still
an open research question.
In this paper we evaluate a novel approach for test data
generation, which we call whole test suite generation, that
improves upon the current approach of targeting one goal at a
time. We use an evolutionary technique [
1], [34] in which,
instead of evolving each test case individually, we evolve
all the test cases in a test suite at the same time, and the
fitness function considers all the testing goals simultaneously.
The technique starts with an initial population of randomly
generated test suites, and then uses a Genetic Algorithm to
optimize towards satisfying a chosen coverage criterion, while
using the test suite size as a secondary objective. At the end,
the best resulting test suite is minimized, giving us a test
suite as shown in Figure
2 for the Stack example from
Figure 1. With such an approach, most of the complications
and downsides of the one target at a time approach either
disappear or become significantly reduced. The technique is
implemented as part of our testing tool EVOSUITE [
18], which
is freely available online.
This novel approach was first described in [
17], and this
paper extends that work in several directions, by for example
using a much larger and variegated case study, verifying that
the presence of infeasible branches has no negative impact
on performance, and by providing theoretical analyses to shed
more lights on the properties of the proposed approach. In
particular, we demonstrate the effectiveness of EVOSUITE
by applying it to 1,741 classes coming from open source
libraries and an industrial case study (Section
5); to the best
of our knowledge, this is the largest evaluation of search-
based testing of object-oriented software to date. Because to
effectively address the problem of test suite generation we had
to develop specialized search operators, there would be no
guarantee on the convergence property of the resulting search
algorithm. To cope with this problem, we formally prove the
convergence of our proposed technique.
The results of our experiments show strong statistical evi-
dence that the EVOSUITE approach yields significantly better
results (i.e., either higher coverage or, if same coverage, then
smaller test suites) compared to the traditional approach of
targeting each testing goal independently. In some cases, EVO-
SUITE achieved up to 188 times higher coverage on average,
and test suites that were 62% smaller while maintaining the
same structural coverage. Furthermore, running EVOSUITE
with a constrained budget (one million statement executions
during the search, up to a maximum 10 minutes timeout)
resulted in an impressive 83% of coverage on average on our
case study.
The paper is organized as follows. Section
2 provides back-
ground information. The novel approach of evolving whole
test suites is described in Section
3, and the details of our
EVOSUITE tool follow in Section 4. The empirical study we
conducted to validate our approach is presented and discussed
in Section
5. Convergence is formally proven in Section 6.
Threats to validity of our study are analyzed in Section 7, and
finally, Section
8 concludes the paper.
2 BACKGROUND
Coverage criteria are commonly used to guide test generation.
A coverage criterion represents a finite set of coverage goals,
and a common approach is to target one such goal at a time,
generating test inputs either symbolically or with a search-
based approach. The predominant criterion in the literature
is branch coverage, but in principle any other coverage cri-
terion or related techniques such as mutation testing [
29] are
amenable to automated test generation.
Solving path constraints generated with symbolic execution
is a popular approach to generate test data [
50] or unit
tests [51], and dynamic symbolic execution as an extension
can overcome a number of problems by combining concrete
executions with symbolic execution (e.g., [
22], [39]). This
idea has been implemented in tools like DART [22] and
CUTE [
39], and is also applied in Microsoft’s parametrized
unit testing tool PEX [42] or in the Dsc [28] tool.
Meta-heuristic search techniques have been used as an
alternative to symbolic execution based approaches (see [
1],
[34] for surveys on this topic). The application of search for
test data generation can be traced as back to the 70s [
35],
where the key concepts of branch distance [30] and approach
level [
48] were introduced to help search techniques in gen-
erating the right test data. A promising avenue also seems
to be the combination of evolutionary methods with dynamic
symbolic execution (e.g., [
12], [27], [33]), alleviating some of
the problems both approaches have.

Search-based techniques have also been applied to test
object-oriented software using method sequences [21], [43]
or strongly typed genetic programming [37], [47]. When
generating test cases for object-oriented software, since the
early work of Tonella [
43], authors have tried to deal with the
problem of handling the length of the test sequences, for ex-
ample by penalizing the length directly in the fitness function.
However, longer test sequences can lead to achieve higher code
coverage [
5], yet properly handling their growth/reduction
during the search requires special care [
19].
Most approaches described in the literature aim to generate
test suites that achieve as high as possible branch coverage.
In principle, any other coverage criterion is amenable to
automated test generation. For example, mutation testing [
29]
is often considered a worthwhile test goal, and has been used
in a search-based test generation environment [
21]. When test
cases are sought for individual targets in such coverage based
approaches, it is important to keep track of the accidental
collateral coverage of the remaining targets. Otherwise, it has
been proven that random testing would fare better under some
scalability models [
10]. Recently, Harman et al. [25] proposed
a search-based multi-objective approach in which, although
each goal is still targeted individually, there is the secondary
objective of maximizing the number of collateral targets that
are accidentally covered. However, no particular heuristic is
used to help covering these other targets.
All approaches mentioned so far target a single test goal
at a time this is the predominant method. There are some
notable exceptions in search-based software testing. The works
of Arcuri and Yao [11] and Baresi et al. [13] use a single
sequence of function calls to maximize the number of covered
branches while minimizing the length of such a test case. A
drawback of such an approach is that there can be conflicting
testing goals, and it might be impossible to cover all of them
with a single test sequence regardless of its length.
Regarding the optimization of an entire test suite in which
all test cases are considered at the same time, we are aware of
only the work of Baudry et al. [
14]. In that work, test suites
are optimized with a search algorithm with respect to mutation
analysis. However, in that work there is the strong limitation
of having to manually choose and fix the length of the test
cases, which does not change during the search.
In the literature of testing object-oriented software, there are
also techniques that do not directly aim at code coverage, as for
example implemented in the Randoop [
36] tool. In that work,
sequences of function calls are generated incrementally using
an extension of random testing (for details, see [
36]), and the
goal is to find test sequences for which the system under test
(SUT) fails. This, however, is feasible if and only if automated
oracles are available. Once a sequence of function calls is
found for which at least one automated oracle is not passed,
that sequence can be reduced to remove all the unnecessary
function calls to trigger the failure. The software tester would
usually get as output only the test cases for which failures are
triggered. Notice that achieving higher coverage likely leads
to higher probability of finding faults, and so recent extensions
such as Palus [
52] aim to achieve this.
Although targeting for path coverage, tools like DART [22]
or CUTE [39] have a similar objective, assuming the avail-
ability of an automated oracle (e.g., does the SUT crash?) to
check the generated test cases. This step is essential because,
apart from trivial cases, the test suites generated following a
path coverage criterion would be far too large to be manually
evaluated by software testers in real industrial contexts.
The testing problem we address in this paper is very
different from the one considered by tools such as Randoop,
DART, or CUTE: Our goal is to target difficult faults for which
automated oracles are not available which is a common
situation in practice. Because in these cases the outputs of
the test cases have to be verified manually, the generated test
suites need to be of manageable size. There are two contrasting
objectives: the “quality” of the test suite (e.g., measured in its
ability to trigger failures once manual oracles are provided)
and its size. The approach we follow in this paper can be
summarized as: Satisfy the chosen coverage criterion (e.g.,
branch coverage) with the smallest possible test suite.
3 TEST SUITE OPTIMIZATION
To evolve test suites that optimize the chosen coverage crite-
rion, we use a search algorithm, namely a Genetic Algorithm
(GA), that is applied on a population of test suites. In this sec-
tion, we describe the applied GA, the representation, genetic
operations, and the fitness function.
3.1 Genetic Algorithms
Genetic Algorithms (GAs) qualify as meta-heuristic search
technique and attempt to imitate the mechanisms of natural
adaptation in computer systems. A population of chromosomes
is evolved using genetics-inspired operations, where each
chromosome represents a possible problem solution.
The GA employed in this paper is depicted in Algorithm
1:
Starting with a random population, evolution is performed un-
til a solution is found that fulfills the coverage criterion, or the
allocated resources (e.g., time, number of fitness evaluations)
have been used up. In each iteration of the evolution, a new
generation is created and initialized with the best individuals
of the last generation (elitism). Then, the new generation is
filled up with individuals produced by rank selection (Line
5),
crossover (Line 7), and mutation (Line 10). Either the offspring
or the parents are added to the new generation, depending on
fitness and length constraints (see Section
3.4).
3.2 Problem Representation
To apply search algorithms to solve an engineering problem,
the first step is to define a representation of the valid solutions
for that problem. In our case, a solution is a test suite, which
is represented as a set T of test cases t
i
. Given |T | = n, we
have T = {t
1
,t
2
, . . . ,t
n
}.
In a unit testing scenario, a test case t essentially is a
program that executes the SUT. Consequently, a test case
requires a reasonable subset of the target language (e.g., Java
in our case) that allows one to encode optimal solutions for
the addressed problem. In this paper, we use a test case
representation similar to what has been used previously [
21],
[43]: A test case is a sequence of statements t = hs
1
,s
2
, . . . ,s
l
i

Algorithm 1 The genetic algorithm applied in EVOSUITE
1 current population generate random population
2 repeat
3 Z elite of current
population
4 while |Z| 6= |current population| do
5 P
1
,P
2
select two parents with rank selection
6 if crossover probability then
7 O
1
,O
2
crossover P
1
,P
2
8 else
9 O
1
,O
2
P
1
,P
2
10 mutate O
1
and O
2
11 f
P
= min( f itness(P
1
),fitness(P
2
))
12 f
O
= min(f itness(O
1
),fitness(O
2
))
13 l
P
= length(P
1
) + length(P
2
)
14 l
O
= length(O
1
) + length(O
2
)
15 T
B
= best individual of current
population
16 if f
O
< f
P
(f
O
= f
P
l
O
l
P
) then
17 for O in {O
1
,O
2
} do
18 if length(O) 2 × length(T
B
) then
19 Z Z {O}
20 else
21 Z Z {P
1
or P
2
}
22 else
23 Z Z {P
1
,P
2
}
24 current
population Z
25 until solution found or maximum resources spent
of length l. The length of a test suite is defined as the sum of
the lengths of its test cases, i.e., length(T ) =
P
tT
l
t
. Note,
that in this paper we only consider the problem of deriving
test inputs. In practice, a test case usually also contains a test
oracle, e.g., in terms of test assertions; the problem of deriving
such oracles is addressed elsewhere (e.g., [
21].
Each statement in a test case represents one value v(s
i
),
which has a type τ(v(s
i
)) T , where T is the finite set of
types. We define five different kinds of statements:
Primitive statements represent numeric, Boolean, String,
and enumeration variables, as for example int var0 = 54.
Furthermore, primitive statements can also define arrays of any
type (e.g., Object[] var1 = new Object[10]). The
value and type of the statement are defined by the primitive
variable. In addition, an array definition also implicitly defines
a set of values of the component type of the array, according
to the length of the array.
Constructor statements generate new instances of any
given class; e.g., Stack var2 = new Stack(). Value
and type of the statement are defined by the object constructed
in the call. Any parameters of the constructor call are assigned
values out of the set {v(s
k
) | 0 k < i}.
Field statements access public member variables of ob-
jects, e.g., int var3 = var2.size. Value and type of
a field statement are defined by the member variable. If the
field is non-static, then the source object of the field has to be
in the set {v(s
k
) | 0 k < i}.
Method statements invoke methods on objects or call
static methods, e.g., int var4 = var2.pop(). Again,
the source object or any of the parameters have to be values
in {v(s
k
) | 0 k < i}. Value and type of a method statement
are defined by its return value.
Assignment statements assign values to array indices or to
public member variables of objects, e.g., var1[0] = new
Object() or var2.maxSize = 10. Assignment state-
ments do not define new values.
For a given SUT, the test cluster [
47] defines the set of
available classes, their public constructors, methods, and fields.
Note that the chosen representation has variable size. Not
only the number n of test cases in a test suite can vary
during the GA search, but also the number of statements l
in the test cases. The motivation for having a variable length
representation is that, for a new software to test, we do not
know its optimal number of test cases and their optimal length
a priori this needs to be searched for.
The entire search space of test suites is composed of all
possible sets of sizes from 1 to N (i.e., n [1,N ]). Each
test case can have a size from 1 to L (i.e., l [1,L]).
We need to have these constraints, because in the context
addressed in this paper we are not assuming the presence of
an automated oracle. Therefore, we cannot expect software
testers to manually check the outputs (i.e., writing assert
statements) of thousands of long test cases. For each position
in the sequence of statements of a test case, there can be from
I
min
to I
max
possible statements, depending on the SUT and
the position (later statements can reuse objects instantiated
in previous statements). The search space is hence extremely
large, although finite because N, L and I
max
are finite.
3.3 Fitness Function
In this paper, we focus on branch coverage as test criterion,
although the EVOSUITE approach can be generalized to any
test criterion. A program contains control structures such as
if or while statements guarded by logical predicates; branch
coverage requires that each of these predicates evaluates to true
and to false. A branch is infeasible if there exists no program
input that evaluates the predicate such that this particular
branch is executed.
Let B denote the set of branches of the SUT, two for
every control structure. For simplicity, we treat switch/case
constructs such that each case is treated like an individual if
condition with a true and false branch. A method without any
control structures consists of only one branch, and therefore
we require that each method in the set of methods M is
executed at least once.
An optimal solution T
o
is defined as a solution that covers
all the feasible branches/methods and it is minimal in the
total number of statements, i.e., no other test suite with the
same coverage should exist that has a lower total number of
statements in its test cases. Depending on the chosen test case
representation some branches might never be covered, even
though they are potentially reachable if the entire grammar
of the target language was used. As a very simple example,
if the chosen representation allows only to create instances
of the SUT and none of other classes, then it might not be
possible to reach the branches in the methods of the SUT that
take as input instances of other classes. Because without a

formal proof it is not possible to state that a representation is
fully adequate, for sake of simplicity we tag those branches
as infeasible for the given representation.
In order to guide the selection of parents for offspring
generation, we use a fitness function that rewards better cov-
erage. If two test suites have the same coverage, the selection
mechanism rewards the test suite with less statements, i.e., the
shorter one.
For a given test suite T , the fitness value is measured by
executing all tests t T and keeping track of the set of
executed methods M
T
as well as the minimal branch distance
d
min
(b,T ) for each branch b B. The branch distance is
a common heuristic to guide the search for input data to
solve the constraints in the logical predicates of the branches
[
30], [34]. The branch distance for any given execution of a
predicate can be calculated by applying a recursively defined
set of rules (see [
30], [34] for details). For example, for
predicate x 10 and x having the value 5, the branch distance
to the true branch is 10 5 + k, with k > 0. In practice, to
determine the branch distance each predicate of the SUT is
instrumented to keep track of the distances for each execution.
The fitness function estimates how close a test suite is to
covering all branches of a program, therefore it is important to
consider that each predicate has to be executed at least twice
so that each branch can be taken. Consequently, we define the
branch distance d(b,T ) for branch b on test suite T as follows:
d(b,T ) =
0 if the branch has been covered,
ν(d
min
(b,T )) if the predicate has been
executed at least twice,
1 otherwise.
Here, ν(x) is a normalizing function in [0,1]; we use the
normalization function [
4]: ν(x) = x/(x + 1). Notice that
there is a non-trivial reason behind the choice of d(b,T ) =
ν(d
min
(b,T )) applied only when the predicate is executed at
least twice [
11]. For example, assume the case in which it
is always applied. If the predicate is reached, and branch b
is not covered, then we would have d(b,T ) > 0, while the
opposite branch b
opp
would be covered, and so d(b
opp
,T ) = 0.
The search algorithm might be able to follow the gradient
given by d(b,T) > 0 until b is covered, i.e., d(b,T ) = 0.
However, in that case b
opp
would not be covered any more,
and so its branch distance would increase, i.e., d(b
opp
,T ) > 0.
Now, the search would have a gradient to cover b
opp
but,
if it does cover it, then necessarily b would not be covered
any more (the predicate is reached only once) and so on.
Forcing a predicate to be evaluated at least twice, before
assigning ν(d
min
(b,T )) to the distance of the non-covered
branch, avoids this kind of circular behavior.
Finally, the resulting fitness function to minimize is as
follows:
fitness(T ) = |M| |M
T
| +
X
b
k
B
d(b
k
,T )
3.4 Bloat Control
A variable size representation could lead to bloat, which
is a problem that is very typical for example in Genetic
Programming [
41]: For instance, after each generation, test
cases can become longer and longer, until all the memory
is consumed, even if shorter sequences are better rewarded.
Notice that bloat is an extremely complex phenomenon in
evolutionary computation, and after many decades of research
it is still an open problem whose dynamics and nature are not
completely understood [
41].
Bloat occurs when small negligible improvements in the
fitness value are obtained with larger solutions. This is very
typical in classification/regression problems. When in software
testing the fitness function is just the obtained coverage, then
we would not expect bloat, because the fitness function would
assume only few possible values. However, when other metrics
are introduced with large domains of possible values (e.g.,
branch distance and also for example mutation impact [
21]),
then bloat might occur.
In previous work [
19], we have studied several bloat control
methods from the literature of Genetic Programming [41]
applied in the context of testing object-oriented software.
However, our previous study [
19] covered only the case of
targeting one branch at a time. In EVOSUITE we use the same
methods analyzed in that study [
19], although further analyses
are required to study whether there are differences in their
application to handle bloat in evolving test suites rather than
single test cases. The employed bloat control methods are:
We put a limit N on the maximum number of test cases
and limit L for their maximum length. Even if we expect
the length and number of test cases of an optimal test suite
to have low values, we still need to choose comparatively
larger N and L. In fact, allowing the search process to
employ longer test sequences and then reduce their length
during/after the search can provide staggering improvements
in terms of achieved coverage [
5].
In our GA we use rank selection [
49] based on the fitness
function (i.e., the obtained coverage and branch distance
values). In case of ties, we assign better ranks to smaller
test suites. Notice that including the length directly in
the fitness function (as for example done in [
11], [13]),
might have side-effects, because we would need to put
together and linearly combine two values of different units
of measure. Furthermore, although we have two distinct
objectives, coverage is more important than size.
Offspring with non-better coverage are never accepted for
the next generation if they are larger than their parents (for
the details, see Algorithm
1).
We use a dynamic size limit conceptually similar to the
one presented by Silva and Costa [41]. If an offspring’s
coverage is not better than that of the best solution T
B
in
the current entire GA population, then it is not accepted in
the new generations if it is longer than twice the length of
T
B
(see Line
18 in Algorithm 1).
3.5 Search Operators
The GA code depicted in Algorithm
1 is at high level, and can
be used for many engineering problems in which variable size
representations are used. To adapt it to a specific engineering
problem, we need to define search operators that manipulate

Citations
More filters
Journal ArticleDOI

The Oracle Problem in Software Testing: A Survey

TL;DR: This paper provides a comprehensive survey of current approaches to the test oracle problem and an analysis of trends in this important area of software testing research and practice.
Journal ArticleDOI

A Hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering

TL;DR: This paper provides guidelines on how to carry out and properly analyse randomized algorithms applied to solve software engineering tasks, with a particular focus on software testing, which is by far the most frequent application area of randomized algorithms within software engineering.
Proceedings ArticleDOI

Sapienz: multi-objective automated testing for Android applications

TL;DR: Sapienz, an approach to Android testing that uses multi-objective search-based testing to automatically explore and optimise test sequences, minimising length, while simultaneously maximising coverage and fault revelation, significantly outperforms both the state-of-the-art technique Dynodroid and the widely-used tool, Android Monkey.
Proceedings ArticleDOI

DeepGauge: multi-granularity testing criteria for deep learning systems

TL;DR: DeepGauge is proposed, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed and sheds light on the construction of more generic and robust DL systems.
Proceedings ArticleDOI

Fairness testing: testing software for discrimination

TL;DR: Themis as discussed by the authors is a testing-based method for measuring if and how much software discriminates, focusing on causality in discriminatory behavior, and generates efficient test suites to measure discrimination.
References
More filters
Book

An introduction to probability theory

TL;DR: The authors introduce probability theory for both advanced undergraduate students of statistics and scientists in related fields, drawing on real applications in the physical and biological sciences, and make probability exciting." -Journal of the American Statistical Association
Journal ArticleDOI

An Introduction to Probability Theory

TL;DR: This classic text and reference introduces probability theory for both advanced undergraduate students of statistics and scientists in related fields, drawing on real applications in the physical and biological sciences.
Journal ArticleDOI

DART: directed automated random testing

TL;DR: DART is a new tool for automatically testing software that combines three main techniques, automated extraction of the interface of a program with its external environment using static source-code parsing, and dynamic analysis of how the program behaves under random testing and automatic generation of new test inputs to direct systematically the execution along alternative program paths.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions in "Whole test suite generation" ?

To overcome this problem, the authors propose a novel paradigm in which whole test suites are evolved with the aim of covering all coverage goals at the same time, while keeping the total size as small as possible. Evaluated on open source libraries and an industrial case study for a total of 1,741 classes, the authors show that EVOSUITE achieved up to 188 times the branch coverage of a traditional approach targeting single branches, with up to 62 % smaller 

However, the EVOSUITE approach could be easily applied to procedural software as well, although further research is needed to assess the potential benefits in such a context. 

The branch distance for any given execution of a predicate can be calculated by applying a recursively defined set of rules (see [30], [34] for details). 

Other examples of infeasible branches are given by private methods that are not called in any public methods, dead code, or methods of abstract classes that are overridden in all concrete subclasses without calling the abstract super-class. 

Before presenting the result to the user, test suites are minimized using a simple minimization algorithm [5] which attempts to remove each statement one at a time until all remaining statements contribute to the coverage; this minimization reduces both the number of test cases as well as their length, such that removing any statement in the resulting test suite will reduce its coverage. 

Once EVOSUITE is able to generate a test sequence that covers that difficult branch, this sequence can be extended (e.g., by adding function calls at its end) or copied in another test case in the test suite (e.g., through the crossover operator) to make it easier to cover the other nested branches. 

When a test case representation is complex and it is of variable length (as it happens in their case, see Section 3.2), it is often not possible to sample test cases with uniform distribution (i.e., each test case having the same probability of being sampled). 

Although the authors used both open source projects and industrial software as case studies, there is the threat to external validity regarding the generalization to other types of software, which is common for any empirical analysis. 

The statements of the second part are appended one at a time similarly to the insertion described in Section 3.5.2, except that whenever possible dependencies are satisfied using existing values. 

It could be explained by the fact that, when EVOSUITE is worse on a specific SUT, it is only worse by little, whereas when it is better, it is better by a larger quantity.