What are the contributions in "Whole test suite generation" ?

To overcome this problem, the authors propose a novel paradigm in which whole test suites are evolved with the aim of covering all coverage goals at the same time, while keeping the total size as small as possible. Evaluated on open source libraries and an industrial case study for a total of 1,741 classes, the authors show that EVOSUITE achieved up to 188 times the branch coverage of a traditional approach targeting single branches, with up to 62 % smaller

What have the authors stated for future works in "Whole test suite generation" ?

However, the EVOSUITE approach could be easily applied to procedural software as well, although further research is needed to assess the potential benefits in such a context.

What are some examples of infeasible branches?

Other examples of infeasible branches are given by private methods that are not called in any public methods, dead code, or methods of abstract classes that are overridden in all concrete subclasses without calling the abstract super-class.

What is the way to minimize test suites?

Before presenting the result to the user, test suites are minimized using a simple minimization algorithm [5] which attempts to remove each statement one at a time until all remaining statements contribute to the coverage; this minimization reduces both the number of test cases as well as their length, such that removing any statement in the resulting test suite will reduce its coverage.

How can one extend the sequence to cover all the branches?

Once EVOSUITE is able to generate a test sequence that covers that difficult branch, this sequence can be extended (e.g., by adding function calls at its end) or copied in another test case in the test suite (e.g., through the crossover operator) to make it easier to cover the other nested branches.

What is the probability of sampling a test case?

When a test case representation is complex and it is of variable length (as it happens in their case, see Section 3.2), it is often not possible to sample test cases with uniform distribution (i.e., each test case having the same probability of being sampled).

What is the threat to external validity regarding the generalization to other types of software?

Although the authors used both open source projects and industrial software as case studies, there is the threat to external validity regarding the generalization to other types of software, which is common for any empirical analysis.

Why are the statements of the second part appended one at a time?

The statements of the second part are appended one at a time similarly to the insertion described in Section 3.5.2, except that whenever possible dependencies are satisfied using existing values.

What could be explained by the fact that, when EVOSUITE is worse on a?

It could be explained by the fact that, when EVOSUITE is worse on a specific SUT, it is only worse by little, whereas when it is better, it is better by a larger quantity.

(Open Access) Whole Test Suite Generation (2013) | Gordon Fraser

Q: How can the authors calculate the branch distance for a given predicate?

The branch distance for any given execution of a predicate can be calculated by applying a recursively defined set of rules (see [30], [34] for details).

Whole Test Suite Generation

Gordon Fraser, Member, IEEE and Andrea Arcuri, Member, IEEE.

✦

Abstract—Not all bugs lead to program crashes, and not always is

there a formal speciﬁcation to check the correctness of a software test’s

outcome. A common scenario in software testing is therefore that test

data is generated, and a tester manually adds test oracles. As this

is a difﬁcult task, it is important to produce small yet representative

test sets, and this representativeness is typically measured using code

coverage. There is, however, a fundamental problem with the common

approach of targeting one coverage goal at a time: Coverage goals

are not independent, not equally difﬁcult, and sometimes infeasible –

the result of test generation is therefore dependent on the order of

coverage goals and how many of them are feasible. To overcome this

problem, we propose a novel paradigm in which whole test suites are

evolved with the aim of covering all coverage goals at the same time,

while keeping the total size as small as possible. This approach has

several advantages, as for example its effectiveness is not affected by

the number of infeasible targets in the code. We have implemented this

novel approach in the EVOSUITE tool, and compared it to the common

approach of addressing one goal at a time. Evaluated on open source

libraries and an industrial case study for a total of 1,741 classes, we

show that EVOSUITE achieved up to 188 times the branch coverage of

a traditional approach targeting single branches, with up to 62% smaller

test suites.

Index Terms—Search based software engineering, length, branch cov-

erage, genetic algorithm, infeasible goal, collateral coverage

1 INTRODUCTION

T is widely recognized that software testing is an essential

component of any successful software development process.

A software test consists of an input that executes the program

and a deﬁnition of the expected outcome. Many techniques

to automatically produce inputs have been proposed over the

years, and today are able to produce test suites with high code

coverage. Yet, the problem of the expected outcome persists,

and has become known as the oracle problem. Sometimes,

essential properties of programs are formally speciﬁed, or have

to hold universally such that no explicit oracles need to be

deﬁned (e.g., programs should normally not crash). However,

in the general case one cannot assume the availability of an

automated oracle. This means that, if we produce test inputs,

then a human tester needs to specify the oracle in terms of

the expected outcome. To make this feasible, test generation

needs to aim not only at high code coverage, but also at small

test suites that make oracle generation as easy as possible.

• Gordon Fraser is with Saarland University – Computer Science,

Saarbr

ucken, Germany, email: fraser@cs.uni-saarland.de. Andrea Arcuri

is with the Certus Software V&V Center at Simula Research Laboratory,

P.O. Box 134, Lysaker, Norway, email: arcuri@simula.no

1 public class Stack {

2 int[] values = new int[3];

3 int size = 0;

4 void push(int x) {

5 if(size >= values.length) ⇐

Requires a full stack

6 resize();

7 if(size < values.length) ⇐

Else branch is infeasible

8 values[size++] = x;

9 }

10 int pop() {

11 if(size > 0)⇐

May imply coverage in push and resize

12 return values[size−−];

13 else

14 throw new EmptyStackException();

15 }

16 private void resize(){

17 int[] tmp = new int[values.length ∗ 2];

18 for(int i = 0; i < values.length; i++)

19 tmp[i] = values[i];

20 values = tmp;

21 }

22 }

Fig. 1. Example stack implementation: Some branches

are more difﬁcult to cover than others, some lead to

coverage of further branches, and some can be infeasible.

A common approach in the literature is to generate a

test case for each coverage goal (e.g., branches in branch

coverage), and then to combine them in a single test suite (e.g.,

see [

43]). However, the size of a resulting test suite is difﬁcult

to predict, as a test case generated for one goal may implicitly

also cover any number of further coverage goals. This is

usually called collateral or serendipitous coverage (e.g., [

25]).

For example, consider the stack implementation in Figure

Covering the true branch in Line 11 is necessarily preceded

by the true branch in Line

7, and may or may not also be

preceded by the true branch in Line 5. In fact, the order

in which each goal is select can thus play a major role, as

there can be dependencies among goals. Although there have

been attempts to exploit collateral coverage to optimize test

generation (e.g., [

25]), to the best of our knowledge there is

no conclusive evaluation in the literature of their effectiveness.

Stack stack0 = new Stack();

try {

stack0.pop();

} catch(EmptyStackException e) {

}

Stack stack0 = new Stack();

int int0 = −510;

stack0.push(int0);

stack0.pop();

Fig. 2. Test suite consisting of two tests, produced by

EVOSUITE for the Stack class shown in Figure 1: All

feasible branches are covered.

There are further issues to the approach of targeting one

test goal at a time: Some targets are more difﬁcult to cover

than others. For example, covering the true branch in Line

of the stack example is more difﬁcult than covering the false

branch of the same line, as the true branch requires a Stack

object which has ﬁlled its internal array. Furthermore, coverage

goals can be infeasible, such that there exists no input that

would cover them. For example, in Figure

1 the false branch

of the if condition in Line

7 is infeasible. Even if this

particular infeasible branch may be easy to detect this is not

true in general (it is, in fact, an undecidable problem [23]),

and thus targeting infeasible goals will per deﬁnition fail and

the effort would be wasted. This leads to the question of how

to properly allocate how much of the testing budget (e.g., the

maximum total time allowed for testing by the user) is used

for each target, and how to redistribute such budget to other

uncovered targets when the current target is covered before its

budget is fully consumed. Although in the literature there has

been preliminary work based on software metrics to predict

the difﬁculty of coverage goals in procedural code [

31], its

evaluation and usefulness on object-oriented software is still

an open research question.

In this paper we evaluate a novel approach for test data

generation, which we call whole test suite generation, that

improves upon the current approach of targeting one goal at a

time. We use an evolutionary technique [

1], [34] in which,

instead of evolving each test case individually, we evolve

all the test cases in a test suite at the same time, and the

ﬁtness function considers all the testing goals simultaneously.

The technique starts with an initial population of randomly

generated test suites, and then uses a Genetic Algorithm to

optimize towards satisfying a chosen coverage criterion, while

using the test suite size as a secondary objective. At the end,

the best resulting test suite is minimized, giving us a test

suite as shown in Figure

2 for the Stack example from

Figure 1. With such an approach, most of the complications

and downsides of the one target at a time approach either

disappear or become signiﬁcantly reduced. The technique is

implemented as part of our testing tool EVOSUITE [

18], which

is freely available online.

This novel approach was ﬁrst described in [

17], and this

paper extends that work in several directions, by for example

using a much larger and variegated case study, verifying that

the presence of infeasible branches has no negative impact

on performance, and by providing theoretical analyses to shed

more lights on the properties of the proposed approach. In

particular, we demonstrate the effectiveness of EVOSUITE

by applying it to 1,741 classes coming from open source

libraries and an industrial case study (Section

5); to the best

of our knowledge, this is the largest evaluation of search-

based testing of object-oriented software to date. Because to

effectively address the problem of test suite generation we had

to develop specialized search operators, there would be no

guarantee on the convergence property of the resulting search

algorithm. To cope with this problem, we formally prove the

convergence of our proposed technique.

The results of our experiments show strong statistical evi-

dence that the EVOSUITE approach yields signiﬁcantly better

results (i.e., either higher coverage or, if same coverage, then

smaller test suites) compared to the traditional approach of

targeting each testing goal independently. In some cases, EVO-

SUITE achieved up to 188 times higher coverage on average,

and test suites that were 62% smaller while maintaining the

same structural coverage. Furthermore, running EVOSUITE

with a constrained budget (one million statement executions

during the search, up to a maximum 10 minutes timeout)

resulted in an impressive 83% of coverage on average on our

case study.

The paper is organized as follows. Section

2 provides back-

ground information. The novel approach of evolving whole

test suites is described in Section

3, and the details of our

EVOSUITE tool follow in Section 4. The empirical study we

conducted to validate our approach is presented and discussed

in Section

5. Convergence is formally proven in Section 6.

Threats to validity of our study are analyzed in Section 7, and

ﬁnally, Section

8 concludes the paper.

2 BACKGROUND

Coverage criteria are commonly used to guide test generation.

A coverage criterion represents a ﬁnite set of coverage goals,

and a common approach is to target one such goal at a time,

generating test inputs either symbolically or with a search-

based approach. The predominant criterion in the literature

is branch coverage, but in principle any other coverage cri-

terion or related techniques such as mutation testing [

29] are

amenable to automated test generation.

Solving path constraints generated with symbolic execution

is a popular approach to generate test data [

50] or unit

tests [51], and dynamic symbolic execution as an extension

can overcome a number of problems by combining concrete

executions with symbolic execution (e.g., [

22], [39]). This

idea has been implemented in tools like DART [22] and

CUTE [

39], and is also applied in Microsoft’s parametrized

unit testing tool PEX [42] or in the Dsc [28] tool.

Meta-heuristic search techniques have been used as an

alternative to symbolic execution based approaches (see [

1],

[34] for surveys on this topic). The application of search for

test data generation can be traced as back to the 70s [

35],

where the key concepts of branch distance [30] and approach

level [

48] were introduced to help search techniques in gen-

erating the right test data. A promising avenue also seems

to be the combination of evolutionary methods with dynamic

symbolic execution (e.g., [

12], [27], [33]), alleviating some of

the problems both approaches have.

Search-based techniques have also been applied to test

object-oriented software using method sequences [21], [43]

or strongly typed genetic programming [37], [47]. When

generating test cases for object-oriented software, since the

early work of Tonella [

43], authors have tried to deal with the

problem of handling the length of the test sequences, for ex-

ample by penalizing the length directly in the ﬁtness function.

However, longer test sequences can lead to achieve higher code

coverage [

5], yet properly handling their growth/reduction

during the search requires special care [

19].

Most approaches described in the literature aim to generate

test suites that achieve as high as possible branch coverage.

In principle, any other coverage criterion is amenable to

automated test generation. For example, mutation testing [

29]

is often considered a worthwhile test goal, and has been used

in a search-based test generation environment [

21]. When test

cases are sought for individual targets in such coverage based

approaches, it is important to keep track of the accidental

collateral coverage of the remaining targets. Otherwise, it has

been proven that random testing would fare better under some

scalability models [

10]. Recently, Harman et al. [25] proposed

a search-based multi-objective approach in which, although

each goal is still targeted individually, there is the secondary

objective of maximizing the number of collateral targets that

are accidentally covered. However, no particular heuristic is

used to help covering these other targets.

All approaches mentioned so far target a single test goal

at a time – this is the predominant method. There are some

notable exceptions in search-based software testing. The works

of Arcuri and Yao [11] and Baresi et al. [13] use a single

sequence of function calls to maximize the number of covered

branches while minimizing the length of such a test case. A

drawback of such an approach is that there can be conﬂicting

testing goals, and it might be impossible to cover all of them

with a single test sequence regardless of its length.

Regarding the optimization of an entire test suite in which

all test cases are considered at the same time, we are aware of

only the work of Baudry et al. [

14]. In that work, test suites

are optimized with a search algorithm with respect to mutation

analysis. However, in that work there is the strong limitation

of having to manually choose and ﬁx the length of the test

cases, which does not change during the search.

In the literature of testing object-oriented software, there are

also techniques that do not directly aim at code coverage, as for

example implemented in the Randoop [

36] tool. In that work,

sequences of function calls are generated incrementally using

an extension of random testing (for details, see [

36]), and the

goal is to ﬁnd test sequences for which the system under test

(SUT) fails. This, however, is feasible if and only if automated

oracles are available. Once a sequence of function calls is

found for which at least one automated oracle is not passed,

that sequence can be reduced to remove all the unnecessary

function calls to trigger the failure. The software tester would

usually get as output only the test cases for which failures are

triggered. Notice that achieving higher coverage likely leads

to higher probability of ﬁnding faults, and so recent extensions

such as Palus [

52] aim to achieve this.

Although targeting for path coverage, tools like DART [22]

or CUTE [39] have a similar objective, assuming the avail-

ability of an automated oracle (e.g., does the SUT crash?) to

check the generated test cases. This step is essential because,

apart from trivial cases, the test suites generated following a

path coverage criterion would be far too large to be manually

evaluated by software testers in real industrial contexts.

The testing problem we address in this paper is very

different from the one considered by tools such as Randoop,

DART, or CUTE: Our goal is to target difﬁcult faults for which

automated oracles are not available – which is a common

situation in practice. Because in these cases the outputs of

the test cases have to be veriﬁed manually, the generated test

suites need to be of manageable size. There are two contrasting

objectives: the “quality” of the test suite (e.g., measured in its

ability to trigger failures once manual oracles are provided)

and its size. The approach we follow in this paper can be

summarized as: Satisfy the chosen coverage criterion (e.g.,

branch coverage) with the smallest possible test suite.

3 TEST SUITE OPTIMIZATION

To evolve test suites that optimize the chosen coverage crite-

rion, we use a search algorithm, namely a Genetic Algorithm

(GA), that is applied on a population of test suites. In this sec-

tion, we describe the applied GA, the representation, genetic

operations, and the ﬁtness function.

3.1 Genetic Algorithms

Genetic Algorithms (GAs) qualify as meta-heuristic search

technique and attempt to imitate the mechanisms of natural

adaptation in computer systems. A population of chromosomes

is evolved using genetics-inspired operations, where each

chromosome represents a possible problem solution.

The GA employed in this paper is depicted in Algorithm

Starting with a random population, evolution is performed un-

til a solution is found that fulﬁlls the coverage criterion, or the

allocated resources (e.g., time, number of ﬁtness evaluations)

have been used up. In each iteration of the evolution, a new

generation is created and initialized with the best individuals

of the last generation (elitism). Then, the new generation is

ﬁlled up with individuals produced by rank selection (Line

5),

crossover (Line 7), and mutation (Line 10). Either the offspring

or the parents are added to the new generation, depending on

ﬁtness and length constraints (see Section

3.4).

3.2 Problem Representation

To apply search algorithms to solve an engineering problem,

the ﬁrst step is to deﬁne a representation of the valid solutions

for that problem. In our case, a solution is a test suite, which

is represented as a set T of test cases t

. Given |T | = n, we

have T = {t

, . . . ,t

In a unit testing scenario, a test case t essentially is a

program that executes the SUT. Consequently, a test case

requires a reasonable subset of the target language (e.g., Java

in our case) that allows one to encode optimal solutions for

the addressed problem. In this paper, we use a test case

representation similar to what has been used previously [

21],

[43]: A test case is a sequence of statements t = hs

, . . . ,s

Algorithm 1 The genetic algorithm applied in EVOSUITE

1 current population ← generate random population

2 repeat

3 Z ← elite of current

population

4 while |Z| 6= |current population| do

5 P

← select two parents with rank selection

6 if crossover probability then

7 O

← crossover P

8 else

9 O

← P

10 mutate O

and O

11 f

= min( f itness(P

),fitness(P

))

12 f

= min(f itness(O

),fitness(O

))

13 l

= length(P

) + length(P

)

14 l

= length(O

) + length(O

)

15 T

= best individual of current

population

16 if f

< f

∨ (f

= f

∧ l

≤ l

) then

17 for O in {O

} do

18 if length(O) ≤ 2 × length(T

) then

19 Z ← Z ∪ {O}

20 else

21 Z ← Z ∪ {P

or P

}

22 else

23 Z ← Z ∪ {P

}

24 current

population ← Z

25 until solution found or maximum resources spent

of length l. The length of a test suite is deﬁned as the sum of

the lengths of its test cases, i.e., length(T ) =

t∈T

. Note,

that in this paper we only consider the problem of deriving

test inputs. In practice, a test case usually also contains a test

oracle, e.g., in terms of test assertions; the problem of deriving

such oracles is addressed elsewhere (e.g., [

21].

Each statement in a test case represents one value v(s

which has a type τ(v(s

)) ∈ T , where T is the ﬁnite set of

types. We deﬁne ﬁve different kinds of statements:

Primitive statements represent numeric, Boolean, String,

and enumeration variables, as for example int var0 = 54.

Furthermore, primitive statements can also deﬁne arrays of any

type (e.g., Object[] var1 = new Object[10]). The

value and type of the statement are deﬁned by the primitive

variable. In addition, an array deﬁnition also implicitly deﬁnes

a set of values of the component type of the array, according

to the length of the array.

Constructor statements generate new instances of any

given class; e.g., Stack var2 = new Stack(). Value

and type of the statement are deﬁned by the object constructed

in the call. Any parameters of the constructor call are assigned

values out of the set {v(s

) | 0 ≤ k < i}.

Field statements access public member variables of ob-

jects, e.g., int var3 = var2.size. Value and type of

a ﬁeld statement are deﬁned by the member variable. If the

ﬁeld is non-static, then the source object of the ﬁeld has to be

in the set {v(s

) | 0 ≤ k < i}.

Method statements invoke methods on objects or call

static methods, e.g., int var4 = var2.pop(). Again,

the source object or any of the parameters have to be values

in {v(s

) | 0 ≤ k < i}. Value and type of a method statement

are deﬁned by its return value.

Assignment statements assign values to array indices or to

public member variables of objects, e.g., var1[0] = new

Object() or var2.maxSize = 10. Assignment state-

ments do not deﬁne new values.

For a given SUT, the test cluster [

47] deﬁnes the set of

available classes, their public constructors, methods, and ﬁelds.

Note that the chosen representation has variable size. Not

only the number n of test cases in a test suite can vary

during the GA search, but also the number of statements l

in the test cases. The motivation for having a variable length

representation is that, for a new software to test, we do not

know its optimal number of test cases and their optimal length

a priori – this needs to be searched for.

The entire search space of test suites is composed of all

possible sets of sizes from 1 to N (i.e., n ∈ [1,N ]). Each

test case can have a size from 1 to L (i.e., l ∈ [1,L]).

We need to have these constraints, because in the context

addressed in this paper we are not assuming the presence of

an automated oracle. Therefore, we cannot expect software

testers to manually check the outputs (i.e., writing assert

statements) of thousands of long test cases. For each position

in the sequence of statements of a test case, there can be from

min

to I

max

possible statements, depending on the SUT and

the position (later statements can reuse objects instantiated

in previous statements). The search space is hence extremely

large, although ﬁnite because N, L and I

max

are ﬁnite.

3.3 Fitness Function

In this paper, we focus on branch coverage as test criterion,

although the EVOSUITE approach can be generalized to any

test criterion. A program contains control structures such as

if or while statements guarded by logical predicates; branch

coverage requires that each of these predicates evaluates to true

and to false. A branch is infeasible if there exists no program

input that evaluates the predicate such that this particular

branch is executed.

Let B denote the set of branches of the SUT, two for

every control structure. For simplicity, we treat switch/case

constructs such that each case is treated like an individual if

condition with a true and false branch. A method without any

control structures consists of only one branch, and therefore

we require that each method in the set of methods M is

executed at least once.

An optimal solution T

is deﬁned as a solution that covers

all the feasible branches/methods and it is minimal in the

total number of statements, i.e., no other test suite with the

same coverage should exist that has a lower total number of

statements in its test cases. Depending on the chosen test case

representation some branches might never be covered, even

though they are potentially reachable if the entire grammar

of the target language was used. As a very simple example,

if the chosen representation allows only to create instances

of the SUT and none of other classes, then it might not be

possible to reach the branches in the methods of the SUT that

take as input instances of other classes. Because without a

formal proof it is not possible to state that a representation is

fully adequate, for sake of simplicity we tag those branches

as infeasible for the given representation.

In order to guide the selection of parents for offspring

generation, we use a ﬁtness function that rewards better cov-

erage. If two test suites have the same coverage, the selection

mechanism rewards the test suite with less statements, i.e., the

shorter one.

For a given test suite T , the ﬁtness value is measured by

executing all tests t ∈ T and keeping track of the set of

executed methods M

as well as the minimal branch distance

min

(b,T ) for each branch b ∈ B. The branch distance is

a common heuristic to guide the search for input data to

solve the constraints in the logical predicates of the branches

[

30], [34]. The branch distance for any given execution of a

predicate can be calculated by applying a recursively deﬁned

set of rules (see [

30], [34] for details). For example, for

predicate x ≥ 10 and x having the value 5, the branch distance

to the true branch is 10 − 5 + k, with k > 0. In practice, to

determine the branch distance each predicate of the SUT is

instrumented to keep track of the distances for each execution.

The ﬁtness function estimates how close a test suite is to

covering all branches of a program, therefore it is important to

consider that each predicate has to be executed at least twice

so that each branch can be taken. Consequently, we deﬁne the

branch distance d(b,T ) for branch b on test suite T as follows:

d(b,T ) =











0 if the branch has been covered,

ν(d

min

(b,T )) if the predicate has been

executed at least twice,

1 otherwise.

Here, ν(x) is a normalizing function in [0,1]; we use the

normalization function [

4]: ν(x) = x/(x + 1). Notice that

there is a non-trivial reason behind the choice of d(b,T ) =

ν(d

min

(b,T )) applied only when the predicate is executed at

least twice [

11]. For example, assume the case in which it

is always applied. If the predicate is reached, and branch b

is not covered, then we would have d(b,T ) > 0, while the

opposite branch b

opp

would be covered, and so d(b

opp

,T ) = 0.

The search algorithm might be able to follow the gradient

given by d(b,T) > 0 until b is covered, i.e., d(b,T ) = 0.

However, in that case b

opp

would not be covered any more,

and so its branch distance would increase, i.e., d(b

opp

,T ) > 0.

Now, the search would have a gradient to cover b

opp

but,

if it does cover it, then necessarily b would not be covered

any more (the predicate is reached only once) – and so on.

Forcing a predicate to be evaluated at least twice, before

assigning ν(d

min

(b,T )) to the distance of the non-covered

branch, avoids this kind of circular behavior.

Finally, the resulting ﬁtness function to minimize is as

follows:

ﬁtness(T ) = |M| − |M

| +

∈B

d(b

,T )

3.4 Bloat Control

A variable size representation could lead to bloat, which

is a problem that is very typical for example in Genetic

Programming [

41]: For instance, after each generation, test

cases can become longer and longer, until all the memory

is consumed, even if shorter sequences are better rewarded.

Notice that bloat is an extremely complex phenomenon in

evolutionary computation, and after many decades of research

it is still an open problem whose dynamics and nature are not

completely understood [

41].

Bloat occurs when small negligible improvements in the

ﬁtness value are obtained with larger solutions. This is very

typical in classiﬁcation/regression problems. When in software

testing the ﬁtness function is just the obtained coverage, then

we would not expect bloat, because the ﬁtness function would

assume only few possible values. However, when other metrics

are introduced with large domains of possible values (e.g.,

branch distance and also for example mutation impact [

21]),

then bloat might occur.

In previous work [

19], we have studied several bloat control

methods from the literature of Genetic Programming [41]

applied in the context of testing object-oriented software.

However, our previous study [

19] covered only the case of

targeting one branch at a time. In EVOSUITE we use the same

methods analyzed in that study [

19], although further analyses

are required to study whether there are differences in their

application to handle bloat in evolving test suites rather than

single test cases. The employed bloat control methods are:

• We put a limit N on the maximum number of test cases

and limit L for their maximum length. Even if we expect

the length and number of test cases of an optimal test suite

to have low values, we still need to choose comparatively

larger N and L. In fact, allowing the search process to

employ longer test sequences and then reduce their length

during/after the search can provide staggering improvements

in terms of achieved coverage [

5].

• In our GA we use rank selection [

49] based on the ﬁtness

function (i.e., the obtained coverage and branch distance

values). In case of ties, we assign better ranks to smaller

test suites. Notice that including the length directly in

the ﬁtness function (as for example done in [

11], [13]),

might have side-effects, because we would need to put

together and linearly combine two values of different units

of measure. Furthermore, although we have two distinct

objectives, coverage is more important than size.

• Offspring with non-better coverage are never accepted for

the next generation if they are larger than their parents (for

the details, see Algorithm

1).

• We use a dynamic size limit conceptually similar to the

one presented by Silva and Costa [41]. If an offspring’s

coverage is not better than that of the best solution T

the current entire GA population, then it is not accepted in

the new generations if it is longer than twice the length of

(see Line

18 in Algorithm 1).

3.5 Search Operators

The GA code depicted in Algorithm

1 is at high level, and can

be used for many engineering problems in which variable size

representations are used. To adapt it to a speciﬁc engineering

problem, we need to deﬁne search operators that manipulate

Whole Test Suite Generation

Figures

Citations

The Oracle Problem in Software Testing: A Survey

A Hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering

Sapienz: multi-objective automated testing for Android applications

DeepGauge: multi-granularity testing criteria for deep learning systems

Fairness testing: testing software for discrimination

References

An introduction to probability theory and its applications

An Introduction to Probability Theory and Its Applications.

An introduction to probability theory

An Introduction to Probability Theory

DART: directed automated random testing

Related Papers (5)

Search‐based software test data generation: a survey

EvoSuite: automatic test suite generation for object-oriented software

Feedback-Directed Random Test Generation

Automated software test data generation

DART: directed automated random testing

Frequently Asked Questions (10)

Q1. What are the contributions in "Whole test suite generation" ?

Q2. What have the authors stated for future works in "Whole test suite generation" ?

Q3. How can the authors calculate the branch distance for a given predicate?

Q4. What are some examples of infeasible branches?

Q5. What is the way to minimize test suites?

Q6. How can one extend the sequence to cover all the branches?

Q7. What is the probability of sampling a test case?

Q8. What is the threat to external validity regarding the generalization to other types of software?

Q9. Why are the statements of the second part appended one at a time?

Q10. What could be explained by the fact that, when EVOSUITE is worse on a?