scispace - formally typeset
Search or ask a question

Showing papers on "Test suite published in 2015"


Proceedings ArticleDOI
30 Aug 2015
TL;DR: This paper evaluates two well-studied repair tools, GenProg and TrpAutoRepair, on a publicly available benchmark of bugs, each with a human-written patch, and finds that the tools are unlikely to improve the proportion of independent tests passed, and that the quality of the patches is proportional to the coverage of the test suite used during repair.
Abstract: Automated program repair has shown promise for reducing the significant manual effort debugging requires. This paper addresses a deficit of earlier evaluations of automated repair techniques caused by repairing programs and evaluating generated patches' correctness using the same set of tests. Since tests are an imperfect metric of program correctness, evaluations of this type do not discriminate between correct patches and patches that overfit the available tests and break untested but desired functionality. This paper evaluates two well-studied repair tools, GenProg and TrpAutoRepair, on a publicly available benchmark of bugs, each with a human-written patch. By evaluating patches using tests independent from those used during repair, we find that the tools are unlikely to improve the proportion of independent tests passed, and that the quality of the patches is proportional to the coverage of the test suite used during repair. For programs that pass most tests, the tools are as likely to break tests as to fix them. However, novice developers also overfit, and automated repair performs no worse than these developers. In addition to overfitting, we measure the effects of test suite coverage, test suite provenance, and starting program quality, as well as the difference in quality between novice-developer-written and tool-generated patches when quality is assessed with a test suite independent from the one used for patch generation.

298 citations


Proceedings ArticleDOI
09 Nov 2015
TL;DR: Three state-of-the-art unit test generation tools for Java (Randoop, EvoSuite, and Agitar) are applied to the 357 real faults in the Defects4J dataset and investigated how well the generated test suites perform at detecting these faults.
Abstract: Rather than tediously writing unit tests manually, tools can be used to generate them automatically --- sometimes even resulting in higher code coverage than manual testing. But how good are these tests at actually finding faults? To answer this question, we applied three state-of-the-art unit test generation tools for Java (Randoop, EvoSuite, and Agitar) to the 357 real faults in the Defects4J dataset and investigated how well the generated test suites perform at detecting these faults. Although the automatically generated test suites detected 55.7% of the faults overall, only 19.9% of all the individual test suites detected a fault. By studying the effectiveness and problems of the individual tools and the tests they generate, we derive insights to support the development of automated unit test generators that achieve a higher fault detection rate. These insights include 1) improving the obtained code coverage so that faulty statements are executed in the first instance, 2) improving the propagation of faulty program states to an observable output, coupled with the generation of more sensitive assertions, and 3) improving the simulation of the execution environment to detect faults that are dependent on external factors such as date and time.

193 citations


Proceedings ArticleDOI
14 Jan 2015
TL;DR: K-Java is presented, a complete executable formal semantics of Java 1.4 that is applied to model-check multi-threaded programs and is generic and ready to be used in other Java-related projects.
Abstract: This paper presents K-Java, a complete executable formal semantics of Java 1.4. K-Java was extensively tested with a test suite developed alongside the project, following the Test Driven Development methodology. In order to maintain clarity while handling the great size of Java, the semantics was split into two separate definitions -- a static semantics and a dynamic semantics. The output of the static semantics is a preprocessed Java program, which is passed as input to the dynamic semantics for execution. The preprocessed program is a valid Java program, which uses a subset of the features of Java. The semantics is applied to model-check multi-threaded programs. Both the test suite and the static semantics are generic and ready to be used in other Java-related projects.

137 citations


Proceedings ArticleDOI
16 May 2015
TL;DR: This work proposes an approach of automated repair of software regressions, called relifix, that considers the regression repair problem as a problem of reconciling problematic changes, and derives a set of code transformations obtained from a manual inspection of 73 realSoftware regressions.
Abstract: Regression occurs when code changes introduce failures in previously passing test cases. As software evolves, regressions may be introduced. Fixing regression errors manually is time-consuming and error-prone. We propose an approach of automated repair of software regressions, called relifix, that considers the regression repair problem as a problem of recon- ciling problematic changes. Specifically, we derive a set of code transformations obtained from our manual inspection of 73 real software regressions; this set of code transformations uses syntactical information from changed statements. Regression repair is then accomplished via a search over the code transformation operators -- which operator to apply, and where. Our evaluation compares the repairability of relifix with GenProg on 35 real regression errors. relifix repairs 23 bugs, while GenProg only fixes five bugs. We also measure the likelihood of both approaches in introducing new regressions given a reduced test suite. Our experimental results shows that our approach is less likely to introduce new regressions than GenProg.

123 citations


Proceedings ArticleDOI
03 Jun 2015
TL;DR: It is argued that this work is the most comprehensive and complete semantic treatment of undefined behavior in C, and thus of the C language itself.
Abstract: We present a ``negative'' semantics of the C11 language---a semantics that does not just give meaning to correct programs, but also rejects undefined programs. We investigate undefined behavior in C and discuss the techniques and special considerations needed for formally specifying it. We have used these techniques to modify and extend a semantics of C into one that captures undefined behavior. The amount of semantic infrastructure and effort required to achieve this was unexpectedly high, in the end nearly doubling the size of the original semantics. From our semantics, we have automatically extracted an undefinedness checker, which we evaluate against other popular analysis tools, using our own test suite in addition to a third-party test suite. Our checker is capable of detecting examples of all 77 categories of core language undefinedness appearing in the C11 standard, more than any other tool we considered. Based on this evaluation, we argue that our work is the most comprehensive and complete semantic treatment of undefined behavior in C, and thus of the C language itself.

119 citations


Journal ArticleDOI
TL;DR: The search-based EvoSuite test generation tool integrates two novel optimizations that avoid redundant test executions on mutants by monitoring state infection conditions, and uses whole test suite generation to optimize test suites towards killing the highest number of mutants, rather than selecting individual mutants.
Abstract: Without complete formal specification, automatically generated software tests need to be manually checked in order to detect faults. This makes it desirable to produce the strongest possible test set while keeping the number of tests as small as possible. As commonly applied coverage criteria like branch coverage are potentially weak, mutation testing has been proposed as a stronger criterion. However, mutation based test generation is hampered because usually there are simply too many mutants, and too many of these are either trivially killed or equivalent. On such mutants, any effort spent on test generation would per definition be wasted. To overcome this problem, our search-based EvoSuite test generation tool integrates two novel optimizations: First, we avoid redundant test executions on mutants by monitoring state infection conditions, and second we use whole test suite generation to optimize test suites towards killing the highest number of mutants, rather than selecting individual mutants. These optimizations allowed us to apply EvoSuite to a random sample of 100 open source projects, consisting of a total of 8,963 classes and more than two million lines of code, leading to a total of 1,380,302 mutants. The experiment demonstrates that our approach scales well, making mutation testing a viable test criterion for automated test case generation tools, and allowing us to analyze the relationship of branch coverage and mutation testing in detail.

112 citations


Proceedings ArticleDOI
03 Jun 2015
TL;DR: KJS is presented, the most complete and throughly tested formal semantics of JavaScript to date, which revealed that the ECMAScript 5.1 conformance test suite fails to cover several semantic rules.
Abstract: This paper presents KJS, the most complete and throughly tested formal semantics of JavaScript to date. Being executable, KJS has been tested against the ECMAScript 5.1 conformance test suite, and passes all 2,782 core language tests. Among the existing implementations of JavaScript, only Chrome V8's passes all the tests, and no other semantics passes more than 90%. In addition to a reference implementation for JavaScript, KJS also yields a simple coverage metric for a test suite: the set of semantic rules it exercises. Our semantics revealed that the ECMAScript 5.1 conformance test suite fails to cover several semantic rules. Guided by the semantics, we wrote tests to exercise those rules. The new tests revealed bugs both in production JavaScript engines (Chrome V8, Safari WebKit, Firefox SpiderMonkey) and in other semantics. KJS is symbolically executable, thus it can be used for formal analysis and verification of JavaScript programs. We verified non-trivial programs and found a known security vulnerability.

108 citations


Proceedings ArticleDOI
13 Jul 2015
TL;DR: This work proposes a new methodology for testing by leveraging existing test suites such that each test case is systematically exposed to adverse conditions where certain unexpected events may interfere with the execution.
Abstract: Event-driven applications, such as, mobile apps, are difficult to test thoroughly. The application programmers often put significant effort into writing end-to-end test suites. Even though such tests often have high coverage of the source code, we find that they often focus on the expected behavior, not on occurrences of unusual events. On the other hand, automated testing tools may be capable of exploring the state space more systematically, but this is mostly without knowledge of the intended behavior of the individual applications. As a consequence, many programming errors remain unnoticed until they are encountered by the users. We propose a new methodology for testing by leveraging existing test suites such that each test case is systematically exposed to adverse conditions where certain unexpected events may interfere with the execution. In this way, we explore the interesting execution paths and take advantage of the assertions in the manually written test suite, while ensuring that the injected events do not affect the expected outcome. The main challenge that we address is how to accomplish this systematically and efficiently. We have evaluated the approach by implementing a tool, Thor, working on Android. The results on four real-world apps with existing test suites demonstrate that apps are often fragile with respect to certain unexpected events and that our methodology effectively increases the testing quality: Of 507 individual tests, 429 fail when exposed to adverse conditions, which reveals 66 distinct problems that are not detected by ordinary execution of the tests.

107 citations


Journal ArticleDOI
TL;DR: It is shown that the optimality of MOGAs can be significantly improved by diversifying the solutions (sub-sets of the test suite) generated during the search process by introducing a new MOGA, coined as DIversity based Genetic Algorithm (DIV-GA), based on the mechanisms of Orthogonal design and orthogonal evolution.
Abstract: A way to reduce the cost of regression testing consists of selecting or prioritizing subsets of test cases from a test suite according to some criteria. Besides greedy algorithms, cost cognizant additional greedy algorithms, multi-objective optimization algorithms, and multi-objective genetic algorithms (MOGAs), have also been proposed to tackle this problem. However, previous studies have shown that there is no clear winner between greedy and MOGAs, and that their combination does not necessarily produce better results. In this paper we show that the optimality of MOGAs can be significantly improved by diversifying the solutions (sub-sets of the test suite) generated during the search process. Specifically, we introduce a new MOGA, coined as DIversity based Genetic Algorithm (DIV-GA), based on the mechanisms of orthogonal design and orthogonal evolution that increase diversity by injecting new orthogonal individuals during the search process. Results of an empirical study conducted on eleven programs show that DIV-GA outperforms both greedy algorithms and the traditional MOGAs from the optimality point of view. Moreover, the solutions (sub-sets of the test suite) provided by DIV-GA are able to detect more faults than the other algorithms, while keeping the same test execution cost.

106 citations


Journal ArticleDOI
TL;DR: The proposed high-level strategy utilizes a dynamic multiarmed bandit-extreme value-based reward as an online heuristic selection mechanism to select the appropriate heuristic to be applied at each iteration, instead of using human-designed criteria.
Abstract: Hyper-heuristics are search methodologies that aim to provide high-quality solutions across a wide variety of problem domains, rather than developing tailor-made methodologies for each problem instance/domain. A traditional hyper-heuristic framework has two levels, namely, the high level strategy (heuristic selection mechanism and the acceptance criterion) and low level heuristics (a set of problem specific heuristics). Due to the different landscape structures of different problem instances, the high level strategy plays an important role in the design of a hyper-heuristic framework. In this paper, we propose a new high level strategy for a hyper-heuristic framework. The proposed high-level strategy utilizes a dynamic multiarmed bandit-extreme value-based reward as an online heuristic selection mechanism to select the appropriate heuristic to be applied at each iteration. In addition, we propose a gene expression programming framework to automatically generate the acceptance criterion for each problem instance, instead of using human-designed criteria. Two well-known, and very different, combinatorial optimization problems, one static (exam timetabling) and one dynamic (dynamic vehicle routing) are used to demonstrate the generality of the proposed framework. Compared with state-of-the-art hyper-heuristics and other bespoke methods, empirical results demonstrate that the proposed framework is able to generalize well across both domains. We obtain competitive, if not better results, when compared to the best known results obtained from other methods that have been presented in the scientific literature. We also compare our approach against the recently released hyper-heuristic competition test suite. We again demonstrate the generality of our approach when we compare against other methods that have utilized the same six benchmark datasets from this test suite.

99 citations


Journal ArticleDOI
TL;DR: This work evaluates the effectiveness of test suites generated to satisfy four coverage criteria through counterexample-based test generation and a random generation approach-where tests are randomly generated until coverage is achieved-contrasted against purely random test suites of equal size.
Abstract: A number of structural coverage criteria have been proposed to measure the adequacy of testing efforts. In the avionics and other critical systems domains, test suites satisfying structural coverage criteria are mandated by standards. With the advent of powerful automated test generation tools, it is tempting to simply generate test inputs to satisfy these structural coverage criteria. However, while techniques to produce coverage-providing tests are well established, the effectiveness of such approaches in terms of fault detection ability has not been adequately studied. In this work, we evaluate the effectiveness of test suites generated to satisfy four coverage criteria through counterexample-based test generation and a random generation approach—where tests are randomly generated until coverage is achieved—contrasted against purely random test suites of equal size. Our results yield three key conclusions. First, coverage criteria satisfaction alone can be a poor indication of fault finding effectiveness, with inconsistent results between the seven case examples (and random test suites of equal size often providing similar—or even higher—levels of fault finding). Second, the use of structural coverage as a supplement—rather than a target—for test generation can have a positive impact, with random test suites reduced to a coverage-providing subset detecting up to 13.5 percent more faults than test suites generated specifically to achieve coverage. Finally, Observable MC/DC, a criterion designed to account for program structure and the selection of the test oracle, can—in part—address the failings of traditional structural coverage criteria, allowing for the generation of test suites achieving higher levels of fault detection than random test suites of equal size. These observations point to risks inherent in the increase in test automation in critical systems, and the need for more research in how coverage criteria, test generation approaches, the test oracle used, and system structure jointly influence test effectiveness.

Proceedings ArticleDOI
30 Aug 2015
TL;DR: This first extensive study compares test-suite reduction and regression test selection approaches individually and also evaluates a combination of the two approaches, and proposes a new criterion to measure the quality of tests with respect to software changes.
Abstract: Regression testing is widely used to check that changes made to software do not break existing functionality, but regression test suites grow, and running them fully can become costly Researchers have proposed test-suite reduction and regression test selection as two approaches to reduce this cost by not running some of the tests from the test suite However, previous research has not empirically evaluated how the two approaches compare to each other, and how well a combination of these approaches performs We present the first extensive study that compares test-suite reduction and regression test selection approaches individually, and also evaluates a combination of the two approaches We also propose a new criterion to measure the quality of tests with respect to software changes Our experiments on 4,793 commits from 17 open-source projects show that regression test selection runs on average fewer tests (by 4015pp) than test-suite reduction However, test-suite reduction can have a high loss in fault-detection capability with respect to the changes, whereas a (safe) regression test selection has no loss The experiments also show that a combination of the two approaches runs even fewer tests (on average 534pp) than regression test selection, but these tests still have a loss in fault-detection capability with respect to the changes

Proceedings ArticleDOI
30 Aug 2015
TL;DR: The number of assertions in a test suite strongly correlates with its effectiveness, and this factor directly influences the relationship between test suite size and effectiveness.
Abstract: Code coverage is a popular test adequacy criterion in practice. Code coverage, however, remains controversial as there is a lack of coherent empirical evidence for its relation with test suite effectiveness. More recently, test suite size has been shown to be highly correlated with effectiveness. However, previous studies treat test methods as the smallest unit of interest, and ignore potential factors influencing this relationship. We propose to go beyond test suite size, by investigating test assertions inside test methods. We empirically evaluate the relationship between a test suite’s effectiveness and the (1) number of assertions, (2) assertion coverage, and (3) different types of assertions. We compose 6,700 test suites in total, using 24,000 assertions of five real-world Java projects. We find that the number of assertions in a test suite strongly correlates with its effectiveness, and this factor directly influences the relationship between test suite size and effectiveness. Our results also indicate that assertion coverage is strongly correlated with effectiveness and different types of assertions can influence the effectiveness of their containing test suites.

Book ChapterDOI
05 Sep 2015
TL;DR: A search-based test generator could easily be extended with additional fitness functions to capture the properties developers may desire of their unit test suites, as standard structural coverage criteria do not fully capture these properties.
Abstract: Automated test generation techniques typically aim at maximising coverage of well-established structural criteria such as statement or branch coverage In practice, generating tests only for one specific criterion may not be sufficient when testing object oriented classes, as standard structural coverage criteria do not fully capture the properties developers may desire of their unit test suites For example, covering a large number of statements could be easily achieved by just calling the main method of a class; yet, a good unit test suite would consist of smaller unit tests invoking individual methods, and checking return values and states with test assertions There are several different properties that test suites should exhibit, and a search-based test generator could easily be extended with additional fitness functions to capture these properties

Journal ArticleDOI
TL;DR: The effectiveness of this approach is shown and it is discussed how to gain a more expressive test suite by combining cheap but undirected random test case generation with the more expensive but directed mutation‐based technique.
Abstract: This article presents the techniques and results of a novel model-based test case generation approach that automatically derives test cases from UML state machines. The main contribution of this article is the fully automated fault-based test case generation technique together with two empirical case studies derived from industrial use cases. Also, an in-depth evaluation of different fault-based test case generation strategies on each of the case studies is given and a comparison with plain random testing is conducted. The test case generation methodology supports a wide range of UML constructs and is grounded on the formal semantics of Back's action systems and the well-known input-output conformance relation. Mutation operators are employed on the level of the specification to insert faults and generate test cases that will reveal the faults inserted. The effectiveness of this approach is shown and it is discussed how to gain a more expressive test suite by combining cheap but undirected random test case generation with the more expensive but directed mutation-based technique. Finally, an extensive and critical discussion of the lessons learnt is given as well as a future outlook on the general usefulness and practicability of mutation-based test case generation. Copyright © 2014 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The results show that Random-Weighted Genetic Algorithm (RWGA) significantly outperforms the other algorithms since RWGA can balance all the objectives together by dynamically updating weights during each generation.

Journal ArticleDOI
TL;DR: The results show that GA is competitive only for pairwise testing for subjects with a small number of constraints; the results for the greedy algorithm are actually slightly superior, however, the results are critically dependent on the approach adopted to constraint handling.
Abstract: Combinatorial interaction testing (CIT) is important because it tests the interactions between the many features and parameters that make up the configuration space of software systems. Simulated Annealing (SA) and Greedy Algorithms have been widely used to find CIT test suites. From the literature, there is a widely-held belief that SA is slower, but produces more effective tests suites than Greedy and that SA cannot scale to higher strength coverage. We evaluated both algorithms on seven real-world subjects for the well-studied two-way up to the rarely-studied six-way interaction strengths. Our findings present evidence to challenge this current orthodoxy: real-world constraints allow SA to achieve higher strengths. Furthermore, there was no evidence that Greedy was less effective (in terms of time to fault revelation) compared to SA; the results for the greedy algorithm are actually slightly superior. However, the results are critically dependent on the approach adopted to constraint handling. Moreover, we have also evaluated a genetic algorithm for constrained CIT test suite generation. This is the first time strengths higher than 3 and constraint handling have been used to evaluate GA. Our results show that GA is competitive only for pairwise testing for subjects with a small number of constraints.

Journal ArticleDOI
TL;DR: In this paper, a new approach is used to generate combinatorial test suites for a new type of application and case study by using Cuckoo Search concepts, which is evaluated through different benchmarks and it is able to get comparative results.
Abstract: A new approach is used to generate combinatorial test suites.Cuckoo Search application is investigated for a new type of application and case study.The strategy opens a new approach in Search Based Software Testing (SBST).The strategy is evaluated through different benchmarks and it is able to get comparative results.Application of combinatorial optimization is also investigated in the current paper. ContextSoftware has become an innovative solution nowadays for many applications and methods in science and engineering. Ensuring the quality and correctness of software is challenging because each program has different configurations and input domains. To ensure the quality of software, all possible configurations and input combinations need to be evaluated against their expected outputs. However, this exhaustive test is impractical because of time and resource constraints due to the large domain of input and configurations. Thus, different sampling techniques have been used to sample these input domains and configurations. ObjectiveCombinatorial testing can be used to effectively detect faults in software-under-test. This technique uses combinatorial optimization concepts to systematically minimize the number of test cases by considering the combinations of inputs. This paper proposes a new strategy to generate combinatorial test suite by using Cuckoo Search concepts. MethodCuckoo Search is used in the design and implementation of a strategy to construct optimized combinatorial sets. The strategy consists of different algorithms for construction. These algorithms are combined to serve the Cuckoo Search. ResultsThe efficiency and performance of the new technique were proven through different experiment sets. The effectiveness of the strategy is assessed by applying the generated test suites on a real-world case study for the purpose of functional testing. ConclusionResults show that the generated test suites can detect faults effectively. In addition, the strategy also opens a new direction for the application of Cuckoo Search in the context of software engineering.

Proceedings ArticleDOI
02 Nov 2015
TL;DR: The results of the study show that the proposed similarity-based quality measure is significantly more effective for prioritizing test cases compared to existing test case quality measures.
Abstract: Test case prioritization is a crucial element in software quality assurance in practice, specially, in the context of regression testing. Typically, test cases are prioritized in a way that they detect the potential faults earlier. The effectiveness of test cases, in terms of fault detection, is estimated using quality metrics, such as code coverage, size, and historical fault detection. Prior studies have shown that previously failing test cases are highly likely to fail again in the next releases, therefore, they are highly ranked, while prioritizing. However, in practice, a failing test case may not be exactly the same as a previously failed test case, but quite similar, e.g., when the new failing test is a slightly modified version of an old failing one to catch an undetected fault. In this paper, we define a class of metrics that estimate the test cases quality using their similarity to the previously failing test cases. We have conducted several experiments with five real world open source software systems, with real faults, to evaluate the effectiveness of these quality metrics. The results of our study show that our proposed similarity-based quality measure is significantly more effective for prioritizing test cases compared to existing test case quality measures.

Journal ArticleDOI
TL;DR: The Genetic Algorithm behind the EvoSuite test generation tool is extended into a Memetic Algorithm, by equipping it with several local search operators designed to efficiently optimize primitive values and other aspects of a test suite that allow the search for test cases to function more effectively.

Proceedings ArticleDOI
13 Jul 2015
TL;DR: This work proposes a technique, called PollutingTests, for finding tests that pollute the shared state, which finds tests that modify some location on the heap shared across tests or on the file system; a subsequent test could fail if it assumes the shared location to have the initial value before the state was modified.
Abstract: Writing reliable test suites for large object-oriented systems is complex and time consuming. One common cause of unreliable test suites are test dependencies that can cause tests to fail unexpectedly, not exposing bugs in the code under test but in the test code itself. Prior research has shown that the main reason for test dependencies is the ``pollution'' of state shared across tests. We propose a technique, called , for finding tests that pollute the shared state. In a nutshell, finds tests that modify some location on the heap shared across tests or on the file system; a subsequent test could fail if it assumes the shared location to have the initial value before the state was modified. To aid in inspecting the pollutions, provides an access path through the heap that leads to the polluted value or the name of the file that was modified. We implemented a prototype tool for Java and evaluated it on NumOfProjects projects, with a total of NumOfTests tests. Diaper reported PollutingTests , and our inspection found that NumOfTPsSpace of those are relevant pollutions that can easily affect other tests.

Journal ArticleDOI
TL;DR: It is found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase), however, on the other hand, there was no measurable improvement in the number of bugs actually found by developers.
Abstract: Work on automated test generation has produced several tools capable of generating test data which achieves high structural coverage over a program. In the absence of a specification, developers are expected to manually construct or verify the test oracle for each test input. Nevertheless, it is assumed that these generated tests ease the task of testing for the developer, as testing is reduced to checking the results of tests. While this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the limited adoption in industry indicates this assumption may not be correct, and calls into question the practical value of test generation tools. To investigate this issue, we performed two controlled experiments comparing a total of 97 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EvoSuite. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300p increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but also point to improvements and future work necessary before automated test generation tools will be widely adopted by practitioners.

Proceedings ArticleDOI
16 May 2015
TL;DR: This work uses association rule learning to identify patterns among failing test steps that are typically for false test alarms and can be used to automatically classify them, and shows a mean precision between 0.85 and 0.90 predicting between 34% and 48% of allfalse test alarms.
Abstract: Applying code changes to software systems and testing these code changes can be a complex task that involves many different types of software testing strategies, e.g. system and integration tests. However, not all test failures reported during code integration are hinting towards code defects. Testing large systems such as the Microsoft Windows operating system requires complex test infrastructures, which may lead to test failures caused by faulty tests and test infrastructure issues. Such false test alarms are particular annoying as they raise engineer attention and require manual inspection without providing any benefit. The goal of this work is to use empirical data to minimize the number of false test alarms reported during system and integration testing. To achieve this goal, we use association rule learning to identify patterns among failing test steps that are typically for false test alarms and can be used to automatically classify them. A successful classification of false test alarms is particularly valuable for product teams as manual test failure inspection is an expensive and time-consuming process that not only costs engineering time and money but also slows down product development. We evaluating our approach on system and integration tests executed during Windows 8.1 and Microsoft Dynamics AX development. Performing more than 10,000 classifications for each product, our model shows a mean precision between 0.85 and 0.90 predicting between 34% and 48% of all false test alarms.

Proceedings ArticleDOI
13 Jul 2015
TL;DR: An extensive empirical study of the effectiveness of multi objective test case prioritisation is presented, evaluating it on multiple versions of five widely-used benchmark programs and a much larger real world system of over 1 million lines of code.
Abstract: The aim of test case prioritisation is to determine an ordering of test cases that maximises the likelihood of early fault revelation. Previous prioritisation techniques have tended to be single objective, for which the additional greedy algorithm is the current state-of-the-art. Unlike test suite minimisation, multi objective test case prioritisation has not been thoroughly evaluated. This paper presents an extensive empirical study of the effectiveness of multi objective test case prioritisation, evaluating it on multiple versions of five widely-used benchmark programs and a much larger real world system of over 1 million lines of code. The paper also presents a lossless coverage compaction algorithm that dramatically scales the performance of all algorithms studied by between 2 and 4 orders of magnitude, making prioritisation practical for even very demanding problems.

Proceedings ArticleDOI
02 Mar 2015
TL;DR: This paper analyzes two large software systems to measure the relationship of code coverage and its effectiveness in killing real bugs from the software systems and finds that there is indeed statistically significant correlation between code Coverage and bug kill effectiveness.
Abstract: During software maintenance, testing is a crucial activity to ensure the quality of program code as it evolves over time. With the increasing size and complexity of software, adequate software testing has become increasingly important. Code coverage is often used as a yardstick to gauge the comprehensiveness of test cases and the adequacy of testing. A test suite quality is often measured by the number of bugs it can find (aka. kill). Previous studies have analysed the quality of a test suite by its ability to kill mutants, i.e., artificially seeded faults. However, mutants do not necessarily represent real bugs. Moreover, many studies use small programs which increases the threat of the applicability of the results on large real-world systems. In this paper, we analyse two large software systems to measure the relationship of code coverage and its effectiveness in killing real bugs from the software systems. We use Randoop, a random test generation tool to generate test suites with varying levels of coverage and run them to analyse if the test suites can kill each of the real bugs or not. In this preliminary study, we have performed an experiment on 67 and 92 real bugs from Apache HTTPClient and Mozilla Rhino, respectively. Our experiment finds that there is indeed statistically significant correlation between code coverage and bug kill effectiveness. The strengths of the correlation, however, differ for the two software systems. For HTTPClient, the correlation is moderate for both statement and branch coverage. For Rhino, the correlation is strong for both statement and branch coverage.

Proceedings ArticleDOI
30 Aug 2015
TL;DR: A novel approach to detecting all dependencies between test cases in large projects that can enable safe exploitation of parallelism and test selection with a modest analysis cost is presented.
Abstract: Slow builds remain a plague for software developers. The frequency with which code can be built (compiled, tested and packaged) directly impacts the productivity of developers: longer build times mean a longer wait before determining if a change to the application being built was successful. We have discovered that in the case of some languages, such as Java, the majority of build time is spent running tests, where dependencies between individual tests are complicated to discover, making many existing test acceleration techniques unsound to deploy in practice. Without knowledge of which tests are dependent on others, we cannot safely parallelize the execution of the tests, nor can we perform incremental testing (i.e., execute only a subset of an application's tests for each build). The previous techniques for detecting these dependencies did not scale to large test suites: given a test suite that normally ran in two hours, the best-case running scenario for the previous tool would have taken over 422 CPU days to find dependencies between all test methods (and would not soundly find all dependencies) — on the same project the exhaustive technique (to find all dependencies) would have taken over 1e300 years. We present a novel approach to detecting all dependencies between test cases in large projects that can enable safe exploitation of parallelism and test selection with a modest analysis cost.

Proceedings ArticleDOI
01 Jan 2015
TL;DR: The results show that a system based on a reasonably-sized semantic lexicon and a manageable number of non-first-order axioms enables efficient logical inferences, including those concerned with generalized quantifiers and intensional operators, and outperforms the state-of-the-art firstorder inference system.
Abstract: We present a higher-order inference system based on a formal compositional semantics and the wide-coverage CCG parser. We develop an improved method to bridge between the parser and semantic composition. The system is evaluated on the FraCaS test suite. In contrast to the widely held view that higher-order logic is unsuitable for efficient logical inferences, the results show that a system based on a reasonably-sized semantic lexicon and a manageable number of non-first-order axioms enables efficient logical inferences, including those concerned with generalized quantifiers and intensional operators, and outperforms the state-of-the-art firstorder inference system.

Book ChapterDOI
03 Nov 2015
TL;DR: Using an adaptation of state-of-the-art algorithms for black-box automata learning, as implemented in the LearnLib tool, it is succeeded to learn a model of the Engine Status Manager, a software component that is used in printers and copiers of Oce.
Abstract: Using an adaptation of state-of-the-art algorithms for black-box automata learning, as implemented in the LearnLib tool, we succeeded to learn a model of the Engine Status Manager (ESM), a software component that is used in printers and copiers of Oce. The main challenge that we encountered was that LearnLib, although effective in constructing hypothesis models, was unable to find counterexamples for some hypotheses. In fact, none of the existing FSM-based conformance testing methods that we tried worked for this case study. We therefore implemented an extension of the algorithm of Lee and Yannakakis for computing an adaptive distinguishing sequence. Even when an adaptive distinguishing sequence does not exist, Lee and Yannakakis’ algorithm produces an adaptive sequence that ‘almost’ identifies states. In combination with a standard algorithm for computing separating sequences for pairs of states, we managed to verify states with on average 3 test queries. Altogether, we needed around 60 million queries to learn a model of the ESM with 77 inputs and 3.410 states. We also constructed a model directly from the ESM software and established equivalence with the learned model. To the best of our knowledge, this is the first paper in which active automata learning has been applied to industrial control software.

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This paper proposes a novel approach that maximizes both code coverage and diversity among the selected test cases using NSGA-II multi- objective optimization, and the results show a significant improvement in fault detection rate.
Abstract: Test case selection is a classic testing technique to choose a subset of existing test cases for execution, due to the limited budget and tight deadlines. While `code coverage' is the state of practice among test case selection heuristics, recent literature has shown that `test case diversity' is also a very promising approach. In this paper, we first compare these two heuristics for test case selection in several real-world case studies (Apache Ant, Derby, JBoss, NanoXML and Math). The results show that neither of the two techniques completely dominates the other, but they can potentially be complementary. Therefore, we next propose a novel approach that maximizes both code coverage and diversity among the selected test cases using NSGA-II multi- objective optimization, and the results show a significant improvement in fault detection rate. Specifically, sometimes this novel approach detects up to 16\%(Ant), 10\%(JBoss), and 14\% (Math) more faults compared to either of coverage or diversity-based approaches, when the testing budget is less than 20\% of the entire test suite execution cost.

Journal ArticleDOI
TL;DR: Using this new automation testing framework tester can easily write their test cases efficiently and in less time, and this framework is helpful to developer to analyze their code due to screen shot property of framework.