scispace - formally typeset
Search or ask a question

Showing papers on "Test suite published in 2017"


Journal ArticleDOI
TL;DR: Empirical analysis on Nopol shows that the approach can effectively fix bugs with buggy if conditions and missing preconditions on two large open-source projects, namely Apache Commons Math and Apache Commons Lang.
Abstract: We propose Nopol , an approach to automatic repair of buggy conditional statements (i.e., if-then-else statements). This approach takes a buggy program as well as a test suite as input and generates a patch with a conditional expression as output. The test suite is required to contain passing test cases to model the expected behavior of the program and at least one failing test case that reveals the bug to be repaired. The process of Nopol consists of three major phases. First, Nopol employs angelic fix localization to identify expected values of a condition during the test execution. Second, runtime trace collection is used to collect variables and their actual values, including primitive data types and objected-oriented features (e.g., nullness checks), to serve as building blocks for patch generation. Third, Nopol encodes these collected data into an instance of a Satisfiability Modulo Theory (SMT) problem; then a feasible solution to the SMT instance is translated back into a code patch. We evaluate Nopol on 22 real-world bugs (16 bugs with buggy if conditions and six bugs with missing preconditions) on two large open-source projects, namely Apache Commons Math and Apache Commons Lang. Empirical analysis on these bugs shows that our approach can effectively fix bugs with buggy if conditions and missing preconditions. We illustrate the capabilities and limitations of Nopol using case studies of real bug fixes.

354 citations


Journal ArticleDOI
TL;DR: This paper carefully select (or modify) 15 test problems with diverse properties to construct a benchmark test suite, aiming to promote the research of evolutionary many-objective optimization (EMaO) via suggesting a set of testblems with a good representation of various real-world scenarios.
Abstract: In the real world, it is not uncommon to face an optimization problem with more than three objectives. Such problems, called many-objective optimization problems (MaOPs), pose great challenges to the area of evolutionary computation. The failure of conventional Pareto-based multi-objective evolutionary algorithms in dealing with MaOPs motivates various new approaches. However, in contrast to the rapid development of algorithm design, performance investigation and comparison of algorithms have received little attention. Several test problem suites which were designed for multi-objective optimization have still been dominantly used in many-objective optimization. In this paper, we carefully select (or modify) 15 test problems with diverse properties to construct a benchmark test suite, aiming to promote the research of evolutionary many-objective optimization (EMaO) via suggesting a set of test problems with a good representation of various real-world scenarios. Also, an open-source software platform with a user-friendly GUI is provided to facilitate the experimental execution and data observation.

268 citations


Journal ArticleDOI
01 Aug 2017
TL;DR: The result of the experiment shows that the considered state-of-the-art repair methods can generate patches for 47 out of 224 bugs, however, those patches are only test-suite adequate, which means that they pass the test suite and may potentially be incorrect beyond the test-Suite satisfaction correctness criterion.
Abstract: Defects4J is a large, peer-reviewed, structured dataset of real-world Java bugs. Each bug in Defects4J comes with a test suite and at least one failing test case that triggers the bug. In this paper, we report on an experiment to explore the effectiveness of automatic test-suite based repair on Defects4J. The result of our experiment shows that the considered state-of-the-art repair methods can generate patches for 47 out of 224 bugs. However, those patches are only test-suite adequate, which means that they pass the test suite and may potentially be incorrect beyond the test-suite satisfaction correctness criterion. We have manually analyzed 84 different patches to assess their real correctness. In total, 9 real Java bugs can be correctly repaired with test-suite based repair. This analysis shows that test-suite based repair suffers from under-specified bugs, for which trivial or incorrect patches still pass the test suite. With respect to practical applicability, it takes on average 14.8 minutes to find a patch. The experiment was done on a scientific grid, totaling 17.6 days of computation time. All the repair systems and experimental results are publicly available on Github in order to facilitate future research on automatic repair.

186 citations


Proceedings ArticleDOI
30 Oct 2017
TL;DR: The results show that ssFix successfully repaired 20 bugs with valid patches generated and that it outperformed five other repair techniques for Java.
Abstract: We present our automated program repair technique ssFix which leverages existing code (from a code database) that is syntax-related to the context of a bug to produce patches for its repair. Given a faulty program and a fault-exposing test suite, ssFix does fault localization to identify suspicious statements that are likely to be faulty. For each such statement, ssFix identifies a code chunk (or target chunk) including the statement and its local context. ssFix works on the target chunk to produce patches. To do so, it first performs syntactic code search to find candidate code chunks that are syntax-related, i.e., structurally similar and conceptually related, to the target chunk from a code database (or codebase) consisting of the local faulty program and an external code repository. ssFix assumes the correct fix to be contained in the candidate chunks, and it leverages each candidate chunk to produce patches for the target chunk. To do so, ssFix translates the candidate chunk by unifying the names used in the candidate chunk with those in the target chunk; matches the chunk components (expressions and statements) between the translated candidate chunk and the target chunk; and produces patches for the target chunk based on the syntactic differences that exist between the matched components and in the unmatched components. ssFix finally validates the patched programs generated against the test suite and reports the first one that passes the test suite. We evaluated ssFix on 357 bugs in the Defects4J bug dataset. Our results show that ssFix successfully repaired 20 bugs with valid patches generated and that it outperformed five other repair techniques for Java.

167 citations


Journal ArticleDOI
TL;DR: A new benchmark generator is proposed that is able to tune a number of challenging characteristics, including mixed Pareto-optimal front (convexity-concavity), nonmonotonic and time-varying variable-linkages, mixed types of changes, and randomness in type change, which have rarely or not been considered or tested in the literature.
Abstract: Dynamic multiobjective optimization (DMO) has received growing research interest in recent years since many real-world optimization problems appear to not only have multiple objectives that conflict with each other but also change over time. The time-varying characteristics of these DMO problems (DMOPs) pose new challenges to evolutionary algorithms. Considering the importance of a representative and diverse set of benchmark functions for DMO, in this paper, we propose a new benchmark generator that is able to tune a number of challenging characteristics, including mixed Pareto-optimal front (convexity–concavity), nonmonotonic and time-varying variable-linkages, mixed types of changes, and randomness in type change, which have rarely or not been considered or tested in the literature. A test suite of ten instances with different dynamic features is produced from the generator in this paper. Additionally, a few new performance measures are proposed to evaluate algorithms for DMOPs with different characteristics. Six representative multiobjective evolutionary algorithms from the literature are investigated based on the proposed DMO test suite and performance measures. The experimental results facilitate a better understanding of strengths and weaknesses of these compared algorithms for DMOPs.

155 citations


Proceedings ArticleDOI
10 Jul 2017
TL;DR: In this article, the Retecs method uses reinforcement learning to select and prioritize test cases according to their duration, previous last execution and failure history, in a constantly changing environment.
Abstract: Testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle. Selecting the most promising test cases to detect bugs is hard if there are uncertainties on the impact of committed code changes or, if traceability links between code and tests are not available. This paper introduces Retecs, a new method for automatically learning test case selection and prioritization in CI with the goal to minimize the round-trip time between code commits and developer feedback on failed test cases. The Retecs method uses reinforcement learning to select and prioritize test cases according to their duration, previous last execution and failure history. In a constantly changing environment, where new test cases are created and obsolete test cases are deleted, the Retecs method learns to prioritize error-prone test cases higher under guidance of a reward function and by observing previous CI cycles. By applying Retecs on data extracted from three industrial case studies, we show for the first time that reinforcement learning enables fruitful automatic adaptive test case selection and prioritization in CI and regression testing.

131 citations


Proceedings ArticleDOI
20 May 2017
TL;DR: Challenges that need to be addressed in order to improve fault detection in test generation tools are demonstrated, such as a need to integrate with popular build tools, and to improve the readability of the generated tests.
Abstract: Automated unit test generation has been extensively studied in the literature in recent years. Previous studies on open source systems have shown that test generation tools are quite effective at detecting faults, but how effective and applicable are they in an industrial application? In this paper, we investigate this question using a life insurance and pension products calculator engine owned by SEB Life & Pension Holding AB Riga Branch. To study fault-finding effectiveness, we extracted 25 real faults from the version history of this software project, and applied two up-to-date unit test generation tools for Java, EVOSUITE and RANDOOP, which implement search-based and feedback-directed random test generation, respectively. Automatically generated test suites detected up to 56.40% (EVOSUITE) and 38.00% (RANDOOP) of these faults. The analysis of our results demonstrates challenges that need to be addressed in order to improve fault detection in test generation tools. In particular, classification of the undetected faults shows that 97.62% of them depend on either "specific primitive values" (50.00%) or the construction of "complex state configuration of objects" (47.62%). To study applicability, we surveyed the developers of the application under test on their experience and opinions about the test generation tools and the generated test cases. This leads to insights on requirements for academic prototypes for successful technology transfer from academic research to industrial practice, such as a need to integrate with popular build tools, and to improve the readability of the generated tests.

122 citations


Proceedings ArticleDOI
10 Jul 2017
TL;DR: DiffTGen is proposed which identifies a patched program to be overfitting by first generating new test inputs that uncover semantic differences between the original faulty program and the patched program, then testing the patch based on the semantic differences, and finally generating test cases.
Abstract: A typical automatic program repair technique that uses a test suite as the correct criterion can produce a patched program that is test-suite-overfitted, or overfitting, which passes the test suite but does not actually repair the bug. In this paper, we propose DiffTGen which identifies a patched program to be overfitting by first generating new test inputs that uncover semantic differences between the original faulty program and the patched program, then testing the patched program based on the semantic differences, and finally generating test cases. Such a test case could be added to the original test suite to make it stronger and could prevent the repair technique from generating a similar overfitting patch again. We evaluated DiffTGen on 89 patches generated by four automatic repair techniques for Java with 79 of them being likely to be overfitting and incorrect. DiffTGen identifies in total 39 (49.4%) overfitting patches and yields the corresponding test cases. We further show that an automatic repair technique, if configured with DiffTGen, could avoid yielding overfitting patches and potentially produce correct ones.

104 citations


Journal ArticleDOI
TL;DR: It is found that keeping an archive of already covered goals along with the tests covering them and focusing the search on uncovered goals overcomes this small drawback on larger classes, leading to an improved overall effectiveness of whole test suite generation.
Abstract: A common application of search-based software testing is to generate test cases for all goals defined by a coverage criterion (e.g., lines, branches, mutants). Rather than generating one test case at a time for each of these goals individually, whole test suite generation optimizes entire test suites towards satisfying all goals at the same time. There is evidence that the overall coverage achieved with this approach is superior to that of targeting individual coverage goals. Nevertheless, there remains some uncertainty on (a) whether the results generalize beyond branch coverage, (b) whether the whole test suite approach might be inferior to a more focused search for some particular coverage goals, and (c) whether generating whole test suites could be optimized by only targeting coverage goals not already covered. In this paper, we perform an in-depth analysis to study these questions. An empirical study on 100 Java classes using three different coverage criteria reveals that indeed there are some testing goals that are only covered by the traditional approach, although their number is only very small in comparison with those which are exclusively covered by the whole test suite approach. We find that keeping an archive of already covered goals along with the tests covering them and focusing the search on uncovered goals overcomes this small drawback on larger classes, leading to an improved overall effectiveness of whole test suite generation.

96 citations


Proceedings ArticleDOI
21 Aug 2017
TL;DR: This paper studied 61 projects that use Travis CI, a cloud-based continuous integration tool, in order to examine real test failures that were encountered by the developers of those projects, and found that 18% of test suite executions fail and that 13% of these failures are flaky.
Abstract: Software defects cost time and money to diagnose and fix. Consequently, developers use a variety of techniques to avoid introducing defects into their systems. However, these techniques have costs of their own; the benefit of using a technique must outweigh the cost of applying it. In this paper we investigate the costs and benefits of automated regression testing in practice. Specifically, we studied 61 projects that use Travis CI, a cloud-based continuous integration tool, in order to examine real test failures that were encountered by the developers of those projects. We determined how the developers resolved the failures they encountered and used this information to classify the failures as being caused by a flaky test, by a bug in the system under test, or by a broken or obsolete test. We consider that test failures caused by bugs represent a benefit of the test suite, while failures caused by broken or obsolete tests represent a test suite maintenance cost. We found that 18% of test suite executions fail and that 13% of these failures are flaky. Of the non-flaky failures, only 74% were caused by a bug in the system under test; the remaining 26% were due to incorrect or obsolete tests. In addition, we found that, in the failed builds, only 0.38% of the test case executions failed and 64% of failed builds contained more than one failed test. Our findings contribute to a wider understanding of the unforeseen costs that can impact the overall cost effectiveness of regression testing in practice. They can also inform research into test case selection techniques, as we have provided an approximate empirical bound on the practical value that could be extracted from such techniques. This value appears to be large, as the 61 systems under study contained nearly 3 million lines of test code and yet over 99% of test case executions could have been eliminated with a perfect oracle.

86 citations


Proceedings ArticleDOI
Goran Petrovic1, Marko Ivankovic1
01 May 2017
TL;DR: This work presents a diff-based probabilistic approach to mutation analysis that drastically reduces the number of mutants by omitting lines of code without statement coverage and lines that are determined to be uninteresting - these arid lines are dubbed.
Abstract: Mutation testing assesses test suite efficacy by inserting small faults into programs and measuring the ability of the test suite to detect them It is widely considered the strongest test criterion in terms of finding the most faults and it subsumes a number of other coverage criteria Traditional mutation analysis is computationally prohibitive which hinders its adoption as an industry standard In order to alleviate the computational issues, we present a diff-based probabilistic approach to mutation analysis that drastically reduces the number of mutants by omitting lines of code without statement coverage and lines that are determined to be uninteresting - we dub these arid lines Furthermore, by reducing the number of mutants and carefully selecting only the most interesting ones we make it easier for humans to understand and evaluate the result of mutation analysis We propose a heuristic for judging whether a node is arid or not, conditioned on the programming language We focus on a code-review based approach and consider the effects of surfacing mutation results on developer attention The described system is used by 6,000 engineers in Google on all code changes they author or review, affecting in total more than 13,000 code authors as part of the mandatory code review process The system processes about 30% of all diffs across Google that have statement coverage calculated About 15% of coverage statement calculations fail across Google

Journal ArticleDOI
TL;DR: This paper describes the experience with four hyper-heuristic selection and acceptance mechanisms namely Exponential Monte Carlo with counter (EMCQ), Choice Function (CF), Improvement Selection Rules (ISR), and newly developed Fuzzy Inference Selection (FIS), using the t-way test generation problem as a case study.

Proceedings ArticleDOI
21 Aug 2017
TL;DR: In this article, a taxonomy of 262 types of Android faults grouped in 14 categories by manually analyzing 2,023 so ware artifacts from different sources (e.g., bug reports, commits).
Abstract: Mutation testing has been widely used to assess the fault-detection effectiveness of a test suite, as well as to guide test case generation or prioritization. Empirical studies have shown that, while mutants are generally representative of real faults, an effective application of mutation testing requires “traditional” operators designed for programming languages to be augmented with operators specific to an application domain and/or technology. This paper proposes MDroid+, a framework for effective mutation testing of Android apps. First, we systematically devise a taxonomy of 262 types of Android faults grouped in 14 categories by manually analyzing 2,023 so ware artifacts from different sources (e.g., bug reports, commits). Then, we identified a set of 38 mutation operators, and implemented an infrastructure to automatically seed mutations in Android apps with 35 of the identified operators. The taxonomy and the proposed operators have been evaluated in terms of stillborn/trivial mutants generated as compared to well know mutation tools, and their capacity to represent real faults in Android apps

Proceedings ArticleDOI
20 May 2017
TL;DR: The idea of learning to test is proposed, which learns the characteristics of bug-revealing test programs from previous test programs that triggered bugs, an approach to prioritizing test programs for compiler testing acceleration.
Abstract: Compiler testing is a crucial way of guaranteeing the reliability of compilers (and software systems in general). Many techniques have been proposed to facilitate automated compiler testing. These techniques rely on a large number of test programs (which are test inputs of compilers) generated by some test-generation tools (e.g., CSmith). However, these compiler testing techniques have serious efficiency problems as they usually take a long period of time to find compiler bugs. To accelerate compiler testing, it is desirable to prioritize the generated test programs so that the test programs that are more likely to trigger compiler bugs are executed earlier. In this paper, we propose the idea of learning to test, which learns the characteristics of bug-revealing test programs from previous test programs that triggered bugs. Based on the idea of learning to test, we propose LET, an approach to prioritizing test programs for compiler testing acceleration. LET consists of a learning process and a scheduling process. In the learning process, LET identifies a set of features of test programs, trains a capability model to predict the probability of a new test program for triggering compiler bugs and a time model to predict the execution time of a test program. In the scheduling process, LET prioritizes new test programs according to their bug-revealing probabilities in unit time, which is calculated based on the two trained models. Our extensive experiments show that LET significantly accelerates compiler testing. In particular, LET reduces more than 50% of the testing time in 24.64% of the cases, and reduces between 25% and 50% of the testing time in 36.23% of the cases.

Journal ArticleDOI
TL;DR: This work investigates the application of temporal planners to the problem of compiling quantum circuits to newly emerging quantum hardware, and generates a test suite of compilation problems for QAOA circuits of various sizes to a realistic hardware architecture.
Abstract: To run quantum algorithms on emerging gate-model quantum hardware, quantum circuits must be compiled to take into account constraints on the hardware For near-term hardware, with only limited means to mitigate decoherence, it is critical to minimize the duration of the circuit We investigate the application of temporal planners to the problem of compiling quantum circuits to newly emerging quantum hardware While our approach is general, we focus on compiling to superconducting hardware architectures with nearest neighbor constraints Our initial experiments focus on compiling Quantum Alternating Operator Ansatz (QAOA) circuits whose high number of commuting gates allow great flexibility in the order in which the gates can be applied That freedom makes it more challenging to find optimal compilations but also means there is a greater potential win from more optimized compilation than for less flexible circuits We map this quantum circuit compilation problem to a temporal planning problem, and generated a test suite of compilation problems for QAOA circuits of various sizes to a realistic hardware architecture We report compilation results from several state-of-the-art temporal planners on this test set This early empirical evaluation demonstrates that temporal planning is a viable approach to quantum circuit compilation

Proceedings ArticleDOI
21 Aug 2017
TL;DR: An energy-aware mutation testing framework, called μDROID, that can be used by developers to assess the adequacy of their test suite for revealing energy-related defects and relies on a novel, automatic oracle to determine if a mutant can be killed by a test.
Abstract: The rising popularity of mobile apps deployed on battery-constrained devices underlines the need for effectively evaluating their energy properties. However, currently there is a lack of testing tools for evaluating the energy properties of apps. As a result, for energy testing, developers are relying on tests intended for evaluating the functional correctness of apps. Such tests may not be adequate for revealing energy defects and inefficiencies in apps. This paper presents an energy-aware mutation testing framework, called μDROID, that can be used by developers to assess the adequacy of their test suite for revealing energy-related defects. μDROID implements fifty energy-aware mutation operators and relies on a novel, automatic oracle to determine if a mutant can be killed by a test. Our evaluation on real-world Android apps shows the ability of proposed mutation operators for evaluating the utility of tests in revealing energy defects. Moreover, our automated oracle can detect whether tests kill the energy mutants with an overall accuracy of 94%, thereby making it possible to apply μDROID automatically.

Proceedings ArticleDOI
20 May 2017
TL;DR: A metric, called DDU, aimed at complementing adequacy measurements by quantifying a test-suite's diagnosability, i.e., the effectiveness of applying spectrum-based fault localization to pinpoint faults in the code in the event of test failures is proposed.
Abstract: Current metrics for assessing the adequacy of a test-suite plainly focus on the number of components (be it lines, branches, paths) covered by the suite, but do not explicitly check how the tests actually exercise these components and whether they provide enough information so that spectrum-based fault localization techniques can perform accurate fault isolation. We propose a metric, called DDU, aimed at complementing adequacy measurements by quantifying a test-suite's diagnosability, i.e., the effectiveness of applying spectrum-based fault localization to pinpoint faults in the code in the event of test failures. Our aim is to increase the value generated by creating thorough test-suites, so they are not only regarded as error detection mechanisms but also as effective diagnostic aids that help widely-used fault-localization techniques to accurately pinpoint the location of bugs in the system. Our experiments show that optimizing a test suite with respect to DDU yields a 34% gain in spectrum-based fault localization report accuracy when compared to the standard branch-coverage metric.

Proceedings ArticleDOI
21 Aug 2017
TL;DR: A technique to improve mutation testing that is called wild-caught mutants is introduced, which provides a method for creating potential faults that are more closely coupled with changes made by actual programmers in real-world cases.
Abstract: Mutation testing of a test suite and a program provides a way to measure the quality of the test suite. In essence, mutation testing is a form of sensitivity testing: by running mutated versions of the program against the test suite, mutation testing measures the suite's sensitivity for detecting bugs that a programmer might introduce into the program. This paper introduces a technique to improve mutation testing that we call wild-caught mutants; it provides a method for creating potential faults that are more closely coupled with changes made by actual programmers. This technique allows the mutation tester to have more certainty that the test suite is sensitive to the kind of changes that have been observed to have been made by programmers in real-world cases.

Journal ArticleDOI
TL;DR: A new dynamic multiobjective benchmark test suite which contains a number of component functions with clearly defined properties to assess the diversity maintenance and tracking ability of a dynamic multi objective evolutionary algorithm (MOEA).
Abstract: Growing trend of the dynamic multiobjective optimization research in the evolutionary computation community has increased the need for challenging and conceptually simple benchmark test suite to assess the optimization performance of an algorithm. This paper proposes a new dynamic multiobjective benchmark test suite which contains a number of component functions with clearly defined properties to assess the diversity maintenance and tracking ability of a dynamic multiobjective evolutionary algorithm (MOEA). Time-varying fitness landscape modality, tradeoff connectedness, and tradeoff degeneracy are considered as these properties rarely exist in the current benchmark test instances. Cross-problem comparative study is presented to analyze the sensitivity of a given algorithm to certain fitness landscape properties. To demonstrate the use of the proposed benchmark test suite, three evolutionary multiobjective algorithms, namely nondominated sorting genetic algorithm, decomposition-based MOEA, and recently proposed Kalman-filter-based prediction approach, are analyzed and compared. Besides, two problem-specific performance metrics are designed to assess the convergence and diversity performances, respectively. By applying the proposed test suite and performance metrics, microscopic performance details of these algorithms are uncovered to provide insightful guidance to the algorithm designer.

Journal ArticleDOI
TL;DR: A novel encryption scheme, wherein an encryption key is generated by two distant complex nonlinear units, forced into synchronization by a chaotic driver, and the obtained bit series fulfill the randomness conditions as defined by the National Institute of Standards test suite.
Abstract: We present a novel encryption scheme, wherein an encryption key is generated by two distant complex nonlinear units, forced into synchronization by a chaotic driver. The concept is sufficiently generic to be implemented on either photonic, optoelectronic or electronic platforms. The method for generating the key bitstream from the chaotic signals is reconfigurable. Although derived from a deterministic process, the obtained bit series fulfill the randomness conditions as defined by the National Institute of Standards test suite. We demonstrate the feasibility of our concept on an electronic delay oscillator circuit and test the robustness against attacks using a state-of-the-art system identification method.

Proceedings ArticleDOI
17 Apr 2017
TL;DR: It is argued that future performance testing frameworks should provider better support for low-friction testing, for instance via non-parameterized methods or performance test generation, as well as focus on a tight integration with standard continuous integration tooling.
Abstract: The usage of open source (OS) software is wide-spread across many industries. While the functional quality of OS projects is considered to be similar to closed-source software, much is unknown about the quality in terms of performance. One challenge for OS developers is that, unlike for functional testing, there is a lack of accepted best practices for performance testing. To reveal the state of practice of performance testing in OS projects, we conduct an exploratory study on 111 Java-based OS projects from GitHub. We study the performance tests of these projects from five perspectives: (1) developers, (2) size, (3) test organization, (4) types of performance tests and (5) used tooling. We show that writing performance tests is not a popular task in OS projects: performance tests form only a small portion of the test suite, are rarely updated, and are usually maintained by a small group of core project developers. Further, even though many projects are aware that they need performance tests, developers appear to struggle implementing them. We argue that future performance testing frameworks should provider better support for low-friction testing, for instance via non-parameterized methods or performance test generation, as well as focus on a tight integration with standard continuous integration tooling.

Proceedings ArticleDOI
10 Jul 2017
TL;DR: An automated approach which generates descriptive names for automatically generated unit tests by summarizing API-level coverage goals is presented, which is optimized to be short, descriptive of the test, have a clear relation to the covered code under test, and allow developers to uniquely distinguish tests in a test suite.
Abstract: The name of a unit test helps developers to understand the purpose and scenario of the test, and test names support developers when navigating amongst sets of unit tests. When unit tests are generated automatically, however, they tend to be given non-descriptive names such as “test0”, which provide none of the benefits a descriptive name can give a test. The underlying challenge is that automatically generated tests typically do not represent real scenarios and have no clear purpose other than covering code, which makes naming them di cult. In this paper, we present an automated approach which generates descriptive names for automatically generated unit tests by summarizing API-level coverage goals. The tests are optimized to be short, descriptive of the test, have a clear relation to the covered code under test, and allow developers to uniquely distinguish tests in a test suite. An empirical evaluation with 47 participants shows that developers agree with the synthesized names, and the synthesized names are equally descriptive as manually written names. Study participants were even more accurate and faster at matching code and tests with synthesized names compared to manually derived names.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: This paper identifies three test objectives that aim to increase test suite diversity and uses a search-based algorithm to generate diversified but small test suites, and develops a prediction model to stop test generation when adding test cases is unlikely to improve fault localization.
Abstract: One promising way to improve the accuracy of fault localization based on statistical debugging is to increase diversity among test cases in the underlying test suite. In many practical situations, adding test cases is not a cost-free option because test oracles are developed manually or running test cases is expensive. Hence, we require to have test suites that are both diverse and small to improve debugging. In this paper, we focus on improving fault localization of Simulink models by generating test cases. We identify three test objectives that aim to increase test suite diversity. We use these objectives in a search-based algorithm to generate diversified but small test suites. To further minimize test suite sizes, we develop a prediction model to stop test generation when adding test cases is unlikely to improve fault localization. We evaluate our approach using three industrial subjects. Our results show (1) the three selected test objectives are able to significantly improve the accuracy of fault localization for small test suite sizes, and (2) our prediction model is able to maintain almost the same fault localization accuracy while reducing the average number of newly generated test cases by more than half.

Proceedings ArticleDOI
21 Aug 2017
TL;DR: PATDroid is presented, for efficiently testing an Android app while taking the impact of permissions on its behavior into account, and significantly reduces the testing effort, yet achieves comparable code coverage and fault detection capability as exhaustively testing an app under all permission combinations.
Abstract: Recent introduction of a dynamic permission system in Android, allowing the users to grant and revoke permissions after the installation of an app, has made it harder to properly test apps. Since an app's behavior may change depending on the granted permissions, it needs to be tested under a wide range of permission combinations. At the state-of-the-art, in the absence of any automated tool support, a developer needs to either manually determine the interaction of tests and app permissions, or exhaustively re-execute tests for all possible permission combinations, thereby increasing the time and resources required to test apps. This paper presents an automated approach, called PATDroid, for efficiently testing an Android app while taking the impact of permissions on its behavior into account. PATDroid performs a hybrid program analysis on both an app under test and its test suite to determine which tests should be executed on what permission combinations. Our experimental results show that PATDroid significantly reduces the testing effort, yet achieves comparable code coverage and fault detection capability as exhaustively testing an app under all permission combinations.

Journal ArticleDOI
TL;DR: A comprehensive study to investigate the impact of cloning the failed test cases on the effectiveness of Spectrum-based fault localization techniques, and shows that on 22 popular SBFL techniques the fault-localization accuracy can be significantly improved, when thefailed test cases are cloned in the single-fault, double-f fault, and triple-f Fault scenarios.

Journal ArticleDOI
TL;DR: This work proposes here specific approaches to white-box test prioritization, selection and minimization that take into account the reuse context when reordering or selecting test cases, by leveraging possible constraints delimiting the new input domain scope.

Journal ArticleDOI
TL;DR: An integration of MT with APR is presented that enables application of APR without the need for a test oracle, and thus successfully extends APR techniques to a broader application domain.

Journal ArticleDOI
TL;DR: By utilizing the diverse timing characteristics of different initial states, a staged-running Self-timed Ring (STR) architecture, which is able to suppress the degree of bias, is proposed and passes the National Institute of Standards and Technology test suite with high p-values.
Abstract: Bias phenomenon has been a ubiquitous problem in the designs of digital True Random Number Generator (TRNG). Circuit performance can be improved with some auxiliary modules such as analog circuits and post-processing components, which usually involve the compromising of cost, compatibility, throughput, and security as well. In some cases only sub-optimal designs can be achieved. In this paper, by utilizing the diverse timing characteristics of different initial states, a staged-running Self-timed Ring (STR) architecture, which is able to suppress the degree of bias, is proposed. The proposed architecture is compared with some conventional free-running architectures using a Xilinx Zynq-7000 Field Programmable Gate Array (FPGA) platform for a throughput of 100 Mbps. With the increase of the ring size, the bias degree of the newly proposed structure is within a negligible level of less than 1%; whereas those of the conventional architectures can exceed 10%. Statistical tests were also conducted and the results show that the quality of randomness rises as the complexity in initial-state mapping and the ring nodes of the proposed structure increases. The test passes the National Institute of Standards and Technology (NIST) test suite with high p-values.

Book ChapterDOI
09 Sep 2017
TL;DR: This study empirically evaluates six different algorithms and shows that the use of a test archive makes evolutionary algorithms clearly better than random testing, and it confirms that the many-objective search is the most effective.
Abstract: Evolutionary algorithms have been shown to be effective at generating unit test suites optimised for code coverage. While many aspects of these algorithms have been evaluated in detail (e.g., test length and different kinds of techniques aimed at improving performance, like seeding), the influence of the specific algorithms has to date seen less attention in the literature. As it is theoretically impossible to design an algorithm that is best on all possible problems, a common approach in software engineering problems is to first try a Genetic Algorithm, and only afterwards try to refine it or compare it with other algorithms to see if any of them is more suited for the addressed problem. This is particularly important in test generation, since recent work suggests that random search may in practice be equally effective, whereas the reformulation as a many-objective problem seems to be more effective. To shed light on the influence of the search algorithms, we empirically evaluate six different algorithms on a selection of non-trivial open source classes. Our study shows that the use of a test archive makes evolutionary algorithms clearly better than random testing, and it confirms that the many-objective search is the most effective.

Proceedings ArticleDOI
15 May 2017
TL;DR: This paper finds that assertions (i.e., a typical type of test oracles) are significantly correlated with coverage-based test-suite reduction, and proposes an assertion-aware test- Suite reduction technique which outperforms traditional test-Suite reduction in terms of cost-effectiveness.
Abstract: Code coverage is the dominant criterion in test-suite reduction. Typically, most test-suite reduction techniques repeatedly remove tests covering code that has been covered by other tests from the test suite. However, test-suite reduction based on code coverage alone may incur fault-detection capability loss, because a test detects faults if and only if its execution covers buggy code and its test oracle catches the buggy state. In other words, test oracles may also affect test-suite reduction, However, to our knowledge, their impacts have never been studied before. In this paper, we conduct the first empirical study on such impacts by using 10 real-world GitHub Java projects, and find that assertions (i.e., a typical type of test oracles) are significantly correlated with coverage-based test-suite reduction. Based on our preliminary study results, we also proposed an assertion-aware test-suite reduction technique which outperforms traditional test-suite reduction in terms of cost-effectiveness.