scispace - formally typeset
Search or ask a question

Showing papers on "Test suite published in 2018"


Proceedings ArticleDOI
03 Sep 2018
TL;DR: The experimental results demonstrate that DeepRoad can detect thousands of inconsistent behaviors for DNN-based autonomous driving systems, and effectively validate input images to potentially enhance the system robustness as well.
Abstract: While Deep Neural Networks (DNNs) have established the fundamentals of image-based autonomous driving systems, they may exhibit erroneous behaviors and cause fatal accidents. To address the safety issues in autonomous driving systems, a recent set of testing techniques have been designed to automatically generate artificial driving scenes to enrich test suite, e.g., generating new input images transformed from the original ones. However, these techniques are insufficient due to two limitations: first, many such synthetic images often lack diversity of driving scenes, and hence compromise the resulting efficacy and reliability. Second, for machine-learning-based systems, a mismatch between training and application domain can dramatically degrade system accuracy, such that it is necessary to validate inputs for improving system robustness. In this paper, we propose DeepRoad, an unsupervised DNN-based framework for automatically testing the consistency of DNN-based autonomous driving systems and online validation. First, DeepRoad automatically synthesizes large amounts of diverse driving scenes without using image transformation rules (e.g. scale, shear and rotation). In particular, DeepRoad is able to produce driving scenes with various weather conditions (including those with rather extreme conditions) by applying Generative Adversarial Networks (GANs) along with the corresponding real-world weather scenes. Second, DeepRoad utilizes metamorphic testing techniques to check the consistency of such systems using synthetic images. Third, DeepRoad validates input images for DNN-based systems by measuring the distance of the input and training images using their VGGNet features. We implement DeepRoad to test three well-recognized DNN-based autonomous driving systems in Udacity self-driving car challenge. The experimental results demonstrate that DeepRoad can detect thousands of inconsistent behaviors for these systems, and effectively validate input images to potentially enhance the system robustness as well.

445 citations


Proceedings ArticleDOI
16 Nov 2018
TL;DR: This paper proposes a mutation testing framework specialized for DL systems to measure the quality of test data, and designs a set of model-level mutation operators that directly inject faults into DL models without a training process.
Abstract: Deep learning (DL) defines a new data-driven programming paradigm where the internal system logic is largely shaped by the training data. The standard way of evaluating DL models is to examine their performance on a test dataset. The quality of the test dataset is of great importance to gain confidence of the trained models. Using an inadequate test dataset, DL models that have achieved high test accuracy may still lack generality and robustness. In traditional software testing, mutation testing is a well-established technique for quality evaluation of test suites, which analyzes to what extent a test suite detects the injected faults. However, due to the fundamental difference between traditional software and deep learning-based software, traditional mutation testing techniques cannot be directly applied to DL systems. In this paper, we propose a mutation testing framework specialized for DL systems to measure the quality of test data. To do this, by sharing the same spirit of mutation testing in traditional software, we first define a set of source-level mutation operators to inject faults to the source of DL (i.e., training data and training programs). Then we design a set of model-level mutation operators that directly inject faults into DL models without a training process. Eventually, the quality of test data could be evaluated from the analysis on to what extent the injected faults could be detected. The usefulness of the proposed mutation testing techniques is demonstrated on two public datasets, namely MNIST and CIFAR-10, with three DL models.

252 citations


Journal ArticleDOI
TL;DR: The performance statistics demonstrate that the lion algorithm is equivalent to certain optimization algorithms, while outperforming majority of the optimization algorithms and the trade-off maintainability of the lion algorithms over the traditional algorithms.
Abstract: Nature-inspired optimization algorithms, especially evolutionary computation-based and swarm intelligence-based algorithms are being used to solve a variety of optimization problems. Motivated by the obligation of having optimization algorithms, a novel optimization algorithm based on a lion’s unique social behavior had been presented in our previous work. Territorial defense and territorial takeover were the two most popular lion’s social behaviors. This paper takes the algorithm forward on rigorous and diverse performance tests to demonstrate the versatility of the algorithm. Four different test suites are presented in this paper. The first two test suites are benchmark optimization problems. The first suite had comparison with published results of evolutionary and few renowned optimization algorithms, while the second suite leads to a comparative study with state-of-the-art optimization algorithms. The test suite 3 takes the large-scale optimization problems, whereas test suite 4 considers benchmark engineering problems. The performance statistics demonstrate that the lion algorithm is equivalent to certain optimization algorithms, while outperforming majority of the optimization algorithms. The results also demonstrate the trade-off maintainability of the lion algorithm over the traditional algorithms.

138 citations


Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed a mutation testing framework for DL systems to measure the quality of test data, by defining a set of source-level mutation operators to inject faults to the source of DL (i.e., training data and training programs).
Abstract: Deep learning (DL) defines a new data-driven programming paradigm where the internal system logic is largely shaped by the training data. The standard way of evaluating DL models is to examine their performance on a test dataset. The quality of the test dataset is of great importance to gain confidence of the trained models. Using an inadequate test dataset, DL models that have achieved high test accuracy may still lack generality and robustness. In traditional software testing, mutation testing is a well-established technique for quality evaluation of test suites, which analyzes to what extent a test suite detects the injected faults. However, due to the fundamental difference between traditional software and deep learning-based software, traditional mutation testing techniques cannot be directly applied to DL systems. In this paper, we propose a mutation testing framework specialized for DL systems to measure the quality of test data. To do this, by sharing the same spirit of mutation testing in traditional software, we first define a set of source-level mutation operators to inject faults to the source of DL (i.e., training data and training programs). Then we design a set of model-level mutation operators that directly inject faults into DL models without a training process. Eventually, the quality of test data could be evaluated from the analysis on to what extent the injected faults could be detected. The usefulness of the proposed mutation testing techniques is demonstrated on two public datasets, namely MNIST and CIFAR-10, with three DL models.

132 citations


Journal ArticleDOI
21 Feb 2018
TL;DR: In this article, the authors investigate the application of temporal planners to the problem of compiling quantum circuits to newly emerging quantum hardware, focusing on compiling to superconducting hardware architectures with nearest neighbor constraints.
Abstract: To run quantum algorithms on emerging gate-model quantum hardware, quantum circuits must be compiled to take into account constraints on the hardware. For near-term hardware, with only limited means to mitigate decoherence, it is critical to minimize the duration of the circuit. We investigate the application of temporal planners to the problem of compiling quantum circuits to newly emerging quantum hardware. While our approach is general, we focus on compiling to superconducting hardware architectures with nearest neighbor constraints. Our initial experiments focus on compiling Quantum Alternating Operator Ansatz (QAOA) circuits whose high number of commuting gates allow great flexibility in the order in which the gates can be applied. That freedom makes it more challenging to find optimal compilations but also means there is a greater potential win from more optimized compilation than for less flexible circuits. We map this quantum circuit compilation problem to a temporal planning problem, and generated a test suite of compilation problems for QAOA circuits of various sizes to a realistic hardware architecture. We report compilation results from several state-of-the-art temporal planners on this test set. This early empirical evaluation demonstrates that temporal planning is a viable approach to quantum circuit compilation.

127 citations


Proceedings ArticleDOI
27 May 2018
TL;DR: It is found that achieving higher mutation scores improves significantly the fault detection, and both independent variables significantly influence the dependent one, with significantly better fits, but overall with relative low prediction power.
Abstract: Empirical validation of software testing studies is increasingly relying on mutants. This practice is motivated by the strong correlation between mutant scores and real fault detection that is reported in the literature. In contrast, our study shows that correlations are the results of the confounding effects of the test suite size. In particular, we investigate the relation between two independent variables, mutation score and test suite size, with one dependent variable the detection of (real) faults. We use two data sets, CoreBench and De-fects4J, with large C and Java programs and real faults and provide evidence that all correlations between mutation scores and real fault detection are weak when controlling for test suite size. We also found that both independent variables significantly influence the dependent one, with significantly better fits, but overall with relative low prediction power. By measuring the fault detection capability of the top ranked, according to mutation score, test suites (opposed to randomly selected test suites of the same size), we found that achieving higher mutation scores improves significantly the fault detection. Taken together, our data suggest that mutants provide good guidance for improving the fault detection of test suites, but their correlation with fault detection are weak.

96 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: This paper presents a test suite of contrastive translations focused specifically on the translation of pronouns and shows that, while gains in BLEU are moderate for those systems, they outperform baselines by a large margin in terms of accuracy on the contrastive test set.
Abstract: The translation of pronouns presents a special challenge to machine translation to this day, since it often requires context outside the current sentence. Recent work on models that have access to information across sentence boundaries has seen only moderate improvements in terms of automatic evaluation metrics such as BLEU. However, metrics that quantify the overall translation quality are ill-equipped to measure gains from additional context. We argue that a different kind of evaluation is needed to assess how well models translate inter-sentential phenomena such as pronouns. This paper therefore presents a test suite of contrastive translations focused specifically on the translation of pronouns. Furthermore, we perform experiments with several context-aware models. We show that, while gains in BLEU are moderate for those systems, they outperform baselines by a large margin in terms of accuracy on our contrastive test set. Our experiments also show the effectiveness of parameter tying for multi-encoder architectures.

87 citations


Proceedings ArticleDOI
26 Oct 2018
TL;DR: ReproDroid as mentioned in this paper ) is a framework allowing the accurate comparison of Android taint analysis tools, which supports researchers in inferring the ground truth for data leaks in apps, automatically applying tools to benchmarks, and in evaluating the obtained results.
Abstract: In recent years, researchers have developed a number of tools to conduct taint analysis of Android applications. While all the respective papers aim at providing a thorough empirical evaluation, comparability is hindered by varying or unclear evaluation targets. Sometimes, the apps used for evaluation are not precisely described. In other cases, authors use an established benchmark but cover it only partially. In yet other cases, the evaluations differ in terms of the data leaks searched for, or lack a ground truth to compare against. All those limitations make it impossible to truly compare the tools based on those published evaluations. We thus present ReproDroid, a framework allowing the accurate comparison of Android taint analysis tools. ReproDroid supports researchers in inferring the ground truth for data leaks in apps, in automatically applying tools to benchmarks, and in evaluating the obtained results. We use ReproDroid to comparatively evaluate on equal grounds the six prominent taint analysis tools Amandroid, DIALDroid, DidFail, DroidSafe, FlowDroid and IccTA. The results are largely positive although four tools violate some promises concerning features and accuracy. Finally, we contribute to the area of unbiased benchmarking with a new and improved version of the open test suite DroidBench.

84 citations


Proceedings ArticleDOI
26 Oct 2018
TL;DR: Themis is presented, an automated test suite generator to measure two types of discrimination, including causal relationships between sensitive inputs and program behavior, and its effectiveness on open-source software.
Abstract: Bias in decisions made by modern software is becoming a common and serious problem. We present Themis, an automated test suite generator to measure two types of discrimination, including causal relationships between sensitive inputs and program behavior. We explain how Themis can measure discrimination and aid its debugging, describe a set of optimizations Themis uses to reduce test suite size, and demonstrate Themis' effectiveness on open-source software. Themis is open-source and all our evaluation data are available at http://fairness.cs.umass.edu/. See a video of Themis in action: https://youtu.be/brB8wkaUesY

80 citations


Proceedings ArticleDOI
27 May 2018
TL;DR: The FAST family of test case prioritization techniques are introduced that radically changes this landscape by borrowing algorithms commonly exploited in the big data domain to find similar items and outperform white-box approaches, including greedy ones, if preparation time is not counted.
Abstract: Many test case prioritization criteria have been proposed for speeding up fault detection. Among them, similarity-based approaches give priority to the test cases that are the most dissimilar from those already selected. However, the proposed criteria do not scale up to handle the many thousands or even some millions test suite sizes of modern industrial systems and simple heuristics are used instead. We introduce the FAST family of test case prioritization techniques that radically changes this landscape by borrowing algorithms commonly exploited in the big data domain to find similar items. FAST techniques provide scalable similarity-based test case prioritization in both white-box and black-box fashion. The results from experimentation on real world C and Java subjects show that the fastest members of the family outperform other black-box approaches in efficiency with no significant impact on effectiveness, and also outperform white-box approaches, including greedy ones, if preparation time is not counted. A simulation study of scalability shows that one FAST technique can prioritize a million test cases in less than 20 minutes.

66 citations


Journal ArticleDOI
17 May 2018-PLOS ONE
TL;DR: Experimental results reveal that the QLSCA is statistically superior with regard to test suite size reduction compared to recent state-of-the-art strategies, including the original SCA, the particle swarm test generator (PSTG), adaptive particle swarm optimization (APSO) and the cuckoo search strategy (CS) at the 95% confidence level.
Abstract: The sine-cosine algorithm (SCA) is a new population-based meta-heuristic algorithm. In addition to exploiting sine and cosine functions to perform local and global searches (hence the name sine-cosine), the SCA introduces several random and adaptive parameters to facilitate the search process. Although it shows promising results, the search process of the SCA is vulnerable to local minima/maxima due to the adoption of a fixed switch probability and the bounded magnitude of the sine and cosine functions (from -1 to 1). In this paper, we propose a new hybrid Q-learning sine-cosine- based strategy, called the Q-learning sine-cosine algorithm (QLSCA). Within the QLSCA, we eliminate the switching probability. Instead, we rely on the Q-learning algorithm (based on the penalty and reward mechanism) to dynamically identify the best operation during runtime. Additionally, we integrate two new operations (Levy flight motion and crossover) into the QLSCA to facilitate jumping out of local minima/maxima and enhance the solution diversity. To assess its performance, we adopt the QLSCA for the combinatorial test suite minimization problem. Experimental results reveal that the QLSCA is statistically superior with regard to test suite size reduction compared to recent state-of-the-art strategies, including the original SCA, the particle swarm test generator (PSTG), adaptive particle swarm optimization (APSO) and the cuckoo search strategy (CS) at the 95% confidence level. However, concerning the comparison with discrete particle swarm optimization (DPSO), there is no significant difference in performance at the 95% confidence level. On a positive note, the QLSCA statistically outperforms the DPSO in certain configurations at the 90% confidence level.

Posted ContentDOI
21 Jun 2018-bioRxiv
TL;DR: For example, Memote as mentioned in this paper is an open-source software containing a community-maintained, standardized set of metabolic model tests, which can be extended to include experimental datasets for automatic model validation.
Abstract: Several studies have shown that neither the formal representation nor the functional requirements of genome-scale metabolic models (GEMs) are precisely defined. Without a consistent standard, comparability, reproducibility, and interoperability of models across groups and software tools cannot be guaranteed. Here, we present memote (https://github.com/opencobra/memote) an open-source software containing a community-maintained, standardized set of metabolic model tests. The tests cover a range of aspects from annotations to conceptual integrity and can be extended to include experimental datasets for automatic model validation. In addition to testing a model once, memote can be configured to do so automatically, i.e., while building a GEM. A comprehensive report displays the model9s performance parameters, which supports informed model development and facilitates error detection. Memote provides a measure for model quality that is consistent across reconstruction platforms and analysis software and simplifies collaboration within the community by establishing workflows for publicly hosted and version controlled models.

Proceedings ArticleDOI
27 May 2018
TL;DR: This work conducts the largest study of automatically fixing undiscovered bugs in real-world code to date, and presents a new automated program repair technique using Separation Logic that finds and then accurately fixes real bugs without test cases.
Abstract: Static analysis tools have demonstrated effectiveness at finding bugs in real world code. Such tools are increasingly widely adopted to improve software quality in practice. Automated Program Repair (APR) has the potential to further cut down on the cost of improving software quality. However, there is a disconnect between these effective bug-finding tools and APR. Recent advances in APR rely on test cases, making them inapplicable to newly discovered bugs or bugs difficult to test for deterministically (like memory leaks). Additionally, the quality of patches generated to satisfy a test suite is a key challenge. We address these challenges by adapting advances in practical static analysis and verification techniques to enable a new technique that finds and then accurately fixes real bugs without test cases. We present a new automated program repair technique using Separation Logic. At a high-level, our technique reasons over semantic effects of existing program fragments to fix faults related to general pointer safety properties: resource leaks, memory leaks, and null dereferences. The procedure automatically translates identified fragments into source-level patches, and verifies patch correctness with respect to reported faults. In this work we conduct the largest study of automatically fixing undiscovered bugs in real-world code to date. We demonstrate our approach by correctly fixing 55 bugs, including 11 previously undiscovered bugs, in 11 real-world projects.

Journal ArticleDOI
TL;DR: A novel Hybrid Bat Algorithm (HBA) is proposed to improve the performance of BA and three modification methods are incorporated into the standard BA to enhance the local search capability and the ability to escape from local optimum traps.

Proceedings ArticleDOI
27 May 2018
TL;DR: This work uses a lightweight approach based on test suite failure and execution history that is highly effcient and "continuously" prioritizes commits that are waiting for execution in response to the arrival of each new commit and the completion of each previously scheduled commit.
Abstract: Continuous integration (CI) development environments allow soft-ware engineers to frequently integrate and test their code. While CI environments provide advantages, they also utilize non-trivial amounts of time and resources. To address this issue, researchers have adapted techniques for test case prioritization (TCP) to CI environments. To date, however, the techniques considered have operated on test suites, and have not achieved substantial improvements. Moreover, they can be inappropriate to apply when system build costs are high. In this work we explore an alternative: prioritization of commits. We use a lightweight approach based on test suite failure and execution history that is highly effcient; our approach "continuously" prioritizes commits that are waiting for execution in response to the arrival of each new commit and the completion of each previously scheduled commit. We have evaluated our approach on three non-trivial CI data sets. Our results show that our approach can be more effective than prior techniques.

Journal ArticleDOI
TL;DR: Assessment of the quality of ground‐motion predictions at end‐user sites is described by comparing predicted shaking intensities to ShakeMaps for historic events and a threshold‐based approach is implemented that assesses how often end users initiate the appropriate action, based on their ground‐shaking threshold.
Abstract: Earthquake early warning systems provide warnings to end users of incoming moderate to strong ground shaking from earthquakes. An earthquake early warning system, ShakeAlert, is providing alerts to beta end users in the western United States, specifically California, Oregon, and Washington. An essential aspect of the earthquake early warning system is the development of a framework to test modifications to code to ensure functionality and assess performance. In 2016, a Testing and Certification Platform (TCP) was included in the development of the Production Prototype version of ShakeAlert. The purpose of the TCP is to evaluate the robustness of candidate code that is proposed for deployment on ShakeAlert Production Prototype servers. TCP consists of two main components: a real‐time in situ test that replicates the real‐time production system and an offline playback system to replay test suites. The real‐time tests of system performance assess code optimization and stability. The offline tests comprise a stress test of candidate code to assess if the code is production ready. The test suite includes over 120 events including local, regional, and teleseismic historic earthquakes, recentering and calibration events, and other anomalous and potentially problematic signals. Two assessments of alert performance are conducted. First, point‐source assessments are undertaken to compare magnitude, epicentral location, and origin time with the Advanced National Seismic System Comprehensive Catalog, as well as to evaluate alert latency. Second, we describe assessment of the quality of ground‐motion predictions at end‐user sites by comparing predicted shaking intensities to ShakeMaps for historic events and implement a threshold‐based approach that assesses how often end users initiate the appropriate action, based on their ground‐shaking threshold. TCP has been developed to be a convenient streamlined procedure for objectively testing algorithms, and it has been designed with flexibility to accommodate significant changes in development of new or modified system code. It is expected that the TCP will continue to evolve along with the ShakeAlert system, and the framework we describe here provides one example of how earthquake early warning systems can be evaluated.

Journal ArticleDOI
02 May 2018-PLOS ONE
TL;DR: Four hybrid variants of the flower pollination algorithm (FPA) are proposed and demonstrate that FPA hybrids overcome the problems of slow convergence in the original FPA and offers statistically superior performance compared with existing t-way strategies in terms of test suite size.
Abstract: The application of meta-heuristic algorithms for t-way testing has recently become prevalent. Consequently, many useful meta-heuristic algorithms have been developed on the basis of the implementation of t-way strategies (where t indicates the interaction strength). Mixed results have been reported in the literature to highlight the fact that no single strategy appears to be superior compared with other configurations. The hybridization of two or more algorithms can enhance the overall search capabilities, that is, by compensating the limitation of one algorithm with the strength of others. Thus, hybrid variants of the flower pollination algorithm (FPA) are proposed in the current work. Four hybrid variants of FPA are considered by combining FPA with other algorithmic components. The experimental results demonstrate that FPA hybrids overcome the problems of slow convergence in the original FPA and offers statistically superior performance compared with existing t-way strategies in terms of test suite size.

Posted Content
TL;DR: This work proposes to assess reliability of author and automated annotations on patch correctness assessment by constructing a gold set of correctness labels for 189 randomly selected patches generated by 8 state-of-the-art ASR techniques through a user study involving 35 professional developers as independent annotators.
Abstract: Current state-of-the-art automatic software repair (ASR) techniques rely heavily on incomplete specifications, e.g., test suites, to generate repairs. This, however, may render ASR tools to generate incorrect repairs that do not generalize. To assess patch correctness, researchers have been following two typical ways separately: (1) Automated annotation, wherein patches are automatically labeled by an independent test suite (ITS) - a patch passing the ITS is regarded as correct or generalizable, and incorrect otherwise, (2) Author annotation, wherein authors of ASR techniques annotate correctness labels of patches generated by their and competing tools by themselves. While automated annotation fails to prove that a patch is actually correct, author annotation is prone to subjectivity. This concern has caused an on-going debate on appropriate ways to assess the effectiveness of numerous ASR techniques proposed recently. To address this concern, we propose to assess reliability of author and automated annotations on patch correctness assessment. We do this by first constructing a gold set of correctness labels for 189 randomly selected patches generated by 8 state-of-the-art ASR techniques through a user study involving 35 professional developers as independent annotators. By measuring inter-rater agreement as a proxy for annotation quality - as commonly done in the literature - we demonstrate that our constructed gold set is on par with other high-quality gold sets. We then compare labels generated by author and automated annotations with this gold set to assess reliability of the patch assessment methodologies. We subsequently report several findings and highlight implications for future studies.

Journal ArticleDOI
TL;DR: The novel presented MIO algorithm is a step forward in the automation of test suite generation for system testing, and there are still properties of system testing that can be exploited to achieve even better results.
Abstract: Context: Automatically generating test suites is intrinsically a multi-objective problem, as any of the testing targets (e.g., statements to execute or mutants to kill) is an objective on its own. Test suite generation has peculiarities that are quite different from other more regular optimization problems. For example, given an existing test suite, one can add more tests to cover the remaining objectives. One would like the smallest number of small tests to cover as many objectives as possible, but that is a secondary goal compared to covering those targets in the first place. Furthermore, the amount of objectives in software testing can quickly become unmanageable, in the order of (tens/hundreds of) thousands, especially for system testing of industrial size systems. Objective: To overcome these issues, different techniques have been proposed, like for example the Whole Test Suite (WTS) approach and the Many-Objective Sorting Algorithm (MOSA). However, those techniques might not scale well to very large numbers of objectives and limited search budgets (a typical case in system testing). In this paper, we propose a novel algorithm, called Many Independent Objective (MIO) algorithm. This algorithm is designed and tailored based on the specific properties of test suite generation. Method: An empirical study was carried out for test suite generation on a series of artificial examples and seven RESTful API web services. The EvoMaster system test generation tool was used, where MIO, MOSA, WTS and random search were compared. Results: The presented MIO algorithm resulted having the best overall performance, but was not the best on all problems. Conclusion: The novel presented MIO algorithm is a step forward in the automation of test suite generation for system testing. However, there are still properties of system testing that can be exploited to achieve even better results.

Proceedings Article
15 Aug 2018
TL;DR: It is demonstrated that FP-SCANNER can also reveal the original value of altered fingerprint attributes, such as the browser or the operating system, and it is believed that this result can be exploited by fingerprinters to more accurately target browsers with countermeasures.
Abstract: By exploiting the diversity of device and browser configurations, browser fingerprinting established itself as a viable technique to enable stateless user tracking in production Companies and academic communities have responded with a wide range of countermeasures However , the way these countermeasures are evaluated does not properly assess their impact on user privacy, in particular regarding the quantity of information they may indirectly leak by revealing their presence In this paper, we investigate the current state of the art of browser fingerprinting countermeasures to study the inconsistencies they may introduce in altered fingerprints , and how this may impact user privacy To do so, we introduce FP-SCANNER as a new test suite that explores browser fingerprint inconsistencies to detect potential alterations, and we show that we are capable of detecting countermeasures from the inconsistencies they introduce Beyond spotting altered browser fingerprints, we demonstrate that FP-SCANNER can also reveal the original value of altered fingerprint attributes, such as the browser or the operating system We believe that this result can be exploited by fingerprinters to more accurately target browsers with countermeasures

Journal ArticleDOI
TL;DR: In this article, the authors propose and evaluate an approach called UnsatGuided, which aims to alleviate the overfitting problem for synthesis-based repair techniques with automatic test case generation.
Abstract: Among the many different kinds of program repair techniques, one widely studied family of techniques is called test suite based repair. However, test suites are in essence input-output specifications and are thus typically inadequate for completely specifying the expected behavior of the program under repair. Consequently, the patches generated by test suite based repair techniques can just overfit to the used test suite, and fail to generalize to other tests. We deeply analyze the overfitting problem in program repair and give a classification of this problem. This classification will help the community to better understand and design techniques to defeat the overfitting problem. We further propose and evaluate an approach called UnsatGuided, which aims to alleviate the overfitting problem for synthesis-based repair techniques with automatic test case generation. The approach uses additional automatically generated tests to strengthen the repair constraint used by synthesis-based repair techniques. We analyze the effectiveness of UnsatGuided: 1) analytically with respect to alleviating two different kinds of overfitting issues; 2) empirically based on an experiment over the 224 bugs of the Defects4J repository. The main result is that automatic test generation is effective in alleviating one kind of overfitting issue--regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue--incomplete fixing.

Journal ArticleDOI
Donghwan Shin1, Shin Yoo1, Doo-Hwan Bae1
TL;DR: A novel, diversity-aware mutation adequacy criterion is proposed, which is fully satisfied when each of the considered mutants can be identified by the set of tests that kill it, thereby encouraging inclusion of more diverse range of tests.
Abstract: Diversity has been widely studied in software testing as a guidance towards effective sampling of test inputs in the vast space of possible program behaviors. However, diversity has received relatively little attention in mutation testing. The traditional mutation adequacy criterion is a one-dimensional measure of the total number of killed mutants. We propose a novel, diversity-aware mutation adequacy criterion called distinguishing mutation adequacy criterion , which is fully satisfied when each of the considered mutants can be identified by the set of tests that kill it, thereby encouraging inclusion of more diverse range of tests. This paper presents the formal definition of the distinguishing mutation adequacy and its score. Subsequently, an empirical study investigates the relationship among distinguishing mutation score, fault detection capability, and test suite size. The results show that the distinguishing mutation adequacy criterion detects 1.33 times more unseen faults than the traditional mutation adequacy criterion, at the cost of a 1.56 times increase in test suite size, for adequate test suites that fully satisfies the criteria. The results show a better picture for inadequate test suites; on average, 8.63 times more unseen faults are detected at the cost of a 3.14 times increase in test suite size.

Journal ArticleDOI
01 Dec 2018
TL;DR: This paper is based on the development of an automated testing tool which includes two major automated components of software testing, test suite generation and test suite optimization, and the proposed method is able to provide a set of minimal test cases with maximum path coverage as compared to other algorithms.
Abstract: Automated testing mitigates the risk of test maintenance failure, selects the optimized test suite, improves efficiency and hence reduces cost and time consumption. This paper is based on the development of an automated testing tool which includes two major automated components of software testing, test suite generation and test suite optimization. The control flow of the software under test has been represented by a flow graph. There are five test suite generation methods which are made available in the tool, namely boundary value testing, robustness testing, worst-case testing, robust worst-case testing and random testing. The generated test suite is further optimized to a desired fitness level using the artificial bee colony algorithm or the cuckoo search algorithm. The proposed method is able to provide a set of minimal test cases with maximum path coverage as compared to other algorithms. Finally, the generated optimal test suite is used for automated fault detection.

Journal ArticleDOI
TL;DR: This paper proposes an efficient uniform and variable t-way minimal test suite generation approach to address these problems using GA, called Genetic Strategy (GS), and experimental results show that GS can compete against the existing AI-based and computational-based strategies in terms of efficiency and performance.
Abstract: Context To improve the quality and correctness of a software product it is necessary to test different aspects of the software system. Among different approaches for software testing, combinatorial testing along with covering array is a proper testing method. The most challenging problem in combinatorial testing strategies like t-way, is the combinatorial explosion which considers all combinations of input parameters. Many evolutionary and meta-heuristic strategies have been proposed to address and mitigate this problem. Objective Genetic Algorithm (GA) is an evolutionary search-based technique that has been used in t-way interaction testing by different approaches. Although useful, all of these approaches can produce test suite with small interaction strengths (i.e. t ≤ 6). Additionally, most of them suffer from expensive computations. Even though there are other strategies which use different meta-heuristic algorithms to solve these problems, in this paper, we propose an efficient uniform and variable t-way minimal test suite generation approach to address these problems using GA, called Genetic Strategy (GS). Method By changing the bit structure and accessing test cases quickly, GS improves performance of the fitness function. These adjustments and reduction of the complexities of GA in the proposed GS decreases the test suite size and increases the speed of test suite generation up to t = 20 . Results To evaluate the efficiency and performance of the proposed GS, various experiments are performed on different set of benchmarks. Experimental results show that not only GS supports higher interaction strengths in comparison with the existing GA-based strategies, but also its supported interaction strength is higher than most of other AI-based and computational-based strategies. Conclusion Furthermore, experimental results show that GS can compete against the existing (both AI-based and computational-based) strategies in terms of efficiency and performance in most of the case studies.

Journal ArticleDOI
TL;DR: An initial set of recommendations is provided that are useful for performing well-designed experiments in the field of TSR and a number of future research directions based on current trends in this field of research are provided.
Abstract: Regression testing aims at testing a system under test (SUT) in the presence of changes. As a SUT changes, the number of test cases increases to handle the modifications, and ultimately, it becomes practically impossible to execute all of them within limited testing budget. Test suite reduction (TSR) approaches are widely used to improve the regression testing costs by selecting representative test suite without compromising effectiveness, such as fault-detection capability, within allowed time budget. The aim of this systematic review is to identify state-of-the-art TSR approaches categories, assess the quality of experiments reported on this subject, and provide a set of guidelines for conducting future experiments in this area of research. After applying a two-facet study selection procedure, we finalized 113 most relevant studies from an initial pool of 4230 papers published in the field of TSR between 1993 and 2016. The TSR approaches are broadly classified into four main categories based on the literature including greedy, clustering, search, and hybrid approaches. It is noted that majority of the experiments in TSR do not follow any specific guidelines for planning, conducting, and reporting the experiments, which may pose validity threats related to their results. Thus, we recommend conducting experiments that are better designed for the future. In this direction, an initial set of recommendations is provided that are useful for performing well-designed experiments in the field of TSR. Furthermore, we provide a number of future research directions based on current trends in this field of research.

Journal ArticleDOI
TL;DR: An approach based on test classification is proposed to enable the use of unlabelled test cases in localizing faults and shows that with the utilization of these newly labelled test cases, the effectiveness of fault localization can indeed be improved.

Journal ArticleDOI
TL;DR: A conceptual multi-objective regression-test selection framework (called MORTO) is selected and adopted and a custom-built genetic algorithm (GA) is developed based on that conceptual framework that is able to eliminate the subjectivity of regression testing and its dependency on expert opinions.
Abstract: Context Executing an entire regression test-suite after every code change is often costly in large software projects. To cope with this challenge, researchers have proposed various regression test-selection techniques. Objective This paper was motivated by a real industrial need to improve regression-testing practices in the context of a safety-critical industrial software in the defence domain in Turkey. To address our objective, we set up and conducted an “action-research” collaborative project between industry and academia. Method After a careful literature review, we selected a conceptual multi-objective regression-test selection framework (called MORTO) and adopted it to our industrial context by developing a custom-built genetic algorithm (GA) based on that conceptual framework. GA is able to provide full coverage of the affected (changed) requirements while considering multiple cost and benefit factors of regression testing. e.g., minimizing the number of test cases, and maximizing cumulative number of detected faults by each test suite. Results The empirical results of applying the approach on the Software Under Test (SUT) demonstrate that this approach yields a more efficient test suite (in terms of costs and benefits) compared to the old (manual) test-selection approach, used in the company, and another applicable approach chosen from the literature. With this new approach, regression selection process in the project under study is not ad-hoc anymore. Furthermore, we have been able to eliminate the subjectivity of regression testing and its dependency on expert opinions. Conclusion Since the proposed approach has been beneficial in saving the costs of regression testing, it is currently in active use in the company. We believe that other practitioners can apply our approach in their regression-testing contexts too, when applicable. Furthermore, this paper contributes to the body of evidence in regression testing by offering a success story of successful implementation and application of multi-objective regression testing in practice.

Journal ArticleDOI
TL;DR: This paper empirically investigates whether traditional test-suite metrics such as statement/branch coverage and mutation score are effective in controlling the reliability of generated repairs, and conducts the largest-scale experiments to date with real-world software.
Abstract: Automated program repair is increasingly gaining traction, due to its potential to reduce debugging cost greatly. The feasibility of automated program repair has been shown in a number of works, and the research focus is gradually shifting toward the quality of generated patches. One promising direction is to control the quality of generated patches by controlling the quality of test-suites used for automated program repair. In this paper, we ask the following research question: “Can traditional test-suite metrics proposed for the purpose of software testing also be used for the purpose of automated program repair?” We empirically investigate whether traditional test-suite metrics such as statement/branch coverage and mutation score are effective in controlling the reliability of generated repairs (the likelihood that repairs cause regression errors). We conduct the largest-scale experiments of this kind to date with real-world software, and for the first time perform a correlation study between various test-suite metrics and the reliability of generated repairs. Our results show that in general, with the increase of traditional test suite metrics, the reliability of repairs tend to increase. In particular, such a trend is most strongly observed in statement coverage. Our results imply that the traditional test suite metrics proposed for software testing can also be used for automated program repair to improve the reliability of repairs.

Proceedings ArticleDOI
02 Jul 2018
TL;DR: This paper proposes a cost-effective approach for test case selection that relies on black-box data related to inputs and outputs of the system and demonstrates that this approach managed to improve Random Search by 26% on average in terms of the Hypervolume quality indicator.
Abstract: In many domains, engineers build simulation models (e.g., Simulink) before developing code to simulate the behavior of complex systems (e.g., Cyber-Physical Systems). Those models are commonly heavy to simulate which makes it difficult to execute the entire test suite. Furthermore, it is often difficult to measure white-box coverage of test cases when employing such models. In addition, the historical data related to failures might not be available. This paper proposes a cost-effective approach for test case selection that relies on black-box data related to inputs and outputs of the system. The approach defines in total five effectiveness measures and one cost measure followed by deriving in total 15 objective combinations and integrating them within Non-Dominated Sorting Genetic Algorithm-II (NSGA-II). We empirically evaluated our approach with all these 15 combinations using four case studies by employing mutation testing to assess the fault revealing capability. The results demonstrated that our approach managed to improve Random Search by 26% on average in terms of the Hypervolume quality indicator.

Proceedings ArticleDOI
27 May 2018
TL;DR: This paper tackles the question of whether automated repair techniques are capable of repairing defects that developers consider important or that are hard for developers to repair manually.
Abstract: Automated program repair techniques use a buggy program and a partial specification (typically a test suite) to produce a program variant that satisfies the specification. While prior work has studied patch quality [10, 11] and maintainability [2], it has not examined whether automated repair techniques are capable of repairing defects that developers consider important or that are hard for developers to repair manually. This paper tackles those questions.