scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Software Engineering in 2016"


Journal ArticleDOI
TL;DR: A comprehensive overview of a broad spectrum of fault localization techniques, each of which aims to streamline the fault localization process and make it more effective by attacking the problem in a unique way is provided.
Abstract: Software fault localization, the act of identifying the locations of faults in a program, is widely recognized to be one of the most tedious, time consuming, and expensive – yet equally critical – activities in program debugging. Due to the increasing scale and complexity of software today, manually locating faults when failures occur is rapidly becoming infeasible, and consequently, there is a strong demand for techniques that can guide software developers to the locations of faults in a program with minimal human intervention. This demand in turn has fueled the proposal and development of a broad spectrum of fault localization techniques, each of which aims to streamline the fault localization process and make it more effective by attacking the problem in a unique way. In this article, we catalog and provide a comprehensive overview of such techniques and discuss key issues and concerns that are pertinent to software fault localization as a whole.

822 citations


Journal ArticleDOI
TL;DR: This article provides a comprehensive survey on metamorphic testing, which summarises the research results and application areas, and analyses common practice in empirical studies of metamorphIC testing as well as the main open challenges.
Abstract: A test oracle determines whether a test execution reveals a fault, often by comparing the observed program output to the expected output. This is not always practical, for example when a program's input-output relation is complex and difficult to capture formally. Metamorphic testing provides an alternative, where correctness is not determined by checking an individual concrete output, but by applying a transformation to a test input and observing how the program output “morphs” into a different one as a result. Since the introduction of such metamorphic relations in 1998, many contributions on metamorphic testing have been made, and the technique has seen successful applications in a variety of domains, ranging from web services to computer graphics. This article provides a comprehensive survey on metamorphic testing: It summarises the research results and application areas, and analyses common practice in empirical studies of metamorphic testing as well as the main open challenges.

362 citations


Journal ArticleDOI
TL;DR: Improvements of HYDRA over other baseline approaches in terms of F1-score and when inspecting the top 20 percent lines of code are substantial, and in most cases the improvements are significant and have large effect sizes across the 29 datasets.
Abstract: Most software defect prediction approaches are trained and applied on data from the same project. However, often a new project does not have enough training data. Cross-project defect prediction, which uses data from other projects to predict defects in a particular project, provides a new perspective to defect prediction. In this work, we propose a HYbrid moDel Reconstruction Approach ( HYDRA ) for cross-project defect prediction, which includes two phases: genetic algorithm (GA) phase and ensemble learning (EL) phase. These two phases create a massive composition of classifiers. To examine the benefits of HYDRA , we perform experiments on 29 datasets from the PROMISE repository which contains a total of 11,196 instances (i.e., Java classes) labeled as defective or clean. We experiment with logistic regression as the underlying classification algorithm of HYDRA . We compare our approach with the most recently proposed cross-project defect prediction approaches: TCA+ by Nam et al., Peters filter by Peters et al., GP by Liu et al., MO by Canfora et al., and CODEP by Panichella et al. Our results show that HYDRA achieves an average F1-score of 0.544. On average, across the 29 datasets, these results correspond to an improvement in the F1-scores of 26.22 , 34.99, 47.43, 28.61, and 30.14 percent over TCA+, Peters filter, GP, MO, and CODEP, respectively. In addition, HYDRA on average can discover 33 percent of all bugs if developers inspect the top 20 percent lines of code, which improves the best baseline approach (TCA+) by 44.41 percent. We also find that HYDRA improves the F1-score of Zero-R which predict all the instances to be defective by 5.42 percent, but improves Zero-R by 58.65 percent when inspecting the top 20 percent lines of code. In practice, Zero-R can be hard to use since it simply predicts all of the instances to be defective, and thus developers have to inspect all of the instances to find the defective ones. Moreover, we notice the improvement of HYDRA over other baseline approaches in terms of F1-score and when inspecting the top 20 percent lines of code are substantial, and in most cases the improvements are significant and have large effect sizes across the 29 datasets.

228 citations


Journal ArticleDOI
TL;DR: A source code summarization technique that writes English descriptions of Java methods by analyzing how those methods are invoked is proposed and found that while it does not reach the quality of human-written summaries, it does improve over the state-of-the-art summarization tool in several dimensions by a statistically-significant margin.
Abstract: Source code summarization is the task of creating readable summaries that describe the functionality of software. Source code summarization is a critical component of documentation generation, for example as Javadocs formed from short paragraphs attached to each method in a Java program. At present, a majority of source code summarization is manual, in that the paragraphs are written by human experts. However, new automated technologies are becoming feasible. These automated techniques have been shown to be effective in select situations, though a key weakness is that they do not explain the source code's context. That is, they can describe the behavior of a Java method, but not why the method exists or what role it plays in the software. In this paper, we propose a source code summarization technique that writes English descriptions of Java methods by analyzing how those methods are invoked. We then performed two user studies to evaluate our approach. First, we compared our generated summaries to summaries written manually by experts. Then, we compared our summaries to summaries written by a state-of-the-art automatic summarization tool. We found that while our approach does not reach the quality of human-written summaries, we do improve over the state-of-the-art summarization tool in several dimensions by a statistically-significant margin.

152 citations


Journal ArticleDOI
TL;DR: The results are packaged in the Greenfield Startup Model (GSM), which explains the priority of startups to release the product as quickly as possible, and the need to shorten time-to-market, by speeding up the development through low-precision engineering activities.
Abstract: Software startups are newly created companies with no operating history and oriented towards producing cutting-edge products. However, despite the increasing importance of startups in the economy, few scientific studies attempt to address software engineering issues, especially for early-stage startups. If anything, startups need engineering practices of the same level or better than those of larger companies, as their time and resources are more scarce, and one failed project can put them out of business. In this study we aim to improve understanding of the software development strategies employed by startups. We performed this state-of-practice investigation using a grounded theory approach. We packaged the results in the Greenfield Startup Model (GSM), which explains the priority of startups to release the product as quickly as possible. This strategy allows startups to verify product and market fit, and to adjust the product trajectory according to early collected user feedback. The need to shorten time-to-market, by speeding up the development through low-precision engineering activities, is counterbalanced by the need to restructure the product before targeting further growth. The resulting implications of the GSM outline challenges and gaps, pointing out opportunities for future research to develop and validate engineering practices in the startup context.

141 citations


Journal ArticleDOI
TL;DR: The approach, namely cHRev, is presented, to automatically recommend reviewers who are best suited to participate in a given review, based on their historical contributions as demonstrated in their prior reviews, and is evaluated on three open source systems as well as a commercial codebase at Microsoft.
Abstract: Code review is an important part of the software development process. Recently, many open source projects have begun practicing code review through “modern” tools such as GitHub pull-requests and Gerrit. Many commercial software companies use similar tools for code review internally. These tools enable the owner of a source code change to request individuals to participate in the review, i.e., reviewers. However, this task comes with a challenge. Prior work has shown that the benefits of code review are dependent upon the expertise of the reviewers involved. Thus, a common problem faced by authors of source code changes is that of identifying the best reviewers for their source code change. To address this problem, we present an approach, namely cHRev , to automatically recommend reviewers who are best suited to participate in a given review, based on their historical contributions as demonstrated in their prior reviews. We evaluate the effectiveness of cHRev on three open source systems as well as a commercial codebase at Microsoft and compare it to the state of the art in reviewer recommendation. We show that by leveraging the specific information in previously completed reviews (i.e.,quantification of review comments and their recency), we are able to improve dramatically on the performance of prior approaches, which (limitedly) operate on generic review information (i.e., reviewers of similar source code file and path names) or source coderepository data. We also present the insights into why our approach cHRev outperforms the existing approaches.

131 citations


Journal ArticleDOI
TL;DR: This paper extends metamorphic testing into a user-oriented approach to software verification, validation, and quality assessment, and conducts large scale empirical studies with four major web search engines.
Abstract: Metamorphic testing is a testing technique that can be used to verify the functional correctness of software in the absence of an ideal oracle. This paper extends metamorphic testing into a user-oriented approach to software verification, validation, and quality assessment, and conducts large scale empirical studies with four major web search engines: Google, Bing, Chinese Bing, and Baidu. These search engines are very difficult to test and assess using conventional approaches owing to the lack of an objective and generally recognized oracle. The results are useful for both search engine developers and users, and demonstrate that our approach can effectively alleviate the oracle problem and challenges surrounding a lack of specifications when verifying, validating, and evaluating large and complex software systems.

129 citations


Journal ArticleDOI
TL;DR: The main contribution is the description of a mathematical framework for run-time efficient probabilistic model checking, which statically generates a set of verification conditions that can be efficiently evaluated at run time as soon as changes occur.
Abstract: Modern software-intensive systems often interact with an environment whose behavior changes over time, often unpredictably. The occurrence of changes may jeopardize their ability to meet the desired requirements. It is therefore desirable to design software in a way that it can self-adapt to the occurrence of changes with limited, or even without, human intervention. Self-adaptation can be achieved by bringing software models and model checking to run time, to support perpetual automatic reasoning about changes. Once a change is detected, the system itself can predict if requirements violations may occur and enable appropriate counter-actions. However, existing mainstream model checking techniques and tools were not conceived for run-time usage; hence they hardly meet the constraints imposed by on-the-fly analysis in terms of execution time and memory usage. This paper addresses this issue and focuses on perpetual satisfaction of non-functional requirements, such as reliability or energy consumption. Its main contribution is the description of a mathematical framework for run-time efficient probabilistic model checking. Our approach statically generates a set of verification conditions that can be efficiently evaluated at run time as soon as changes occur. The proposed approach also supports sensitivity analysis, which enables reasoning about the effects of changes and can drive effective adaptation strategies.

112 citations


Journal ArticleDOI
TL;DR: This article introduces ethnography, explains its origin, context, strengths and weaknesses, and presents a set of dimensions that position ethnography as a useful and usable approach to empirical software engineering research.
Abstract: Ethnography is a qualitative research method used to study people and cultures. It is largely adopted in disciplines outside software engineering, including different areas of computer science. Ethnography can provide an in-depth understanding of the socio-technological realities surrounding everyday software development practice, i.e., it can help to uncover not only what practitioners do, but also why they do it. Despite its potential, ethnography has not been widely adopted by empirical software engineering researchers, and receives little attention in the related literature. The main goal of this paper is to explain how empirical software engineering researchers would benefit from adopting ethnography. This is achieved by explicating four roles that ethnography can play in furthering the goals of empirical software engineering: to strengthen investigations into the social and human aspects of software engineering; to inform the design of software engineering tools; to improve method and process development; and to inform research programmes. This article introduces ethnography, explains its origin, context, strengths and weaknesses, and presents a set of dimensions that position ethnography as a useful and usable approach to empirical software engineering research. Throughout the paper, relevant examples of ethnographic studies of software practice are used to illustrate the points being made.

94 citations


Journal ArticleDOI
TL;DR: It is concluded that crossover experiments can yield valid results, provided they are properly designed and analysed, and that, if correctly addressed, carryover is no worse than other validity threats.
Abstract: In experiments with crossover design subjects apply more than one treatment. Crossover designs are widespread in software engineering experimentation: they require fewer subjects and control the variability among subjects. However, some researchers disapprove of crossover designs. The main criticisms are: the carryover threat and its troublesome analysis. Carryover is the persistence of the effect of one treatment when another treatment is applied later. It may invalidate the results of an experiment. Additionally, crossover designs are often not properly designed and/or analysed, limiting the validity of the results. In this paper, we aim to make SE researchers aware of the perils of crossover experiments and provide risk avoidance good practices. We study how another discipline (medicine) runs crossover experiments. We review the SE literature and discuss which good practices tend not to be adhered to, giving advice on how they should be applied in SE experiments. We illustrate the concepts discussed analysing a crossover experiment that we have run. We conclude that crossover experiments can yield valid results, provided they are properly designed and analysed, and that, if correctly addressed, carryover is no worse than other validity threats.

88 citations


Journal ArticleDOI
TL;DR: From this empirical study, both the optimal technique and the additional technique significantly outperform the ideal technique in terms of coverage, but the latter significantly outperforms the former two techniques in Terms of fault detection.
Abstract: Software testing aims to assure the quality of software under test. To improve the efficiency of software testing, especially regression testing, test-case prioritization is proposed to schedule the execution order of test cases in software testing. Among various test-case prioritization techniques, the simple additional coverage-based technique, which is a greedy strategy, achieves surprisingly competitive empirical results. To investigate how much difference there is between the order produced by the additional technique and the optimal order in terms of coverage, we conduct a study on various empirical properties of optimal coverage-based test-case prioritization. To enable us to achieve the optimal order in acceptable time for our object programs, we formulate optimal coverage-based test-case prioritization as an integer linear programming (ILP) problem. Then we conduct an empirical study for comparing the optimal technique with the simple additional coverage-based technique. From this empirical study, the optimal technique can only slightly outperform the additional coverage-based technique with no statistically significant difference in terms of coverage, and the latter significantly outperforms the former in terms of either fault detection or execution time. As the optimal technique schedules the execution order of test cases based on their structural coverage rather than detected faults, we further implement the ideal optimal test-case prioritization technique, which schedules the execution order of test cases based on their detected faults. Taking this ideal technique as the upper bound of test-case prioritization, we conduct another empirical study for comparing the optimal technique and the simple additional technique with this ideal technique. From this empirical study, both the optimal technique and the additional technique significantly outperform the ideal technique in terms of coverage, but the latter significantly outperforms the former two techniques in terms of fault detection. Our findings indicate that researchers may need take cautions in pursuing the optimal techniques in test-case prioritization with intermediate goals.

Journal ArticleDOI
TL;DR: A machine learning approach for discovering and visualizing architectural tactics in code, mapping these code segments to tactic traceability patterns, and monitoring sensitive areas of the code for modification events in order to provide users with up-to-date information about underlying architectural concerns is presented.
Abstract: Software architectures are often constructed through a series of design decisions. In particular, architectural tactics are selected to satisfy specific quality concerns such as reliability, performance, and security. However, the knowledge of these tactical decisions is often lost, resulting in a gradual degradation of architectural quality as developers modify the code without fully understanding the underlying architectural decisions. In this paper we present a machine learning approach for discovering and visualizing architectural tactics in code, mapping these code segments to tactic traceability patterns, and monitoring sensitive areas of the code for modification events in order to provide users with up-to-date information about underlying architectural concerns. Our approach utilizes a customized classifier which is trained using code extracted from fifty performance-centric and safety-critical open source software systems. Its performance is compared against seven off-the-shelf classifiers. In a controlled experiment all classifiers performed well; however our tactic detector outperformed the other classifiers when used within the larger context of the Hadoop Distributed File System. We further demonstrate the viability of our approach for using the automatically detected tactics to generate viable and informative messages in a simulation of maintenance events mined from Hadoop's change management system.

Journal ArticleDOI
TL;DR: An adaptive ranking approach that leverages project knowledge through functional decomposition of source code, API descriptions of library components, the bug-fixing history, the code change history, and the file dependency graph to outperforms three recent state-of-the-art methods.
Abstract: When a new bug report is received, developers usually need to reproduce the bug and perform code reviews to find the cause, a process that can be tedious and time consuming. A tool for ranking all the source files with respect to how likely they are to contain the cause of the bug would enable developers to narrow down their search and improve productivity. This paper introduces an adaptive ranking approach that leverages project knowledge through functional decomposition of source code, API descriptions of library components, the bug-fixing history, the code change history, and the file dependency graph. Given a bug report, the ranking score of each source file is computed as a weighted combination of an array of features, where the weights are trained automatically on previously solved bug reports using a learning-to-rank technique. We evaluate the ranking system on six large scale open source Java projects, using the before-fix version of the project for every bug report. The experimental results show that the learning-to-rank approach outperforms three recent state-of-the-art methods. In particular, our method makes correct recommendations within the top 10 ranked source files for over 70 percent of the bug reports in the Eclipse Platform and Tomcat projects.

Journal ArticleDOI
TL;DR: This paper presents a multi-objective test prioritization technique that determines sequences of test cases that maximize the number of discovered faults that are both technical and business critical.
Abstract: While performing regression testing, an appropriate choice for test case ordering allows the tester to early discover faults in source code. To this end, test case prioritization techniques can be used. Several existing test case prioritization techniques leave out the execution cost of test cases and exploit a single objective function (e.g., code or requirements coverage). In this paper, we present a multi-objective test case prioritization technique that determines the ordering of test cases that maximize the number of discovered faults that are both technical and business critical. In other words, our new technique aims at both early discovering faults and reducing the execution cost of test cases. To this end, we automatically recover links among software artifacts (i.e., requirements specifications, test cases, and source code) and apply a metric-based approach to automatically identify critical and fault-prone portions of software artifacts, thus becoming able to give them more importance during test case prioritization. We experimentally evaluated our technique on 21 Java applications. The obtained results support our hypotheses on efficiency and effectiveness of our new technique and on the use of automatic artifacts analysis and weighting in test case prioritization.

Journal ArticleDOI
TL;DR: The relationship between the research group and the performance of a defect prediction model are more likely due to the tendency of researchers to reuse experimental components (e.g., datasets and metrics).
Abstract: Shepperd et al. find that the reported performance of a defect prediction model shares a strong relationship with the group of researchers who construct the models. In this paper, we perform an alternative investigation of Shepperd et al.'s data. We observe that (a) research group shares a strong association with other explanatory variables (i.e., the dataset and metric families that are used to build a model); (b) the strong association among these explanatory variables makes it difficult to discern the impact of the research group on model performance; and (c) after mitigating the impact of this strong association, we find that the research group has a smaller impact than the metric family. These observations lead us to conclude that the relationship between the research group and the performance of a defect prediction model are more likely due to the tendency of researchers to reuse experimental components (e.g., datasets and metrics). We recommend that researchers experiment with a broader selection of datasets and metrics to combat any potential bias in their results.

Journal ArticleDOI
TL;DR: The results suggest that the different artefact types used as safety evidence co-evolve and the evolution of safety cases should probably be better managed, the level of automation in safety evidence change impact analysis is low, and the practice can benefit from over 20 improvement areas.
Abstract: Context. In many application domains, critical systems must comply with safety standards. This involves gathering safety evidence in the form of artefacts such as safety analyses, system specifications, and testing results. These artefacts can evolve during a system's lifecycle, creating a need for change impact analysis to guarantee that system safety and compliance are not jeopardised. Objective. We aim to provide new insights into how safety evidence change impact analysis is addressed in practice. The knowledge about this activity is limited despite the extensive research that has been conducted on change impact analysis and on safety evidence management. Method. We conducted an industrial survey on the circumstances under which safety evidence change impact analysis is addressed, the tool support used, and the challenges faced. Results. We obtained 97 valid responses representing 16 application domains, 28 countries, and 47 safety standards. The respondents had most often performed safety evidence change impact analysis during system development, from system specifications, and fully manually. No commercial change impact analysis tool was reported as used for all artefact types and insufficient tool support was the most frequent challenge. Conclusion. The results suggest that the different artefact types used as safety evidence co-evolve. In addition, the evolution of safety cases should probably be better managed, the level of automation in safety evidence change impact analysis is low, and the state of the practice can benefit from over 20 improvement areas.

Journal ArticleDOI
TL;DR: This work presents Relda2, a light-weight and precise static resource leak detection tool, which is downloaded from popular app stores and an open source community and found 67 real resource leaks, which are confirmed manually.
Abstract: Android devices include many embedded resources such as Camera, Media Player and Sensors. These resources require programmers to explicitly request and release them. Missing release operations might cause serious problems such as performance degradation or system crash. This kind of defects is called resource leak . Despite a large body of existing works on testing and analyzing Android apps, there still remain several challenging problems. In this work, we present Relda2, a light-weight and precise static resource leak detection tool. We first systematically collected a resource table, which includes the resources that the Android reference requires developers release manually. Based on this table, we designed a general approach to automatically detect resource leaks. To make a more precise inter-procedural analysis, we construct a Function Call Graph for each Android application, which handles function calls of user-defined methods and the callbacks invoked by the Android framework at the same time. To evaluate Relda2's effectiveness and practical applicability, we downloaded 103 apps from popular app stores and an open source community, and found 67 real resource leaks, which we have confirmed manually.

Journal ArticleDOI
TL;DR: This work proposes micro interaction metrics (MIMs), which are metrics that leverage developer interaction information that significantly improve overall defect prediction accuracy when combined with existing software measures, perform well in a cost-effective manner, and provide intuitive feedback that enables developers to recognize their own inefficient behaviors during software development.
Abstract: To facilitate software quality assurance, defect prediction metrics, such as source code metrics, change churns, and the number of previous defects, have been actively studied. Despite the common understanding that developer behavioral interaction patterns can affect software quality, these widely used defect prediction metrics do not consider developer behavior. We therefore propose micro interaction metrics (MIMs), which are metrics that leverage developer interaction information. The developer interactions, such as file editing and browsing events in task sessions, are captured and stored as information by Mylyn, an Eclipse plug-in. Our experimental evaluation demonstrates that MIMs significantly improve overall defect prediction accuracy when combined with existing software measures, perform well in a cost-effective manner, and provide intuitive feedback that enables developers to recognize their own inefficient behaviors during software development.

Journal ArticleDOI
TL;DR: A multi-objective evolutionary algorithm based proactive-rescheduling method is proposed, which generates a robust schedule predictively and adapts the previous schedule in response to critical dynamic events during the project execution, and achieves a very good overall performance in a dynamic environment.
Abstract: Software project scheduling in dynamic and uncertain environments is of significant importance to real-world software development. Yet most studies schedule software projects by considering static and deterministic scenarios only, which may cause performance deterioration or even infeasibility when facing disruptions. In order to capture more dynamic features of software project scheduling than the previous work, this paper formulates the project scheduling problem by considering uncertainties and dynamic events that often occur during software project development, and constructs a mathematical model for the resulting multi-objective dynamic project scheduling problem (MODPSP), where the four objectives of project cost, duration, robustness and stability are considered simultaneously under a variety of practical constraints. In order to solve MODPSP appropriately, a multi-objective evolutionary algorithm based proactive-rescheduling method is proposed, which generates a robust schedule predictively and adapts the previous schedule in response to critical dynamic events during the project execution. Extensive experimental results on 21 problem instances, including three instances derived from real-world software projects, show that our novel method is very effective. By introducing the robustness and stability objectives, and incorporating the dynamic optimization strategies specifically designed for MODPSP, our proactive-rescheduling method achieves a very good overall performance in a dynamic environment.

Journal ArticleDOI
TL;DR: By resolving the redundant data problems, the system response time for the studied systems can be improved by an average of 17 percent and an automated approach is proposed to assess the impact and prioritize the resolution efforts.
Abstract: Developers usually leverage Object-Relational Mapping (ORM) to abstract complex database accesses for large-scale systems. However, since ORM frameworks operate at a lower-level (i.e., data access), ORM frameworks do not know how the data will be used when returned from database management systems (DBMSs). Therefore, ORM cannot provide an optimal data retrieval approach for all applications, which may result in accessing redundant data and significantly affect system performance. Although ORM frameworks provide ways to resolve redundant data problems, due to the complexity of modern systems, developers may not be able to locate such problems in the code; hence, may not proactively resolve the problems. In this paper, we propose an automated approach, which we implement as a Java framework, to locate redundant data problems. We apply our framework on one enterprise and two open source systems. We find that redundant data problems exist in 87 percent of the exercised transactions. Due to the large number of detected redundant data problems, we propose an automated approach to assess the impact and prioritize the resolution efforts. Our performance assessment result shows that by resolving the redundant data problems, the system response time for the studied systems can be improved by an average of 17 percent.

Journal ArticleDOI
TL;DR: A hybrid strategy H is designed that starts with R and switches to S0 when S0 is expected to discover more errors per unit time and it is found that H performs similarly or better than the most efficient of both and that S0 may need to be significantly faster than the authors' bounds suggest to retain efficiency over R.
Abstract: We study the relative efficiencies of the random and systematic approaches to automated software testing. Using a simple but realistic set of assumptions, we propose a general model for software testing and define sampling strategies for random ( $\mathcal {R}$ ) and systematic ( $\mathcal {S}_0$ ) testing, where each sampling is associated with a sampling cost: 1 and ${\sf c}$ units of time, respectively. The two most important goals of software testing are: (i) achieving in minimal time a given degree of confidence $x$ in a program's correctness and (ii) discovering a maximal number of errors within a given time bound $\hat{n}$ . For both (i) and (ii), we show that there exists a bound on ${\sf c}$ beyond which $\mathcal {R}$ performs better than $\mathcal {S}_0$ on the average. Moreover for (i), this bound depends asymptotically only on $x$ . We also show that the efficiency of $\mathcal {R}$ can be fitted to the exponential curve. Using these results we design a hybrid strategy $\mathcal {H}$ that starts with $\mathcal {R}$ and switches to $\mathcal {S}_0$ when $\mathcal {S}_0$ is expected to discover more errors per unit time. In our experiments we find that $\mathcal {H}$ performs similarly or better than the most efficient of both and that $\mathcal {S}_0$ may need to be significantly faster than our bounds suggest to retain efficiency over $\mathcal {R}$ .

Journal ArticleDOI
TL;DR: Results indicate that the proposed heuristic for breaking ties in coverage based techniques can resolve ties and in turn noticeably increases the rate of fault detection.
Abstract: Test case prioritization aims at ordering test cases to increase the rate of fault detection , which quantifies how fast faults are detected during the testing phase. A common approach for test case prioritization is to use the information of previously executed test cases, such as coverage information, resulting in an iterative (greedy) prioritization algorithm. Current research in this area validates the fact that using coverage information can improve the rate of fault detection in prioritization algorithms. The performance of such iterative prioritization schemes degrade as the number of ties encountered in prioritization steps increases. In this paper, using the notion of lexicographical ordering, we propose a new heuristic for breaking ties in coverage based techniques. Performance of the proposed technique in terms of the rate of fault detection is empirically evaluated using a wide range of programs. Results indicate that the proposed technique can resolve ties and in turn noticeably increases the rate of fault detection.

Journal ArticleDOI
TL;DR: This work presents ScrIpT repAireR (SITAR), a technique to automatically repair unusable low-level test scripts and amortizes the cost of human intervention across multiple scripts by accumulating the human knowledge as annotations on the EFG.
Abstract: System testing of a GUI-based application requires that test cases, consisting of sequences of user actions/events, be executed and the software's output be verified. To enable automated re-testing, such test cases are increasingly being coded as low-level test scripts, to be replayed automatically using test harnesses. Whenever the GUI changes—widgets get moved around, windows get merged—some scripts become unusable because they no longer encode valid input sequences. Moreover, because the software's output may have changed, their test oracles —assertions and checkpoints—encoded in the scripts may no longer correctly check the intended GUI objects. We present ScrIpT repAireR ( SITAR ), a technique to automatically repair unusable low-level test scripts. SITAR uses reverse engineering techniques to create an abstract test for each script, maps it to an annotated event-flow graph (EFG), uses repairing transformations and human input to repair the test, and synthesizes a new “repaired” test script. During this process, SITAR also repairs the reference to the GUI objects used in the checkpoints yielding a final test script that can be executed automatically to validate the revised software. SITAR amortizes the cost of human intervention across multiple scripts by accumulating the human knowledge as annotations on the EFG. An experiment using QTP test scripts suggests that SITAR is effective in that 41-89 percent unusable test scripts were repaired. Annotations significantly reduced human cost after 20 percent test scripts had been repaired.

Journal ArticleDOI
TL;DR: An empirical study is conducted to examine the usefulness of the semantic and ontological variability analysis (SOVA) by comparing it to an existing tool which is mainly based on a latent semantic analysis measurement.
Abstract: Adoption of Software Product Line Engineering (SPLE) to support systematic reuse of software-related artifacts within product families is challenging, time-consuming and error-prone. Analyzing the variability of existing artifacts needs to reflect different perspectives and preferences of stakeholders in order to facilitate decisions in SPLE adoption. Considering that requirements drive many development methods and activities, we introduce an approach to analyze variability of behaviors as presented in functional requirements. The approach, called semantic and ontological variability analysis (SOVA), uses ontological and semantic considerations to automatically analyze differences between initial states (preconditions), external events (triggers) that act on the system, and final states (post-conditions) of behaviors. The approach generates feature diagrams typically used in SPLE to model variability. Those diagrams are organized according to perspective profiles, reflecting the needs and preferences of the potential stakeholders for given tasks. We conducted an empirical study to examine the usefulness of the approach by comparing it to an existing tool which is mainly based on a latent semantic analysis measurement. SOVA appears to create outputs that are more comprehensible in significantly shorter times. These results demonstrate SOVA's potential to allow for flexible, behavior-oriented variability analysis.

Journal ArticleDOI
TL;DR: Role-Based UI Simplification (RBUIS) is presented as a mechanism for increasing usability through adaptive behavior by providing end-users with a minimal feature-set and an optimal layout, based on the context-of-use.
Abstract: Software applications that are very large-scale, can encompass hundreds of complex user interfaces (UIs). Such applications are commonly sold as feature-bloated off-the-shelf products to be used by people with variable needs in the required features and layout preferences. Although many UI adaptation approaches were proposed, several gaps and limitations including: extensibility and integration in legacy systems, still need to be addressed in the state-of-the-art adaptive UI development systems. This paper presents Role-Based UI Simplification (RBUIS) as a mechanism for increasing usability through adaptive behavior by providing end-users with a minimal feature-set and an optimal layout, based on the context-of-use. RBUIS uses an interpreted runtime model-driven approach based on the Cedar Architecture, and is supported by the integrated development environment (IDE), Cedar Studio. RBUIS was evaluated by integrating it into OFBiz, an open-source ERP system. The integration method was assessed and measured by establishing and applying technical metrics. Afterwards, a usability study was carried out to evaluate whether UIs simplified with RBUIS show an improvement over their initial counterparts. This study leveraged questionnaires, checking task completion times and output quality, and eye-tracking. The results showed that UIs simplified with RBUIS significantly improve end-user efficiency, effectiveness, and perceived usability.

Journal ArticleDOI
TL;DR: In this research, black-box string test case generation methods are investigated and an empirical study is performed indicating that the generated string test cases outperform test cases generated by other methods.
Abstract: String test cases are required by many real-world applications to identify defects and security risks. Random Testing (RT) is a low cost and easy to implement testing approach to generate strings. However, its effectiveness is not satisfactory. In this research, black-box string test case generation methods are investigated. Two objective functions are introduced to produce effective test cases. The diversity of the test cases is the first objective, where it can be measured through string distance functions. The second objective is guiding the string length distribution into a Benford distribution based on the hypothesis that the population of strings is right-skewed within its range. When both objectives are applied via a multi-objective optimization algorithm, superior string test sets are produced. An empirical study is performed with several real-world programs indicating that the generated string test cases outperform test cases generated by other methods.

Journal ArticleDOI
TL;DR: The proposed solution is based on the analysis of regeneration points in model executions: a regeneration is encountered after a discrete event if the future evolution depends only on the current marking and not on its previous history, thus satisfying the Markov property.
Abstract: We consider the problem of verifying quantitative reachability properties in stochastic models of concurrent activities with generally distributed durations. Models are specified as stochastic time Petri nets and checked against Boolean combinations of interval until operators imposing bounds on the probability that the marking process will satisfy a goal condition at some time in the interval $[\alpha, \beta ]$ after an execution that never violates a safety property. The proposed solution is based on the analysis of regeneration points in model executions: a regeneration is encountered after a discrete event if the future evolution depends only on the current marking and not on its previous history, thus satisfying the Markov property. We analyze systems in which multiple generally distributed timers can be started or stopped independently, but regeneration points are always encountered with probability 1 after a bounded number of discrete events. Leveraging the properties of regeneration points in probability spaces of execution paths, we show that the problem can be reduced to a set of Volterra integral equations, and we provide algorithms to compute their parameters through the enumeration of finite sequences of stochastic state classes encoding the joint probability density function (PDF) of generally distributed timers after each discrete event. The computation of symbolic PDFs is limited to discrete events before the first regeneration, and the repetitive structure of the stochastic process is exploited also before the lower bound $\alpha$ , providing crucial benefits for large time bounds. A case study is presented through the probabilistic formulation of Fischer's mutual exclusion protocol, a well-known real-time verification benchmark.

Journal ArticleDOI
TL;DR: Perceived readability might be insufficient as the sole measure of software readability or comprehension and there are no significant relationships between perceived readability and the other measures taken in the present study.
Abstract: Software readability and comprehension are important factors in software maintenance. There is a large body of research on software measurement, but the actual factors that make software easier to read or easier to comprehend are not well understood. In the present study, we investigate the role of method chains and code comments in software readability and comprehension. Our analysis comprises data from 104 students with varying programming experience. Readability and comprehension were measured by perceived readability, reading time and performance on a simple cloze test. Regarding perceived readability, our results show statistically significant differences between comment variants, but not between method chain variants. Regarding comprehension, there are no significant differences between method chain or comment variants. Student groups with low and high experience, respectively, show significant differences in perceived readability and performance on the cloze tests. Our results do not show any significant relationships between perceived readability and the other measures taken in the present study. Perceived readability might therefore be insufficient as the sole measure of software readability or comprehension. We also did not find any statistically significant relationships between size and perceived readability, reading time and comprehension.

Journal ArticleDOI
TL;DR: Reliability Assessment and Improvement (RELAI) testing is proposed, a new technique thought to improve the delivered reliability by an adaptive testing scheme, while providing, at the same time, a continuous assessment of reliability attained through testing and fault removal.
Abstract: Testing software to assess or improve reliability presents several practical challenges. Conventional operational testing is a fundamental strategy that simulates the real usage of the system in order to expose failures with the highest occurrence probability. However, practitioners find it unsuitable for assessing/achieving very high reliability levels; also, they do not see the adoption of a “real” usage profile estimate as a sensible idea, being it a source of non-quantifiable uncertainty. Oppositely, debug testing aims to expose as many failures as possible, but regardless of their impact on runtime reliability. These strategies are used either to assess or to improve reliability, but cannot improve and assess reliability in the same testing session. This article proposes Reliability Assessment and Improvement ( RELAI ) testing, a new technique thought to improve the delivered reliability by an adaptive testing scheme, while providing, at the same time, a continuous assessment of reliability attained through testing and fault removal. The technique also quantifies the impact of a partial knowledge of the operational profile. RELAI is positively evaluated on four software applications compared, in separate experiments, with techniques conceived either for reliability improvement or for reliability assessment, demonstrating substantial improvements in both cases.

Journal ArticleDOI
TL;DR: A new approach that combines symbolic execution and symbolic reachability analysis to improve the effectiveness of branch testing and can both find test inputs that exercise rare execution conditions that are not identified with state-of-the-art approaches and eliminate many infeasible branches from the coverage measurement.
Abstract: Structural coverage metrics, and in particular branch coverage, are popular approaches to measure the thoroughness of test suites Unfortunately, the presence of elements that are not executable in the program under test and the difficulty of generating test cases for rare conditions impact on the effectiveness of the coverage obtained with current approaches In this paper, we propose a new approach that combines symbolic execution and symbolic reachability analysis to improve the effectiveness of branch testing Our approach embraces the ideal definition of branch coverage as the percentage of executable branches traversed with the test suite, and proposes a new bidirectional symbolic analysis for both testing rare execution conditions and eliminating infeasible branches from the set of test objectives The approach is centered on a model of the analyzed execution space The model identifies the frontier between symbolic execution and symbolic reachability analysis, to guide the alternation and the progress of bidirectional analysis towards the coverage targets The experimental results presented in the paper indicate that the proposed approach can both find test inputs that exercise rare execution conditions that are not identified with state-of-the-art approaches and eliminate many infeasible branches from the coverage measurement It can thus produce a modified branch coverage metric that indicates the amount of feasible branches covered during testing, and helps team leaders and developers in estimating the amount of not-yet-covered feasible branches The approach proposed in this paper suffers less than the other approaches from particular cases that may trap the analysis in unbounded loops