scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Software Engineering in 2020"


Journal ArticleDOI
TL;DR: A comprehensive survey of machine learning testing can be found in this article, which covers 138 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (i.e., the data, learning program, and framework), testing workflow, and application scenarios.
Abstract: This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 138 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in machine learning testing.

343 citations


Journal ArticleDOI
TL;DR: It is concluded that the impact of class rebalancing techniques on the performance of defect prediction models depends on the used performance measure and the used classification techniques.
Abstract: Defect models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect models but arrives at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect models. In this paper, we investigate the impact of class rebalancing techniques on the performance measures and interpretation of defect models. We also investigate the experimental settings in which class rebalancing techniques are beneficial for defect models. Through a case study of 101 datasets that span across proprietary and open-source systems, we conclude that the impact of class rebalancing techniques on the performance of defect prediction models depends on the used performance measure and the used classification techniques. We observe that the optimized SMOTE technique and the under-sampling technique are beneficial when quality assurance teams wish to increase AUC and Recall, respectively, but they should be avoided when deriving knowledge and understandings from defect models.

179 citations


Journal ArticleDOI
TL;DR: This work proposes leveraging a powerful representation-learning algorithm, deep learning, to learn the semantic representations of programs automatically from source code files and code changes and results indicate that the DBN-based semantic features can significantly improve the examined defect prediction tasks.
Abstract: Software defect prediction, which predicts defective code regions, can assist developers in finding bugs and prioritizing their testing efforts. Traditional defect prediction features often fail to capture the semantic differences between different programs. This degrades the performance of the prediction models built on these traditional features. Thus, the capability to capture the semantics in programs is required to build accurate prediction models. To bridge the gap between semantics and defect prediction features, we propose leveraging a powerful representation-learning algorithm, deep learning, to learn the semantic representations of programs automatically from source code files and code changes. Specifically, we leverage a deep belief network (DBN) to automatically learn semantic features using token vectors extracted from the programs’ abstract syntax trees (AST) (for file-level defect prediction models) and source code changes (for change-level defect prediction models). We examine the effectiveness of our approach on two file-level defect prediction tasks (i.e., file-level within-project defect prediction and file-level cross-project defect prediction) and two change-level defect prediction tasks (i.e., change-level within-project defect prediction and change-level cross-project defect prediction). Our experimental results indicate that the DBN-based semantic features can significantly improve the examined defect prediction tasks. Specifically, the improvements of semantic features against existing traditional features (in F1) range from 2.1 to 41.9 percentage points for file-level within-project defect prediction, from 1.5 to 13.4 percentage points for file-level cross-project defect prediction, from 1.0 to 8.6 percentage points for change-level within-project defect prediction, and from 0.6 to 9.9 percentage points for change-level cross-project defect prediction.

177 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present an approach that enables accurate prototyping of graphical user interface (GUI) via three tasks: detection, classification, and assembly, where logical components of a GUI are detected from a mock-up artifact using either computer vision techniques or mockup metadata.
Abstract: It is common practice for developers of user-facing software to transform a mock-up of a graphical user interface (GUI) into code. This process takes place both at an application's inception and in an evolutionary context as GUI changes keep pace with evolving features. Unfortunately, this practice is challenging and time-consuming. In this paper, we present an approach that automates this process by enabling accurate prototyping of GUIs via three tasks: detection , classification , and assembly . First, logical components of a GUI are detected from a mock-up artifact using either computer vision techniques or mock-up metadata. Then, software repository mining, automated dynamic analysis, and deep convolutional neural networks are utilized to accurately classify GUI-components into domain-specific types (e.g., toggle-button). Finally, a data-driven, K-nearest-neighbors algorithm generates a suitable hierarchical GUI structure from which a prototype application can be automatically assembled . We implemented this approach for Android in a system called ReDraw . Our evaluation illustrates that ReDraw achieves an average GUI-component classification accuracy of 91 percent and assembles prototype applications that closely mirror target mock-ups in terms of visual affinity while exhibiting reasonable code structure. Interviews with industrial practitioners illustrate ReDraw's potential to improve real development workflows.

141 citations


Journal ArticleDOI
TL;DR: A mixed qualitative and quantitative study to investigate what practitioners think, behave and expect in contrast to research findings when it comes to defect prediction, and identifies reasons why practitioners are reluctant to adopt defect prediction tools.
Abstract: Defect prediction has been an active research area for over four decades. Despite numerous studies on defect prediction, the potential value of defect prediction in practice remains unclear. To address this issue, we performed a mixed qualitative and quantitative study to investigate what practitioners think, behave and expect in contrast to research findings when it comes to defect prediction. We collected hypotheses from open-ended interviews and a literature review of defect prediction papers that were published at ICSE, ESEC/FSE, ASE, TSE and TOSEM in the last 6 years (2012-2017). We then conducted a validation survey where the hypotheses became statements or options of our survey questions. We received 395 responses from practitioners from over 33 countries across five continents. Some of our key findings include: 1) Over 90 percent of respondents are willing to adopt defect prediction techniques. 2) There exists a disconnect between practitioners’ perceptions and well supported research evidence regarding defect density distribution and the relationship between file size and defectiveness. 3) 7.2 percent of the respondents reveal an inconsistency between their behavior and perception regarding defect prediction. 4) Defect prediction at the feature level is the most preferred level of granularity by practitioners. 5) During bug fixing, more than 40 percent of the respondents acknowledged that they would make a “work-around” fix rather than correct the actual error-causing code. Through a qualitative analysis of free-form text responses, we identified reasons why practitioners are reluctant to adopt defect prediction tools. We also noted features that practitioners expect defect prediction tools to deliver. Based on our findings, we highlight future research directions and provide recommendations for practitioners.

122 citations


Journal ArticleDOI
TL;DR: ARJA as discussed by the authors is a genetic programming based approach for automated repair of Java programs, which decouples the search subspaces of likely-buggy locations, operation types and potential fix ingredients, enabling GP to explore the search space more effectively.
Abstract: Automated program repair is the problem of automatically fixing bugs in programs in order to significantly reduce the debugging costs and improve the software quality. To address this problem, test-suite based repair techniques regard a given test suite as an oracle and modify the input buggy program to make the entire test suite pass. GenProg is well recognized as a prominent repair approach of this kind, which uses genetic programming (GP) to rearrange the statements already extant in the buggy program. However, recent empirical studies show that the performance of GenProg is not fully satisfactory, particularly for Java. In this paper, we propose ARJA, a new GP based repair approach for automated repair of Java programs. To be specific, we present a novel lower-granularity patch representation that properly decouples the search subspaces of likely-buggy locations, operation types and potential fix ingredients, enabling GP to explore the search space more effectively. Based on this new representation, we formulate automated program repair as a multi-objective search problem and use NSGA-II to look for simpler repairs. To reduce the computational effort and search space, we introduce a test filtering procedure that can speed up the fitness evaluation of GP and three types of rules that can be applied to avoid unnecessary manipulations of the code. Moreover, we also propose a type matching strategy that can create new potential fix ingredients by exploiting the syntactic patterns of existing statements. We conduct a large-scale empirical evaluation of ARJA along with its variants on both seeded bugs and real-world bugs in comparison with several state-of-the-art repair approaches. Our results verify the effectiveness and efficiency of the search mechanisms employed in ARJA and also show its superiority over the other approaches. In particular, compared to jGenProg (an implementation of GenProg for Java), an ARJA version fully following the redundancy assumption can generate a test-suite adequate patch for more than twice the number of bugs (from 27 to 59), and a correct patch for nearly four times of the number (from 5 to 18), on 224 real-world bugs considered in Defects4J. Furthermore, ARJA is able to correctly fix several real multi-location bugs that are hard to be repaired by most of the existing repair approaches.

115 citations


Journal ArticleDOI
TL;DR: It is concluded that model-agnostic techniques are needed to explain individual predictions of defect models and that more than half of the practitioners perceive that the contrastive explanations are necessary and useful to understand the predictions of defects.
Abstract: Software analytics have empowered software organisations to support a wide range of improved decision-making and policy-making. However, such predictions made by software analytics to date have not been explained and justified. Specifically, current defect prediction models still fail to explain why models make such a prediction and fail to uphold the privacy laws in terms of the requirement to explain any decision made by an algorithm. In this paper, we empirically evaluate three model-agnostic techniques, i.e., two state-of-the-art Local Interpretability Model-agnostic Explanations technique (LIME) and BreakDown techniques, and our improvement of LIME with Hyper Parameter Optimisation (LIME-HPO). Through a case study of 32 highly-curated defect datasets that span across 9 open-source software systems, we conclude that (1) model-agnostic techniques are needed to explain individual predictions of defect models; (2) instance explanations generated by model-agnostic techniques are mostly overlapping (but not exactly the same) with the global explanation of defect models and reliable when they are re-generated; (3) model-agnostic techniques take less than a minute to generate instance explanations; and (4) more than half of the practitioners perceive that the contrastive explanations are necessary and useful to understand the predictions of defect models. Since the implementation of the studied model-agnostic techniques is available in both Python and R, we recommend model-agnostic techniques be used in the future.

95 citations


Journal ArticleDOI
TL;DR: Feedback showed these contract defects are harmful and removing them would improve the quality and robustness of smart contracts, and manually identified them in 587 real world smart contract and publicly released the dataset.
Abstract: Smart contracts are programs running on a blockchain. They are immutable to change, and hence can not be patched for bugs once deployed. Thus it is critical to ensure they are bug-free and well-designed before deployment. Code smells are symptoms in source code that possibly indicate e.g. security, architecture and/or usability problems. The detection of code smells is a method to avoid potential bugs and improve the design of existing code. However, traditional code smell patterns are designed for centralized OO programs, e.g., Java or C++. Smart contracts are however decentralized and contain numerous distinctive features, such as the gas system. To fill this gap, we collected smart-contract-related posts from Ethereum Stack Exchange, as well as real-world smart contracts. We manually analyzed these posts and used them to define 20 kinds of code smells for smart contracts. We categorized these into indicating potential security, architecture, and usability problems. To validate if practitioners consider these contract smells as harmful, we created an online survey and received 96 responses from 24 different countries. Feedback showed these code smells are harmful and removing them would improve the quality and robustness of smart contracts. We manually identified our defined code smells in 587 contract accounts and publicly released our dataset. Finally, we summarized 5 impacts caused by contract code smells. These help developers better understand the symptoms of the smells and removal priority.

90 citations


Journal ArticleDOI
TL;DR: Flash as mentioned in this paper is a sequential model-based method that sequentially explores the configuration space by reflecting on the configurations evaluated so far to determine the next best configuration to explore, which reduces the effort required to find the better configuration.
Abstract: Finding good configurations of a software system is often challenging since the number of configuration options can be large. Software engineers often make poor choices about configuration or, even worse, they usually use a sub-optimal configuration in production, which leads to inadequate performance. To assist engineers in finding the better configuration, this article introduces Flash , a sequential model-based method that sequentially explores the configuration space by reflecting on the configurations evaluated so far to determine the next best configuration to explore. Flash scales up to software systems that defeat the prior state-of-the-art model-based methods in this area. Flash runs much faster than existing methods and can solve both single-objective and multi-objective optimization problems. The central insight of this article is to use the prior knowledge of the configuration space (gained from prior runs) to choose the next promising configuration. This strategy reduces the effort (i.e., number of measurements) required to find the better configuration. We evaluate Flash using 30 scenarios based on 7 software systems to demonstrate that Flash saves effort in 100 and 80 percent of cases in single-objective and multi-objective problems respectively by up to several orders of magnitude compared to state-of-the-art techniques.

86 citations


Journal ArticleDOI
TL;DR: Empirical studies reveal previously unknown failures in some of the most popular applications in the world, and show how the proposed approach can help users to better understand and better use the systems.
Abstract: Modern information technology paradigms, such as online services and off-the-shelf products, often involve a wide variety of users with different or even conflicting objectives. Every software output may satisfy some users, but may also fail to satisfy others. Furthermore, users often do not know the internal working mechanisms of the systems. This situation is quite different from bespoke software, where developers and users typically know each other. This paper proposes an approach to help users to better understand the software that they use, and thereby more easily achieve their objectives—even when they do not fully understand how the system is implemented. Our approach borrows the concept of metamorphic relations from the field of metamorphic testing (MT), using it in an innovative way that extends beyond MT. We also propose a “symmetry” metamorphic relation pattern and a “change direction” metamorphic relation input pattern that can be used to derive multiple concrete metamorphic relations. Empirical studies reveal previously unknown failures in some of the most popular applications in the world, and show how our approach can help users to better understand and better use the systems. The empirical results provide strong evidence of the simplicity, applicability, and effectiveness of our methodology.

84 citations


Journal ArticleDOI
TL;DR: This work created one of the most accurate, complete, and representative refactoring oracles to date, and showed that the approach achieves the highest average precision and recall among all competitive tools, and on median is 2.6 times faster than the second faster competitive tool.
Abstract: Refactoring detection is crucial for a variety of applications and tasks: (i) empirical studies about code evolution, (ii) tools for library API migration, (iii) code reviews and change comprehension. However, recent research has questioned the accuracy of the state-of-the-art refactoring mining tools, which poses threats to the reliability of the detected refactorings. Moreover, the majority of refactoring mining tools depend on code similarity thresholds. Finding universal threshold values that can work well for all projects, regardless of their architectural style, application domain, and development practices is extremely challenging. Therefore, in a previous work [1], we introduced the first refactoring mining tool that does not require any code similarity thresholds to operate. In this work, we extend our tool to support low-level refactorings that take place within the body of methods. To evaluate our tool, we created one of the most accurate, complete, and representative refactoring oracles to date, including 7,223 true instances for 40 different refactoring types detected by one (minimum) up to six (maximum) different tools, and validated by one up to four refactoring experts. Our evaluation showed that our approach achieves the highest average precision (99.6%) and recall (94%) among all competitive tools, and on median is 2.6 times faster than the second faster competitive tool.

Journal ArticleDOI
TL;DR: The textual features of a bug report and reporter's experience are the most important factors to distinguish valid bug reports from invalid ones, and the approach statistically significantly outperforms two baselines using features proposed by Zanetti et al.
Abstract: Developers use bug reports to triage and fix bugs. When triaging a bug report, developers must decide whether the bug report is valid (i.e., a real bug). A large amount of bug reports are submitted every day, with many of them end up being invalid reports. Manually determining valid bug report is a difficult and tedious task. Thus, an approach that can automatically analyze the validity of a bug report and determine whether a report is valid can help developers prioritize their triaging tasks and avoid wasting time and effort on invalid bug reports. In this study, motivated by the above needs, we propose an approach which can determine whether a newly submitted bug report is valid . Our approach first extracts 33 features from bug reports. The extracted features are grouped along 5 dimensions, i.e., reporter experience, collaboration network, completeness, readability and text. Based on these features, we use a random forest classifier to identify valid bug reports. To evaluate the effectiveness of our approach, we experiment on large-scale datasets containing a total of 560,697 bug reports from five open source projects (i.e., Eclipse, Netbeans, Mozilla, Firefox and Thunderbird). On average, across the five datasets, our approach achieves an F1-score for valid bug reports and F1-score for invalid ones of 0.74 and 0.67, respectively. Moreover, our approach achieves an average AUC of 0.81. In terms of AUC and F1-scores for valid and invalid bug reports, our approach statistically significantly outperforms two baselines using features that are proposed by Zanetti et al. [104] . We also study the most important features that distinguish valid bug reports from invalid ones. We find that the textual features of a bug report and reporter's experience are the most important factors to distinguish valid bug reports from invalid ones.

Journal ArticleDOI
TL;DR: In this article, a tree-based neural machine translation model is proposed to learn the probability distribution of changes in code and a change suggestion engine, CODIT, is trained with more than 30k real-world changes and evaluated on 6k patches.
Abstract: The way developers edit day-to-day code tends to be repetitive, often using existing code elements. Many researchers have tried to automate repetitive code changes by learning from specific change templates which are applied to limited scope. The advancement of deep neural networks and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild. However, deep neural network based modeling for code changes and code in general introduces some specific problems that needs specific attention from research community. For instance, compared to natural language, source code vocabulary can be significantly larger. Further, good changes in code do not break its syntactic structure. Thus, deploying state-of-the-art neural network models without adapting the methods to the source code domain yields sub-optimal results. To this end, we propose a novel tree-based neural network system to model source code changes and learn code change patterns from the wild. Specifically, we propose a tree-based neural machine translation model to learn the probability distribution of changes in code. We realize our model with a change suggestion engine, CODIT, and train the model with more than 30k real-world changes and evaluate it on 6k patches. Our evaluation shows the effectiveness of CODIT in learning and suggesting patches. CODIT can also learn specific bug fix pattern from bug fixing patches and can fix 27 bugs out of 75 one line bugs in Defects4J.

Journal ArticleDOI
TL;DR: This paper presents a new code summarization approach using hierarchical attention network by incorporating multiple code features, including type-augmented abstract syntax trees and program control flows into a deep reinforcement learning (DRL) framework for comment generation.
Abstract: Code summarization (aka comment generation) provides a high-level natural language description of the function performed by code, which can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, the state-of-the-art approaches follow an encoder-decoder framework which encodes source code into a hidden space and later decodes it into a natural language space. Such approaches suffer from the following drawbacks: (a) they are mainly input by representing code as a sequence of tokens while ignoring code hierarchy; (b) most of the encoders only input simple features (e.g., tokens) while ignoring the features that can help capture the correlations between comments and code; (c) the decoders are typically trained to predict subsequent words by maximizing the likelihood of subsequent ground truth words, while in real world, they are excepted to generate the entire word sequence from scratch. As a result, such drawbacks lead to inferior and inconsistent comment generation accuracy. To address the above limitations, this paper presents a new code summarization approach using hierarchical attention network by incorporating multiple code features, including type-augmented abstract syntax trees and program control flows. Such features, along with plain code sequences, are injected into a deep reinforcement learning (DRL) framework (e.g., actor-critic network) for comment generation. Our approach assigns weights (pays “attention”) to tokens and statements when constructing the code representation to reflect the hierarchical code structure under different contexts regarding code features (e.g., control flows and abstract syntax trees). Our reinforcement learning mechanism further strengthens the prediction results through the actor network and the critic network, where the actor network provides the confidence of predicting subsequent words based on the current state, and the critic network computes the reward values of all the possible extensions of the current state to provide global guidance for explorations. Eventually, we employ an advantage reward to train both networks and conduct a set of experiments on a real-world dataset. The experimental results demonstrate that our approach outperforms the baselines by around 22% to 45% in BLEU-1 and outperforms the state-of-the-art approaches by around 5% to 60% in terms of S-BLEU and C-BLEU.

Journal ArticleDOI
TL;DR: In this article, the authors systematically map, aggregates and synthesize the literature on cognitive biases in software engineering to generate a comprehensive body of knowledge, understand state-of-the-art research and provide guidelines for future research and practise.
Abstract: One source of software project challenges and failures is the systematic errors introduced by human cognitive biases. Although extensively explored in cognitive psychology, investigations concerning cognitive biases have only recently gained popularity in software engineering research. This paper therefore systematically maps, aggregates and synthesizes the literature on cognitive biases in software engineering to generate a comprehensive body of knowledge, understand state-of-the-art research and provide guidelines for future research and practise. Focusing on bias antecedents, effects and mitigation techniques, we identified 65 articles (published between 1990 and 2016), which investigate 37 cognitive biases. Despite strong and increasing interest, the results reveal a scarcity of research on mitigation techniques and poor theoretical foundations in understanding and interpreting cognitive biases. Although bias-related research has generated many new insights in the software engineering community, specific bias mitigation techniques are still needed for software professionals to overcome the deleterious effects of cognitive biases on their work.

Journal ArticleDOI
TL;DR: A survey among 327 practitioners to gain their insights into various categories of automated bug report management techniques and summarized some potential research directions in developing techniques to help developers better manage bug reports.
Abstract: Bug reports play an important role in the process of debugging and fixing bugs. To reduce the burden of bug report managers and facilitate the process of bug fixing, a great amount of software engineering research has been invested toward automated bug report management techniques. However, the verdict is still open whether such techniques are actually required and applicable outside the domain of theoretical research. To fill this gap, we conducted a survey among 327 practitioners to gain their insights into various categories of automated bug report management techniques. Specifically, we asked the respondents to rate the importance of such techniques and provide the rationale. To get deeper insights into practitioners’ perspective, we conducted follow-up interviews with 25 interviewees selected from the survey respondents. Through the survey and the interviews, we gained a better understanding of the perceived usefulness (or its lack) of different categories of automated bug report management techniques. Based on our findings, we summarized some potential research directions in developing techniques to help developers better manage bug reports.

Journal ArticleDOI
TL;DR: A novel commit message generation model, named ATOM, which explicitly incorporates the abstract syntax tree for representing code changes and integrates both retrieved and generated messages through hybrid ranking, which demonstrates the effectiveness of ATOM in generating accurate code commit messages.
Abstract: Commit messages record code changes (e.g., feature modifications and bug repairs) in natural language, and are useful for program comprehension. Due to the frequent updates of software and time cost, developers are generally unmotivated to write commit messages for code changes. Therefore, automating the message writing process is necessitated. Previous studies on commit message generation have been benefited from generation models or retrieval models, but the code structure of changed code, i.e., AST, which can be important for capturing code semantics, has not been explicitly involved. Moreover, although generation models have the advantages of synthesizing commit messages for new code changes, they are not easy to bridge the semantic gap between code and natural languages which could be mitigated by retrieval models. In this paper, we propose a novel commit message generation model, named ATOM, which explicitly incorporates the abstract syntax tree for representing code changes and integrates both retrieved and generated messages through hybrid ranking. Specifically, the hybrid ranking module can prioritize the most accurate message from both retrieved and generated messages regarding one code change. We evaluate the proposed model ATOM on our dataset crawled from 56 popular Java repositories. Experimental results demonstrate that ATOM increases the state-of-the-art models by 30.72% in terms of BLEU-4 (an accuracy measure that is widely used to evaluate text generation systems). Qualitative analysis also demonstrates the effectiveness of ATOM in generating accurate code commit messages.

Journal ArticleDOI
TL;DR: A convolution neural network (CNN)-based approach to automatically classify sentences into different categories of intentions is proposed and investigated and applied to improve an automated software engineering task, in which the proposed approach to rectify misclassified issue reports is used to reduce the bias introduced by such data to other studies.
Abstract: Developers frequently discuss aspects of the systems they are developing online. The comments they post to discussions form a rich information source about the system. Intention mining, a process introduced by Di Sorbo et al., classifies sentences in developer discussions to enable further analysis. As one example of use, intention mining has been used to help build various recommenders for software developers. The technique introduced by Di Sorbo et al. to categorize sentences is based on linguistic patterns derived from two projects. The limited number of data sources used in this earlier work introduces questions about the comprehensiveness of intention categories and whether the linguistic patterns used to identify the categories are generalizable to developer discussion recorded in other kinds of software artifacts (e.g., issue reports). To assess the comprehensiveness of the previously identified intention categories and the generalizability of the linguistic patterns for category identification, we manually created a new dataset, categorizing 5,408 sentences from issue reports of four projects in GitHub. Based on this manual effort, we refined the previous categories. We assess Di Sorbo et al.'s patterns on this dataset, finding that the accuracy rate achieved is low (0.31). To address the deficiencies of Di Sorbo et al.'s patterns, we propose and investigate a convolution neural network (CNN)-based approach to automatically classify sentences into different categories of intentions. Our approach optimizes CNN by integrating batch normalization to accelerate the training speed, and an automatic hyperparameter tuning approach to tune appropriate hyperparameters of CNN. Our approach achieves an accuracy of 0.84 on the new dataset, improving Di Sorbo et al.'s approach by 171 percent. We also apply our approach to improve an automated software engineering task, in which we use our proposed approach to rectify misclassified issue reports, thus reducing the bias introduced by such data to other studies. A case study on four open source projects with 2,076 issue reports shows that our approach achieves an average AUC score of 0.687, which improves other baselines by at least 16 percent.

Journal ArticleDOI
TL;DR: The results suggest that the proposed Line-DP can effectively identify defective lines that contain common defects while requiring a smaller amount of inspection effort and a manageable computation cost.
Abstract: Defect prediction models are proposed to help a team prioritize source code areas files that need Software Quality Assurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole file while only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e., LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of 0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defects while requiring a smaller amount of inspection effort and a manageable computation cost. The contribution of this paper builds an important step towards line-level defect prediction by leveraging a model-agnostic technique.

Journal ArticleDOI
TL;DR: Whether functional magnetic resonance imaging (fMRI) is feasible for soundly measuring program comprehension is explored and a clear, distinct activation of five brain regions, which are related to working memory, attention, and language processing, are shown.
Abstract: Program comprehension is an important, but hard to measure cognitive process. This makes it difficult to provide suitable programming languages, tools, or coding conventions to support developers in their everyday work. Here, we explore whether functional magnetic resonance imaging (fMRI) is feasible for soundly measuring program comprehension. To this end, we observed 17 participants inside an fMRI scanner while they were comprehending source code. The results show a clear, distinct activation of five brain regions, which are related to working memory, attention, and language processing, which all fit well to our understanding of program comprehension. Furthermore, we found reduced activity in the default mode network, indicating the cognitive effort necessary for program comprehension. We also observed that familiarity with Java as underlying programming language reduced cognitive effort during program comprehension. To gain confidence in the results and the method, we replicated the study with 11 new participants and largely confirmed our findings. Our results encourage us and, hopefully, others to use fMRI to observe programmers and, in the long run, answer questions, such as: How should we train programmers? Can we train someone to become an excellent programmer? How effective are new languages and tools for program comprehension?

Journal ArticleDOI
TL;DR: A qualitative study that combines a survey of 66 developers and a case study of 223 logging-related issue reports draws a comprehensive picture of the benefits and costs of logging from developers’ perspectives.
Abstract: Software developers insert logging statements in their source code to collect important runtime information of software systems. In practice, logging appropriately is a challenge for developers. Prior studies aimed to improve logging by proactively inserting logging statements in certain code snippets or by learning where to log from existing logging code. However, there exists no work that systematically studies developers' logging considerations, i.e., the benefits and costs of logging from developers' perspectives. Without understanding developers' logging considerations, automated approaches for logging decisions are based primarily on researchers' intuition which may not be convincing to developers. In order to fill the gap between developers' logging considerations and researchers' intuition, we performed a qualitative study that combines a survey of 66 developers and a case study of 223 logging-related issue reports. The findings of our qualitative study draw a comprehensive picture of the benefits and costs of logging from developers' perspectives. We observe that developers consider a wide range of logging benefits and costs, while most of the uncovered benefits and costs have never been observed nor discussed in prior work. We also observe that developers use ad hoc strategies to balance the benefits and costs of logging. Developers need to be fully aware of the benefits and costs of logging, in order to better benefit from logging (e.g., leveraging logging to enable users to solve problems by themselves) and avoid unnecessary negative impact (e.g., exposing users' sensitive information). Future research needs to consider such a wide range of logging benefits and costs when developing automated logging strategies. Our findings also inspire opportunities for researchers and logging library providers to help developers balance the benefits and costs of logging, for example, to support different log levels for different parts of a logging statement, or to help developers estimate and reduce the negative impact of logging statements.

Journal ArticleDOI
TL;DR: Why and how developers use linters along with the challenges they face while using such tools are studied along with several feature suggestions for tool makers and future work for researchers are proposed.
Abstract: A linter is a static analysis tool that warns software developers about possible code errors or violations to coding standards By using such a tool, errors can be surfaced early in the development process when they are cheaper to fix For a linter to be successful, it is important to understand the needs and challenges of developers when using a linter In this paper, we examine developers’ perceptions on JavaScript linters We study why and how developers use linters along with the challenges they face while using such tools For this purpose we perform a case study on ESLint, the most popular JavaScript linter We collect data with three different methods where we interviewed 15 developers from well-known open source projects, analyzed over 9,500 ESLint configuration files, and surveyed 337 developers from the JavaScript community Our results provide practitioners with reasons for using linters in their JavaScript projects as well as several configuration strategies and their advantages We also provide a list of linter rules that are often enabled and disabled, which can be interpreted as the most important rules to reason about when configuring linters Finally, we propose several feature suggestions for tool makers and future work for researchers

Journal ArticleDOI
TL;DR: The Theory of Motivation and Satisfaction of Software Engineers (TMS-SE), presented in this article, combines elements from well established theories with new findings, and translates them into the software engineering context.
Abstract: Context : The proper management of people can help software organisations to achieve higher levels of success. However, the limited attention paid to the appropriate use of theories to underpin the research in this area leaves it unclear how to deal with human aspects of software engineers, such as motivation and satisfaction. Objectives : This article aims to expose what drives the motivation and satisfaction of software engineers at work. Methods : A multiple case study was conducted at four software organisations in Brazil. For 11 months, data was collected using semi-structured interviews, diary studies, and document analyses. Results : The Theory of Motivation and Satisfaction of Software Engineers (TMS-SE), presented in this article, combines elements from well established theories with new findings, and translates them into the software engineering context. Conclusion : The TMS-SE advances the understanding of people management in the software engineering field and presents a strong conceptual framework for future investigations in this area.

Journal ArticleDOI
TL;DR: An automated approach to learn characteristics of smart contracts in Solidity, which is useful for clone detection, bug detection and contract validation on smart contracts, based on word embeddings and vector space comparison is proposed.
Abstract: Smart contracts have been increasingly used together with blockchains to automate financial and business transactions. However, many bugs and vulnerabilities have been identified in many contracts which raises serious concerns about smart contract security, not to mention that the blockchain systems on which the smart contracts are built can be buggy. Thus, there is a significant need to better maintain smart contract code and ensure its high reliability. In this paper, we propose an automated approach to learn characteristics of smart contracts in Solidity, useful for repetitive contract code, bug detection and contract validation. Our new approach is based on word embeddings and vector space comparison. We parse smart contract code into word streams with code structural information, convert code elements (e.g., statements, functions) into numerical vectors that are supposed to encode the code syntax and semantics, and compare the similarities among the vectors encoding code and known bugs, to identify potential issues. We have implemented the approach in a prototype, named SmartEmbed, and evaluated it with more than 22,000 smart contracts collected from the Ethereum blockchain. Results show that our tool can effectively identify many repetitive instances of Solidity code, where the clone ratio is around 90%. Code clones such as type-III or even type-IV semantic clones can also be detected. Our tool can identify more than 500 clone related bugs based on our bug databases efficiently and accurately. Our tool can also help to efficiently validate any given smart contract against the known set of bugs, which can help to improve the users' confidence in the reliability of the contract.

Journal ArticleDOI
TL;DR: According to the results on 82 projects in terms of 11 performance measures, it is found that the supervised CPDP methods are more promising than the unsupervised method in practical application scenarios, since the limitation of testing resource and the impact on developers cannot be ignored in these scenarios.
Abstract: Cross-project defect prediction (CPDP), aiming to apply defect prediction models built on source projects to a target project, has been an active research topic. A variety of supervised CPDP methods and some simple unsupervised CPDP methods have been proposed. In a recent study, Zhou et al. found that simple unsupervised CPDP methods (i.e., ManualDown and ManualUp) have a prediction performance comparable or even superior to complex supervised CPDP methods. Therefore, they suggested that the ManualDown should be treated as the baseline when considering non-effort-aware performance measures (NPMs) and the ManualUp should be treated as the baseline when considering effort-aware performance measures (EPMs) in future CPDP studies. However, in that work, these unsupervised methods are only compared with existing supervised CPDP methods in terms of one or two NPMs and the prediction results of baselines are directly collected from the primary literature. Besides, the comparison has not considered other recently proposed EPMs, which consider context switches and developer fatigue due to initial false alarms. These limitations may not give a holistic comparison between the supervised methods and unsupervised methods. In this paper, we aim to revisit Zhou et al.'s study. To the best of our knowledge, we are the first to make a comparison between the existing supervised CPDP methods and the unsupervised methods proposed by Zhou et al. in the same experimental setting, considering both NPMs and EPMs. We also propose an improved supervised CPDP method EASC and make a further comparison between this method and the unsupervised methods. According to the results on 82 projects in terms of 12 performance measures, we find that when considering NPMs, EASC can achieve similar results with the unsupervised method ManualDown without statistically significant difference in most cases. However, when considering EPMs, our proposed supervised method EASC can statistically significantly outperform the unsupervised method ManualUp with a large improvement in terms of Cliff's delta in most cases. Therefore, the supervised CPDP methods are more promising than the unsupervised method in practical application scenarios, since the limitation of testing resource and the impact on developers cannot be ignored in these scenarios.

Journal ArticleDOI
TL;DR: A study of feature use and misuse in 9,312 open source systems that use Travis CI reveals that explicit deployment code is rare and proposes Gretel—an anti-pattern removal tool for Travis CI specifications, which can remove 69.60 percent of the most frequently occurring anti- pattern automatically.
Abstract: Continuous Integration (CI) is a popular practice where software systems are automatically compiled and tested as changes appear in the version control system of a project. Like other software artifacts, CI specifications require maintenance effort. Although there are several service providers like Travis CI offering various CI features, it is unclear which features are being (mis)used. In this paper, we present a study of feature use and misuse in 9,312 open source systems that use Travis CI . Analysis of the features that are adopted by projects reveals that explicit deployment code is rare—48.16 percent of the studied Travis CI specification code is instead associated with configuring job processing nodes. To analyze feature misuse, we propose Hansel —an anti-pattern detection tool for Travis CI specifications. We define four anti-patterns and Hansel detects anti-patterns in the Travis CI specifications of 894 projects in the corpus (9.60 percent), and achieves a recall of 82.76 percent in a sample of 100 projects. Furthermore, we propose Gretel —an anti-pattern removal tool for Travis CI specifications, which can remove 69.60 percent of the most frequently occurring anti-pattern automatically. Using Gretel , we have produced 36 accepted pull requests that remove Travis CI anti-patterns automatically.

Journal ArticleDOI
TL;DR: The findings of this study highlight the need for changes to the current badge system in order to provide a better balance between the quality and quantity of revisions.
Abstract: To ensure the quality of its shared knowledge, Stack Overflow encourages users to revise answers through a badge system, which is based on quantitative measures (e.g., a badge is awarded after revising more than 500 answers). Prior studies show that badges can positively steer the user behavior on Stack Overflow (e.g., increasing user participation). However, little is known whether revision-related badges have a negative impact on the quality of revisions since some studies show that certain users may game incentive systems to gain rewards. In this study, we analyze 3,871,966 revision records that are collected from 2,377,692 Stack Overflow answers. We find that: 1) Users performed a much larger than usual revisions on the badge-awarding days compared to normal days; 25% of the users did not make any more revisions once they received their first revision-related badge. 2) Performing more revisions than usual in a single day increased the likelihood of such revisions being rolled back (e.g., due to undesired or incorrect revisions). 3) Users were more likely to perform text and small revisions if they performed many revisions in a single day. Our findings are concurred by the Stack Overflow community, and they highlight the need for changes to the current badge system in order to provide a better balance between the quality and quantity of revisions.

Journal ArticleDOI
TL;DR: An in-depth study of the merge conflicts found in the histories of 2,731 open source Java projects gives rise to three primary recommendations for future merge techniques, that could on one hand help in automatically resolving certain types of conflicts and on the other hand provide the developer with tool-based assistance to more easily resolve other types of conflict.
Abstract: When multiple developers change a software system in parallel, these concurrent changes need to be merged to all appear in the software being developed Numerous merge techniques have been proposed to support this task, but none of them can fully automate the merge process Indeed, it has been reported that as much as 10 to 20 percent of all merge attempts result in a merge conflict, meaning that a developer has to manually complete the merge To date, we have little insight into the nature of these merge conflicts What do they look like, in detail? How do developers resolve them? Do any patterns exist that might suggest new merge techniques that could reduce the manual effort? This paper contributes an in-depth study of the merge conflicts found in the histories of 2,731 open source Java projects Seeded by the manual analysis of the histories of five projects, our automated analysis of all 2,731 projects: (1) characterizes the merge conflicts in terms of number of chunks, size, and programming language constructs involved, (2) classifies the manual resolution strategies that developers use to address these merge conflicts, and (3) analyzes the relationships between various characteristics of the merge conflicts and the chosen resolution strategies Our results give rise to three primary recommendations for future merge techniques, that – when implemented – could on one hand help in automatically resolving certain types of conflicts and on the other hand provide the developer with tool-based assistance to more easily resolve other types of conflicts that cannot be automatically resolved

Journal ArticleDOI
TL;DR: It is found that developers face a variety of challenges for all nine configuration engineering activities, starting from the creation of options, which generally is not planned beforehand and increases the complexity of a software system, to the non-trivial comprehension and debugging of configurations, and ending with the risky maintenance of configuration options, since developers avoid touching and changing configuration options in a mature system.
Abstract: Modern software applications are adapted to different situations (e.g., memory limits, enabling/disabling features, database credentials) by changing the values of configuration options, without any source code modifications. According to several studies, this flexibility is expensive as configuration failures represent one of the most common types of software failures. They are also hard to debug and resolve as they require a lot of effort to detect which options are misconfigured among a large number of configuration options and values, while comprehension of the code also is hampered by sprinkling conditional checks of the values of configuration options. Although researchers have proposed various approaches to help debug or prevent configuration failures, especially from the end users’ perspective, this paper takes a step back to understand the process required by practitioners to engineer the run-time configuration options in their source code, the challenges they experience as well as best practices that they have or could adopt. By interviewing 14 software engineering experts, followed by a large survey on 229 Java software engineers, we identified 9 major activities related to configuration engineering, 22 challenges faced by developers, and 24 expert recommendations to improve software configuration quality. We complemented this study by a systematic literature review to enrich the experts’ recommendations, and to identify possible solutions discussed and evaluated by the research community for the developers’ problems and challenges. We find that developers face a variety of challenges for all nine configuration engineering activities, starting from the creation of options, which generally is not planned beforehand and increases the complexity of a software system, to the non-trivial comprehension and debugging of configurations, and ending with the risky maintenance of configuration options, since developers avoid touching and changing configuration options in a mature system. We also find that researchers thus far focus primarily on testing and debugging configuration failures, leaving a large range of opportunities for future work.

Journal ArticleDOI
TL;DR: A JIT defect localization approach that leverages software naturalness with the N-gram model is proposed that achieves a reasonable performance, and outperforms the two baselines by a substantial margin in terms of four ranking measures.
Abstract: Defect localization aims to locate buggy program elements (e.g., buggy files, methods or lines of code) based on defect symptoms, e.g., bug reports or program spectrum. However, when we receive the defect symptoms, the defect has been exposed and negative impacts have been introduced. Thus, one challenging task is: whether we can locate buggy program prior to appearance of the defect symptom at an early time (e.g., when buggy program elements are being checked-in). We refer to this type of defect localization as “Just-In-Time (JIT) Defect localization”. Although many prior studies have proposed various JIT defect identification methods to identify whether a new change is buggy, these prior methods do not locate the suspicious positions. Thus, JIT defect localization is the next step of JIT defect identification (i.e., after a buggy change is identified, suspicious source code lines are located). To address this problem, we propose a two-phase framework, i.e., JIT defect identification and JIT defect localization. Given a new change, JIT defect identification will identify it as buggy change or clean change first. If a new change is identified as buggy, JIT defect localization will rank the source code lines introduced by the new change according to their suspiciousness scores. The source code lines ranked at the top of the list are estimated as the defect location. For JIT defect identification phase, we use 14 change-level features to build a classifier by following existing approach. For JIT defect localization phase, we propose a JIT defect localization approach that leverages software naturalness with the N-gram model. To evaluate the proposed framework, we conduct an empirical study on 14 open source projects with a total of 177,250 changes. The results show that software naturalness is effective for our JIT defect localization. Our model achieves a reasonable performance, and outperforms the two baselines (i.e., random guess and a static bug finder (i.e., PMD)) by a substantial margin in terms of four ranking measures.