scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Software Engineering in 2018"


Proceedings ArticleDOI
TL;DR: DeepGauge as discussed by the authors proposes a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed.
Abstract: Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems.

307 citations


Posted Content
Miltiadis Allamanis1
TL;DR: The effects of code duplication on machine learning models are explored, showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machineLearning models of code are used by software engineers.
Abstract: The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of near-duplicate code on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this work, we explore the effects of code duplication on machine learning models showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present a duplication index for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them. Finally, we release tools to help the community avoid this problem in future research.

166 citations


Journal ArticleDOI
TL;DR: In this paper, the authors provide a study on the meaning, characteristics, and dynamic growth of GitHub stars and propose four patterns to describe stars growth, which are derived after clustering the time series representing the number of stars of the studied repositories.
Abstract: Besides a git-based version control system, GitHub integrates several social coding features Particularly, GitHub users can star a repository, presumably to manifest interest or satisfaction with an open source project However, the real and practical meaning of starring a project was never the subject of an in-depth and well-founded empirical investigation Therefore, we provide in this paper a throughout study on the meaning, characteristics, and dynamic growth of GitHub stars First, by surveying 791 developers, we report that three out of four developers consider the number of stars before using or contributing to a GitHub project Then, we report a quantitative analysis on the characteristics of the top-5,000 most starred GitHub repositories We propose four patterns to describe stars growth, which are derived after clustering the time series representing the number of stars of the studied repositories; we also reveal the perception of 115 developers about these growth patterns To conclude, we provide a list of recommendations to open source project managers (eg, on the importance of social media promotion) and to GitHub users and Software Engineering researchers (eg, on the risks faced when selecting projects by GitHub stars)

147 citations


Journal ArticleDOI
TL;DR: This article considers behavioral repair where test suites, contracts, models, and crashing inputs are taken as oracle, and state repair, also known as runtime repair or runtime recovery, with techniques such as checkpoint and restart, reconfiguration, and invariant restoration.
Abstract: This article presents a survey on automatic software repair. Automatic software repair consists of automatically finding a solution to software bugs without human intervention. This article considers all kinds of repairs. First, it discusses behavioral repair where test suites, contracts, models, and crashing inputs are taken as oracle. Second, it discusses state repair, also known as runtime repair or runtime recovery, with techniques such as checkpoint and restart, reconfiguration, and invariant restoration. The uniqueness of this article is that it spans the research communities that contribute to this body of knowledge: software engineering, dependability, operating systems, programming languages, and security. It provides a novel and structured overview of the diversity of bug oracles and repair operators used in the literature.

147 citations


Posted Content
TL;DR: In this article, the authors perform an empirical study to assess the feasibility of using Neural Machine Translation (NMT) techniques for learning bug-fixing patches for real defects, and show that NMT is capable of predicting fixed patches generated by developers in 9-50% of the cases.
Abstract: Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. First, we mine millions of bug-fixes from the change histories of projects hosted on GitHub, in order to extract meaningful examples of such bug-fixes. Next, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. In our empirical investigation we found that such a model is able to fix thousands of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9-50% of the cases, depending on the number of candidate patches we allow it to generate. Also, the model is able to emulate a variety of different Abstract Syntax Tree operations and generate candidate patches in a split second.

146 citations


Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed a mutation testing framework for DL systems to measure the quality of test data, by defining a set of source-level mutation operators to inject faults to the source of DL (i.e., training data and training programs).
Abstract: Deep learning (DL) defines a new data-driven programming paradigm where the internal system logic is largely shaped by the training data. The standard way of evaluating DL models is to examine their performance on a test dataset. The quality of the test dataset is of great importance to gain confidence of the trained models. Using an inadequate test dataset, DL models that have achieved high test accuracy may still lack generality and robustness. In traditional software testing, mutation testing is a well-established technique for quality evaluation of test suites, which analyzes to what extent a test suite detects the injected faults. However, due to the fundamental difference between traditional software and deep learning-based software, traditional mutation testing techniques cannot be directly applied to DL systems. In this paper, we propose a mutation testing framework specialized for DL systems to measure the quality of test data. To do this, by sharing the same spirit of mutation testing in traditional software, we first define a set of source-level mutation operators to inject faults to the source of DL (i.e., training data and training programs). Then we design a set of model-level mutation operators that directly inject faults into DL models without a training process. Eventually, the quality of test data could be evaluated from the analysis on to what extent the injected faults could be detected. The usefulness of the proposed mutation testing techniques is demonstrated on two public datasets, namely MNIST and CIFAR-10, with three DL models.

132 citations


Posted Content
TL;DR: A data-driven approach to vulnerability detection using machine learning, specifically applied to C and C++ programs is presented and it is found that source-based models perform better than traditional models.
Abstract: Thousands of security vulnerabilities are discovered in production software each year, either reported publicly to the Common Vulnerabilities and Exposures database or discovered internally in proprietary code. Vulnerabilities often manifest themselves in subtle ways that are not obvious to code reviewers or the developers themselves. With the wealth of open source code available for analysis, there is an opportunity to learn the patterns of bugs that can lead to security vulnerabilities directly from data. In this paper, we present a data-driven approach to vulnerability detection using machine learning, specifically applied to C and C++ programs. We first compile a large dataset of hundreds of thousands of open-source functions labeled with the outputs of a static analyzer. We then compare methods applied directly to source code with methods applied to artifacts extracted from the build process, finding that source-based models perform better. We also compare the application of deep neural network models with more traditional models such as random forests and find the best performance comes from combining features learned by deep models with tree-based models. Ultimately, our highest performing model achieves an area under the precision-recall curve of 0.49 and an area under the ROC curve of 0.87.

116 citations


Posted Content
TL;DR: A novel test adequacy criterion is proposed, called Surprise Adequacy for Deep Learning Systems (SADL), which is based on the behaviour of DL systems with respect to their training data, and shows that systematic sampling of inputs based on their surprise can improve classification accuracy ofDL systems against adversarial examples by up to 77.5% via retraining.
Abstract: Deep Learning (DL) systems are rapidly being adopted in safety and security critical domains, urgently calling for ways to test their correctness and robustness. Testing of DL systems has traditionally relied on manual collection and labelling of data. Recently, a number of coverage criteria based on neuron activation values have been proposed. These criteria essentially count the number of neurons whose activation during the execution of a DL system satisfied certain properties, such as being above predefined thresholds. However, existing coverage criteria are not sufficiently fine grained to capture subtle behaviours exhibited by DL systems. Moreover, evaluations have focused on showing correlation between adversarial examples and proposed criteria rather than evaluating and guiding their use for actual testing of DL systems. We propose a novel test adequacy criterion for testing of DL systems, called Surprise Adequacy for Deep Learning Systems (SADL), which is based on the behaviour of DL systems with respect to their training data. We measure the surprise of an input as the difference in DL system's behaviour between the input and the training data (i.e., what was learnt during training), and subsequently develop this as an adequacy criterion: a good test input should be sufficiently but not overtly surprising compared to training data. Empirical evaluation using a range of DL systems from simple image classifiers to autonomous driving car platforms shows that systematic sampling of inputs based on their surprise can improve classification accuracy of DL systems against adversarial examples by up to 77.5% via retraining.

115 citations


Posted Content
TL;DR: DeepBugs as discussed by the authors is a learning approach to name-based bug detection, which reasons about names based on a semantic representation and automatically learns bug detectors instead of manually writing them.
Abstract: Natural language elements in source code, e.g., the names of variables and functions, convey useful information. However, most existing bug detection tools ignore this information and therefore miss some classes of bugs. The few existing name-based bug detection approaches reason about names on a syntactic level and rely on manually designed and tuned algorithms to detect bugs. This paper presents DeepBugs, a learning approach to name-based bug detection, which reasons about names based on a semantic representation and which automatically learns bug detectors instead of manually writing them. We formulate bug detection as a binary classification problem and train a classifier that distinguishes correct from incorrect code. To address the challenge that effectively learning a bug detector requires examples of both correct and incorrect code, we create likely incorrect code examples from an existing corpus of code through simple code transformations. A novel insight learned from our work is that learning from artificially seeded bugs yields bug detectors that are effective at finding bugs in real-world code. We implement our idea into a framework for learning-based and name-based bug detection. Three bug detectors built on top of the framework detect accidentally swapped function arguments, incorrect binary operators, and incorrect operands in binary operations. Applying the approach to a corpus of 150,000 JavaScript files yields bug detectors that have a high accuracy (between 89% and 95%), are very efficient (less than 20 milliseconds per analyzed file), and reveal 102 programming mistakes (with 68% true positive rate) in real-world code.

114 citations


Posted Content
TL;DR: An abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network) which provides the confidence of predicting the next word according to current state and an advantage reward composed of BLEU metric to train both networks.
Abstract: Code summarization provides a high level natural language description of the function performed by code, as it can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization, b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given. However, it is expected to generate the entire sequence from scratch at test time. This discrepancy can cause an \textit{exposure bias} issue, making the learnt decoder suboptimal. In this paper, we incorporate an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network). The actor network provides the confidence of predicting the next word according to current state. On the other hand, the critic network evaluates the reward value of all possible extensions of the current state and can provide global guidance for explorations. We employ an advantage reward composed of BLEU metric to train both networks. Comprehensive experiments on a real-world dataset show the effectiveness of our proposed model when compared with some state-of-the-art methods.

114 citations


Proceedings ArticleDOI
TL;DR: This research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis, which has many applications, such as cross-architecture vulnerability discovery and code plagiarism detection.
Abstract: Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.

Posted Content
TL;DR: The large body of available literature is reviewed in order to distill a list of the main factors influencing productivity investigated so far and can be used to guide further analysis and as basis for building productivity models.
Abstract: Analysing and improving productivity has been one of the main goals of software engineering research since its beginnings. A plethora of studies has been conducted on various factors that resulted in several models for analysis and prediction of productivity. However, productivity is still an issue in current software development and not all factors and their relationships are known. This paper reviews the large body of available literature in order to distill a list of the main factors influencing productivity investigated so far. The measure for importance here is the number of articles a factor is mentioned in. Special consideration is given to soft or human-related factors in software engineering that are often not analysed with equal detail as more technical factors. The resulting list can be used to guide further analysis and as basis for building productivity models.

Posted Content
TL;DR: Huang et al. as discussed by the authors presented a comprehensive evaluation study on automated log parsing and further released the tools and benchmarks for easy reuse, and evaluated 13 log parsers on a total of 16 log datasets.
Abstract: Logs are imperative in the development and maintenance process of many software systems. They record detailed runtime information that allows developers and support engineers to monitor their systems and dissect anomalous behaviors and errors. The increasing scale and complexity of modern software systems, however, make the volume of logs explodes. In many cases, the traditional way of manual log inspection becomes impractical. Many recent studies, as well as industrial tools, resort to powerful text search and machine learning-based analytics solutions. Due to the unstructured nature of logs, a first crucial step is to parse log messages into structured data for subsequent analysis. In recent years, automated log parsing has been widely studied in both academia and industry, producing a series of log parsers by different techniques. To better understand the characteristics of these log parsers, in this paper, we present a comprehensive evaluation study on automated log parsing and further release the tools and benchmarks for easy reuse. More specifically, we evaluate 13 log parsers on a total of 16 log datasets spanning distributed systems, supercomputers, operating systems, mobile systems, server applications, and standalone software. We report the benchmarking results in terms of accuracy, robustness, and efficiency, which are of practical importance when deploying automated log parsing in production. We also share the success stories and lessons learned in an industrial application at Huawei. We believe that our work could serve as the basis and provide valuable guidance to future research and deployment of automated log parsing.

Posted Content
TL;DR: This study builds the AVATAR APR system, which exploits fix patterns of static analysis violations as ingredients for patch generation, and highlights the relevance of static bug finding tools as indirect contributors of fix ingredients for addressing code defects identified with functional test cases.
Abstract: Fix pattern-based patch generation is a promising direction in Automated Program Repair (APR). Notably, it has been demonstrated to produce more acceptable and correct patches than the patches obtained with mutation operators through genetic programming. The performance of pattern-based APR systems, however, depends on the fix ingredients mined from fix changes in development histories. Unfortunately, collecting a reliable set of bug fixes in repositories can be challenging. In this paper, we propose to investigate the possibility in an APR scenario of leveraging code changes that address violations by static bug detection tools. To that end, we build the AVATAR APR system, which exploits fix patterns of static analysis violations as ingredients for patch generation. Evaluated on the Defects4J benchmark, we show that, assuming a perfect localization of faults, AVATAR can generate correct patches to fix 34/39 bugs. We further find that AVATAR yields performance metrics that are comparable to that of the closely-related approaches in the literature. While AVATAR outperforms many of the state-of-the-art pattern-based APR systems, it is mostly complementary to current approaches. Overall, our study highlights the relevance of static bug finding tools as indirect contributors of fix ingredients for addressing code defects identified with functional test cases.

Journal ArticleDOI
TL;DR: Aroma as mentioned in this paper is a tool and technique for code recommendation via structural code search, which takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippets, and clusters and intersects the results of the search to recommend a small set of succinct code snippets which both contain the query snippet and appear as part of several methods in the corpus.
Abstract: Programmers often write code that has similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to completely implement necessary functionality, help to discover extensions to the partial code which are commonly included by other programmers, help to cross-check against similar code written by other programmers, or help to add extra code which would fix common mistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural code search. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippet, and clusters and intersects the results of the search to recommend a small set of succinct code snippets which both contain the query snippet and appear as part of several methods in the corpus. We evaluated Aroma on 2000 randomly selected queries created from the corpus, as well as 64 queries derived from code snippets obtained from Stack Overflow, a popular website for discussing code. We implemented Aroma for 4 different languages, and developed an IDE plugin for Aroma. Furthermore, we conducted a study where we asked 12 programmers to complete programming tasks using Aroma, and collected their feedback. Our results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently.

Posted Content
TL;DR: This paper implements the first practical bytecode-level APR technique, PraPR, and presents the first extensive study on fixing real-world bugs using JVM bytecode mutation, and demonstrates the overfitting problem of recent advanced APR tools for the first time.
Abstract: Software debugging is tedious, time-consuming, and even error-prone by itself. So, various automated debugging techniques have been proposed in the literature to facilitate the debugging process. Automated Program Repair (APR) is one of the most recent advances in automated debugging, and can directly produce patches for buggy programs with minimal human intervention. Although various advanced APR techniques (including those that are either search-based or semantic-based) have been proposed, the simplistic mutation-based APR technique, which simply uses pre-defined mutation operators (e.g., changing a>=b into a>b) to mutate programs for finding patches, has not yet been thoroughly studied. In this paper, we implement the first practical bytecode-level APR technique, PraPR, and present the first extensive study on fixing real-world bugs (e.g., Defects4J bugs) using bytecode mutation. The experimental results show that surprisingly even PraPR with only the basic traditional mutators can produce genuine patches for 18 bugs. Furthermore, with our augmented mutators, PraPR is able to produce genuine patches for 43 bugs, significantly outperforming state-of-the-art APR. It is also an order of magnitude faster, indicating a promising future for bytecode-mutation-based APR.

Posted Content
TL;DR: In this paper, the authors present requirements for the representation of scenarios in different process steps defined by the ISO 26262 standard, and demonstrate how scenarios can be systematically evolved along the phases of the development process outlined in the ISO26262 standard.
Abstract: The ISO 26262 standard from 2016 represents the state of the art for a safety-guided development of safety-critical electric/electronic vehicle systems. These vehicle systems include advanced driver assistance systems and vehicle guidance systems. The development process proposed in the ISO 26262 standard is based upon multiple V-models, and defines activities and work products for each process step. In many of these process steps, scenario based approaches can be applied to achieve the defined work products for the development of automated driving functions. To accomplish the work products of different process steps, scenarios have to focus on various aspects like a human understandable notation or a description via time-space variables. This leads to contradictory requirements regarding the level of detail and way of notation for the representation of scenarios. In this paper, the authors present requirements for the representation of scenarios in different process steps defined by the ISO 26262 standard, propose a consistent terminology based on prior publications for the identified levels of abstraction, and demonstrate how scenarios can be systematically evolved along the phases of the development process outlined in the ISO 26262 standard.

Posted Content
TL;DR: MonkeyLab as mentioned in this paper is based on the Record-Mine-Generate-Validate framework, which relies on recording app usages that yield execution (event) traces, mining those event traces and generating execution scenarios using statistical language modeling, static and dynamic analyses, and validating the resulting scenarios using an interactive execution of the app on a real device.
Abstract: GUI-based models extracted from Android app execution traces, events, or source code can be extremely useful for challenging tasks such as the generation of scenarios or test cases. However, extracting effective models can be an expensive process. Moreover, existing approaches for automatically deriving GUI-based models are not able to generate scenarios that include events which were not observed in execution (nor event) traces. In this paper, we address these and other major challenges in our novel hybrid approach, coined as MonkeyLab. Our approach is based on the Record-Mine-Generate-Validate framework, which relies on recording app usages that yield execution (event) traces, mining those event traces and generating execution scenarios using statistical language modeling, static and dynamic analyses, and validating the resulting scenarios using an interactive execution of the app on a real device. The framework aims at mining models capable of generating feasible and fully replayable (i.e., actionable) scenarios reflecting either natural user behavior or uncommon usages (e.g., corner cases) for a given app. We evaluated MONKEYLAB in a case study involving several medium-to-large open-source Android apps. Our results demonstrate that MonkeyLab is able to mine GUI-based models that can be used to generate actionable execution scenarios for both natural and unnatural sequences of events on Google Nexus 7 tablets.

Book ChapterDOI
TL;DR: In this paper, a taxonomy of scale for agile software development projects is presented, which has the potential to clarify what topics researchers are studying and ease discussion of research priorities in the area.
Abstract: Positive experience of agile development methods in smaller projects has created interest in the applicability of such methods in larger scale projects. However, there is a lack of conceptual clarity regarding what large-scale agile software development is. This inhibits effective collaboration and progress in the research area. In this paper, we suggest a taxonomy of scale for agile software development projects that has the potential to clarify what topics researchers are studying and ease discussion of research priorities.

Posted Content
TL;DR: DeepRoad is proposed, an unsupervised framework to automatically generate large amounts of accurate driving scenes to test the consistency of DNN-based autonomous driving systems across different scenes, and can detect thousands of behavioral inconsistencies in these systems.
Abstract: While Deep Neural Networks (DNNs) have established the fundamentals of DNN-based autonomous driving systems, they may exhibit erroneous behaviors and cause fatal accidents. To resolve the safety issues of autonomous driving systems, a recent set of testing techniques have been designed to automatically generate test cases, e.g., new input images transformed from the original ones. Unfortunately, many such generated input images often render inferior authenticity, lacking accurate semantic information of the driving scenes and hence compromising the resulting efficacy and reliability. In this paper, we propose DeepRoad, an unsupervised framework to automatically generate large amounts of accurate driving scenes to test the consistency of DNN-based autonomous driving systems across different scenes. In particular, DeepRoad delivers driving scenes with various weather conditions (including those with rather extreme conditions) by applying the Generative Adversarial Networks (GANs) along with the corresponding real-world weather scenes. Moreover, we have implemented DeepRoad to test three well-recognized DNN-based autonomous driving systems. Experimental results demonstrate that DeepRoad can detect thousands of behavioral inconsistencies in these systems.

Proceedings ArticleDOI
TL;DR: This work presents the solution approach, based on the concept of Metamorphic Testing, which aims to identify implementation bugs in ML based image classifiers and developed metamorphic relations for an application based on Support Vector Machine and a Deep Learning based application.
Abstract: We have recently witnessed tremendous success of Machine Learning (ML) in practical applications. Computer vision, speech recognition and language translation have all seen a near human level performance. We expect, in the near future, most business applications will have some form of ML. However, testing such applications is extremely challenging and would be very expensive if we follow today's methodologies. In this work, we present an articulation of the challenges in testing ML based applications. We then present our solution approach, based on the concept of Metamorphic Testing, which aims to identify implementation bugs in ML based image classifiers. We have developed metamorphic relations for an application based on Support Vector Machine and a Deep Learning based application. Empirical validation showed that our approach was able to catch 71% of the implementation bugs in the ML applications.

Journal ArticleDOI
TL;DR: In this article, a case study focused on how knowledge work is coordinated in large-scale agile development programs by providing a rich description of the coordination practices used and how these practices change over time in a four year development program with 12 development teams.
Abstract: Software development projects have undergone remarkable changes with the arrival of agile development methods. While intended for small, self-managing teams, these methods are increasingly used also for large development programs. A major challenge in programs is to coordinate the work of many teams, due to high uncertainty in tasks, a high degree of interdependence between tasks and because of the large number of people involved. This revelatory case study focuses on how knowledge work is coordinated in large-scale agile development programs by providing a rich description of the coordination practices used and how these practices change over time in a four year development program with 12 development teams. The main findings highlight the role of coordination modes based on feedback, the use of a number of mechanisms far beyond what is described in practitioner advice, and finally how coordination practices change over time. The findings are important to improve the outcome of large knowledge-based development programs by tailoring coordination practices to needs and ensuring adjustment over time.

Posted Content
TL;DR: In this paper, the authors investigate the impact of four popularly-used class rebalancing techniques on 10 commonly-used performance measures and the interpretation of defect prediction models and construct statistical models to better understand in which experimental design settings that class rebaling techniques are beneficial for defect prediction model.
Abstract: Defect prediction models that are trained on class imbalanced datasets (ie, the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models Prior research compares the impact of class rebalancing techniques on the performance of defect prediction models Prior research efforts arrive at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect prediction models In this paper, we investigate the impact of 4 popularly-used class rebalancing techniques on 10 commonly-used performance measures and the interpretation of defect prediction models We also construct statistical models to better understand in which experimental design settings that class rebalancing techniques are beneficial for defect prediction models Through a case study of 101 datasets that span across proprietary and open-source systems, we recommend that class rebalancing techniques are necessary when quality assurance teams wish to increase the completeness of identifying software defects (ie, Recall) However, class rebalancing techniques should be avoided when interpreting defect prediction models We also find that class rebalancing techniques do not impact the AUC measure Hence, AUC should be used as a standard measure when comparing defect prediction models

Proceedings ArticleDOI
TL;DR: In this article, the authors provide a comparative study between successful and unsuccessful pull requests made to 78 GitHub base projects by 20,142 developers from 103,192 forked projects, analyzing pull request discussion texts, project specific information (e.g., domain, maturity), and developer specific information(e. g., experience) in order to report useful insights.
Abstract: Given the increasing number of unsuccessful pull requests in GitHub projects, insights into the success and failure of these requests are essential for the developers. In this paper, we provide a comparative study between successful and unsuccessful pull requests made to 78 GitHub base projects by 20,142 developers from 103,192 forked projects. In the study, we analyze pull request discussion texts, project specific information (e.g., domain, maturity), and developer specific information (e.g., experience) in order to report useful insights, and use them to contrast between successful and unsuccessful pull requests. We believe our study will help developers overcome the issues with pull requests in GitHub, and project administrators with informed decision making.

Posted Content
TL;DR: A novel prediction model is developed which is capable of automatically learning features for representing source code and using them for defect prediction, built upon the powerful deep learning, tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code.
Abstract: Defects are common in software systems and can potentially cause various problems to software users Different methods have been developed to quickly predict the most likely locations of defects in large code bases Most of them focus on designing features (eg complexity metrics) that correlate with potentially defective code Those approaches however do not sufficiently capture the syntax and different levels of semantics of source code, an important capability for building accurate prediction models In this paper, we develop a novel prediction model which is capable of automatically learning features for representing source code and using them for defect prediction Our prediction system is built upon the powerful deep learning, tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code An evaluation on two datasets, one from open source projects contributed by Samsung and the other from the public PROMISE repository, demonstrates the effectiveness of our approach for both within-project and cross-project predictions

Proceedings ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a code reviewer recommendation technique that considers not only the relevant cross-project work history (e.g., external library experience) but also the experience of a developer in certain specialized technologies associated with a pull request for determining her expertise as a potential code reviewer.
Abstract: Peer code review locates common coding rule violations and simple logical errors in the early phases of software development, and thus reduces overall cost. However, in GitHub, identifying an appropriate code reviewer for a pull request is a non-trivial task given that reliable information for reviewer identification is often not readily available. In this paper, we propose a code reviewer recommendation technique that considers not only the relevant cross-project work history (e.g., external library experience) but also the experience of a developer in certain specialized technologies associated with a pull request for determining her expertise as a potential code reviewer. We first motivate our technique using an exploratory study with 10 commercial projects and 10 associated libraries external to those projects. Experiments using 17,115 pull requests from 10 commercial projects and six open source projects show that our technique provides 85%--92% recommendation accuracy, about 86% precision and 79%--81% recall in code reviewer recommendation, which are highly promising. Comparison with the state-of-the-art technique also validates the empirical findings and the superiority of our recommendation technique.

Posted Content
TL;DR: An exploratory study of CT on DL systems and proposes a set of coverage criteria for DL systems, as well as a CT coverage guided test generation technique, demonstrating that CT provides a promising avenue for testing DL systems.
Abstract: Deep learning (DL) has achieved remarkable progress over the past decade and been widely applied to many safety-critical applications. However, the robustness of DL systems recently receives great concerns, such as adversarial examples against computer vision systems, which could potentially result in severe consequences. Adopting testing techniques could help to evaluate the robustness of a DL system and therefore detect vulnerabilities at an early stage. The main challenge of testing such systems is that its runtime state space is too large: if we view each neuron as a runtime state for DL, then a DL system often contains massive states, rendering testing each state almost impossible. For traditional software, combinatorial testing (CT) is an effective testing technique to reduce the testing space while obtaining relatively high defect detection abilities. In this paper, we perform an exploratory study of CT on DL systems. We adapt the concept in CT and propose a set of coverage criteria for DL systems, as well as a CT coverage guided test generation technique. Our evaluation demonstrates that CT provides a promising avenue for testing DL systems. We further pose several open questions and interesting directions for combinatorial testing of DL systems.

Proceedings ArticleDOI
TL;DR: This paper introduces Retecs, a new method for automatically learning test case selection and prioritization in CI with the goal to minimize the round-trip time between code commits and developer feedback on failed test cases.
Abstract: Testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle. Selecting the most promising test cases to detect bugs is hard if there are uncertainties on the impact of committed code changes or, if traceability links between code and tests are not available. This paper introduces Retecs, a new method for automatically learning test case selection and prioritization in CI with the goal to minimize the round-trip time between code commits and developer feedback on failed test cases. The Retecs method uses reinforcement learning to select and prioritize test cases according to their duration, previous last execution and failure history. In a constantly changing environment, where new test cases are created and obsolete test cases are deleted, the Retecs method learns to prioritize error-prone test cases higher under guidance of a reward function and by observing previous CI cycles. By applying Retecs on data extracted from three industrial case studies, we show for the first time that reinforcement learning enables fruitful automatic adaptive test case selection and prioritization in CI and regression testing.

Posted Content
TL;DR: In this article, the authors identify and investigate a practical bias caused by the fault localization (FL) step in a repair pipeline, and explore the performance variations that can be achieved by ''tweaking'' the FL step.
Abstract: Properly benchmarking Automated Program Repair (APR) systems should contribute to the development and adoption of the research outputs by practitioners. To that end, the research community must ensure that it reaches significant milestones by reliably comparing state-of-the-art tools for a better understanding of their strengths and weaknesses. In this work, we identify and investigate a practical bias caused by the fault localization (FL) step in a repair pipeline. We propose to highlight the different fault localization configurations used in the literature, and their impact on APR systems when applied to the Defects4J benchmark. Then, we explore the performance variations that can be achieved by `tweaking' the FL step. Eventually, we expect to create a new momentum for (1) full disclosure of APR experimental procedures with respect to FL, (2) realistic expectations of repairing bugs in Defects4J, as well as (3) reliable performance comparison among the state-of-the-art APR systems, and against the baseline performance results of our thoroughly assessed kPAR repair tool. Our main findings include: (a) only a subset of Defects4J bugs can be currently localized by commonly-used FL techniques; (b) current practice of comparing state-of-the-art APR systems (i.e., counting the number of fixed bugs) is potentially misleading due to the bias of FL configurations; and (c) APR authors do not properly qualify their performance achievement with respect to the different tuning parameters implemented in APR systems.

Proceedings ArticleDOI
TL;DR: SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks and connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts.
Abstract: Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.