scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Software Engineering in 2021"


Journal ArticleDOI
TL;DR: A survey of the research literature that has addressed this topic in the period 1996–2006 is provided and a new classification framework is presented that represents an abstracted and synthesized view of the types of factors that have been asserted as influencing project outcomes.
Abstract: Determining the factors that have an influence on software systems development and deployment project outcomes has been the focus of extensive and ongoing research for more than 30 years. We provide here a survey of the research literature that has addressed this topic in the period 1996-2006, with a particular focus on empirical analyses. On the basis of this survey we present a new classification framework that represents an abstracted and synthesized view of the types of factors that have been asserted as influencing project outcomes.

199 citations


Proceedings ArticleDOI
TL;DR: CURE as mentioned in this paper pre-trains a programming language (PL) model on a large software codebase to learn developer-like source code before the APR task, and uses a subword tokenization technique to generate a smaller search space that contains more correct fixes.
Abstract: Automatic program repair (APR) is crucial to improve software reliability. Recently, neural machine translation (NMT) techniques have been used to fix software bugs automatically. While promising, these approaches have two major limitations. Their search space often does not contain the correct fix, and their search strategy ignores software knowledge such as strict code syntax. Due to these limitations, existing NMT-based techniques underperform the best template-based approaches. We propose CURE, a new NMT-based APR technique with three major novelties. First, CURE pre-trains a programming language (PL) model on a large software codebase to learn developer-like source code before the APR task. Second, CURE designs a new code-aware search strategy that finds more correct fixes by focusing on compilable patches and patches that are close in length to the buggy code. Finally, CURE uses a subword tokenization technique to generate a smaller search space that contains more correct fixes. Our evaluation on two widely-used benchmarks shows that CURE correctly fixes 57 Defects4J bugs and 26 QuixBugs bugs, outperforming all existing APR techniques on both benchmarks.

108 citations


Posted ContentDOI
TL;DR: In this paper, the authors investigate the impact of snoring on classifiers' accuracy and the effectiveness of a possible countermeasure, i.e., dropping too recent data from a training set.
Abstract: Defect prediction models can be beneficial to prioritize testing, analysis, or code review activities, and has been the subject of a substantial effort in academia, and some applications in industrial contexts. A necessary precondition when creating a defect prediction model is the availability of defect data from the history of projects. If this data is noisy, the resulting defect prediction model could result to be unreliable. One of the causes of noise for defect datasets is the presence of "dormant defects", i.e., of defects discovered several releases after their introduction. This can cause a class to be labeled as defect-free while it is not, and is, therefore "snoring". In this paper, we investigate the impact of snoring on classifiers' accuracy and the effectiveness of a possible countermeasure, i.e., dropping too recent data from a training set. We analyze the accuracy of 15 machine learning defect prediction classifiers, on data from more than 4,000 defects and 600 releases of 19 open source projects from the Apache ecosystem. Our results show that on average across projects: (i) the presence of dormant defects decreases the recall of defect prediction classifiers, and (ii) removing from the training set the classes that in the last release are labeled as not defective significantly improves the accuracy of the classifiers. In summary, this paper provides insights on how to create defects datasets by mitigating the negative effect of dormant defects on defect prediction.

106 citations


Journal ArticleDOI
TL;DR: In this paper, the authors examine multiple platform sponsors from an industrial manufacturing context and demarcate three platform archetypes: product platform, supply chain platform, and platform ecosystem, and find that each platform archetype is characterized by a specific innovation mechanism that contributes to the platform service discovery and expands the platform value.
Abstract: Industrial manufacturers increasingly develop digital platforms in the business-to-business (B2B) context. This emergent form of digital platforms requires a profound yet little understood holistic perspective that encompasses the co-evolution of platform architecture, platform services, and platform governance. To address this research gap, our study examines multiple platform sponsors from an industrial manufacturing context. The study demarcates three platform archetypes: product platform, supply chain platform, and platform ecosystem. We argue that each platform archetype involves a gradual development of platform architecture, platform services, and platform governance, which mirror each other. We also find that each platform archetype is characterized by a specific innovation mechanism that contributes to the platform service discovery and expands the platform value. Our study extends the co-evolution perspective of platform ecosystem literature and digital servitization literature.

45 citations


Posted Content
TL;DR: CodeXGLUE as mentioned in this paper is a benchmark dataset to foster machine learning research for program understanding and generation, which includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.
Abstract: Benchmark datasets have a significant impact on accelerating research in programming language tasks In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems

38 citations


Journal ArticleDOI
TL;DR: In this paper, a mixed-methods study was conducted with 106 survey responses and 6 interviews from microservices practitioners to gain a deep understanding of how microservices systems are designed, monitored, and tested in the industry.
Abstract: Context: Microservices Architecture (MSA) has received significant attention in the software industry. However, little empirical evidence exists on design, monitoring, and testing of microservices systems. Objective: This research aims to gain a deep understanding of how microservices systems are designed, monitored, and tested in the industry. Method: A mixed-methods study was conducted with 106 survey responses and 6 interviews from microservices practitioners. Results: The main findings are: (1) a combination of domain-driven design and business capability is the most used strategy to decompose an application into microservices, (2) over half of the participants used architecture evaluation and architecture implementation when designing microservices systems, (3) API gateway and Backend for frontend patterns are the most used MSA patterns, (4) resource usage and load balancing as monitoring metrics, log management and exception tracking as monitoring practices are widely used, (5) unit and end-to-end testing are the most used testing strategies, and (6) the complexity of microservices systems poses challenges for their design, monitoring, and testing, for which there are no dedicated solutions. Conclusions: Our findings reveal that more research is needed to (1) deal with microservices complexity at the design level, (2) handle security in microservices systems, and (3) address the monitoring and testing challenges through dedicated solutions.

35 citations


Proceedings ArticleDOI
TL;DR: Xia et al. as discussed by the authors proposed a method to automatically collect and curate a comprehensive vulnerability dataset from Common Vulnerabilities and Exposures (CVE) records in the public National Vulnerability Database (NVD).
Abstract: Data-driven research on the automated discovery and repair of security vulnerabilities in source code requires comprehensive datasets of real-life vulnerable code and their fixes. To assist in such research, we propose a method to automatically collect and curate a comprehensive vulnerability dataset from Common Vulnerabilities and Exposures (CVE) records in the public National Vulnerability Database (NVD). We implement our approach in a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes. The CVEfixes collection tool automatically fetches all available CVE records from the NVD, gathers the vulnerable code and corresponding fixes from associated open-source repositories, and organizes the collected information in a relational database. Moreover, the dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction. The collection can easily be repeated to keep up-to-date with newly discovered or patched vulnerabilities. The initial release of CVEfixes spans all published CVEs up to 9 June 2021, covering 5365 CVE records for 1754 open-source projects that were addressed in a total of 5495 vulnerability fixing commits. CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.

35 citations


Posted Content
TL;DR: A system, named ATVHunter, which can pinpoint the precise vulnerable in-app TPL versions and provide detailed information about the vulnerabilities and TPLs is proposed and Experimental results show AtVHunter outperforms state-of-the-art TPL detection tools.
Abstract: We propose a system, named ATVHunter, which can pinpoint the precise vulnerable in-app TPL versions and provide detailed information about the vulnerabilities and TPLs. We propose a two-phase detection approach to identify specific TPL versions. Specifically, we extract the Control Flow Graphs as the coarse-grained feature to match potential TPLs in the pre-defined TPL database, and then extract opcode in each basic block of CFG as the fine-grained feature to identify the exact TPL versions. We build a comprehensive TPL database (189,545 unique TPLs with 3,006,676 versions) as the reference database. Meanwhile, to identify the vulnerable in-app TPL versions, we also construct a comprehensive and known vulnerable TPL database containing 1,180 CVEs and 224 security bugs. Experimental results show ATVHunter outperforms state-of-the-art TPL detection tools, achieving 90.55% precision and 88.79% recall with high efficiency, and is also resilient to widely-used obfuscation techniques and scalable for large-scale TPL detection. Furthermore, to investigate the ecosystem of the vulnerable TPLs used by apps, we exploit ATVHunter to conduct a large-scale analysis on 104,446 apps and find that 9,050 apps include vulnerable TPL versions with 53,337 vulnerabilities and 7,480 security bugs, most of which are with high risks and are not recognized by app developers.

34 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors explored two popular online developer communities, Stack Overflow (SO) and Reddit, to provide insights on the characteristics and challenges of low-code development from a practitioners' perspective.
Abstract: Background: In recent years, Low-code development (LCD) is growing rapidly, and Gartner and Forrester have predicted that the use of LCD is very promising. Giant companies, such as Microsoft, Mendix, and Outsystems have also launched their LCD platforms. Aim: In this work, we explored two popular online developer communities, Stack Overflow (SO) and Reddit, to provide insights on the characteristics and challenges of LCD from a practitioners' perspective. Method: We used two LCD related terms to search the relevant posts in SO and extracted 73 posts. Meanwhile, we explored three LCD related subreddits from Reddit and collected 228 posts. We extracted data from these posts and applied the Constant Comparison method to analyze the descriptions, benefits, and limitations and challenges of LCD. For platforms and programming languages used in LCD, implementation units in LCD, supporting technologies of LCD, types of applications developed by LCD, and domains that use LCD, we used descriptive statistics to analyze and present the results. Results: Our findings show that: (1) LCD may provide a graphical user interface for users to drag and drop with little or even no code; (2) the equipment of out-of-the-box units (e.g., APIs and components) in LCD platforms makes them easy to learn and use as well as speeds up the development; (3) LCD is particularly favored in the domains that have the need for automated processes and workflows; and (4) practitioners have conflicting views on the advantages and disadvantages of LCD. Conclusions: Our findings suggest that researchers should clearly define the terms when they refer to LCD, and developers should consider whether the characteristics of LCD are appropriate for their projects.

29 citations


Journal Article
TL;DR: This work uses the ASDL grammar of programming language to define the node and edge types of program graphs and uses heterogeneous graph neural networks to learn on these graphs, showing that leveraging the type information of nodes and edges in program graphs can help in learning program semantics.
Abstract: Program source code contains complex structure information, which can be represented in structured data forms like trees or graphs. To acquire the structural information in source code, most existing researches use abstract syntax trees (AST). A group of works add additional edges to ASTs to convert source code into graphs and use graph neural networks to learn representations for program graphs. Although these works provide additional control or data flow information to ASTs for downstream tasks, they neglect an important aspect of structure information in AST itself: the different types of nodes and edges. In ASTs, different nodes contain different kinds of information like variables or control flow, and the relation between a node and all its children can also be different. To address the information of node and edge types, we bring the idea of heterogeneous graph mining to learning on source code and present a new formula of building heterogeneous program graphs from ASTs with additional type information for nodes and edges. We use the ASDL grammar of programming language to define the node and edge types of program graphs. Then we use heterogeneous graph neural networks to learn on these graphs. We evaluate our approach on two tasks: code comment generation and method naming. Both tasks require reasoning on the semantics of complete code snippets. Experiment results show that our approach outperforms baseline models, including homogeneous graph-based models, showing that leveraging the type information of nodes and edges in program graphs can help in learning program semantics.

28 citations


Proceedings ArticleDOI
TL;DR: SIVAND as discussed by the authors uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model, and finds that the models in their experiments often rely heavily on just a few syntactic features in input programs.
Abstract: A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior.

Proceedings ArticleDOI
TL;DR: Comfort as discussed by the authors is a compiler fuzzing framework for detecting JS engine bugs and behaviors that deviate from the ECMAScript standard, which leverages the recent advance in deep learning-based language models.
Abstract: JavaScript (JS) is a popular, platform-independent programming language. To ensure the interoperability of JS programs across different platforms, the implementation of a JS engine should conform to the ECMAScript standard. However, doing so is challenging as there are many subtle definitions of API behaviors, and the definitions keep evolving. We present COMFORT, a new compiler fuzzing framework for detecting JS engine bugs and behaviors that deviate from the ECMAScript standard. COMFORT leverages the recent advance in deep learning-based language models to automatically generate JS test code. As a departure from prior fuzzers, COMFORT utilizes the well-structured ECMAScript specifications to automatically generate test data along with the test programs to expose bugs that could be overlooked by the developers or manually written test cases. COMFORT then applies differential testing methodologies on the generated test cases to expose standard conformance bugs. We apply COMFORT to ten mainstream JS engines. In 200 hours of automated concurrent testing runs, we discover bugs in all tested JS engines. We had identified 158 unique JS engine bugs, of which 129 have been verified, and 115 have already been fixed by the developers. Furthermore, 21 of the Comfort-generated test cases have been added to Test262, the official ECMAScript conformance test suite.

Journal ArticleDOI
TL;DR: The Reproducible Builds project as discussed by the authors is an approach that can determine whether generated binaries correspond with their original source code, in order to increase confidence in Free and Open Source Software (FOSS).
Abstract: Although it is possible to increase confidence in Free and Open Source Software (FOSS) by reviewing its source code, trusting code is not the same as trusting its executable counterparts. These are typically built and distributed by third-party vendors, with severe security consequences if their supply chains are compromised. In this paper, we present reproducible builds, an approach that can determine whether generated binaries correspond with their original source code. We first define the problem, and then provide insight into the challenges of making real-world software build in a "reproducible" manner-this is, when every build generates bit-for-bit identical results. Through the experience of the Reproducible Builds project making the Debian Linux distribution reproducible, we also describe the affinity between reproducibility and quality assurance (QA).

Proceedings ArticleDOI
TL;DR: In this paper, the authors present an in-depth case study by comparing the analysis reports of 9 industry-leading SCA tools on a large web application, OpenMRS, composed of Maven (Java) and npm (JavaScript) projects.
Abstract: Background: Modern software uses many third-party libraries and frameworks as dependencies. Known vulnerabilities in these dependencies are a potential security risk. Software composition analysis (SCA) tools, therefore, are being increasingly adopted by practitioners to keep track of vulnerable dependencies. Aim: The goal of this study is to understand the difference in vulnerability reporting by various SCA tools. Understanding if and how existing SCA tools differ in their analysis may help security practitioners to choose the right tooling and identify future research needs. Method: We present an in-depth case study by comparing the analysis reports of 9 industry-leading SCA tools on a large web application, OpenMRS, composed of Maven (Java) and npm (JavaScript) projects. Results: We find that the tools vary in their vulnerability reporting. The count of reported vulnerable dependencies ranges from 17 to 332 for Maven and from 32 to 239 for npm projects across the studied tools. Similarly, the count of unique known vulnerabilities reported by the tools ranges from 36 to 313 for Maven and from 45 to 234 for npm projects. Our manual analysis of the tools' results suggest that accuracy of the vulnerability database is a key differentiator for SCA tools. Conclusion: We recommend that practitioners should not rely on any single tool at the present, as that can result in missing known vulnerabilities. We point out two research directions in the SCA space: i) establishing frameworks and metrics to identify false positives for dependency vulnerabilities; and ii) building automation technologies for continuous monitoring of vulnerability data from open source package ecosystems.

Posted Content
TL;DR: In this article, the authors present a plugin for the IDE that implements a hybrid of code generation and code retrieval functionality, and orchestrate virtual environments to enable collection of many user events.
Abstract: A great part of software development involves conceptualizing or communicating the underlying procedures and logic that needs to be expressed in programs. One major difficulty of programming is turning concept into code, especially when dealing with the APIs of unfamiliar libraries. Recently, there has been a proliferation of machine learning methods for code generation and retrieval from natural language queries, but these have primarily been evaluated purely based on retrieval accuracy or overlap of generated code with developer-written code, and the actual effect of these methods on the developer workflow is surprisingly unattested. We perform the first comprehensive investigation of the promise and challenges of using such technology inside the IDE, asking "at the current state of technology does it improve developer productivity or accuracy, how does it affect the developer experience, and what are the remaining gaps and challenges?" We first develop a plugin for the IDE that implements a hybrid of code generation and code retrieval functionality, and orchestrate virtual environments to enable collection of many user events. We ask developers with various backgrounds to complete 14 Python programming tasks ranging from basic file manipulation to machine learning or data visualization, with or without the help of the plugin. While qualitative surveys of developer experience are largely positive, quantitative results with regards to increased productivity, code quality, or program correctness are inconclusive. Analysis identifies several pain points that could improve the effectiveness of future machine learning based code generation/retrieval developer assistants, and demonstrates when developers prefer code generation over code retrieval and vice versa. We release all data and software to pave the road for future empirical studies and development of better models.

Posted Content
TL;DR: Test case selection and prioritization (TSP) techniques have been proposed to improve regression testing by selecting and prioritizing test cases in order to provide early feedback to developers as discussed by the authors.
Abstract: Regression testing is an essential activity to assure that software code changes do not adversely affect existing functionalities. With the wide adoption of Continuous Integration (CI) in software projects, which increases the frequency of running software builds, running all tests can be time-consuming and resource-intensive. To alleviate that problem, Test case Selection and Prioritization (TSP) techniques have been proposed to improve regression testing by selecting and prioritizing test cases in order to provide early feedback to developers. In recent years, researchers have relied on Machine Learning (ML) techniques to achieve effective TSP (ML-based TSP). Such techniques help combine information about test cases, from partial and imperfect sources, into accurate prediction models. This work conducts a systematic literature review focused on ML-based TSP techniques, aiming to perform an in-depth analysis of the state of the art, thus gaining insights regarding future avenues of research. To that end, we analyze 29 primary studies published from 2006 to 2020, which have been identified through a systematic and documented process. This paper addresses five research questions addressing variations in ML-based TSP techniques and feature sets for training and testing ML models, alternative metrics used for evaluating the techniques, the performance of techniques, and the reproducibility of the published studies.

Journal ArticleDOI
TL;DR: In this paper, a conceptual framework is developed for understanding and analyzing evaluations of project success, both formal and informal, highlighting how different stakeholder perspectives influence the perceived outcome(s) of a project, and how project evaluations may differ between stakeholders and across time.
Abstract: Answering the call for alternative approaches to researching project management, we explore the evaluation of project success from a subjectivist perspective. An in-depth, longitudinal case study of information systems development in a large manufacturing company was used to investigate how various project stakeholders subjectively perceived the project outcome and what evaluation criteria they drew on in doing so. A conceptual framework is developed for understanding and analyzing evaluations of project success, both formal and informal. The framework highlights how different stakeholder perspectives influence the perceived outcome(s) of a project, and how project evaluations may differ between stakeholders and across time.

Proceedings ArticleDOI
TL;DR: TDCleaner as mentioned in this paper identifies obsolete TODO comments in software projects by automatically establishing training samples and leveraging three neural encoders to capture the semantic features of TODO comment, code change and commit message respectively.
Abstract: TODO comments are very widely used by software developers to describe their pending tasks during software development. However, after performing the task developers sometimes neglect or simply forget to remove the TODO comment, resulting in obsolete TODO comments. These obsolete TODO comments can confuse development teams and may cause the introduction of bugs in the future, decreasing the software's quality and maintainability. In this work, we propose a novel model, named TDCleaner (TODO comment Cleaner), to identify obsolete TODO comments in software projects. TDCleaner can assist developers in just-in-time checking of TODO comments status and avoid leaving obsolete TODO comments. Our approach has two main stages: offline learning and online prediction. During offline learning, we first automatically establish training samples and leverage three neural encoders to capture the semantic features of TODO comment, code change and commit message respectively. TDCleaner then automatically learns the correlations and interactions between different encoders to estimate the final status of the TODO comment. For online prediction, we check a TODO comment's status by leveraging the offline trained model to judge the TODO comment's likelihood of being obsolete. We built our dataset by collecting TODO comments from the top-10,000 Python and Java Github repositories and evaluated TDCleaner on them. Extensive experimental results show the promising performance of our model over a set of benchmarks. We also performed an in-the-wild evaluation with real-world software projects, we reported 18 obsolete TODO comments identified by TDCleaner to Github developers and 9 of them have already been confirmed and removed by the developers, demonstrating the practical usage of our approach.

Posted Content
TL;DR: Concerns of statisticians that the F1 is a problematic metric outside of an information retrieval context are reiterated, since they are concerned about both classes (defect-prone and not defect-prone units).
Abstract: Context: Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately, some widely used performance metrics are known to be problematic, most notably F1, but nevertheless F1 is widely used. Objective: To investigate the potential impact of using F1 on the validity of this large body of research. Method: We undertook a systematic review to locate relevant experiments and then extract all pairwise comparisons of defect prediction performance using F1 and the un-biased Matthews correlation coefficient (MCC). Results: We found a total of 38 primary studies. These contain 12,471 pairs of results. Of these, 21.95% changed direction when the MCC metric is used instead of the biased F1 metric. Unfortunately, we also found evidence suggesting that F1 remains widely used in software defect prediction research. Conclusions: We reiterate the concerns of statisticians that the F1 is a problematic metric outside of an information retrieval context, since we are concerned about both classes (defect-prone and not defect-prone units). This inappropriate usage has led to a substantial number (more than one fifth) of erroneous (in terms of direction) results. Therefore we urge researchers to (i) use an unbiased metric and (ii) publish detailed results including confusion matrices such that alternative analyses become possible.

Proceedings ArticleDOI
TL;DR: In this paper, the authors present a method to express and compare different definitions of accountability using Structural Causal Models (SCLMs) and present a small use case based on an autonomous car.
Abstract: Accountability is an often called for property of technical systems. It is a requirement for algorithmic decision systems, autonomous cyber-physical systems, and for software systems in general. As a concept, accountability goes back to the early history of Liberalism and is suggested as a tool to limit the use of power. This long history has also given us many, often slightly differing, definitions of accountability. The problem that software developers now face is to understand what accountability means for their systems and how to reflect it in a system's design. To enable the rigorous study of accountability in a system, we need models that are suitable for capturing such a varied concept. In this paper, we present a method to express and compare different definitions of accountability using Structural Causal Models. We show how these models can be used to evaluate a system's design and present a small use case based on an autonomous car.

Proceedings ArticleDOI
Anup K. Kalia1, Jin Xiao1, Rahul Krishna1, Saurabh Sinha1, Maja Vukovic1, Debasish Banerjee1 
TL;DR: Mono2Micro as mentioned in this paper performs spatio-temporal decomposition, leveraging well-defined business use cases and runtime call relations to create functionally cohesive partitioning of application classes.
Abstract: In migrating production workloads to cloud, enterprises often face the daunting task of evolving monolithic applications toward a microservice architecture. At IBM, we developed a tool called Mono2Micro to assist with this challenging task. Mono2Micro performs spatio-temporal decomposition, leveraging well-defined business use cases and runtime call relations to create functionally cohesive partitioning of application classes. Our preliminary evaluation of Mono2Micro showed promising results. How well does Mono2Micro perform against other decomposition techniques, and how do practitioners perceive the tool? This paper describes the technical foundations of Mono2Micro and presents results to answer these two questions. To answer the first question, we evaluated Mono2Micro against four existing techniques on a set of open-source and proprietary Java applications and using different metrics to assess the quality of decomposition and tool's efficiency. Our results show that Mono2Micro significantly outperforms state-of-the-art baselines in specific metrics well-defined for the problem domain. To answer the second question, we conducted a survey of twenty-one practitioners in various industry roles who have used Mono2Micro. This study highlights several benefits of the tool, interesting practitioner perceptions, and scope for further improvements. Overall, these results show that Mono2Micro can provide a valuable aid to practitioners in creating functionally cohesive and explainable microservice decompositions.

Journal ArticleDOI
TL;DR: In this paper, the authors leverage the mature social construct of incivility as a lens to understand confrontational conflicts in open source code review discussions and identify various causes and consequences of such uncivil communication.
Abstract: Code review is an important quality assurance activity for software development. Code review discussions among developers and maintainers can be heated and sometimes involve personal attacks and unnecessary disrespectful comments, demonstrating, therefore, incivility. Although incivility in public discussions has received increasing attention from researchers in different domains, the knowledge about the characteristics, causes, and consequences of uncivil communication is still very limited in the context of software development, and more specifically, code review. To address this gap in the literature, we leverage the mature social construct of incivility as a lens to understand confrontational conflicts in open source code review discussions. For that, we conducted a qualitative analysis on 1,545 emails from the Linux Kernel Mailing List (LKML) that were associated with rejected changes. We found that more than half 66.66% of the non-technical emails included uncivil features. Particularly, frustration, name calling, and impatience are the most frequent features in uncivil emails. We also found that there are civil alternatives to address arguments, while uncivil comments can potentially be made by any people when discussing any topic. Finally, we identified various causes and consequences of such uncivil communication. Our work serves as the first study about the phenomenon of in(civility) in open source software development, paving the road for a new field of research about collaboration and communication in the context of software engineering activities.

Posted Content
TL;DR: In a large study of 457 bugs in 46 open source C programs, it is found that dynamic slicing is more effective for programs with single fault, but statistical debugging performs better on multiple faults.
Abstract: Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate locations, though, more traditional techniques of following dependencies along dynamic slices may be better suited. In a large study of 457 bugs (369 single faults and 88 multiple faults) in 46 open source C programs, we compare the effectiveness of statistical fault localization against dynamic slicing. For single faults, we find that dynamic slicing was eight percentage points more effective than the best performing statistical debugging formula; for 66% of the bugs, dynamic slicing finds the fault earlier than the best performing statistical debugging formula. In our evaluation, dynamic slicing is more effective for programs with single fault, but statistical debugging performs better on multiple faults. Best results, however, are obtained by a hybrid approach: If programmers first examine at most the top five most suspicious locations from statistical debugging, and then switch to dynamic slices, on average, they will need to examine 15% (30 lines) of the code. These findings hold for 18 most effective statistical debugging formulas and our results are independent of the number of faults (i.e. single or multiple faults) and error type (i.e. artificial or real errors).

Journal ArticleDOI
TL;DR: In this article, the state of the art on modern code review (MCR) is identified, providing a structured overview and an in-depth analysis of the research done in this field.
Abstract: Modern Code Review (MCR) is a widely known practice of software quality assurance. However, the existing body of knowledge of MCR is currently not understood as a whole. Objective: Our goal is to identify the state of the art on MCR, providing a structured overview and an in-depth analysis of the research done in this field. Method: We performed a systematic literature review, selecting publications from four digital libraries. Results: A total of 139 papers were selected and analyzed in three main categories. Foundational studies are those that analyze existing or collected data from the adoption of MCR. Proposals consist of techniques and tools to support MCR, while evaluations are studies to assess an approach or compare a set of them. Conclusion: The most represented category is foundational studies, mainly aiming to understand the motivations for adopting MCR, its challenges and benefits, and which influence factors lead to which MCR outcomes. The most common types of proposals are code reviewer recommender and support to code checking. Evaluations of MCR-supporting approaches have been done mostly offline, without involving human subjects. Five main research gaps have been identified, which point out directions for future work in the area.

Posted Content
TL;DR: CCAG as mentioned in this paper models the flattened sequence of a partial AST as an AST graph and uses graph attention block to capture different dependencies in the AST graph for representation learning in code completion.
Abstract: Code completion has become an essential component of integrated development environments. Contemporary code completion methods rely on the abstract syntax tree (AST) to generate syntactically correct code. However, they cannot fully capture the sequential and repetitive patterns of writing code and the structural information of the AST. To alleviate these problems, we propose a new code completion approach named CCAG, which models the flattened sequence of a partial AST as an AST graph. CCAG uses our proposed AST Graph Attention Block to capture different dependencies in the AST graph for representation learning in code completion. The sub-tasks of code completion are optimized via multi-task learning in CCAG, and the task balance is automatically achieved using uncertainty without the need to tune task weights. The experimental results show that CCAG has superior performance than state-of-the-art approaches and it is able to provide intelligent code completion.

Posted Content
TL;DR: BisaFinder as mentioned in this paper automatically curates suitable templates based on the pieces of text from a large corpus, using various Natural Language Processing (NLP) techniques to identify words that describe demographic characteristics.
Abstract: Artificial Intelligence (AI) software systems, such as Sentiment Analysis (SA) systems, typically learn from large amounts of data that may reflect human biases. Consequently, the machine learning model in such software systems may exhibit unintended demographic bias based on specific characteristics (e.g., gender, occupation, country-of-origin, etc.). Such biases manifest in an SA system when it predicts a different sentiment for similar texts that differ only in the characteristic of individuals described. Existing studies on revealing bias in SA systems rely on the production of sentences from a small set of short, predefined templates. To address this limitation, we present BisaFinder, an approach to discover biased predictions in SA systems via metamorphic testing. A key feature of BisaFinder is the automatic curation of suitable templates based on the pieces of text from a large corpus, using various Natural Language Processing (NLP) techniques to identify words that describe demographic characteristics. Next, BisaFinder instantiates new text from these templates by filling in placeholders with words associated with a class of a characteristic (e.g., gender-specific words such as female names, "she", "her"). These texts are used to tease out bias in an SA system. BisaFinder identifies a bias-uncovering test case when it detects that the SA system exhibits demographic bias for a pair of texts, i.e., it predicts a different sentiment for texts that differ only in words associated with a different class (e.g., male vs. female) of a target characteristic (e.g., gender). Our empirical evaluation showed that BisaFinder can effectively create a large number of realistic and diverse test cases that uncover various biases in an SA system with a high true positive rate of up to 95.8\%.

Proceedings ArticleDOI
TL;DR: Fuzzy-SAT as mentioned in this paper uses Satisfiability Modulo Theories (SMT) solvers to check whether symbolic formulas are satisfiable in the context of concolic and hybrid fuzzing engines, providing a viable alternative to classic SMT solving techniques.
Abstract: Recent years have witnessed a wide array of results in software testing, exploring different approaches and methodologies ranging from fuzzers to symbolic engines, with a full spectrum of instances in between such as concolic execution and hybrid fuzzing. A key ingredient of many of these tools is Satisfiability Modulo Theories (SMT) solvers, which are used to reason over symbolic expressions collected during the analysis. In this paper, we investigate whether techniques borrowed from the fuzzing domain can be applied to check whether symbolic formulas are satisfiable in the context of concolic and hybrid fuzzing engines, providing a viable alternative to classic SMT solving techniques. We devise a new approximate solver, FUZZY-SAT, and show that it is both competitive with and complementary to state-of-the-art solvers such as Z3 with respect to handling queries generated by hybrid fuzzers.

Posted Content
TL;DR: This work evaluates the memorization and generalization tendencies in neural code intelligence models through a case study across several benchmarks and model families by leveraging established approaches from other fields that use DNNs, such as introducing targeted noise into the training dataset.
Abstract: Deep Neural Networks (DNN) are increasingly commonly used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, training DNNs means walking a knife's edges, because their large capacity also renders them prone to memorizing data points. While traditionally thought of as an aspect of over-training, recent work suggests that the memorization risk manifests especially strongly when the training datasets are noisy and memorization is the only recourse. Unfortunately, most code intelligence tasks rely on rather noise-prone and repetitive data sources, such as GitHub, which, due to their sheer size, cannot be manually inspected and evaluated. We evaluate the memorization and generalization tendencies in neural code intelligence models through a case study across several benchmarks and model families by leveraging established approaches from other fields that use DNNs, such as introducing targeted noise into the training dataset. In addition to reinforcing prior general findings about the extent of memorization in DNNs, our results shed light on the impact of noisy dataset in training.

Posted Content
TL;DR: In this paper, a deep learning model is proposed to assist a modeler in the design of a metamodel by recommending relevant domain concepts in several modeling scenarios, such as concept renaming.
Abstract: The design of conceptually sound metamodels that embody proper semantics in relation to the application domain is particularly tedious in Model-Driven Engineering. As metamodels define complex relationships between domain concepts, it is crucial for a modeler to define these concepts thoroughly while being consistent with respect to the application domain. We propose an approach to assist a modeler in the design of a metamodel by recommending relevant domain concepts in several modeling scenarios. Our approach does not require to extract knowledge from the domain or to hand-design completion rules. Instead, we design a fully data-driven approach using a deep learning model that is able to abstract domain concepts by learning from both structural and lexical metamodel properties in a corpus of thousands of independent metamodels. We evaluate our approach on a test set containing 166 metamodels, unseen during the model training, with more than 5000 test samples. Our preliminary results show that the trained model is able to provide accurate top-$5$ lists of relevant recommendations for concept renaming scenarios. Although promising, the results are less compelling for the scenario of the iterative construction of the metamodel, in part because of the conservative strategy we use to evaluate the recommendations.

Posted Content
TL;DR: In this paper, the authors provide an empirical analysis of the defect labels created with the SZZ algorithm and the impact of commonly used features on results, finding that only half of the bug fixing commits determined by SZZ are actually bug fixing.
Abstract: Context: The SZZ algorithm is the de facto standard for labeling bug fixing commits and finding inducing changes for defect prediction data. Recent research uncovered potential problems in different parts of the SZZ algorithm. Most defect prediction data sets provide only static code metrics as features, while research indicates that other features are also important. Objective: We provide an empirical analysis of the defect labels created with the SZZ algorithm and the impact of commonly used features on results. Method: We used a combination of manual validation and adopted or improved heuristics for the collection of defect data. We conducted an empirical study on 398 releases of 38 Apache projects. Results: We found that only half of the bug fixing commits determined by SZZ are actually bug fixing. If a six-month time frame is used in combination with SZZ to determine which bugs affect a release, one file is incorrectly labeled as defective for every file that is correctly labeled as defective. In addition, two defective files are missed. We also explored the impact of the relatively small set of features that are available in most defect prediction data sets, as there are multiple publications that indicate that, e.g., churn related features are important for defect prediction. We found that the difference of using more features is not significant. Conclusion: Problems with inaccurate defect labels are a severe threat to the validity of the state of the art of defect prediction. Small feature sets seem to be a less severe threat.