scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Software Engineering and Methodology in 2022"


Journal ArticleDOI
TL;DR: Tests that fail inconsistently, without changes to the code under test, are referred to as flaky tests as discussed by the authors, which do not give a clear indication of the presence of software bugs and thus limit the reliab...
Abstract: Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliab...

39 citations


Journal ArticleDOI
TL;DR: In this paper , the authors conducted a systematic mapping study on software engineering approaches for building, operating, and maintaining AI-based systems and identified multiple SE approaches for AIbased systems, which they classified according to the SWEBOK areas.
Abstract: AI-based systems are software systems with functionalities enabled by at least one AI component (e.g., for image- and speech-recognition, and autonomous driving). AI-based systems are becoming pervasive in society due to advances in AI. However, there is limited synthesized knowledge on Software Engineering (SE) approaches for building, operating, and maintaining AI-based systems. To collect and analyze state-of-the-art knowledge about SE for AI-based systems, we conducted a systematic mapping study. We considered 248 studies published between January 2010 and March 2020. SE for AI-based systems is an emerging research area, where more than 2/3 of the studies have been published since 2018. The most studied properties of AI-based systems are dependability and safety. We identified multiple SE approaches for AI-based systems, which we classified according to the SWEBOK areas. Studies related to software testing and software quality are very prevalent, while areas like software maintenance seem neglected. Data-related issues are the most recurrent challenges. Our results are valuable for: researchers, to quickly understand the state of the art and learn which topics need more research; practitioners, to learn about the approaches and challenges that SE entails for AI-based systems; and, educators, to bridge the gap among SE and AI in their curricula.

24 citations


Journal ArticleDOI
TL;DR: This study implements the idea of context-aware code change embedding considering program structures for patch correctness assessment as Cache and demonstrates that it can achieve overall higher performance than existing APCA techniques while even being more precise than certain dynamic ones including PATCH-SIM.
Abstract: Despite the capability in successfully fixing more and more real-world bugs, existing Automated Program Repair (APR) techniques are still challenged by the long-standing overfitting problem (i.e., a generated patch that passes all tests is actually incorrect). Plenty of approaches have been proposed for automated patch correctness assessment (APCA). Nonetheless, dynamic ones (i.e., those that needed to execute tests) are time-consuming while static ones (i.e., those built on top of static code features) are less precise. Therefore, embedding techniques have been proposed recently, which assess patch correctness via embedding token sequences extracted from the changed code of a generated patch. However, existing techniques rarely considered the context information and program structures of a generated patch, which are crucial for patch correctness assessment as revealed by existing studies. In this study, we explore the idea of context-aware code change embedding considering program structures for patch correctness assessment. Specifically, given a patch, we not only focus on the changed code but also take the correlated unchanged part into consideration, through which the context information can be extracted and leveraged. We then utilize the AST path technique for representation where the structure information from AST node can be captured. Finally, based on several pre-defined heuristics, we build a deep learning based classifier to predict the correctness of the patch. We implemented this idea as Cache and performed extensive experiments to assess its effectiveness. Our results demonstrate that Cache can (1) perform better than previous representation learning based techniques (e.g., Cache relatively outperforms existing techniques by \( \approx \) 6%, \( \approx \) 3%, and \( \approx \) 16%, respectively under three diverse experiment settings), and (2) achieve overall higher performance than existing APCA techniques while even being more precise than certain dynamic ones including PATCH-SIM (92.9% vs. 83.0%). Further results reveal that the context information and program structures leveraged by Cache contributed significantly to its outstanding performance.

23 citations


Journal ArticleDOI
TL;DR: In this article, security patches in open source software, providing security fixes to identified vulnerabilities, are crucial in protecting against cyber attacks, and security advisories and announcements are often pu...
Abstract: Security patches in open source software, providing security fixes to identified vulnerabilities, are crucial in protecting against cyber attacks. Security advisories and announcements are often pu...

19 citations


Journal ArticleDOI
TL;DR: The key finding is that weighted search reaches a certain level of solution quality by consuming relatively less resources at the early stage of the search; however, Pareto search is at the majority of the time significantly better than its weighted counterpart, as long as the authors allow a sufficient, but not unrealistic search budget.
Abstract: In presence of multiple objectives to be optimized in Search-Based Software Engineering (SBSE), Pareto search has been commonly adopted. It searches for a good approximation of the problem’s Pareto-optimal solutions, from which the stakeholders choose the most preferred solution according to their preferences. However, when clear preferences of the stakeholders (e.g., a set of weights that reflect relative importance between objectives) are available prior to the search, weighted search is believed to be the first choice, since it simplifies the search via converting the original multi-objective problem into a single-objective one and enables the search to focus on what only the stakeholders are interested in. This article questions such a “weighted search first” belief. We show that the weights can, in fact, be harmful to the search process even in the presence of clear preferences. Specifically, we conduct a large-scale empirical study that consists of 38 systems/projects from three representative SBSE problems, together with two types of search budget and nine sets of weights, leading to 604 cases of comparisons. Our key finding is that weighted search reaches a certain level of solution quality by consuming relatively less resources at the early stage of the search; however, Pareto search is significantly better than its weighted counterpart the majority of the time (up to 77% of the cases), as long as we allow a sufficient, but not unrealistic search budget. This is a beneficial result, as it discovers a potentially new “rule-of-thumb” for the SBSE community: Even when clear preferences are available, it is recommended to always consider Pareto search by default for multi-objective SBSE problems, provided that solution quality is more important. Weighted search, in contrast, should only be preferred when the resource/search budget is limited, especially for expensive SBSE problems. This, together with other findings and actionable suggestions in the article, allows us to codify pragmatic and comprehensive guidance on choosing weighted and Pareto search for SBSE under the circumstance that clear preferences are available. All code and data can be accessed at https://github.com/ideas-labo/pareto-vs-weight-for-sbse.

17 citations


Journal ArticleDOI
TL;DR: The evaluation shows that Crucial obtains superior or comparable performance to Apache Spark at similar cost (18%–40% faster) and can rival in performance with a single-machine, multi-threaded implementation of a complex coordination problem.
Abstract: Serverless computing greatly simplifies the use of cloud resources. In particular, Function-as-a-Service (FaaS) platforms enable programmers to develop applications as individual functions that can run and scale independently. Unfortunately, applications that require fine-grained support for mutable state and synchronization, such as machine learning (ML) and scientific computing, are notoriously hard to build with this new paradigm. In this work, we aim at bridging this gap. We present Crucial, a system to program highly-parallel stateful serverless applications. Crucial retains the simplicity of serverless computing. It is built upon the key insight that FaaS resembles to concurrent programming at the scale of a datacenter. Accordingly, a distributed shared memory layer is the natural answer to the needs for fine-grained state management and synchronization. Crucial allows to port effortlessly a multi-threaded code base to serverless, where it can benefit from the scalability and pay-per-use model of FaaS platforms. We validate Crucial with the help of micro-benchmarks and by considering various stateful applications. Beyond classical parallel tasks (e.g., a Monte Carlo simulation), these applications include representative ML algorithms such as k-means and logistic regression. Our evaluation shows that Crucial obtains superior or comparable performance to Apache Spark at similar cost (18%–40% faster). We also use Crucial to port (part of) a state-of-the-art multi-threaded ML library to serverless. The ported application is up to 30% faster than with a dedicated high-end server. Finally, we attest that Crucial can rival in performance with a single-machine, multi-threaded implementation of a complex coordination problem. Overall, Crucial delivers all these benefits with less than 6% of changes in the code bases of the evaluated applications.

17 citations


Journal ArticleDOI
TL;DR: This study considered the complexity of the security decision space of developers using theory from cognitive and social psychology to provide conceptual underpinnings for three categories of impediments to achieving security goals and suggests “adaptive security interventions” as a solution that responds to the changing security needs of individual developers.
Abstract: Despite the availability of various methods and tools to facilitate secure coding, developers continue to write code that contains common vulnerabilities. It is important to understand why technological advances do not sufficiently facilitate developers in writing secure code. In order to widen our understanding of developers' behaviour, we considered the complexity of the security decision space of developers using theory from cognitive and social psychology. Our interdisciplinary study reported in this paper (1) draws on the psychology literature to provide conceptual underpinnings for three categories of impediments to achieving security goals, (2) reports on an in-depth meta-analysis of existing software security literature which identified a catalogue of factors that influence developers' security decisions, and (3) characterises the landscape of existing security interventions that are available to the developer during coding and identifies gaps. Collectively, these show that different forms of impediments to achieving security goals arise from different contributing factors. Interventions will be more effective where they reflect psychological factors more sensitively and marry technical sophistication, psychological frameworks, and usability. Our analysis suggests `adaptive security interventions' as a solution that responds to the changing security needs of individual developers and a present a proof-of-concept tool to substantiate our suggestion.

15 citations


Journal ArticleDOI
TL;DR: The empirical results indicate that tuning specific hyperparameters has heterogeneous impact on the performance of DNN models across different models and different performance properties, and that model optimization has a confounding effect on the impact ofhyperparameters on DNN model performance.
Abstract: Deep neural network (DNN) models typically have many hyperparameters that can be configured to achieve optimal performance on a particular dataset. Practitioners usually tune the hyperparameters of their DNN models by training a number of trial models with different configurations of the hyperparameters, to find the optimal hyperparameter configuration that maximizes the training accuracy or minimizes the training loss. As such hyperparameter tuning usually focuses on the model accuracy or the loss function, it is not clear and remains under-explored how the process impacts other performance properties of DNN models, such as inference latency and model size. On the other hand, standard DNN models are often large in size and computing-intensive, prohibiting them from being directly deployed in resource-bounded environments such as mobile devices and Internet of Things (IoT) devices. To tackle this problem, various model optimization techniques (e.g., pruning or quantization) are proposed to make DNN models smaller and less computing-intensive so that they are better suited for resource-bounded environments. However, it is neither clear how the model optimization techniques impact other performance properties of DNN models such as inference latency and battery consumption, nor how the model optimization techniques impact the effect of hyperparameter tuning (i.e., the compounding effect). Therefore, in this paper, we perform a comprehensive study on four representative and widely-adopted DNN models, i.e., CNN image classification, Resnet-50, CNN text classification, and LSTM sentiment classification, to investigate how different DNN model hyperparameters affect the standard DNN models, as well as how the hyperparameter tuning combined with model optimization affect the optimized DNN models, in terms of various performance properties (e.g., inference latency or battery consumption). Our empirical results indicate that tuning specific hyperparameters has heterogeneous impact on the performance of DNN models across different models and different performance properties. In particular, although the top tuned DNN models usually have very similar accuracy, they may have significantly different performance in terms of other aspects (e.g., inference latency). We also observe that model optimization has a confounding effect on the impact of hyperparameters on DNN model performance. For example, two sets of hyperparameters may result in standard models with similar performance but their performance may become significantly different after they are optimized and deployed on the mobile device. Our findings highlight that practitioners can benefit from paying attention to a variety of performance properties and the confounding effect of model optimization when tuning and optimizing their DNN models.

15 citations


Journal ArticleDOI
TL;DR: DAT, a novel distribution-aware test selection metric that effectively alleviates the impact of distribution shifts and outperforms the compared metrics by up to five times and 30.09% accuracy improvement for model enhancement on simulated and in-the-wild distribution shift scenarios, respectively.
Abstract: Similar to traditional software that is constantly under evolution, deep neural networks need to evolve upon the rapid growth of test data for continuous enhancement (e.g., adapting to distribution shift in a new environment for deployment). However, it is labor intensive to manually label all of the collected test data. Test selection solves this problem by strategically choosing a small set to label. Via retraining with the selected set, deep neural networks will achieve competitive accuracy. Unfortunately, existing selection metrics involve three main limitations: (1) using different retraining processes, (2) ignoring data distribution shifts, and (3) being insufficiently evaluated. To fill this gap, we first conduct a systemically empirical study to reveal the impact of the retraining process and data distribution on model enhancement. Then based on our findings, we propose DAT, a novel distribution-aware test selection metric. Experimental results reveal that retraining using both the training and selected data outperforms using only the selected data. None of the selection metrics perform the best under various data distributions. By contrast, DAT effectively alleviates the impact of distribution shifts and outperforms the compared metrics by up to five times and 30.09% accuracy improvement for model enhancement on simulated and in-the-wild distribution shift scenarios, respectively.

14 citations


Journal ArticleDOI
TL;DR: In this paper , a plugin for the PyCharm IDE implements a hybrid of code generation and code retrieval functionality, and orchestrate virtual environments to enable collection of many user events (e.g., web browsing, keystrokes, fine-grained code edits).
Abstract: A great part of software development involves conceptualizing or communicating the underlying procedures and logic that needs to be expressed in programs. One major difficulty of programming is turning concept into code , especially when dealing with the APIs of unfamiliar libraries. Recently, there has been a proliferation of machine learning methods for code generation and retrieval from natural language queries , but these have primarily been evaluated purely based on retrieval accuracy or overlap of generated code with developer-written code, and the actual effect of these methods on the developer workflow is surprisingly unattested. In this article, we perform the first comprehensive investigation of the promise and challenges of using such technology inside the PyCharm IDE, asking, “At the current state of technology does it improve developer productivity or accuracy, how does it affect the developer experience, and what are the remaining gaps and challenges?” To facilitate the study, we first develop a plugin for the PyCharm IDE that implements a hybrid of code generation and code retrieval functionality, and we orchestrate virtual environments to enable collection of many user events (e.g., web browsing, keystrokes, fine-grained code edits). We ask developers with various backgrounds to complete 7 varieties of 14 Python programming tasks ranging from basic file manipulation to machine learning or data visualization, with or without the help of the plugin. While qualitative surveys of developer experience are largely positive, quantitative results with regards to increased productivity, code quality, or program correctness are inconclusive. Further analysis identifies several pain points that could improve the effectiveness of future machine learning-based code generation/retrieval developer assistants and demonstrates when developers prefer code generation over code retrieval and vice versa. We release all data and software to pave the road for future empirical studies on this topic, as well as development of better code generation models.

13 citations


Journal ArticleDOI
TL;DR: This work conducts the most large-scale study on 800 bugs from four popular and diverse DL frameworks and obtains 14 major findings for the comprehensive understanding of DL framework bugs and the current status of existing DL framework testing and debugging practice.
Abstract: DL frameworks are the basis of constructing all DL programs and models, and thus their bugs could lead to the unexpected behaviors of any DL program or model relying on them. Such a wide effect demonstrates the necessity and importance of guaranteeing DL frameworks’ quality. Understanding the characteristics of DL framework bugs is a fundamental step for this quality assurance task, facilitating designing effective bug detection and debugging approaches. Hence, in this work we conduct the most large-scale study on 1,000 bugs from four popular and diverse DL frameworks (i.e., TensorFlow, PyTorch, MXNet, and DL4J). By analyzing the root causes and symptoms of DL framework bugs associated with 5 components decomposed from DL frameworks, as well as measuring test coverage achieved by three state-of-the-art testing techniques, we obtain 12 major findings for the comprehensive understanding of DL framework bugs and the current status of existing DL framework testing practice, and then provide a series of actionable guidelines for better DL framework bug detection and debugging. Finally, based on the guidelines, we design and implement a prototype DL-framework testing tool, called TenFuzz, which is evaluated to be effective and finds 3 unknown bugs on the latest TensorFlow framework in a preliminary study, indicating the significance of our guidelines.

Journal ArticleDOI
TL;DR: This article proposes a method to automatically generate system-level test cases for REST web services using search techniques, and shows how this method can be used to improve the quality of search-based testing of ERP systems.
Abstract: REST web services are widely popular in industry, and search techniques have been successfully used to automatically generate system-level test cases for those systems. In this article, we propose ...

Journal ArticleDOI
TL;DR: A taxonomy of the faults found based on the manual analysis of 415 faults in the eight case studies is presented and a method to support the classification using clustering of the resulting test cases is proposed.
Abstract: RESTful web services are often used for building a wide variety of enterprise applications. The diversity and increased number of applications using RESTful APIs means that increasing amounts of resources are spent developing and testing these systems. Automation in test data generation provides a useful way of generating test data in a fast and efficient manner. However, automated test generation often results in large test suites that are hard to evaluate and investigate manually. This article proposes a taxonomy of the faults we have found using search-based software testing techniques applied on RESTful APIs. The taxonomy is a first step in understanding, analyzing, and ultimately fixing software faults in web services and enterprise applications. We propose to apply a density-based clustering algorithm to the test cases evolved during the search to allow a better separation between different groups of faults. This is needed to enable engineers to highlight and focus on the most serious faults. Tests were automatically generated for a set of eight case studies, seven open-source and one industrial. The test cases generated during the search are clustered based on the reported last executed line and based on the error messages returned, when such error messages were available. The tests were manually evaluated to determine their root causes and to obtain additional information. The article presents a taxonomy of the faults found based on the manual analysis of 415 faults in the eight case studies and proposes a method to support the classification using clustering of the resulting test cases.

Journal ArticleDOI
TL;DR: In this paper , the authors define the overall process of knowledge graph development and its key constituent steps and propose a unified approach for both researchers and practitioners when constructing and managing knowledge graphs.
Abstract: Knowledge graphs are widely used in industry and studied within the academic community. However, the models applied in the development of knowledge graphs vary. Analysing and providing a synthesis of the commonly used approaches to knowledge graph development would provide researchers and practitioners a better understanding of the overall process and methods involved. Hence, this article aims at defining the overall process of knowledge graph development and its key constituent steps. For this purpose, a systematic review and a conceptual analysis of the literature was conducted. The resulting process was compared to case studies to evaluate its applicability. The proposed process suggests a unified approach and provides guidance for both researchers and practitioners when constructing and managing knowledge graphs.

Journal ArticleDOI
TL;DR: COTest first adopts machine-learning (the XGBoost algorithm) to model the relationship between test programs and optimization settings, to predict the bug-triggering probability of a test program under an optimization setting, and designs a diversity augmentation strategy to select a set of diverse candidate optimization settings for prediction for a test programs.
Abstract: Compilers are a kind of important software, and similar to the quality assurance of other software, compiler testing is one of the most widely-used ways of guaranteeing their quality. Compiler bugs tend to occur in compiler optimizations. Detecting optimization bugs needs to consider two main factors: (1) the optimization flags controlling the accessability of the compiler buggy code should be turned on; and (2) the test program should be able to trigger the buggy code. However, existing compiler testing approaches only consider the latter to generate effective test programs, but just run them under several pre-defined optimization levels (e.g., -O0, -O1, -O2, -O3, -Os in GCC). To better understand the influence of compiler optimizations on compiler testing, we conduct the first empirical study, and find that (1) all the bugs detected under the widely-used optimization levels are also detected under the explored optimization settings (we call a combination of optimization flags turned on for compilation an optimization setting), while 83.54% of bugs are only detected under the latter; (2) there exist both inhibition effect and promotion effect among optimization flags for compiler testing, indicating the necessity and challenges of considering the factor of compiler optimizations in compiler testing. We then propose the first approach, called COTest, by considering both factors to test compilers. Specifically, COTest first adopts machine-learning (the XGBoost algorithm) to model the relationship between test programs and optimization settings, to predict the bug-triggering probability of a test program under an optimization setting. Then, it designs a diversity augmentation strategy to select a set of diverse candidate optimization settings for prediction for a test program. Finally, Top-K optimization settings are selected for compiler testing according to the predicted bug-triggering probabilities. Then, it designs a diversity augmentation strategy to select a set of diverse candidate optimization settings for prediction for a test program. Finally, Top-K optimization settings are selected for compiler testing according to the predicted bug-triggering probabilities. The experiments on GCC and LLVM demonstrate its effectiveness, especially COTest detects 17 previously unknown bugs, 11 of which have been fixed or confirmed by developers.

Journal ArticleDOI
Zhen Yang, Jacky Keung, Xiao Yu, Yan Xiao, Zhi Jin 
TL;DR: A composite approach named CBS (i.e., Classifying Before Synchronizing) is proposed to further improve the code-comment synchronization performance, which combines the advantages of CUP and HebCUP with the assistance of inferred categories of Code-Comment Inconsistent (CCI) samples.
Abstract: Software comments sometimes are not promptly updated in sync when the associated code is changed. The inconsistency between code and comments may mislead the developers and result in future bugs. Thus, studies concerning code-comment synchronization have become highly important, which aims to automatically synchronize comments with code changes. Existing code-comment synchronization approaches mainly contain two types, i.e., (1) deep learning-based (e.g., CUP), and (2) heuristic-based (e.g., HebCUP). The former constructs a neural machine translation-structured semantic model, which has a more generalized capability on synchronizing comments with software evolution and growth. However, the latter designs a series of rules for performing token-level replacements on old comments, which can generate the completely correct comments for the samples fully covered by their fine-designed heuristic rules. In this article, we propose a composite approach named CBS (i.e., Classifying Before Synchronizing) to further improve the code-comment synchronization performance, which combines the advantages of CUP and HebCUP with the assistance of inferred categories of Code-Comment Inconsistent (CCI) samples. Specifically, we firstly define two categories (i.e., heuristic-prone and non-heuristic-prone) for CCI samples and propose five features to assist category prediction. The samples whose comments can be correctly synchronized by HebCUP are heuristic-prone, while others are non-heuristic-prone. Then, CBS employs our proposed Multi-Subsets Ensemble Learning (MSEL) classification algorithm to alleviate the class imbalance problem and construct the category prediction model. Next, CBS uses the trained MSEL to predict the category of the new sample. If the predicted category is heuristic-prone, CBS employs HebCUP to conduct the code-comment synchronization for the sample, otherwise, CBS allocates CUP to handle it. Our extensive experiments demonstrate that CBS statistically significantly outperforms CUP and HebCUP, and obtains an average improvement of 23.47%, 22.84%, 3.04%, 3.04%, 1.64%, and 19.39% in terms of Accuracy, Recall@5, Average Edit Distance (AED), Relative Edit Distance (RED), BLEU-4, and Effective Synchronized Sample (ESS) ratio, respectively, which highlights that category prediction for CCI samples can boost the code-comment synchronization performance.

Journal ArticleDOI
TL;DR: A novel method called NIRVANA (uNcertaInty pRediction ValidAtor iN Ai) is proposed for prediction validation based on uncertainty metrics to address uncertainty in deep learning models applied to CPS data.
Abstract: The use of Deep learning in Cyber-Physical Systems (CPSs) is gaining popularity due to its ability to bring intelligence to CPS behaviors. However, both CPSs and deep learning have inherent uncertainty. Such uncertainty, if not handled adequately, can lead to unsafe CPS behavior. The first step toward addressing such uncertainty in deep learning is to quantify uncertainty. Hence, we propose a novel method called NIRVANA (uNcertaInty pRediction ValidAtor iN Ai) for prediction validation based on uncertainty metrics. To this end, we first employ prediction-time Dropout-based Neural Networks to quantify uncertainty in deep learning models applied to CPS data. Second, such quantified uncertainty is taken as the input to predict wrong labels using a support vector machine, with the aim of building a highly discriminating prediction validator model with uncertainty values. In addition, we investigated the relationship between uncertainty quantification and prediction performance and conducted experiments to obtain optimal dropout ratios. We conducted all the experiments with four real-world CPS datasets. Results show that uncertainty quantification is negatively correlated to prediction performance of a deep learning model of CPS data. Also, our dropout ratio adjustment approach is effective in reducing uncertainty of correct predictions while increasing uncertainty of wrong predictions.

Journal ArticleDOI
TL;DR: In this article, a deep learning program encodes the network structure of a neural network and uses it to solve complex real-world problems, such as real-time decision making.
Abstract: Nowadays, we are witnessing an increasing demand in both corporates and academia for exploiting Deep Learning (DL) to solve complex real-world problems. A DL program encodes the network structure o...

Journal ArticleDOI
TL;DR: The results of the study serve as references to choose suitable opinion mining tools for software development activities and provide critical insights for the further development of opinion mining techniques in the SE domain.
Abstract: Opinion mining, sometimes referred to as sentiment analysis, has gained increasing attention in software engineering (SE) studies. SE researchers have applied opinion mining techniques in various contexts, such as identifying developers’ emotions expressed in code comments and extracting users’ critics toward mobile apps. Given the large amount of relevant studies available, it can take considerable time for researchers and developers to figure out which approaches they can adopt in their own studies and what perils these approaches entail. We conducted a systematic literature review involving 185 papers. More specifically, we present (1) well-defined categories of opinion mining-related software development activities, (2) available opinion mining approaches, whether they are evaluated when adopted in other studies, and how their performance is compared, (3) available datasets for performance evaluation and tool customization, and (4) concerns or limitations SE researchers might need to take into account when applying/customizing these opinion mining techniques. The results of our study serve as references to choose suitable opinion mining tools for software development activities and provide critical insights for the further development of opinion mining techniques in the SE domain.

Journal ArticleDOI
TL;DR: Search-based software testing has been shown to be an effective technique to generate test cases automatically and its effectiveness strongly depends on the guidance of the fitness function.
Abstract: Search-based software testing (SBST) has been shown to be an effective technique to generate test cases automatically. Its effectiveness strongly depends on the guidance of the fitness function. Un...

Journal ArticleDOI
TL;DR: This article proposes an optimization-based attack technique CARROTA to generate valid adversarial source code examples effectively and efficiently and defines the robustness metrics and proposes robustness measurement toolkit CARROTM, which employs the worst-case performance approximation under the allowable perturbations.
Abstract: Deep learning (DL) has recently been widely applied to diverse source code processing tasks in the software engineering (SE) community, which achieves competitive performance (e.g., accuracy). However, the robustness, which requires the model to produce consistent decisions given minorly perturbed code inputs, still lacks systematic investigation as an important quality indicator. This article initiates an early step and proposes a framework CARROT for robustness detection, measurement, and enhancement of DL models for source code processing. We first propose an optimization-based attack technique CARROTA to generate valid adversarial source code examples effectively and efficiently. Based on this, we define the robustness metrics and propose robustness measurement toolkit CARROTM, which employs the worst-case performance approximation under the allowable perturbations. We further propose to improve the robustness of the DL models by adversarial training (CARROTT) with our proposed attack techniques. Our in-depth evaluations on three source code processing tasks (i.e., functionality classification, code clone detection, defect prediction) containing more than 3 million lines of code and the classic or SOTA DL models, including GRU, LSTM, ASTNN, LSCNN, TBCNN, CodeBERT, and CDLH, demonstrate the usefulness of our techniques for ❶ effective and efficient adversarial example detection, ❷ tight robustness estimation, and ❸ effective robustness enhancement.

Journal ArticleDOI
TL;DR: This paper proposes a set of seeding strategies for the test case selection problem that generate the initial population of pareto-based multi-objective algorithms, with the goals of helping to find an overall better set of solutions and enhancing the convergence of the algorithms.
Abstract: The time it takes software systems to be tested is usually long. Search-based test selection has been a widely investigated technique to optimize the testing process. In this article, we propose a set of seeding strategies for the test case selection problem that generates the initial population of Pareto-based multi-objective algorithms, with the goals of (1) helping to find an overall better set of solutions and (2) enhancing the convergence of the algorithms. The seeding strategies were integrated with four state-of-the-art multi-objective search algorithms and applied into two contexts where regression-testing is paramount: (1) Simulation-based testing of Cyber-physical Systems and (2) Continuous Integration. For the first context, we evaluated our approach by using six fitness function combinations and six independent case studies, whereas in the second context, we derived a total of six fitness function combinations and employed four case studies. Our evaluation suggests that some of the proposed seeding strategies are indeed helpful for solving the multi-objective test case selection problem. Specifically, the proposed seeding strategies provided a higher convergence of the algorithms towards optimal solutions in 96% of the studied scenarios and an overall cost-effectiveness with a standard search budget in 85% of the studied scenarios.

Journal ArticleDOI
TL;DR: A novel learning-based framework for automated security tool API Recommendation for security Orchestration, automation, and response, APIRO, which consists of an API-specific word embedding model and a Convolutional Neural Network model that are used for prediction of top 3 relevant APIs for a task.
Abstract: Security Orchestration, Automation, and Response (SOAR) platforms integrate and orchestrate a wide variety of security tools to accelerate the operational activities of Security Operation Center (SOC). Integration of security tools in a SOAR platform is mostly done manually using APIs, plugins, and scripts. SOC teams need to navigate through API calls of different security tools to find a suitable API to define or update an incident response action. Analyzing various types of API documentation with diverse API format and presentation structure involves significant challenges such as data availability, data heterogeneity, and semantic variation for automatic identification of security tool APIs specific to a particular task. Given these challenges can have negative impact on SOC team’s ability to handle security incident effectively and efficiently, we consider it important to devise suitable automated support solutions to address these challenges. We propose a novel learning-based framework for automated security tool API Recommendation for security Orchestration, automation, and response, APIRO. To mitigate data availability constraint, APIRO enriches security tool API description by applying a wide variety of data augmentation techniques. To learn data heterogeneity of the security tools and semantic variation in API descriptions, APIRO consists of an API-specific word embedding model and a Convolutional Neural Network (CNN) model that are used for prediction of top three relevant APIs for a task. We experimentally demonstrate the effectiveness of APIRO in recommending APIs for different tasks using three security tools and 36 augmentation techniques. Our experimental results demonstrate the feasibility of APIRO for achieving 91.9% Top-1 Accuracy. Compared to the state-of-the-art baseline, APIRO is 26.93%, 23.03%, and 20.87% improved in terms of Top-1, Top-2, and Top-3 Accuracy and outperforms the baseline by 23.7% in terms of Mean Reciprocal Rank (MRR).

Journal ArticleDOI
TL;DR: A framework, LingLong Synthesis Framework (L2S), to address the problem of finding the most likely program that meets a specification under a local context, and it is proved that the probability of a program is the product of the probabilities of choosing expansion rules, regardless of the choosing order.
Abstract: In many scenarios, we need to find the most likely program that meets a specification under a local context, where the local context can be an incomplete program, a partial specification, natural language description, and so on. We call such a problem program estimation. In this article, we propose a framework, LingLong Synthesis Framework (L2S), to address this problem. Compared with existing work, our work is novel in the following aspects. (1) We propose a theory of expansion rules to describe how to decompose a program into choices. (2) We propose an approach based on abstract interpretation to efficiently prune off the program sub-space that does not satisfy the specification. (3) We prove that the probability of a program is the product of the probabilities of choosing expansion rules, regardless of the choosing order. (4) We reduce the program estimation problem to a pathfinding problem, enabling existing pathfinding algorithms to solve this problem. L2S has been applied to program generation and program repair. In this article, we report our instantiation of this framework for synthesizing conditional expressions (L2S-Cond) and repairing conditional statements (L2S-Hanabi). The experiments on L2S-Cond show that each option enabled by L2S, including the expansion rules, the pruning technique, and the use of different pathfinding algorithms, plays a major role in the performance of the approach. The default configuration of L2S-Cond correctly predicts nearly 60% of the conditional expressions in the top 5 candidates. Moreover, we evaluate L2S-Hanabi on 272 bugs from two real-world Java defects benchmarks, namely Defects4J and Bugs.jar. L2S-Hanabi correctly fixes 32 bugs with a high precision of 84%. In terms of repairing conditional statement bugs, L2S-Hanabi significantly outperforms all existing approaches in both precision and recall.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper report a study of the critical challenges and benefits of incorporating accessibility into software development and design, and suggest development teams put accessibility as a first-class consideration throughout the software development process, and also propose some remedies to resolve the gaps between groups and to highlight key future research directions.
Abstract: Being able to access software in daily life is vital for everyone, and thus accessibility is a fundamental challenge for software development. However, given the number of accessibility issues reported by many users, e.g., in app reviews, it is not clear if accessibility is widely integrated into current software projects and how software projects address accessibility issues. In this article, we report a study of the critical challenges and benefits of incorporating accessibility into software development and design. We applied a mixed qualitative and quantitative approach for gathering data from 15 interviews and 365 survey respondents from 26 countries across five continents to understand how practitioners perceive accessibility development and design in practice. We got 44 statements grouped into eight topics on accessibility from practitioners’ viewpoints and different software development stages. Our statistical analysis reveals substantial gaps between groups, e.g., practitioners have Direct vs. Indirect accessibility relevant work experience when they reviewed the summarized statements. These gaps might hinder the quality of accessibility development and design, and we use our findings to establish a set of guidelines to help practitioners be aware of accessibility challenges and benefit factors. We suggest development teams put accessibility as a first-class consideration throughout the software development process, and we also propose some remedies to resolve the gaps between groups and to highlight key future research directions to incorporate accessibility into software design and development.

Journal ArticleDOI
TL;DR: This article proposes a neural-network-based approach called PMA to predict the missing key aspects of a vulnerability based on its known aspects and validate the predicting performance of key aspect augmentation of CVEs based on the manually augmented CVE data collected from NVD, which confirms the practicality of the approach.
Abstract: Security vulnerabilities have been continually disclosed and documented. For the effective understanding, management, and mitigation of the fast-growing number of vulnerabilities, an important practice in documenting vulnerabilities is to describe the key vulnerability aspects, such as vulnerability type, root cause, affected product, impact, attacker type, and attack vector. In this article, we first investigate 133,639 vulnerability reports in the Common Vulnerabilities and Exposures (CVE) database over the past 20 years. We find that 56%, 85%, 38%, and 28% of CVEs miss vulnerability type, root cause, attack vector, and attacker type, respectively. By comparing the differences of the latest updated CVE reports across different databases, we observe that 1,476 missing key aspects in 1,320 CVE descriptions were augmented manually in the National Vulnerability Database (NVD), which indicates that the vulnerability database maintainers try to complete the vulnerability descriptions in practice to mitigate such a problem. To help complete the missing information of key vulnerability aspects and reduce human efforts, we propose a neural-network-based approach called PMA to predict the missing key aspects of a vulnerability based on its known aspects. We systematically explore the design space of the neural network models and empirically identify the most effective model design in the scenario. Our ablation study reveals the prominent correlations among vulnerability aspects when predicting. Trained with historical CVEs, our model achieves 88%, 71%, 61%, and 81% in F1 for predicting the missing vulnerability type, root cause, attacker type, and attack vector of 8,623 “future” CVEs across 3 years, respectively. Furthermore, we validate the predicting performance of key aspect augmentation of CVEs based on the manually augmented CVE data collected from NVD, which confirms the practicality of our approach. We finally highlight that PMA has the ability to reduce human efforts by recommending and augmenting missing key aspects for vulnerability databases, and to facilitate other research works such as severity level prediction of CVEs based on the vulnerability descriptions.

Journal ArticleDOI
TL;DR: A meaningful and deep understanding of the human aspects of software engineering (SE) requires psychological constructs to be considered as mentioned in this paper, and psychology theory can facilitate the systematic and sound development of software systems.
Abstract: A meaningful and deep understanding of the human aspects of software engineering (SE) requires psychological constructs to be considered. Psychology theory can facilitate the systematic and sound d...

Journal ArticleDOI
TL;DR: This article proposes the interpretable coverage criteria through constructing the decision structure of a DNN, and proposes two variants of path coverage to measure the adequacy of the test cases in exercising the decision logic.
Abstract: Deep learning has recently been widely applied to many applications across different domains, e.g., image classification and audio recognition. However, the quality of Deep Neural Networks (DNNs) still raises concerns in the practical operational environment, which calls for systematic testing, especially in safety-critical scenarios. Inspired by software testing, a number of structural coverage criteria are designed and proposed to measure the test adequacy of DNNs. However, due to the blackbox nature of DNN, the existing structural coverage criteria are difficult to interpret, making it hard to understand the underlying principles of these criteria. The relationship between the structural coverage and the decision logic of DNNs is unknown. Moreover, recent studies have further revealed the non-existence of correlation between the structural coverage and DNN defect detection, which further posts concerns on what a suitable DNN testing criterion should be. In this article, we propose the interpretable coverage criteria through constructing the decision structure of a DNN. Mirroring the control flow graph of the traditional program, we first extract a decision graph from a DNN based on its interpretation, where a path of the decision graph represents a decision logic of the DNN. Based on the control flow and data flow of the decision graph, we propose two variants of path coverage to measure the adequacy of the test cases in exercising the decision logic. The higher the path coverage, the more diverse decision logic the DNN is expected to be explored. Our large-scale evaluation results demonstrate that: The path in the decision graph is effective in characterizing the decision of the DNN, and the proposed coverage criteria are also sensitive with errors, including natural errors and adversarial examples, and strongly correlate with the output impartiality.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method to accelerate software development by searching and reusing existing code snippets from a large-scale codebase, e.g., GitHub, to accelerate code development.
Abstract: To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information re...

Journal ArticleDOI
TL;DR: This paper compares seven state-of-the-art fuzzers on 18 open-source and one industrial RESTful APIs and analyzes the source code of which parts of these APIs the fuzzers fail to generate tests for to point to clear limitations of these current fuzzers.
Abstract: RESTful APIs are a type of web service that are widely used in industry. In the last few years, a lot of effort in the research community has been spent in designing novel techniques to automatically fuzz those APIs to find faults in them. Many real faults were automatically found in a large variety of RESTful APIs. However, usually the analyzed fuzzers treat the APIs as black-box, and no analysis of what is actually covered in these systems is done. Therefore, although these fuzzers are clearly useful for practitioners, we do not know what are their current limitations and actual effectiveness. Solving this is a necessary step to be able to design better, more efficient and effective techniques. To address this issue, in this paper we compare seven state-of-the-art fuzzers on 18 open-source, one industrial and one artificial RESTful APIs. We then analyzed the source code of which parts of these APIs the fuzzers fail to generate tests for. This analysis points to clear limitations of these current fuzzers, listing concrete challenges for the research community to follow up on.