scispace - formally typeset
Search or ask a question

Showing papers in "Empirical Software Engineering in 2020"


Journal ArticleDOI
TL;DR: In this article, the authors analyzed the projected changes in temperature and precipitation over six South Asian countries during the twenty-first century, and showed that the CMIP6 models display higher sensitivity to greenhouse gas emissions over South Asia compared with the CMIA5 models.
Abstract: The latest Coupled Model Intercomparison Project phase 6 (CMIP6) dataset was analyzed to examine the projected changes in temperature and precipitation over six South Asian countries during the twenty-first century. The CMIP6 model simulations reveal biases in annual mean temperature and precipitation over South Asia in the present climate. In the historical period, the median of the CMIP6 model ensemble systematically underestimates the annual mean temperature for all the South Asian countries, while a mixed behavior is shown in the case of precipitation. In the future climate, the CMIP6 models display higher sensitivity to greenhouse gas emissions over South Asia compared with the CMIP5 models. The multimodel ensemble from 27 CMIP6 models projects a continuous increase in the annual mean temperature over South Asia during the twenty-first century under three future scenarios. The projected temperature shows a large increase (over 6 °C under SSP5-8.5 scenario) over the northwestern parts of South Asia, comprising the complex Karakorum and Himalayan mountain ranges. Any large increase in the mean temperature over this region will most likely result in a faster rate of glacier melting. By the end of the twenty-first century, the annual mean temperature (uncertainty range) over South Asia is projected to increase by 1.2 (0.7–2.1) °C, 2.1 (1.5–3.3) °C, and 4.3 (3.2–6.6) °C under the SSP1-2.6, SSP2-4.5, and SSP5-8.5 scenarios, respectively, relative to the present (1995–2014) climate. The warming over South Asia is also continuous on the seasonal time scale. The CMIP6 models projected higher warming in the winter season than in the summer over South Asia, which if verified will have repercussions for snow/ice accumulations as well as winter cropping patterns. The annual mean precipitation is also projected to increase over South Asia during the twenty-first century under all scenarios. The rate of change in the projected annual mean precipitation varies considerably between the South Asian countries. By the end of the twenty-first century, the country-averaged annual mean precipitation (uncertainty range) is projected to increase by 17.1 (2.2–49.1)% in Bangladesh, 18.9 (−4.9 to 72)% in Bhutan, 27.3 (5.3–160.5)% in India, 19.5 (−5.9 to 95.6)% in Nepal, 26.4 (6.4–159.7)% in Pakistan, and 25.1 (−8.5 to 61.0)% in Sri Lanka under the SSP5-8.5 scenario. The seasonal precipitation projections also shows large variability. The projected winter precipitation reveals a robust increase over the western Himalayas, with a corresponding decrease over the eastern Himalayas. On the other hand, the summer precipitation shows a robust increase over most of the South Asia region, with the largest increase over the arid region of southern Pakistan and adjacent areas of India, under the high-emission scenario. The results presented in this study give detailed insights into CMIP6 model performance over the South Asia region, which could be extended further to develop adaptation strategies, and may act as a guideline document for climate change related policymaking in the region.

211 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate that the method Hybrid-DeepCom outperforms the state-of-the-art by a substantial margin and the results show that reducing the out- of-vocabulary tokens improves the accuracy effectively.
Abstract: During software maintenance, developers spend a lot of time understanding the source code. Existing studies show that code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named Hybrid-DeepCom to automatically generate code comments for the functional units of Java language, namely, Java methods. The generated comments aim to help developers understand the functionality of Java methods. Hybrid-DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. It formulates the comment generation task as the machine translation problem. Hybrid-DeepCom exploits a deep neural network that combines the lexical and structure information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects on GitHub. We evaluate the experimental results on both machine translation metrics and information retrieval metrics. Experimental results demonstrate that our method Hybrid-DeepCom outperforms the state-of-the-art by a substantial margin. In addition, we evaluate the influence of out-of-vocabulary tokens on comment generation. The results show that reducing the out-of-vocabulary tokens improves the accuracy effectively.

175 citations


Journal ArticleDOI
TL;DR: A systematic mapping study about testing techniques for MLSs driven by 33 research questions and investigated multiple aspects of the testing approaches, such as the used/proposed adequacy criteria, the algorithms for test input generation, and the test oracles.
Abstract: A Machine Learning based System (MLS) is a software system including one or more components that learn how to perform a task from a given data set. The increasing adoption of MLSs in safety critical domains such as autonomous driving, healthcare, and finance has fostered much attention towards the quality assurance of such systems. Despite the advances in software testing, MLSs bring novel and unprecedented challenges, since their behaviour is defined jointly by the code that implements them and the data used for training them. To identify the existing solutions for functional testing of MLSs, and classify them from three different perspectives: (1) the context of the problem they address, (2) their features, and (3) their empirical evaluation. To report demographic information about the ongoing research. To identify open challenges for future research. We conducted a systematic mapping study about testing techniques for MLSs driven by 33 research questions. We followed existing guidelines when defining our research protocol so as to increase the repeatability and reliability of our results. We identified 70 relevant primary studies, mostly published in the last years. We identified 11 problems addressed in the literature. We investigated multiple aspects of the testing approaches, such as the used/proposed adequacy criteria, the algorithms for test input generation, and the test oracles. The most active research areas in MLS testing address automated scenario/input generation and test oracle creation. MLS testing is a rapidly growing and developing research area, with many open challenges, such as the generation of realistic inputs and the definition of reliable evaluation metrics and benchmarks.

116 citations


Journal ArticleDOI
TL;DR: An exploratory study of smart contracts, which elucidates how frequently used the contracts are, what they do, and how complex they are and proposes an open research agenda to drive and foster future studies in the area.
Abstract: Ethereum is a blockchain platform that supports smart contracts. Smart contracts are pieces of code that perform general-purpose computations. For instance, smart contracts have been used to implement crowdfunding initiatives that raised a total of US$6.2 billion from January to June of 2018. In this paper, we conduct an exploratory study of smart contracts. Differently from prior studies that focused on particular aspects of a subset of smart contracts, our goal is to have a broader understanding of all contracts that are currently deployed in Ethereum. In particular, we elucidate how frequently used the contracts are (activity level), what they do (category), and how complex they are (source code complexity). To conduct this study, we mined and cross-linked data from four sources: Ethereum dataset on the Google BigQuery platform, Etherscan, State of the DApps, and CoinMarketCap. Our study period runs from July 2015 (inception of Ethereum) until September 2018. With regards to activity level, we notice that it is concentrated on a very small subset of the contracts. More specifically, only 0.05% of the smart contracts are the target of 80% of the transactions that are sent to contracts. New solutions to cope with Ethereum’s limited scalability should take such an activity imbalance into consideration. With regards to categories, we highlight that the new and widely advertised rich programming model of smart contracts is currently being used to develop very simple applications that tend to be token-centric (e.g., ICOs, Crowdsales, etc). Finally, with regards to code complexity, we observe that the source code of high-activity verified contracts is small, with at most 211 instructions in 80% of the cases. These contracts also commonly include at least two subcontracts and libraries in their source code. The comment ratio of these contracts is also significantly higher than that of GitHub top-starred projects written in Java, C++, and C#. Hence, the source code of high-activity verified smart contracts exhibit particular complexity characteristics compared to other popular programming languages. Further studies are necessary to uncover the actual reasons behind such differences. Finally, based on our findings, we propose an open research agenda to drive and foster future studies in the area.

112 citations


Journal ArticleDOI
TL;DR: In this paper, a questionnaire survey was created mainly from existing, validated scales and translated into 12 languages, and the data was analyzed using nonparametric inferential statistics and structural equation modeling.
Abstract: Context: As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective: This study investigates the effects of the pandemic on developers’ wellbeing and productivity. Method: A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results: The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers’ wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions: To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees’ home offices. Women, parents and disabled persons may require extra support.

105 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a systematic and automated approach to mine relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches, which can be leveraged to extract generic fix actions.
Abstract: Patching is a common activity in software development. It is generally performed on a source code base to address bugs or add new functionalities. In this context, given the recurrence of bugs across projects, the associated similar patches can be leveraged to extract generic fix actions. While the literature includes various approaches leveraging similarity among patches to guide program repair, these approaches often do not yield fix patterns that are tractable and reusable as actionable input to APR systems. In this paper, we propose a systematic and automated approach to mining relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches. The goal of FixMiner is thus to infer separate and reusable fix patterns that can be leveraged in other patch generation systems. Our technique, FixMiner, leverages Rich Edit Script which is a specialized tree structure of the edit scripts that captures the AST-level context of the code changes. FixMiner uses different tree representations of Rich Edit Scripts for each round of clustering to identify similar changes. These are abstract syntax trees, edit actions trees, and code context trees. We have evaluated FixMiner on thousands of software patches collected from open source projects. Preliminary results show that we are able to mine accurate patterns, efficiently exploiting change information in Rich Edit Scripts. We further integrated the mined patterns to an automated program repair prototype, PARFixMiner, with which we are able to correctly fix 26 bugs of the Defects4J benchmark. Beyond this quantitative performance, we show that the mined fix patterns are sufficiently relevant to produce patches with a high probability of correctness: 81% of PARFixMiner’s generated plausible patches are correct.

83 citations


Journal ArticleDOI
TL;DR: This study confirms previous findings on the unwillingness of developers to configure ASATs and emphasizes the necessity to improve existing strategies for the selection and prioritization of AsATs warnings that are shown to developers.
Abstract: Automatic static analysis tools (ASATs) are instruments that support code quality assessment by automatically detecting defects and design issues. Despite their popularity, they are characterized by (i) a high false positive rate and (ii) the low comprehensibility of the generated warnings. However, no prior studies have investigated the usage of ASATs in different development contexts (e.g., code reviews, regular development), nor how open source projects integrate ASATs into their workflows. These perspectives are paramount to improve the prioritization of the identified warnings. To shed light on the actual ASATs usage practices, in this paper we first survey 56 developers (66% from industry and 34% from open source projects) and interview 11 industrial experts leveraging ASATs in their workflow with the aim of understanding how they use ASATs in different contexts. Furthermore, to investigate how ASATs are being used in the workflows of open source projects, we manually inspect the contribution guidelines of 176 open-source systems and extract the ASATs’ configuration and build files from their corresponding GitHub repositories. Our study highlights that (i) 71% of developers do pay attention to different warning categories depending on the development context; (ii) 63% of our respondents rely on specific factors (e.g., team policies and composition) when prioritizing warnings to fix during their programming; and (iii) 66% of the projects define how to use specific ASATs, but only 37% enforce their usage for new contributions. The perceived relevance of ASATs varies between different projects and domains, which is a sign that ASATs use is still not a common practice. In conclusion, this study confirms previous findings on the unwillingness of developers to configure ASATs and it emphasizes the necessity to improve existing strategies for the selection and prioritization of ASATs warnings that are shown to developers.

78 citations


Journal ArticleDOI
TL;DR: This work applies Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano, and makes a comparison of topics between the two platforms.
Abstract: Deep learning has gained tremendous traction from the developer and researcher communities. It plays an increasingly significant role in a number of application domains. Deep learning frameworks are proposed to help developers and researchers easily leverage deep learning technologies, and they attract a great number of discussions on popular platforms, i.e., Stack Overflow and GitHub. To understand and compare the insights from these two platforms, we mine the topics of interests from these two platforms. Specifically, we apply Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano. Within each platform, we compare the topics across the three deep learning frameworks. Moreover, we make a comparison of topics between the two platforms. Our observations include 1) a wide range of topics that are discussed about the three deep learning frameworks on both platforms, and the most popular workflow stages are Model Training and Preliminary Preparation. 2) the topic distributions at the workflow level and topic category level on Tensorflow and PyTorch are always similar while the topic distribution pattern on Theano is quite different. In addition, the topic trends at the workflow level and topic category level of the three deep learning frameworks are quite different. 3) the topics at the workflow level show different trends across the two platforms. e.g., the trend of the Preliminary Preparation stage topic on Stack Overflow comes to be relatively stable after 2016, while the trend of it on GitHub shows a stronger upward trend after 2016. Besides, the Model Training stage topic still achieves the highest impact scores across two platforms. Based on the findings, we also discuss implications for practitioners and researchers.

66 citations


Journal ArticleDOI
TL;DR: In this article, a machine learning approach, named FaRM, is proposed to select and rank killable and fault revealing mutants, i.e., the mutants that are potentially dangerous and lead to test cases that uncover unknown program faults.
Abstract: Mutant selection refers to the problem of choosing, among a large number of mutants, the (few) ones that should be used by the testers. In view of this, we investigate the problem of selecting the fault revealing mutants, i.e., the mutants that are killable and lead to test cases that uncover unknown program faults. We formulate two variants of this problem: the fault revealing mutant selection and the fault revealing mutant prioritization. We argue and show that these problems can be tackled through a set of ‘static’ program features and propose a machine learning approach, named FaRM, that learns to select and rank killable and fault revealing mutants. Experimental results involving 1,692 real faults show the practical benefits of our approach in both examined problems. Our results show that FaRM achieves a good trade-off between application cost and effectiveness (measured in terms of faults revealed). We also show that FaRM outperforms all the existing mutant selection methods, i.e., the random mutant sampling, the selective mutation and defect prediction (mutating the code areas pointed by defect prediction). In particular, our results show that with respect to mutant selection, our approach reveals 23% to 34% more faults than any of the baseline methods, while, with respect to mutant prioritization, it achieves higher average percentage of revealed faults with a median difference between 4% and 9% (from the random mutant orderings).

59 citations


Journal ArticleDOI
TL;DR: This multiple-case study analyzes the current adoption of variability management techniques in twelve medium- to large-scale industrial cases in domains such as automotive, aerospace or railway systems to understand the current state of adoption and shed light on gaps to address in industrial practice.
Abstract: Handling large-scale software variability is still a challenge for many organizations. After decades of research on variability management concepts, many industrial organizations have introduced techniques known from research, but still lament that pure textbook approaches are not applicable or efficient. For instance, software product line engineering—an approach to systematically develop portfolios of products—is difficult to adopt given the high upfront investments; and even when adopted, organizations are challenged by evolving their complex product lines. Consequently, the research community now mainly focuses on re-engineering and evolution techniques for product lines; yet, understanding the current state of adoption and the industrial challenges for organizations is necessary to conceive effective techniques. In this multiple-case study, we analyze the current adoption of variability management techniques in twelve medium- to large-scale industrial cases in domains such as automotive, aerospace or railway systems. We identify the current state of variability management, emphasizing the techniques and concepts they adopted. We elicit the needs and challenges expressed for these cases, triangulated with results from a literature review. We believe our results help to understand the current state of adoption and shed light on gaps to address in industrial practice.

58 citations


Journal ArticleDOI
TL;DR: In this article, the authors used the Standardized Precipitation Index (SPI) to quantify the drought characteristics, widely used for its simplicity and variable approaches to dignify a drought.
Abstract: Deficiency in rainfall introduces drought phenomena with temporal and spatial variability in terms of intensity and magnitude. Study of drought in different scales is necessary for successful planning in a country such as India, where agricultural sector contributes highest in economy. Drought indices (DI) have a tool to quantify the drought nature and express a single digit which is helpful to recognise a drought character. Standardized Precipitation Index (SPI) is a tool to quantify the drought characteristics, widely used for its simplicity and variable approaches to dignify a drought. Therefore, the present study deals with SPI to analyse drought phenomena in pre-monsoon, monsoon, post-monsoon and monthly time steps in three relatively drought prone districts (Purulia, Bankura, Midnapore) of West Bengal in India of rainfall data of 117 years (1901–2017). From SPI values, drought frequency is analysed using Gumbel’s type 1 distribution and trend is calculated using Mann–Kendal test (M–K test). Occurrence of drought with negative SPI values is frequent in these districts with increasing dry events and decreasing wet and normal event. More intensive study in hydrological and agricultural drought is necessary to implement any plan with this increasing aggravation of drought.

Journal ArticleDOI
TL;DR: This paper empirically investigates what are the bad practices experienced by developers applying Continuous Integration, and compiled a catalog of 79 CI bad smells belonging to 7 categories related to different dimensions of a CI pipeline management and process.
Abstract: Continuous Integration (CI) has been claimed to introduce several benefits in software development, including high software quality and reliability. However, recent work pointed out challenges, barriers and bad practices characterizing its adoption. This paper empirically investigates what are the bad practices experienced by developers applying CI. The investigation has been conducted by leveraging semi-structured interviews of 13 experts and mining more than 2,300 Stack Overflow posts. As a result, we compiled a catalog of 79 CI bad smells belonging to 7 categories related to different dimensions of a CI pipeline management and process. We have also investigated the perceived importance of the identified bad smells through a survey involving 26 professional developers, and discussed how the results of our study relate to existing knowledge about CI bad practices. Whilst some results, such as the poor usage of branches, confirm existing literature, the study also highlights uncovered bad practices, e.g., related to static analysis tools or the abuse of shell scripts, and contradict knowledge from existing literature, e.g., about avoiding nightly builds. We discuss the implications of our catalog of CI bad smells for (i) practitioners, e.g., favor specific, portable tools over hacking, and do not ignore nor hide build failures, (ii) educators, e.g., teach CI culture, not just technology, and teach CI by providing examples of what not to do, and (iii) researchers, e.g., developing support for failure analysis, as well as automated CI bad smell detectors.

Journal ArticleDOI
TL;DR: This paper discusses when and why researchers should use eye trackers as well as how they should use them, and compiles a list of typical use cases—real and anticipated—of eyeTrackers, aswell as metrics, visualizations, and statistical analyses to analyze and report eye-tracking data.
Abstract: For several years, the software engineering research community used eye trackers to study program comprehension, bug localization, pair programming, and other software engineering tasks. Eye trackers provide researchers with insights on software engineers’ cognitive processes, data that can augment those acquired through other means, such as on-line surveys and questionnaires. While there are many ways to take advantage of eye trackers, advancing their use requires defining standards for experimental design, execution, and reporting. We begin by presenting the foundations of eye tracking to provide context and perspective. Based on previous surveys of eye tracking for programming and software engineering tasks and our collective, extensive experience with eye trackers, we discuss when and why researchers should use eye trackers as well as how they should use them. We compile a list of typical use cases—real and anticipated—of eye trackers, as well as metrics, visualizations, and statistical analyses to analyze and report eye-tracking data. We also discuss the pragmatics of eye tracking studies. Finally, we offer lessons learned about using eye trackers to study software engineering tasks. This paper is intended to be a one-stop resource for researchers interested in designing, executing, and reporting eye tracking studies of software engineering tasks.

Journal ArticleDOI
TL;DR: A model for defining criteria to identify the first snapshot of an evolving software system that exhibits a bug is proposed, based on the perfect test idea, and shows empirical evidence that the prevalent assumption, “a bug was introduced by the lines of code that were modified to fix it”, is just one case of how bugs are introduced in a software system.
Abstract: When identifying the origin of software bugs, many studies assume that “a bug was introduced by the lines of code that were modified to fix it”. However, this assumption does not always hold and at least in some cases, these modified lines are not responsible for introducing the bug. For example, when the bug was caused by a change in an external API. The lack of empirical evidence makes it impossible to assess how important these cases are and therefore, to which extent the assumption is valid. To advance in this direction, and better understand how bugs “are born”, we propose a model for defining criteria to identify the first snapshot of an evolving software system that exhibits a bug. This model, based on the perfect test idea, decides whether a bug is observed after a change to the software. Furthermore, we studied the model’s criteria by carefully analyzing how 116 bugs were introduced in two different open source software projects. The manual analysis helped classify the root cause of those bugs and created manually curated datasets with bug-introducing changes and with bugs that were not introduced by any change in the source code. Finally, we used these datasets to evaluate the performance of four existing SZZ-based algorithms for detecting bug-introducing changes. We found that SZZ-based algorithms are not very accurate, especially when multiple commits are found; the F-Score varies from 0.44 to 0.77, while the percentage of true positives does not exceed 63%. Our results show empirical evidence that the prevalent assumption, “a bug was introduced by the lines of code that were modified to fix it”, is just one case of how bugs are introduced in a software system. Finding what introduced a bug is not trivial: bugs can be introduced by the developers and be in the code, or be created irrespective of the code. Thus, further research towards a better understanding of the origin of bugs in software projects could help to improve design integration tests and to design other procedures to make software development more robust.

Journal ArticleDOI
TL;DR: This paper aims at automating the classification of SO question posts into seven question categories, and finds that the categories API usage, Conceptual, and Discrepancy are the most frequently assigned question categories and that they also occur together frequently.
Abstract: On question and answer sites, such as Stack Overflow (SO), developers use tags to label the content of a post and to support developers in question searching and browsing. However, these tags mainly refer to technological aspects instead of the purpose of the question. Tagging questions with their purpose can add a new dimension to the identification of discussed topics in posts on SO. In this paper, we aim at automating the classification of SO question posts into seven question categories. As a first step, we harmonized existing taxonomies of question categories and then, we manually classified 1,000 SO questions according to our new taxonomy. Additionally to the question category, we marked the phrases that indicate a question category for each of the posts. We then use this data set to automate the classification of posts using two approaches. For the first approach, we manually analyzed the phrases to find patterns. Based on regular expressions, we implemented a classifier, for each of the categories, that determines whether a post belongs to a category. These regular expressions are derived by analyzing patterns in the phrases. In the second approach, we use the curated data set to train classification models of supervised machine learning algorithms (Random Forest and Support Vector Machines). For the machine learning algorithms, we experimented with 1,312 different configurations regarding the preprocessing of the text and the representation of the input data. Then, we compared the performance of the regex approach with the performance of the best configuration that uses machine learning algorithms on a validation set of 110 posts. The results show that using the regular expression approach, we can classify posts into the correct question category with an average precision and recall of 0.90, and an MCC of 0.68. Additionally, we applied the regex approach on all questions of SO that deal with Android app development and investigated the co-occurrence of question categories in posts. We found that the categories API usage, Conceptual, and Discrepancy are the most frequently assigned question categories and that they also occur together frequently. Our approach can be used to support developers in browsing SO discussions or researchers in building recommender systems based on SO.

Journal ArticleDOI
TL;DR: In a large-scale empirical study using six open source software projects, the results show that context metrics that consider the context lines of added-lines achieve the best median value in all cases in terms of a statistical test.
Abstract: Traditional just-in-time defect prediction approaches have been using changed lines of software to predict defective-changes in software development. However, they disregard information around the changed lines. Our main hypothesis is that such information has an impact on the likelihood that the change is defective. To take advantage of this information in defect prediction, we consider n-lines (n = 1,2,…) that precede and follow the changed lines (which we call context lines), and propose metrics that measure them, which we call “Context Metrics.” Specifically, these context metrics are defined as the number of words/keywords in the context lines. In a large-scale empirical study using six open source software projects, we compare the performance of using our context metrics, traditional code churn metrics (e.g., the number of modified subsystems), our extended context metrics which measure not only context lines but also changed lines, and combination metrics that use two extended context metrics at a prediction model for defect prediction. The results show that context metrics that consider the context lines of added-lines achieve the best median value in all cases in terms of a statistical test. Moreover, using few number of context lines is suitable for context metric that considers words, and using more number of context lines is suitable for context metric that considers keywords. Finally, the combination metrics of two extended context metrics significantly outperform all studied metrics in all studied projects w. r. t. the area under the receiver operation characteristic curve (AUC) and Matthews correlation coefficient (MCC).

Journal ArticleDOI
TL;DR: The lessons learned when maturing the tool from a research prototype to an industrial-grade solution are reported on and an empirical study was conducted to compare its detection capabilities with those of OWASP Dependency Check.
Abstract: Open source software (OSS) libraries are widely used in the industry to speed up the development of software products. However, these libraries are subject to an ever-increasing number of vulnerabilities that are publicly disclosed. It is thus crucial for application developers to detect dependencies on vulnerable libraries in a timely manner, to precisely assess their impact, and to mitigate any potential risk. This paper presents a novel method to detect, assess and mitigate OSS vulnerabilities. Differently from state-of-the-art approaches that depend on metadata to identify vulnerable OSS dependencies, our solution is code-centric, and combines static and dynamic analyses to determine the reachability of the vulnerable portion of libraries, in the context of a given application. Our approach also supports developers in choosing among the existing non-vulnerable library versions, with the goal to determine and minimize incompatibilities. Eclipse Steady, the open source implementation of our code-centric and usage-based approach is the tool recommended to scan Java software products at SAP; it has been successfully used to perform more than one million scans of about 1500 applications. In this paper we report on the lessons learned when maturing the tool from a research prototype to an industrial-grade solution. To evaluate Eclipse Steady, we conducted an empirical study to compare its detection capabilities with those of OWASP Dependency Check (OWASP DC), scanning 300 large enterprise applications under development with a total of 78165 dependencies. Reviewing a sample of the findings reported only by one of the two tools revealed that all Steady findings are true positives, while 88.8% of the findings of OWASP DC for vulnerabilities covered by our code-centric approach are false positives. For vulnerabilities not caused by code but due, e.g., to erroneous configuration, 63.3% of OWASP DC findings are true positives.

Journal ArticleDOI
TL;DR: The design of InsighTD is presented, which has the primary goal of replication at a large-scale, with the results of the study in Brazil as a small part of the larger puzzle.
Abstract: Studying the causes of technical debt (TD) could aid in TD prevention, thus easing the job of TD management. On the other hand, better understanding of the effects of TD could also aid in TD management by facilitating more informed decisions about incurring and paying off debt. Create a deeper understanding, and confirming existing evidence, of the causes and effects of TD by collecting new evidence from real-world TD examples. InsighTD is a globally distributed family of industrial surveys on the causes and effects of TD. It is designed to run as a large-scale study based on continuous and independent replications in different countries. The survey instrument asks practitioners to describe in detail a real example of TD from their experience. We present in this paper the design of InsighTD, which has the primary goal of replication at a large-scale, with the results of the study in Brazil as a small part of the larger puzzle. The first iteration of the InsighTD survey, carried out in Brazil, yielded 107 responses. We identified a total of 78 causes and 66 effects, which confirm and also extend the current knowledge on causes and effects of TD. Then, we organized the identified set of causes and effects in probabilistic cause-effect diagrams. The proposed diagrams highlight the causes that can most contribute to the occurrence of TD as well as the most common effects that occur as a result of debt. We intend to reduce the problem of isolated TD investigations that are not yet representative and build a continuous and generalizable empirical basis for understanding practical problems and challenges of TD.

Journal ArticleDOI
TL;DR: This paper exploits the so-called intensity index, a previously defined metric that captures the severity of a code smell, and evaluates its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features.
Abstract: Code smells are sub-optimal implementation choices applied by developers that have the effect of negatively impacting, among others, the change-proneness of the affected classes. Based on this consideration, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, i.e., models having the goal of indicating which classes are more likely to change in the future. We exploit the so-called intensity index—a previously defined metric that captures the severity of a code smell—and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with a model based on previously defined antipattern metrics, a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including models is statistically better than the baselines and, (ii) the intensity is a better predictor than antipattern metrics. We observed some orthogonality between the set of change-prone and non-change-prone classes correctly classified by the models relying on intensity and antipattern metrics: for this reason, we also devise and evaluate a smell-aware combined change prediction model including product, process, developer-based, and smell-related features. We show that the F-Measure of this model is notably higher than other models.

Journal ArticleDOI
TL;DR: This work investigated the reasons behind library reuse and re-implementation and provided a few suggestions to improve the current library recommendation systems: tailored recommendation according to users’ preferences, detection of external code that is similar to a part of the Users’ code (to avoid duplication or re-IMplementation), grouping similar recommendations for developers to compare and select the one they prefer, and disrecommendation of poor-quality libraries.
Abstract: Nowadays, with the rapid growth of open source software (OSS), library reuse becomes more and more popular since a large amount of third- party libraries are available to download and reuse. A deeper understanding on why developers reuse a library (i.e., replacing self-implemented code with an external library) or re-implement a library (i.e., replacing an imported external library with self-implemented code) could help researchers better understand the factors that developers are concerned with when reusing code. This understanding can then be used to improve existing libraries and API recommendation tools for researchers and practitioners by using the developers concerns identified in this study as design criteria. In this work, we investigated the reasons behind library reuse and re-implementation. To achieve this goal, we first crawled data from two popular sources, F-Droid and GitHub. Then, potential instances of library reuse and re-implementation were found automatically based on certain heuristics. Next, for each instance, we further manually identified whether it is valid or not. For library re-implementation, we obtained 82 instances which are distributed in 75 repositories. We then conducted two types of surveys (i.e., individual survey to corresponding developers of the validated instances and another open survey) for library reuse and re-implementation. For library reuse individual survey, we received 36 responses out of 139 contacted developers. For re-implementation individual survey, we received 13 responses out of 71 contacted developers. In addition, we received 56 responses from the open survey. Finally, we perform qualitative and quantitative analysis on the survey responses and commit logs of the validated instances. The results suggest that library reuse occurs mainly because developers were initially unaware of the library or the library had not been introduced. Re-implementation occurs mainly because the used library method is only a small part of the library, the library dependencies are too complicated, or the library method is deprecated. Finally, based on all findings obtained from analyzing the surveys and commit messages, we provided a few suggestions to improve the current library recommendation systems: tailored recommendation according to users’ preferences, detection of external code that is similar to a part of the users’ code (to avoid duplication or re-implementation), grouping similar recommendations for developers to compare and select the one they prefer, and disrecommendation of poor-quality libraries.

Journal ArticleDOI
TL;DR: In this article, the authors developed a socio-technical research framework to capture the main beneficiary of a research study (the who), the main type of research contribution produced (the what), and the research strategies used in the study (how we methodologically approach delivering relevant results given the who and what of our studies).
Abstract: Software engineering is a socio-technical endeavor, and while many of our contributions focus on technical aspects, human stakeholders such as software developers are directly affected by and can benefit from our research and tool innovations. In this paper, we question how much of our research addresses human and social issues, and explore how much we study human and social aspects in our research designs. To answer these questions, we developed a socio-technical research framework to capture the main beneficiary of a research study (the who), the main type of research contribution produced (the what), and the research strategies used in the study (how we methodologically approach delivering relevant results given the who and what of our studies). We used this Who-What-How framework to analyze 151 papers from two well-cited publishing venues—the main technical track at the International Conference on Software Engineering, and the Empirical Software Engineering Journal by Springer—to assess how much this published research explicitly considers human aspects. We find that although a majority of these papers claim the contained research should benefit human stakeholders, most focus predominantly on technical contributions. Although our analysis is scoped to two venues, our results suggest a need for more diversification and triangulation of research strategies. In particular, there is a need for strategies that aim at a deeper understanding of human and social aspects of software development practice to balance the design and evaluation of technical innovations. We recommend that the framework should be used in the design of future studies in order to steer software engineering research towards explicitly including human and social concerns in their designs, and to improve the relevance of our research for human stakeholders.

Journal ArticleDOI
TL;DR: This paper quantifies the amount of clones in Ethereum, understands key characteristics of clone clusters, and determines whether smart contracts contain pieces of code that are identical to those published by OpenZeppelin (RQ3), and concludes that the aforementioned findings yield implications to the security, development, and usage of smart contracts.
Abstract: Ethereum is a blockchain platform that hosts and executes smart contracts. Smart contracts have been used to implement cryptocurrencies and crowdfunding initiatives (ICOs). A major concern in Ethereum is the security of smart contracts. Different from traditional software development, smart contracts are immutable once deployed. Hence, vulnerabilities and bugs in smart contracts can lead to catastrophic financial loses. In order to avoid taking the risk of writing buggy code, smart contract developers are encouraged to reuse pieces of code from reputable sources (e.g., OpenZeppelin). In this paper, we study code cloning in Ethereum. Our goal is to quantify the amount of clones in Ethereum (RQ1), understand key characteristics of clone clusters (RQ2), and determine whether smart contracts contain pieces of code that are identical to those published by OpenZeppelin (RQ3). We applied Deckard, a tree-based clone detector, to all Ethereum contracts for which the source code was available. We observe that developers frequently clone contracts. In particular, 79.2% of the studied contracts are clones and we note an upward trend in the number of cloned contracts per quarter. With regards to the characteristics of clone clusters, we observe that: (i) 9 out of the top-10 largest clone clusters are token managers, (ii) most of the activity of a cluster tends to be concentrated on a few contracts, and (iii) contracts in a cluster to be created by several authors. Finally, we note that the studied contracts have different ratios of code blocks that are identical to those provided by the OpenZeppelin project. Due to the immutability of smart contracts, as well as the impossibility of reverting transactions once they are deemed final, we conclude that the aforementioned findings yield implications to the security, development, and usage of smart contracts.

Journal ArticleDOI
TL;DR: It is found that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models.
Abstract: The interpretation of defect models heavily relies on software metrics that are used to construct them. Prior work often uses feature selection techniques to remove metrics that are correlated and irrelevant in order to improve model performance. Yet, conclusions that are derived from defect models may be inconsistent if the selected metrics are inconsistent and correlated. In this paper, we systematically investigate 12 automated feature selection techniques with respect to the consistency, correlation, performance, computational cost, and the impact on the interpretation dimensions. Through an empirical investigation of 14 publicly-available defect datasets, we find that (1) 94–100% of the selected metrics are inconsistent among the studied techniques; (2) 37–90% of the selected metrics are inconsistent among training samples; (3) 0–68% of the selected metrics are inconsistent when the feature selection techniques are applied repeatedly; (4) 5–100% of the produced subsets of metrics contain highly correlated metrics; and (5) while the most important metrics are inconsistent among correlation threshold values, such inconsistent most important metrics are highly-correlated with the Spearman correlation of 0.85–1. Since we find that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models. In addition to introducing AutoSpearman which mitigates correlated metrics better than commonly-used feature selection techniques, this paper opens up new research avenues in the automated selection of features for defect models to optimise for interpretability as well as performance.

Journal ArticleDOI
TL;DR: fNIRS and eye tracking devices are proposed as an objective measure of program comprehension that allows researchers to conduct studies in environments close to real world settings, at identifier level of granularity and observes that self-reported task difficulty, cognitive load, and fixation duration do not correlate and appear to be measuring different aspects of task difficulty.
Abstract: A large portion of the cost of any software lies in the time spent by developers in understanding a program’s source code before any changes can be undertaken. Measuring program comprehension is not a trivial task. In fact, different studies use self-reported and various psycho-physiological measures as proxies. In this research, we propose a methodology using functional Near Infrared Spectroscopy (fNIRS) and eye tracking devices as an objective measure of program comprehension that allows researchers to conduct studies in environments close to real world settings, at identifier level of granularity. We validate our methodology and apply it to study the impact of lexical, structural, and readability issues on developers’ cognitive load during bug localization tasks. Our study involves 25 undergraduate and graduate students and 21 metrics. Results show that the existence of lexical inconsistencies in the source code significantly increases the cognitive load experienced by participants not only on identifiers involved in the inconsistencies but also throughout the entire code snippet. We did not find statistical evidence that structural inconsistencies increase the average cognitive load that participants experience, however, both types of inconsistencies result in lower performance in terms of time and success rate. Finally, we observe that self-reported task difficulty, cognitive load, and fixation duration do not correlate and appear to be measuring different aspects of task difficulty.

Journal ArticleDOI
TL;DR: In this paper, a time-aware evaluation of cross-project defect prediction models is presented, where models are trained only on the past and evaluations are executed only on future software projects.
Abstract: Researchers in empirical software engineering often make claims based on observable data such as defect reports. Unfortunately, in many cases, these claims are generalized beyond the data sets that have been evaluated. Will the researcher’s conclusions hold a year from now for the same software projects? Perhaps not. Recent studies show that in the area of Software Analytics, conclusions over different data sets are usually inconsistent. In this article, we empirically investigate whether conclusions in the area of cross-project defect prediction truly exhibit stability throughout time or not. Our investigation applies a time-aware evaluation approach where models are trained only on the past, and evaluations are executed only on the future. Through this time-aware evaluation, we show that depending on which time period we evaluate defect predictors, their performance, in terms of F-Score, the area under the curve (AUC), and Mathews Correlation Coefficient (MCC), varies and their results are not consistent. The next release of a product, which is significantly different from its prior release, may drastically change defect prediction performance. Therefore, without knowing about the conclusion stability, empirical software engineering researchers should limit their claims of performance within the contexts of evaluation, because broad claims about defect prediction performance might be contradicted by the next upcoming release of a product under analysis.

Journal ArticleDOI
TL;DR: An automated approach for demarcating requirements in free-form requirements specifications based on machine learning, which can be applied to a wide variety of specifications in different domains and with different writing styles is proposed.
Abstract: A simple but important task during the analysis of a textual requirements specification is to determine which statements in the specification represent requirements. In principle, by following suitable writing and markup conventions, one can provide an immediate and unequivocal demarcation of requirements at the time a specification is being developed. However, neither the presence nor a fully accurate enforcement of such conventions is guaranteed. The result is that, in many practical situations, analysts end up resorting to after-the-fact reviews for sifting requirements from other material in a requirements specification. This is both tedious and time-consuming. We propose an automated approach for demarcating requirements in free-form requirements specifications. The approach, which is based on machine learning, can be applied to a wide variety of specifications in different domains and with different writing styles. We train and evaluate our approach over an independently labeled dataset comprised of 33 industrial requirements specifications. Over this dataset, our approach yields an average precision of 81.2% and an average recall of 95.7%. Compared to simple baselines that demarcate requirements based on the presence of modal verbs and identifiers, our approach leads to an average gain of 16.4% in precision and 25.5% in recall. We collect and analyze expert feedback on the demarcations produced by our approach for industrial requirements specifications. The results indicate that experts find our approach useful and efficient in practice. We developed a prototype tool, named DemaRQ, in support of our approach. To facilitate replication, we make available to the research community this prototype tool alongside the non-proprietary portion of our training data.

Journal ArticleDOI
TL;DR: In this article, the authors applied the multi-criteria decision-making technique to assess the block-level risk to cyclone in the Sundarban region, India and found that nearness to coastline and distance from cyclone tract have the greatest vulnerability exposure; windspeed has the highest hazard score; and nearness of cyclone shelter and cyclone awareness has utmost mitigation capacity.
Abstract: Cyclones are one of the devastating natural hazards in the earth that have extensive consequences on both life and livelihood. Assessment of cyclone risk is important not only for the survival of people but also for designing adaptation strategies in major cyclone-prone areas. Risk can be defined as the integrated function of vulnerability, hazard, and mitigation capacity. Risk assessment is a very protracted task needing analysis of multiple factors which are also changes with changing geographical location. Thus, we applied the multi-criteria decision-making technique to assess the block-level risk to cyclone in the Sundarban region, India. The weight-based analysis reveals that nearness to coastline and distance from cyclone tract have the greatest vulnerability exposure; windspeed has the highest hazard score; and nearness to cyclone shelter and cyclone awareness, on the other hand, has utmost mitigation capacity to cyclone risk analysis. The risk analysis shows that out of the total blocks (19) in the Sundarban, nearly half of the blocks are with very high to moderate cyclone risk with the least mitigation capacity to cope with such natural hazards. The blocks like Gosaba and Kultali to be found in the southern portions and nearby to the coast exposed more risk to cyclone while those situated farther to the coast in the central and northern parts exposed low risk. The findings of the present study may have high implications for developing more mitigation capacity in a very-high- to a moderate-cyclone-risk area. Therefore, the applied approach can help the local authorities in identifying vulnerable and hazard areas and building actionable strategies for mitigation and reworked copy of cyclone hazards in the Sundarban Biosphere Reserve.

Journal ArticleDOI
TL;DR: To identify and map the tools, Language Workbenches (LW), or frameworks that were proposed to develop DSLs discussed and referenced in publications between 2012 and 2019, a Systematic Mapping Study of the literature scoping tools for DSL development is conducted.
Abstract: Domain-specific languages (DSL) are programming or modeling languages devoted to a given application domain. There are many tools used to support the implementation of a DSL, making hard the decision-making process for one or another. In this sense, identifying and mapping their features is relevant for decision-making by academic and industrial initiative on DSL development. Objective: The goal of this work is to identify and map the tools, Language Workbenches (LW), or frameworks that were proposed to develop DSLs discussed and referenced in publications between 2012 and 2019. Method: A Systematic Mapping Study (SMS) of the literature scoping tools for DSL development. Results: We identified 59 tools, including 9 under a commercial license and 41 with non-commercial licenses, and analyzed their features from 230 papers. Conclusion: There is a substantial amount of tools that cover a large number of features. Furthermore, we observed that usually, the developer adopts one type of notation to implement the DSL: textual or graphical. We also discuss research gaps, such as a lack of tools that allow meta-meta model transformations and that support modeling tools interoperability.

Journal ArticleDOI
TL;DR: The design science lens helps emphasize the theoretical contribution of research output—in terms of technological rules—and reflect on the practical relevance, novelty and rigor of the rules proposed by the research.
Abstract: Assessing and communicating software engineering research can be challenging. Design science is recognized as an appropriate research paradigm for applied research, but is rarely explicitly used as a way to present planned or achieved research contributions in software engineering. Applying the design science lens to software engineering research may improve the assessment and communication of research contributions. The aim of this study is 1) to understand whether the design science lens helps summarize and assess software engineering research contributions, and 2) to characterize different types of design science contributions in the software engineering literature. In previous research, we developed a visual abstract template, summarizing the core constructs of the design science paradigm. In this study, we use this template in a review of a set of 38 award winning software engineering publications to extract, analyze and characterize their design science contributions. We identified five clusters of papers, classifying them according to their different types of design science contributions. The design science lens helps emphasize the theoretical contribution of research output—in terms of technological rules—and reflect on the practical relevance, novelty and rigor of the rules proposed by the research.

Journal ArticleDOI
TL;DR: The paper mine 3 073 open-source repositories containing more than 118 million lines of C# code and empirically investigate the relationships between seven architecture and 19 design smells, finding that smell density does not depend on repository size.
Abstract: Architecture of a software system represents the key design decisions and therefore its quality plays an important role to keep the software maintainable. Code smells are indicators of quality issues in a software system and are classified based on their granularity, scope, and impact. Despite a plethora of existing work on smells, a detailed exploration of architecture smells, their characteristics, and their relationships with smells in other granularities is missing. The paper aims to study architecture smells characteristics, investigate correlation, collocation, and causation relationships between architecture and design smells. We implement smell detection support for seven architecture smells. We mine 3 073 open-source repositories containing more than 118 million lines of C# code and empirically investigate the relationships between seven architecture and 19 design smells. We find that smell density does not depend on repository size. Cumulatively, architecture smells are highly correlated with design smells. Our collocation analysis finds that the majority of design and architecture smell pairs do not exhibit collocation. Finally, our causality analysis reveals that design smells cause architecture smells.