Showing papers in &quot;Empirical Software Engineering in 2020&quot;

Testing machine learning based systems: a systematic mapping

TL;DR: Experimental results demonstrate that the method Hybrid-DeepCom outperforms the state-of-the-art by a substantial margin and the results show that reducing the out- of-vocabulary tokens improves the accuracy effectively.

...read moreread less

Abstract: During software maintenance, developers spend a lot of time understanding the source code. Existing studies show that code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named Hybrid-DeepCom to automatically generate code comments for the functional units of Java language, namely, Java methods. The generated comments aim to help developers understand the functionality of Java methods. Hybrid-DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. It formulates the comment generation task as the machine translation problem. Hybrid-DeepCom exploits a deep neural network that combines the lexical and structure information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects on GitHub. We evaluate the experimental results on both machine translation metrics and information retrieval metrics. Experimental results demonstrate that our method Hybrid-DeepCom outperforms the state-of-the-art by a substantial margin. In addition, we evaluate the influence of out-of-vocabulary tokens on comment generation. The results show that reducing the out-of-vocabulary tokens improves the accuracy effectively.

...read moreread less

175 citations

Journal Article•DOI•

[...]

Vincenzo Riccio¹, Gunel Jahangirova¹, Andrea Stocco¹, Nargiz Humbatova¹, Michael Weiss¹, Paolo Tonella¹ - Show less +2 more•Institutions (1)

University of Lugano¹

01 Nov 2020-Empirical Software Engineering

TL;DR: A systematic mapping study about testing techniques for MLSs driven by 33 research questions and investigated multiple aspects of the testing approaches, such as the used/proposed adequacy criteria, the algorithms for test input generation, and the test oracles.

...read moreread less

Abstract: A Machine Learning based System (MLS) is a software system including one or more components that learn how to perform a task from a given data set. The increasing adoption of MLSs in safety critical domains such as autonomous driving, healthcare, and finance has fostered much attention towards the quality assurance of such systems. Despite the advances in software testing, MLSs bring novel and unprecedented challenges, since their behaviour is defined jointly by the code that implements them and the data used for training them. To identify the existing solutions for functional testing of MLSs, and classify them from three different perspectives: (1) the context of the problem they address, (2) their features, and (3) their empirical evaluation. To report demographic information about the ongoing research. To identify open challenges for future research. We conducted a systematic mapping study about testing techniques for MLSs driven by 33 research questions. We followed existing guidelines when defining our research protocol so as to increase the repeatability and reliability of our results. We identified 70 relevant primary studies, mostly published in the last years. We identified 11 problems addressed in the literature. We investigated multiple aspects of the testing approaches, such as the used/proposed adequacy criteria, the algorithms for test input generation, and the test oracles. The most active research areas in MLS testing address automated scenario/input generation and test oracle creation. MLS testing is a rapidly growing and developing research area, with many open challenges, such as the generation of realistic inputs and the definition of reliable evaluation metrics and benchmarks.

...read moreread less

116 citations

Journal Article•DOI•

An exploratory study of smart contracts in the Ethereum blockchain platform

[...]

Gustavo Ansaldi Oliva¹, Ahmed E. Hassan¹, Zhen Ming Jiang²•Institutions (2)

Queen's University¹, York University²

Pandemic programming: How COVID-19 affects software developers and how their organizations can help.

TL;DR: An exploratory study of smart contracts, which elucidates how frequently used the contracts are, what they do, and how complex they are and proposes an open research agenda to drive and foster future studies in the area.

...read moreread less

Abstract: Ethereum is a blockchain platform that supports smart contracts. Smart contracts are pieces of code that perform general-purpose computations. For instance, smart contracts have been used to implement crowdfunding initiatives that raised a total of US$6.2 billion from January to June of 2018. In this paper, we conduct an exploratory study of smart contracts. Differently from prior studies that focused on particular aspects of a subset of smart contracts, our goal is to have a broader understanding of all contracts that are currently deployed in Ethereum. In particular, we elucidate how frequently used the contracts are (activity level), what they do (category), and how complex they are (source code complexity). To conduct this study, we mined and cross-linked data from four sources: Ethereum dataset on the Google BigQuery platform, Etherscan, State of the DApps, and CoinMarketCap. Our study period runs from July 2015 (inception of Ethereum) until September 2018. With regards to activity level, we notice that it is concentrated on a very small subset of the contracts. More specifically, only 0.05% of the smart contracts are the target of 80% of the transactions that are sent to contracts. New solutions to cope with Ethereum’s limited scalability should take such an activity imbalance into consideration. With regards to categories, we highlight that the new and widely advertised rich programming model of smart contracts is currently being used to develop very simple applications that tend to be token-centric (e.g., ICOs, Crowdsales, etc). Finally, with regards to code complexity, we observe that the source code of high-activity verified contracts is small, with at most 211 instructions in 80% of the cases. These contracts also commonly include at least two subcontracts and libraries in their source code. The comment ratio of these contracts is also significantly higher than that of GitHub top-starred projects written in Java, C++, and C#. Hence, the source code of high-activity verified smart contracts exhibit particular complexity characteristics compared to other popular programming languages. Further studies are necessary to uncover the actual reasons behind such differences. Finally, based on our findings, we propose an open research agenda to drive and foster future studies in the area.

...read moreread less

112 citations

Journal Article•DOI•

[...]

Paul Ralph¹, Sebastian Baltes², Gianisa Adisaputri¹, Richard Torkar³, Richard Torkar⁴, Vladimir Kovalenko⁵, Marcos Kalinowski⁶, Nicole Novielli⁷, Shin Yoo⁸, Xavier Devroey⁹, Xin Tan¹⁰, Minghui Zhou¹⁰, Burak Turhan¹¹, Burak Turhan¹², Rashina Hoda¹¹, Hideaki Hata¹³, Gregorio Robles¹⁴, Amin Milani Fard¹⁵, Rana Alkadhi¹⁶ - Show less +15 more•Institutions (16)

Dalhousie University¹, University of Adelaide², Stellenbosch University³, University of Gothenburg⁴, JetBrains⁵, Pontifical Catholic University of Rio de Janeiro⁶, University of Bari⁷, KAIST⁸, Delft University of Technology⁹, Peking University¹⁰, Monash University¹¹, University of Oulu¹², Nara Institute of Science and Technology¹³, King Juan Carlos University¹⁴, New York Institute of Technology¹⁵, King Saud University¹⁶

14 Sep 2020-Empirical Software Engineering

TL;DR: In this paper, a questionnaire survey was created mainly from existing, validated scales and translated into 12 languages, and the data was analyzed using nonparametric inferential statistics and structural equation modeling.

...read moreread less

Abstract: Context: As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective: This study investigates the effects of the pandemic on developers’ wellbeing and productivity. Method: A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results: The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers’ wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions: To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees’ home offices. Women, parents and disabled persons may require extra support.

...read moreread less

105 citations

Journal Article•DOI•

FixMiner: Mining relevant fix patterns for automated program repair

[...]

Anil Koyuncu¹, Kui Liu¹, Tegawendé F. Bissyandé¹, Dongsun Kim, Jacques Klein¹, Martin Monperrus², Yves Le Traon¹ - Show less +3 more•Institutions (2)

University of Luxembourg¹, Royal Institute of Technology²

14 Mar 2020-Empirical Software Engineering

TL;DR: In this paper, the authors propose a systematic and automated approach to mine relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches, which can be leveraged to extract generic fix actions.

...read moreread less

Abstract: Patching is a common activity in software development. It is generally performed on a source code base to address bugs or add new functionalities. In this context, given the recurrence of bugs across projects, the associated similar patches can be leveraged to extract generic fix actions. While the literature includes various approaches leveraging similarity among patches to guide program repair, these approaches often do not yield fix patterns that are tractable and reusable as actionable input to APR systems. In this paper, we propose a systematic and automated approach to mining relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches. The goal of FixMiner is thus to infer separate and reusable fix patterns that can be leveraged in other patch generation systems. Our technique, FixMiner, leverages Rich Edit Script which is a specialized tree structure of the edit scripts that captures the AST-level context of the code changes. FixMiner uses different tree representations of Rich Edit Scripts for each round of clustering to identify similar changes. These are abstract syntax trees, edit actions trees, and code context trees. We have evaluated FixMiner on thousands of software patches collected from open source projects. Preliminary results show that we are able to mine accurate patterns, efficiently exploiting change information in Rich Edit Scripts. We further integrated the mined patterns to an automated program repair prototype, PARFixMiner, with which we are able to correctly fix 26 bugs of the Defects4J benchmark. Beyond this quantitative performance, we show that the mined fix patterns are sufficiently relevant to produce patches with a high probability of correctness: 81% of PARFixMiner’s generated plausible patches are correct.

...read moreread less

83 citations

Journal Article•DOI•

How developers engage with static analysis tools in different contexts

[...]

Carmine Vassallo¹, Sebastiano Panichella¹, Fabio Palomba¹, Sebastian Proksch¹, Harald C. Gall¹, Andy Zaidman - Show less +2 more•Institutions (1)

University of Zurich¹

What do Programmers Discuss about Deep Learning Frameworks

TL;DR: This study confirms previous findings on the unwillingness of developers to configure ASATs and emphasizes the necessity to improve existing strategies for the selection and prioritization of AsATs warnings that are shown to developers.

...read moreread less

Abstract: Automatic static analysis tools (ASATs) are instruments that support code quality assessment by automatically detecting defects and design issues. Despite their popularity, they are characterized by (i) a high false positive rate and (ii) the low comprehensibility of the generated warnings. However, no prior studies have investigated the usage of ASATs in different development contexts (e.g., code reviews, regular development), nor how open source projects integrate ASATs into their workflows. These perspectives are paramount to improve the prioritization of the identified warnings. To shed light on the actual ASATs usage practices, in this paper we first survey 56 developers (66% from industry and 34% from open source projects) and interview 11 industrial experts leveraging ASATs in their workflow with the aim of understanding how they use ASATs in different contexts. Furthermore, to investigate how ASATs are being used in the workflows of open source projects, we manually inspect the contribution guidelines of 176 open-source systems and extract the ASATs’ configuration and build files from their corresponding GitHub repositories. Our study highlights that (i) 71% of developers do pay attention to different warning categories depending on the development context; (ii) 63% of our respondents rely on specific factors (e.g., team policies and composition) when prioritizing warnings to fix during their programming; and (iii) 66% of the projects define how to use specific ASATs, but only 37% enforce their usage for new contributions. The perceived relevance of ASATs varies between different projects and domains, which is a sign that ASATs use is still not a common practice. In conclusion, this study confirms previous findings on the unwillingness of developers to configure ASATs and it emphasizes the necessity to improve existing strategies for the selection and prioritization of ASATs warnings that are shown to developers.

...read moreread less

78 citations

Journal Article•DOI•

[...]

Junxiao Han¹, Emad Shihab², Zhiyuan Wan¹, Shuiguang Deng¹, Xin Xia³ - Show less +1 more•Institutions (3)

Zhejiang University¹, Concordia University², Monash University³

24 Apr 2020-Empirical Software Engineering

TL;DR: This work applies Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano, and makes a comparison of topics between the two platforms.

...read moreread less

Abstract: Deep learning has gained tremendous traction from the developer and researcher communities. It plays an increasingly significant role in a number of application domains. Deep learning frameworks are proposed to help developers and researchers easily leverage deep learning technologies, and they attract a great number of discussions on popular platforms, i.e., Stack Overflow and GitHub. To understand and compare the insights from these two platforms, we mine the topics of interests from these two platforms. Specifically, we apply Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano. Within each platform, we compare the topics across the three deep learning frameworks. Moreover, we make a comparison of topics between the two platforms. Our observations include 1) a wide range of topics that are discussed about the three deep learning frameworks on both platforms, and the most popular workflow stages are Model Training and Preliminary Preparation. 2) the topic distributions at the workflow level and topic category level on Tensorflow and PyTorch are always similar while the topic distribution pattern on Theano is quite different. In addition, the topic trends at the workflow level and topic category level of the three deep learning frameworks are quite different. 3) the topics at the workflow level show different trends across the two platforms. e.g., the trend of the Preliminary Preparation stage topic on Stack Overflow comes to be relatively stable after 2016, while the trend of it on GitHub shows a stronger upward trend after 2016. Besides, the Model Training stage topic still achieves the highest impact scores across two platforms. Based on the findings, we also discuss implications for practitioners and researchers.

...read moreread less

66 citations

Journal Article•DOI•

Selecting fault revealing mutants

[...]

Thierry Titcheu Chekam¹, Mike Papadakis¹, Tegawendé F. Bissyandé¹, Yves Le Traon¹, Koushik Sen² - Show less +1 more•Institutions (2)

University of Luxembourg¹, University of California, Berkeley²

The state of adoption and the challenges of systematic variability management in industry

TL;DR: In this article, a machine learning approach, named FaRM, is proposed to select and rank killable and fault revealing mutants, i.e., the mutants that are potentially dangerous and lead to test cases that uncover unknown program faults.

...read moreread less

Abstract: Mutant selection refers to the problem of choosing, among a large number of mutants, the (few) ones that should be used by the testers. In view of this, we investigate the problem of selecting the fault revealing mutants, i.e., the mutants that are killable and lead to test cases that uncover unknown program faults. We formulate two variants of this problem: the fault revealing mutant selection and the fault revealing mutant prioritization. We argue and show that these problems can be tackled through a set of ‘static’ program features and propose a machine learning approach, named FaRM, that learns to select and rank killable and fault revealing mutants. Experimental results involving 1,692 real faults show the practical benefits of our approach in both examined problems. Our results show that FaRM achieves a good trade-off between application cost and effectiveness (measured in terms of faults revealed). We also show that FaRM outperforms all the existing mutant selection methods, i.e., the random mutant sampling, the selective mutation and defect prediction (mutating the code areas pointed by defect prediction). In particular, our results show that with respect to mutant selection, our approach reveals 23% to 34% more faults than any of the baseline methods, while, with respect to mutant prioritization, it achieves higher average percentage of revealed faults with a median difference between 4% and 9% (from the random mutant orderings).

...read moreread less

59 citations

Journal Article•DOI•

[...]

Thorsten Berger¹, Jan-Philipp Steghöfer¹, Tewfik Ziadi², Jacques Robin², Jabier Martinez - Show less +1 more•Institutions (2)

University of Gothenburg¹, University of Paris²

04 Apr 2020-Empirical Software Engineering

TL;DR: This multiple-case study analyzes the current adoption of variability management techniques in twelve medium- to large-scale industrial cases in domains such as automotive, aerospace or railway systems to understand the current state of adoption and shed light on gaps to address in industrial practice.

...read moreread less

Abstract: Handling large-scale software variability is still a challenge for many organizations. After decades of research on variability management concepts, many industrial organizations have introduced techniques known from research, but still lament that pure textbook approaches are not applicable or efficient. For instance, software product line engineering—an approach to systematically develop portfolios of products—is difficult to adopt given the high upfront investments; and even when adopted, organizations are challenged by evolving their complex product lines. Consequently, the research community now mainly focuses on re-engineering and evolution techniques for product lines; yet, understanding the current state of adoption and the industrial challenges for organizations is necessary to conceive effective techniques. In this multiple-case study, we analyze the current adoption of variability management techniques in twelve medium- to large-scale industrial cases in domains such as automotive, aerospace or railway systems. We identify the current state of variability management, emphasizing the techniques and concepts they adopted. We elicit the needs and challenges expressed for these cases, triangulated with results from a literature review. We believe our results help to understand the current state of adoption and shed light on gaps to address in industrial practice.

...read moreread less

58 citations

Journal Article•DOI•

Meteorological Drought Study Through SPI in Three Drought Prone Districts of West Bengal, India

[...]

Prasenjit Bhunia, Pritha Das¹, Ramkrishna Maiti¹•Institutions (1)

Vidyasagar University¹

An empirical characterization of bad practices in continuous integration

TL;DR: In this article, the authors used the Standardized Precipitation Index (SPI) to quantify the drought characteristics, widely used for its simplicity and variable approaches to dignify a drought.

...read moreread less

Abstract: Deficiency in rainfall introduces drought phenomena with temporal and spatial variability in terms of intensity and magnitude. Study of drought in different scales is necessary for successful planning in a country such as India, where agricultural sector contributes highest in economy. Drought indices (DI) have a tool to quantify the drought nature and express a single digit which is helpful to recognise a drought character. Standardized Precipitation Index (SPI) is a tool to quantify the drought characteristics, widely used for its simplicity and variable approaches to dignify a drought. Therefore, the present study deals with SPI to analyse drought phenomena in pre-monsoon, monsoon, post-monsoon and monthly time steps in three relatively drought prone districts (Purulia, Bankura, Midnapore) of West Bengal in India of rainfall data of 117 years (1901–2017). From SPI values, drought frequency is analysed using Gumbel’s type 1 distribution and trend is calculated using Mann–Kendal test (M–K test). Occurrence of drought with negative SPI values is frequent in these districts with increasing dry events and decreasing wet and normal event. More intensive study in hydrological and agricultural drought is necessary to implement any plan with this increasing aggravation of drought.

...read moreread less

Journal Article•DOI•

[...]

Fiorella Zampetti¹, Carmine Vassallo², Sebastiano Panichella³, Gerardo Canfora¹, Harald C. Gall², Massimiliano Di Penta¹ - Show less +2 more•Institutions (3)

University of Sannio¹, University of Zurich², Zürcher Fachhochschule³

A practical guide on conducting eye tracking studies in software engineering

TL;DR: This paper empirically investigates what are the bad practices experienced by developers applying Continuous Integration, and compiled a catalog of 79 CI bad smells belonging to 7 categories related to different dimensions of a CI pipeline management and process.

...read moreread less

Abstract: Continuous Integration (CI) has been claimed to introduce several benefits in software development, including high software quality and reliability. However, recent work pointed out challenges, barriers and bad practices characterizing its adoption. This paper empirically investigates what are the bad practices experienced by developers applying CI. The investigation has been conducted by leveraging semi-structured interviews of 13 experts and mining more than 2,300 Stack Overflow posts. As a result, we compiled a catalog of 79 CI bad smells belonging to 7 categories related to different dimensions of a CI pipeline management and process. We have also investigated the perceived importance of the identified bad smells through a survey involving 26 professional developers, and discussed how the results of our study relate to existing knowledge about CI bad practices. Whilst some results, such as the poor usage of branches, confirm existing literature, the study also highlights uncovered bad practices, e.g., related to static analysis tools or the abuse of shell scripts, and contradict knowledge from existing literature, e.g., about avoiding nightly builds. We discuss the implications of our catalog of CI bad smells for (i) practitioners, e.g., favor specific, portable tools over hacking, and do not ignore nor hide build failures, (ii) educators, e.g., teach CI culture, not just technology, and teach CI by providing examples of what not to do, and (iii) researchers, e.g., developing support for failure analysis, as well as automated CI bad smell detectors.

...read moreread less

Journal Article•DOI•

[...]

Zohreh Sharafi¹, Bonita Sharif², Yann-Gaël Guéhéneuc³, Andrew Begel⁴, Roman Bednarik⁵, Martha E. Crosby - Show less +2 more•Institutions (5)

University of Michigan¹, University of Nebraska–Lincoln², Concordia University³, Microsoft⁴, University of Eastern Finland⁵

12 Jun 2020-Empirical Software Engineering

TL;DR: This paper discusses when and why researchers should use eye trackers as well as how they should use them, and compiles a list of typical use cases—real and anticipated—of eyeTrackers, aswell as metrics, visualizations, and statistical analyses to analyze and report eye-tracking data.

...read moreread less

Abstract: For several years, the software engineering research community used eye trackers to study program comprehension, bug localization, pair programming, and other software engineering tasks. Eye trackers provide researchers with insights on software engineers’ cognitive processes, data that can augment those acquired through other means, such as on-line surveys and questionnaires. While there are many ways to take advantage of eye trackers, advancing their use requires defining standards for experimental design, execution, and reporting. We begin by presenting the foundations of eye tracking to provide context and perspective. Based on previous surveys of eye tracking for programming and software engineering tasks and our collective, extensive experience with eye trackers, we discuss when and why researchers should use eye trackers as well as how they should use them. We compile a list of typical use cases—real and anticipated—of eye trackers, as well as metrics, visualizations, and statistical analyses to analyze and report eye-tracking data. We also discuss the pragmatics of eye tracking studies. Finally, we offer lessons learned about using eye trackers to study software engineering tasks. This paper is intended to be a one-stop resource for researchers interested in designing, executing, and reporting eye tracking studies of software engineering tasks.

...read moreread less

Journal Article•DOI•

How bugs are born: a model to identify how bugs are introduced in software components

[...]

Gema Rodríguez-Pérez¹, Gregorio Robles², Alexander Serebrenik³, Andy Zaidman⁴, Daniel M. German⁵, Jesus M. Gonzalez-Barahona² - Show less +2 more•Institutions (5)

University of Waterloo¹, King Juan Carlos University², Eindhoven University of Technology³, Delft University of Technology⁴, University of Victoria⁵

What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

TL;DR: A model for defining criteria to identify the first snapshot of an evolving software system that exhibits a bug is proposed, based on the perfect test idea, and shows empirical evidence that the prevalent assumption, “a bug was introduced by the lines of code that were modified to fix it”, is just one case of how bugs are introduced in a software system.

...read moreread less

Abstract: When identifying the origin of software bugs, many studies assume that “a bug was introduced by the lines of code that were modified to fix it”. However, this assumption does not always hold and at least in some cases, these modified lines are not responsible for introducing the bug. For example, when the bug was caused by a change in an external API. The lack of empirical evidence makes it impossible to assess how important these cases are and therefore, to which extent the assumption is valid. To advance in this direction, and better understand how bugs “are born”, we propose a model for defining criteria to identify the first snapshot of an evolving software system that exhibits a bug. This model, based on the perfect test idea, decides whether a bug is observed after a change to the software. Furthermore, we studied the model’s criteria by carefully analyzing how 116 bugs were introduced in two different open source software projects. The manual analysis helped classify the root cause of those bugs and created manually curated datasets with bug-introducing changes and with bugs that were not introduced by any change in the source code. Finally, we used these datasets to evaluate the performance of four existing SZZ-based algorithms for detecting bug-introducing changes. We found that SZZ-based algorithms are not very accurate, especially when multiple commits are found; the F-Score varies from 0.44 to 0.77, while the percentage of true positives does not exceed 63%. Our results show empirical evidence that the prevalent assumption, “a bug was introduced by the lines of code that were modified to fix it”, is just one case of how bugs are introduced in a software system. Finding what introduced a bug is not trivial: bugs can be introduced by the developers and be in the code, or be created irrespective of the code. Thus, further research towards a better understanding of the origin of bugs in software projects could help to improve design integration tests and to design other procedures to make software development more robust.

...read moreread less

Journal Article•DOI•

[...]

Stefanie Beyer¹, Christian Macho¹, Massimiliano Di Penta², Martin Pinzger¹•Institutions (2)

Alpen-Adria-Universität Klagenfurt¹, University of Sannio²

The impact of context metrics on just-in-time defect prediction

TL;DR: This paper aims at automating the classification of SO question posts into seven question categories, and finds that the categories API usage, Conceptual, and Discrepancy are the most frequently assigned question categories and that they also occur together frequently.

...read moreread less

Abstract: On question and answer sites, such as Stack Overflow (SO), developers use tags to label the content of a post and to support developers in question searching and browsing. However, these tags mainly refer to technological aspects instead of the purpose of the question. Tagging questions with their purpose can add a new dimension to the identification of discussed topics in posts on SO. In this paper, we aim at automating the classification of SO question posts into seven question categories. As a first step, we harmonized existing taxonomies of question categories and then, we manually classified 1,000 SO questions according to our new taxonomy. Additionally to the question category, we marked the phrases that indicate a question category for each of the posts. We then use this data set to automate the classification of posts using two approaches. For the first approach, we manually analyzed the phrases to find patterns. Based on regular expressions, we implemented a classifier, for each of the categories, that determines whether a post belongs to a category. These regular expressions are derived by analyzing patterns in the phrases. In the second approach, we use the curated data set to train classification models of supervised machine learning algorithms (Random Forest and Support Vector Machines). For the machine learning algorithms, we experimented with 1,312 different configurations regarding the preprocessing of the text and the representation of the input data. Then, we compared the performance of the regex approach with the performance of the best configuration that uses machine learning algorithms on a validation set of 110 posts. The results show that using the regular expression approach, we can classify posts into the correct question category with an average precision and recall of 0.90, and an MCC of 0.68. Additionally, we applied the regex approach on all questions of SO that deal with Android app development and investigated the co-occurrence of question categories in posts. We found that the categories API usage, Conceptual, and Discrepancy are the most frequently assigned question categories and that they also occur together frequently. Our approach can be used to support developers in browsing SO discussions or researchers in building recommender systems based on SO.

...read moreread less

Journal Article•DOI•

[...]

Masanari Kondo¹, Daniel M. German², Osamu Mizuno¹, Eun-Hye Choi³•Institutions (3)

Kyoto Institute of Technology¹, University of Victoria², National Institute of Advanced Industrial Science and Technology³

Detection, assessment and mitigation of vulnerabilities in open source dependencies

TL;DR: In a large-scale empirical study using six open source software projects, the results show that context metrics that consider the context lines of added-lines achieve the best median value in all cases in terms of a statistical test.

...read moreread less

Abstract: Traditional just-in-time defect prediction approaches have been using changed lines of software to predict defective-changes in software development. However, they disregard information around the changed lines. Our main hypothesis is that such information has an impact on the likelihood that the change is defective. To take advantage of this information in defect prediction, we consider n-lines (n = 1,2,…) that precede and follow the changed lines (which we call context lines), and propose metrics that measure them, which we call “Context Metrics.” Specifically, these context metrics are defined as the number of words/keywords in the context lines. In a large-scale empirical study using six open source software projects, we compare the performance of using our context metrics, traditional code churn metrics (e.g., the number of modified subsystems), our extended context metrics which measure not only context lines but also changed lines, and combination metrics that use two extended context metrics at a prediction model for defect prediction. The results show that context metrics that consider the context lines of added-lines achieve the best median value in all cases in terms of a statistical test. Moreover, using few number of context lines is suitable for context metric that considers words, and using more number of context lines is suitable for context metric that considers keywords. Finally, the combination metrics of two extended context metrics significantly outperform all studied metrics in all studied projects w. r. t. the area under the receiver operation characteristic curve (AUC) and Matthews correlation coefficient (MCC).

...read moreread less

Journal Article•DOI•

[...]

Serena Elisa Ponta, Henrik Plate, Antonino Sabetta

30 Jun 2020-Empirical Software Engineering

TL;DR: The lessons learned when maturing the tool from a research prototype to an industrial-grade solution are reported on and an empirical study was conducted to compare its detection capabilities with those of OWASP Dependency Check.

...read moreread less

Abstract: Open source software (OSS) libraries are widely used in the industry to speed up the development of software products. However, these libraries are subject to an ever-increasing number of vulnerabilities that are publicly disclosed. It is thus crucial for application developers to detect dependencies on vulnerable libraries in a timely manner, to precisely assess their impact, and to mitigate any potential risk. This paper presents a novel method to detect, assess and mitigate OSS vulnerabilities. Differently from state-of-the-art approaches that depend on metadata to identify vulnerable OSS dependencies, our solution is code-centric, and combines static and dynamic analyses to determine the reachability of the vulnerable portion of libraries, in the context of a given application. Our approach also supports developers in choosing among the existing non-vulnerable library versions, with the goal to determine and minimize incompatibilities. Eclipse Steady, the open source implementation of our code-centric and usage-based approach is the tool recommended to scan Java software products at SAP; it has been successfully used to perform more than one million scans of about 1500 applications. In this paper we report on the lessons learned when maturing the tool from a research prototype to an industrial-grade solution. To evaluate Eclipse Steady, we conducted an empirical study to compare its detection capabilities with those of OWASP Dependency Check (OWASP DC), scanning 300 large enterprise applications under development with a total of 78165 dependencies. Reviewing a sample of the findings reported only by one of the two tools revealed that all Steady findings are true positives, while 88.8% of the findings of OWASP DC for vulnerabilities covered by our code-centric approach are false positives. For vulnerabilities not caused by code but due, e.g., to erroneous configuration, 63.3% of OWASP DC findings are true positives.

...read moreread less

Journal Article•DOI•

The practitioners’ point of view on the concept of technical debt and its causes and consequences: a design for a global family of industrial surveys and its first results from Brazil

[...]

Nicolli Rios¹, Rodrigo O. Spínola, Manoel Mendonça¹, Carolyn Seaman²•Institutions (2)

Federal University of Bahia¹, University of Maryland, Baltimore County²

13 Jun 2020-Empirical Software Engineering

TL;DR: The design of InsighTD is presented, which has the primary goal of replication at a large-scale, with the results of the study in Brazil as a small part of the larger puzzle.

...read moreread less

Abstract: Studying the causes of technical debt (TD) could aid in TD prevention, thus easing the job of TD management. On the other hand, better understanding of the effects of TD could also aid in TD management by facilitating more informed decisions about incurring and paying off debt. Create a deeper understanding, and confirming existing evidence, of the causes and effects of TD by collecting new evidence from real-world TD examples. InsighTD is a globally distributed family of industrial surveys on the causes and effects of TD. It is designed to run as a large-scale study based on continuous and independent replications in different countries. The survey instrument asks practitioners to describe in detail a real example of TD from their experience. We present in this paper the design of InsighTD, which has the primary goal of replication at a large-scale, with the results of the study in Brazil as a small part of the larger puzzle. The first iteration of the InsighTD survey, carried out in Brazil, yielded 107 responses. We identified a total of 78 causes and 66 effects, which confirm and also extend the current knowledge on causes and effects of TD. Then, we organized the identified set of causes and effects in probabilistic cause-effect diagrams. The proposed diagrams highlight the causes that can most contribute to the occurrence of TD as well as the most common effects that occur as a result of debt. We intend to reduce the problem of isolated TD investigations that are not yet representative and build a continuous and generalizable empirical basis for understanding practical problems and challenges of TD.

...read moreread less

Journal Article•DOI•

Improving change prediction models with code smell-related information

[...]

Gemma Catolino¹, Fabio Palomba², Francesca Arcelli Fontana³, Andrea De Lucia¹, Andy Zaidman⁴, Filomena Ferrucci¹ - Show less +2 more•Institutions (4)

University of Salerno¹, University of Zurich², University of Milano-Bicocca³, Delft University of Technology⁴

Why reinventing the wheels? An empirical study on library reuse and re-implementation

TL;DR: This paper exploits the so-called intensity index, a previously defined metric that captures the severity of a code smell, and evaluates its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features.

...read moreread less

Abstract: Code smells are sub-optimal implementation choices applied by developers that have the effect of negatively impacting, among others, the change-proneness of the affected classes. Based on this consideration, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, i.e., models having the goal of indicating which classes are more likely to change in the future. We exploit the so-called intensity index—a previously defined metric that captures the severity of a code smell—and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with a model based on previously defined antipattern metrics, a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including models is statistically better than the baselines and, (ii) the intensity is a better predictor than antipattern metrics. We observed some orthogonality between the set of change-prone and non-change-prone classes correctly classified by the models relying on intensity and antipattern metrics: for this reason, we also devise and evaluate a smell-aware combined change prediction model including product, process, developer-based, and smell-related features. We show that the F-Measure of this model is notably higher than other models.

...read moreread less

Journal Article•DOI•

[...]

Bowen Xu¹, Le An², Ferdian Thung¹, Foutse Khomh², David Lo¹ - Show less +1 more•Institutions (2)

Singapore Management University¹, École Polytechnique de Montréal²

The who, what, how of software engineering research: a socio-technical framework

TL;DR: This work investigated the reasons behind library reuse and re-implementation and provided a few suggestions to improve the current library recommendation systems: tailored recommendation according to users’ preferences, detection of external code that is similar to a part of the Users’ code (to avoid duplication or re-IMplementation), grouping similar recommendations for developers to compare and select the one they prefer, and disrecommendation of poor-quality libraries.

...read moreread less

Abstract: Nowadays, with the rapid growth of open source software (OSS), library reuse becomes more and more popular since a large amount of third- party libraries are available to download and reuse. A deeper understanding on why developers reuse a library (i.e., replacing self-implemented code with an external library) or re-implement a library (i.e., replacing an imported external library with self-implemented code) could help researchers better understand the factors that developers are concerned with when reusing code. This understanding can then be used to improve existing libraries and API recommendation tools for researchers and practitioners by using the developers concerns identified in this study as design criteria. In this work, we investigated the reasons behind library reuse and re-implementation. To achieve this goal, we first crawled data from two popular sources, F-Droid and GitHub. Then, potential instances of library reuse and re-implementation were found automatically based on certain heuristics. Next, for each instance, we further manually identified whether it is valid or not. For library re-implementation, we obtained 82 instances which are distributed in 75 repositories. We then conducted two types of surveys (i.e., individual survey to corresponding developers of the validated instances and another open survey) for library reuse and re-implementation. For library reuse individual survey, we received 36 responses out of 139 contacted developers. For re-implementation individual survey, we received 13 responses out of 71 contacted developers. In addition, we received 56 responses from the open survey. Finally, we perform qualitative and quantitative analysis on the survey responses and commit logs of the validated instances. The results suggest that library reuse occurs mainly because developers were initially unaware of the library or the library had not been introduced. Re-implementation occurs mainly because the used library method is only a small part of the library, the library dependencies are too complicated, or the library method is deprecated. Finally, based on all findings obtained from analyzing the surveys and commit messages, we provided a few suggestions to improve the current library recommendation systems: tailored recommendation according to users’ preferences, detection of external code that is similar to a part of the users’ code (to avoid duplication or re-implementation), grouping similar recommendations for developers to compare and select the one they prefer, and disrecommendation of poor-quality libraries.

...read moreread less

Journal Article•DOI•

[...]

Margaret-Anne Storey¹, Neil A. Ernst¹, Courtney Williams¹, Eirini Kalliamvakou¹•Institutions (1)

University of Victoria¹

28 Aug 2020-Empirical Software Engineering

TL;DR: In this article, the authors developed a socio-technical research framework to capture the main beneficiary of a research study (the who), the main type of research contribution produced (the what), and the research strategies used in the study (how we methodologically approach delivering relevant results given the who and what of our studies).

...read moreread less

Abstract: Software engineering is a socio-technical endeavor, and while many of our contributions focus on technical aspects, human stakeholders such as software developers are directly affected by and can benefit from our research and tool innovations. In this paper, we question how much of our research addresses human and social issues, and explore how much we study human and social aspects in our research designs. To answer these questions, we developed a socio-technical research framework to capture the main beneficiary of a research study (the who), the main type of research contribution produced (the what), and the research strategies used in the study (how we methodologically approach delivering relevant results given the who and what of our studies). We used this Who-What-How framework to analyze 151 papers from two well-cited publishing venues—the main technical track at the International Conference on Software Engineering, and the Empirical Software Engineering Journal by Springer—to assess how much this published research explicitly considers human aspects. We find that although a majority of these papers claim the contained research should benefit human stakeholders, most focus predominantly on technical contributions. Although our analysis is scoped to two venues, our results suggest a need for more diversification and triangulation of research strategies. In particular, there is a need for strategies that aim at a deeper understanding of human and social aspects of software development practice to balance the design and evaluation of technical innovations. We recommend that the framework should be used in the design of future studies in order to steer software engineering research towards explicitly including human and social concerns in their designs, and to improve the relevance of our research for human stakeholders.

...read moreread less

Journal Article•DOI•

Code cloning in smart contracts: a case study on verified contracts from the Ethereum blockchain platform

[...]

Masanari Kondo¹, Gustavo Ansaldi Oliva², Zhen Ming Jiang³, Ahmed E. Hassan², Osamu Mizuno¹ - Show less +1 more•Institutions (3)

Kyoto Institute of Technology¹, Queen's University², York University³

01 Nov 2020-Empirical Software Engineering

TL;DR: This paper quantifies the amount of clones in Ethereum, understands key characteristics of clone clusters, and determines whether smart contracts contain pieces of code that are identical to those published by OpenZeppelin (RQ3), and concludes that the aforementioned findings yield implications to the security, development, and usage of smart contracts.

...read moreread less

Abstract: Ethereum is a blockchain platform that hosts and executes smart contracts. Smart contracts have been used to implement cryptocurrencies and crowdfunding initiatives (ICOs). A major concern in Ethereum is the security of smart contracts. Different from traditional software development, smart contracts are immutable once deployed. Hence, vulnerabilities and bugs in smart contracts can lead to catastrophic financial loses. In order to avoid taking the risk of writing buggy code, smart contract developers are encouraged to reuse pieces of code from reputable sources (e.g., OpenZeppelin). In this paper, we study code cloning in Ethereum. Our goal is to quantify the amount of clones in Ethereum (RQ1), understand key characteristics of clone clusters (RQ2), and determine whether smart contracts contain pieces of code that are identical to those published by OpenZeppelin (RQ3). We applied Deckard, a tree-based clone detector, to all Ethereum contracts for which the source code was available. We observe that developers frequently clone contracts. In particular, 79.2% of the studied contracts are clones and we note an upward trend in the number of cloned contracts per quarter. With regards to the characteristics of clone clusters, we observe that: (i) 9 out of the top-10 largest clone clusters are token managers, (ii) most of the activity of a cluster tends to be concentrated on a few contracts, and (iii) contracts in a cluster to be created by several authors. Finally, we note that the studied contracts have different ratios of code blocks that are identical to those provided by the OpenZeppelin project. Due to the immutability of smart contracts, as well as the impossibility of reverting transactions once they are deemed final, we conclude that the aforementioned findings yield implications to the security, development, and usage of smart contracts.

...read moreread less

Journal Article•DOI•

The impact of automated feature selection techniques on the interpretation of defect models

[...]

Jirayus Jiarpakdee¹, Chakkrit Tantithamthavorn¹, Christoph Treude²•Institutions (2)

Monash University¹, University of Adelaide²

01 Aug 2020-Empirical Software Engineering

TL;DR: It is found that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models.

...read moreread less

Abstract: The interpretation of defect models heavily relies on software metrics that are used to construct them. Prior work often uses feature selection techniques to remove metrics that are correlated and irrelevant in order to improve model performance. Yet, conclusions that are derived from defect models may be inconsistent if the selected metrics are inconsistent and correlated. In this paper, we systematically investigate 12 automated feature selection techniques with respect to the consistency, correlation, performance, computational cost, and the impact on the interpretation dimensions. Through an empirical investigation of 14 publicly-available defect datasets, we find that (1) 94–100% of the selected metrics are inconsistent among the studied techniques; (2) 37–90% of the selected metrics are inconsistent among training samples; (3) 0–68% of the selected metrics are inconsistent when the feature selection techniques are applied repeatedly; (4) 5–100% of the produced subsets of metrics contain highly correlated metrics; and (5) while the most important metrics are inconsistent among correlation threshold values, such inconsistent most important metrics are highly-correlated with the Spearman correlation of 0.85–1. Since we find that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models. In addition to introducing AutoSpearman which mitigates correlated metrics better than commonly-used feature selection techniques, this paper opens up new research avenues in the automated selection of features for defect models to optimise for interpretability as well as performance.

...read moreread less

Journal Article•DOI•

Measuring the impact of lexical and structural inconsistencies on developers’ cognitive load during bug localization

[...]

Sarah Fakhoury¹, Devjeet Roy¹, Yuzhan Ma², Venera Arnaoudova¹, Olusola Adesope¹ - Show less +1 more•Institutions (2)

Washington State University¹, Amazon.com²

On the time-based conclusion stability of cross-project defect prediction models

TL;DR: fNIRS and eye tracking devices are proposed as an objective measure of program comprehension that allows researchers to conduct studies in environments close to real world settings, at identifier level of granularity and observes that self-reported task difficulty, cognitive load, and fixation duration do not correlate and appear to be measuring different aspects of task difficulty.

...read moreread less

Abstract: A large portion of the cost of any software lies in the time spent by developers in understanding a program’s source code before any changes can be undertaken. Measuring program comprehension is not a trivial task. In fact, different studies use self-reported and various psycho-physiological measures as proxies. In this research, we propose a methodology using functional Near Infrared Spectroscopy (fNIRS) and eye tracking devices as an objective measure of program comprehension that allows researchers to conduct studies in environments close to real world settings, at identifier level of granularity. We validate our methodology and apply it to study the impact of lexical, structural, and readability issues on developers’ cognitive load during bug localization tasks. Our study involves 25 undergraduate and graduate students and 21 metrics. Results show that the existence of lexical inconsistencies in the source code significantly increases the cognitive load experienced by participants not only on identifiers involved in the inconsistencies but also throughout the entire code snippet. We did not find statistical evidence that structural inconsistencies increase the average cognitive load that participants experience, however, both types of inconsistencies result in lower performance in terms of time and success rate. Finally, we observe that self-reported task difficulty, cognitive load, and fixation duration do not correlate and appear to be measuring different aspects of task difficulty.

...read moreread less

Journal Article•DOI•

[...]

Abdul Ali Bangash¹, Hareem Sahar¹, Abram Hindle¹, Karim Ali¹•Institutions (1)

University of Alberta¹

01 Nov 2020-Empirical Software Engineering

TL;DR: In this paper, a time-aware evaluation of cross-project defect prediction models is presented, where models are trained only on the past and evaluations are executed only on future software projects.

...read moreread less

Abstract: Researchers in empirical software engineering often make claims based on observable data such as defect reports. Unfortunately, in many cases, these claims are generalized beyond the data sets that have been evaluated. Will the researcher’s conclusions hold a year from now for the same software projects? Perhaps not. Recent studies show that in the area of Software Analytics, conclusions over different data sets are usually inconsistent. In this article, we empirically investigate whether conclusions in the area of cross-project defect prediction truly exhibit stability throughout time or not. Our investigation applies a time-aware evaluation approach where models are trained only on the past, and evaluations are executed only on the future. Through this time-aware evaluation, we show that depending on which time period we evaluate defect predictors, their performance, in terms of F-Score, the area under the curve (AUC), and Mathews Correlation Coefficient (MCC), varies and their results are not consistent. The next release of a product, which is significantly different from its prior release, may drastically change defect prediction performance. Therefore, without knowing about the conclusion stability, empirical software engineering researchers should limit their claims of performance within the contexts of evaluation, because broad claims about defect prediction performance might be contradicted by the next upcoming release of a product under analysis.

...read moreread less

Journal Article•DOI•

Automated demarcation of requirements in textual specifications: a machine learning-based approach

[...]

Sallam Abualhaija¹, Chetan Arora², Chetan Arora¹, Mehrdad Sabetzadeh³, Mehrdad Sabetzadeh¹, Lionel C. Briand³, Lionel C. Briand¹, Michael Traynor⁴ - Show less +4 more•Institutions (4)

University of Luxembourg¹, Deakin University², University of Ottawa³, Halifax⁴

Assessment of Cyclone Vulnerability, Hazard Evaluation and Mitigation Capacity for Analyzing Cyclone Risk using GIS Technique: a Study on Sundarban Biosphere Reserve, India

TL;DR: An automated approach for demarcating requirements in free-form requirements specifications based on machine learning, which can be applied to a wide variety of specifications in different domains and with different writing styles is proposed.

...read moreread less

Abstract: A simple but important task during the analysis of a textual requirements specification is to determine which statements in the specification represent requirements. In principle, by following suitable writing and markup conventions, one can provide an immediate and unequivocal demarcation of requirements at the time a specification is being developed. However, neither the presence nor a fully accurate enforcement of such conventions is guaranteed. The result is that, in many practical situations, analysts end up resorting to after-the-fact reviews for sifting requirements from other material in a requirements specification. This is both tedious and time-consuming. We propose an automated approach for demarcating requirements in free-form requirements specifications. The approach, which is based on machine learning, can be applied to a wide variety of specifications in different domains and with different writing styles. We train and evaluate our approach over an independently labeled dataset comprised of 33 industrial requirements specifications. Over this dataset, our approach yields an average precision of 81.2% and an average recall of 95.7%. Compared to simple baselines that demarcate requirements based on the presence of modal verbs and identifiers, our approach leads to an average gain of 16.4% in precision and 25.5% in recall. We collect and analyze expert feedback on the demarcations produced by our approach for industrial requirements specifications. The results indicate that experts find our approach useful and efficient in practice. We developed a prototype tool, named DemaRQ, in support of our approach. To facilitate replication, we make available to the research community this prototype tool alongside the non-proprietary portion of our training data.

...read moreread less

Journal Article•DOI•

[...]

Sk Ajim Ali¹, Rumana Khatun¹, Ateeque Ahmad¹, Syed Naushad Ahmad¹•Institutions (1)

Aligarh Muslim University¹