scispace - formally typeset
Search or ask a question

Showing papers in "Information & Software Technology in 2021"


Journal ArticleDOI
TL;DR: This work proposes BGNN4VD (Bidirectional Graph Neural Network for Vulnerability Detection), a vulnerability detection approach by constructing a Bidirectionalgraph Neural-Network (BGNN), which achieves a higher precision and accuracy than the state-of-the-art methods.
Abstract: Context: Previous studies have shown that existing deep learning-based approaches can significantly improve the performance of vulnerability detection. They represent code in various forms and mine vulnerability features with deep learning models. However, the differences of code representation forms and deep learning models make various approaches still have some limitations. In practice, their false-positive rate (FPR) and false-negative rate (FNR) are still high. Objective: To address the limitations of existing deep learning-based vulnerability detection approaches, we propose BGNN4VD (Bidirectional Graph Neural Network for Vulnerability Detection), a vulnerability detection approach by constructing a Bidirectional Graph Neural-Network (BGNN). Method: In Phase 1, we extract the syntax and semantic information of source code through abstract syntax tree (AST), control flow graph (CFG), and data flow graph (DFG). Then in Phase 2, we use vectorized source code as input to Bidirectional Graph Neural-Network (BGNN). In Phase 3, we learn the different features between vulnerable code and non-vulnerable code by introducing backward edges on the basis of traditional Graph Neural-Network (GNN). Finally in Phase 4, a Convolutional Neural-Network (CNN) is used to further extract features and detect vulnerabilities through a classifier. Results: We evaluate BGNN4VD on four popular C/C++ projects from NVD and GitHub, and compare it with four state-of-the-art (Flawfinder, RATS, SySeVR, and VUDDY) vulnerab ility detection approaches. Experiment results show that, when compared these baselines, BGNN4VD achieves 4.9%, 11.0%, and 8.4% improvement in F1-measure, accuracy and precision, respectively. Conclusion: The proposed BGNN4VD achieves a higher precision and accuracy than the state-of-the-art methods. In addition, when applied on the latest vulnerabilities reported by CVE, BGNN4VD can still achieve a precision at 45.1%, which demonstrates the feasibility of BGNN4VD in practical application.

72 citations


Journal ArticleDOI
TL;DR: In this paper, a systematic literature review is conducted on published primary studies on the automation of context systematic literature review (SLR) studies, in which 41 primary studies have been analyzed.
Abstract: Context Systematic Literature Review (SLR) studies aim to identify relevant primary papers, extract the required data, analyze, and synthesize results to gain further and broader insight into the investigated domain. Multiple SLR studies have been conducted in several domains, such as software engineering, medicine, and pharmacy. Conducting an SLR is a time-consuming, laborious, and costly effort. As such, several researchers developed different techniques to automate the SLR process. However, a systematic overview of the current state-of-the-art in SLR automation seems to be lacking. Objective This study aims to collect and synthesize the studies that focus on the automation of SLR to pave the way for further research. Method A systematic literature review is conducted on published primary studies on the automation of SLR studies, in which 41 primary studies have been analyzed. Results This SLR identifies the objectives of automation studies, application domains, automated steps of the SLR, automation techniques, and challenges and solution directions. Conclusion According to our study, the leading automated step is the Selection of Primary Studies. Although many studies have provided automation approaches for systematic literature reviews, no study has been found to apply automation techniques in the planning and reporting phase. Further research is needed to support the automation of the other activities of the SLR process.

69 citations


Journal ArticleDOI
TL;DR: HbrPredictor as discussed by the authors combines interactive machine learning and active learning to HBR prediction, which can dramatically reduce the number of bug reports required for prediction model training; on the other hand, it improves the diversity and generalization ability of training samples via uncertainty sampling.
Abstract: Context: Bug reports record issues found during software development and maintenance. A high-impact bug report (HBR) describes an issue that can cause severe damage once occurred after deployment. Identifying HBRs from the bug repository as early as possible is crucial for guaranteeing software quality. Objective: In recent years, many machine learning-based approaches have been proposed for HBR prediction, and most of them are based on supervised machine learning. However, the assumption of supervised machine learning is that it needs a large number of labeled data, which is often difficult to gather in practice. Method: In this paper, we propose hbrPredictor, which combines interactive machine learning and active learning to HBR prediction. On the one hand, it can dramatically reduce the number of bug reports required for prediction model training; on the other hand, it improves the diversity and generalization ability of training samples via uncertainty sampling. Result: We take security bug report (SBR) prediction as an example of HBR prediction and perform a large-scale experimental evaluation on datasets from different open-source projects. The results show: (1) hbrPredictor substantially outperforms the two baselines and obtains the maximum values of F1-score (0.7939) and AUC (0.8789); (2) with the dynamic stop criteria, hbrPredictor could reach its best performance with only 45% and 13% of the total bug reports for small-sized datasets and large-sized datasets, respectively. Conclusion: By reducing the number of required training samples, hbrPredictor could substantially save the data labeling effort without decreasing the effectiveness of the model.

64 citations


Journal ArticleDOI
TL;DR: A computational model to assist teams to identify and monitor risks at different points in the life cycle of projects to show the applicability of the risk recommendation to new projects, based on the similarity analysis of context histories.
Abstract: Context Risk event management has become strategic in Project Management, where uncertainties are inevitable. In this sense, the use of concepts of ubiquitous computing, such as contexts, context histories, and mobile computing can assist in proactive project management. Objective This paper proposes a computational model for the reduction of the probability of project failure through the prediction of risks. The purpose of the study is to show a model to assist teams to identify and monitor risks at different points in the life cycle of projects. The work presents as scientific contribution to the use of context histories to infer the recommendation of risks to new projects. Method The research conducted a case study in a software development company. The study was applied in two scenarios. The first involved two teams that assessed the use of the prototype during the implementation of 5 projects. The second scenario considered 17 completed projects to assess the recommendations made by the Atropos model comparing the recommendations with the risks in the original projects. In this scenario, Atropos used 70% of each project's execution as learning for the recommendations of risks generated to the same projects. Thus, the scenario aimed to assess whether the recommended risks are contained in the remaining 30% of the executed projects. We used as context histories, a database with 153 software projects from a financial company. Results A project team with 18 professionals assessed the recommendations, obtaining a result of 73% acceptance and 83% accuracy when compared to projects already being executed. The results demonstrated a high percentage of acceptance of the recommendation of risks compared to the other models that do not use the characteristics and similarities of projects. Conclusion The results show the applicability of the risk recommendation to new projects, based on the similarity analysis of context histories. This study applies inferences on context histories in the development and planning of projects, focusing on risk recommendation. Thus, with recommendations considering the characteristics of each new project, the manager starts with a larger set of information to make more assertive project planning.

58 citations


Journal ArticleDOI
TL;DR: The proposed assessment framework, based on the analysis of a set of characteristics and metrics, could be useful for companies if they need to migrate to Microservices and do not want to run the risk of failing to consider some important information.
Abstract: Context: Re-architecting monolithic systems with Microservices-based architecture is a common trend. Various companies are migrating to Microservices for different reasons. However, making such an important decision like re-architecting an entire system must be based on real facts and not only on gut feelings. Objective: The goal of this work is to propose an evidence-based decision support framework for companies that need to migrate to Microservices, based on the analysis of a set of characteristics and metrics they should collect before re-architecting their monolithic system. Method: We conducted a survey done in the form of interviews with professionals to derive the assessment framework based on Grounded Theory. Results: We identified a set consisting of information and metrics that companies can use to decide whether to migrate to Microservices or not. The proposed assessment framework, based on the aforementioned metrics, could be useful for companies if they need to migrate to Microservices and do not want to run the risk of failing to consider some important information.

52 citations


Journal ArticleDOI
TL;DR: Complexity-based OverSampling TEchnique (COSTE), a novel oversampling technique that can achieve low p f and high p d simultaneously, is introduced and is recommended as an efficient alternative to address the class imbalance problem in SDP.
Abstract: Context: Generally, there are more non-defective instances than defective instances in the datasets used for software defect prediction (SDP), which is referred to as the class imbalance problem. Oversampling techniques are frequently adopted to alleviate the problem by generating new synthetic defective instances. Existing techniques generate either near-duplicated instances which result in overgeneralization (high probability of false alarm, p f ) or overly diverse instances which hurt the prediction model’s ability to find defects (resulting in low probability of detection, p d ). Furthermore, when existing oversampling techniques are applied in SDP, the effort needed to inspect the instances with different complexity is not taken into consideration. Objective: In this study, we introduce Complexity-based OverSampling TEchnique (COSTE), a novel oversampling technique that can achieve low p f and high p d simultaneously. Meanwhile, COSTE also performs better in terms of N o r m ( p o p t ) and A C C , two effort-aware measures that consider the testing effort. Method: COSTE combines pairs of defective instances with similar complexity to generate synthetic instances, which improves the diversity within the data, maintains the ability of prediction models to find defects, and takes the different testing effort needed for different instances into consideration. We conduct experiments to compare COSTE with Synthetic Minority Oversampling TEchnique, Borderline-SMOTE, Majority Weighted Minority Oversampling TEchnique and MAHAKIL. Results: The experimental results on 23 releases of 10 projects show that COSTE greatly improves the diversity of the synthetic instances without compromising the ability of prediction models to find defects. In addition, COSTE outperforms the other oversampling techniques under the same testing effort. The statistical analysis indicates that COSTE’s ability to outperform the other oversampling techniques is significant under the statistical Wilcoxon rank sum test and Cliff’s effect size. Conclusion: COSTE is recommended as an efficient alternative to address the class imbalance problem in SDP.

48 citations


Journal ArticleDOI
TL;DR: The results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance, and suggest that Neural program models based on data and control dependencies in programs generalize better than neural program model based only on abstract syntax trees (ASTs).
Abstract: Context: With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Objective: Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. Method: We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Results: Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees (ASTs). On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Conclusion: Our results on the generalizability of neural program models provide insights to measure their limitations and provide a stepping stone for their improvement.

43 citations


Journal ArticleDOI
TL;DR: Although scalability is the commonly acknowledged benefit of MSA, it is still an indispensable concern among the identified QAs, especially when trading-off with otherQAs, e.g., performance.
Abstract: Context: As a rapidly adopted architectural style in software engineering, Microservices Architecture (MSA) advocates implementing small-scale and independently distributed services, rather than binding all functions into one monolith. Although many initiatives have contributed to the quality improvement of microservices-based systems, there is still a lack of a systematic understanding of the Quality Attributes (QAs) associated with MSA. Objective: This study aims to investigate the evidence-based state-of-the-art of QAs of microservices-based systems. Method: We carried out a Systematic Literature Review (SLR) to identify and synthesize the relevant studies that report evidence related to QAs of MSA. Results: Based on the data extracted from the 72 selected primary studies, we portray an overview of the six identified QAs most concerned in MSA, scalability, performance, availability, monitorability, security, and testability. We identify 19 tactics that architecturally address the critical QAs in MSA, including two tactics for scalability, four for performance, four for availability, four for monitorability, three for security, and two for testability. Conclusion: This SLR concludes that for MSA-based systems: 1) Although scalability is the commonly acknowledged benefit of MSA, it is still an indispensable concern among the identified QAs, especially when trading-off with other QAs, e.g., performance. Apart from the six identified QAs in this study, other QAs for MSA like maintainability need more attention for effective improvement and evaluation in the future. 3) Practitioners need to carefully make the decision of migrating to MSA based on the return on investment, since this architectural style additionally cause some pains in practice.

41 citations


Journal ArticleDOI
TL;DR: A three-step method to develop a usability evaluation model in the context of OSS can be used as a reference metric for OSS usability evaluation which will have a practical benefit for the community in public and private organisations in helping the decision-maker to select the best OSS software package amongst the alternatives.
Abstract: Context A plethora of models are available for open-source software (OSS) usability evaluation. However, these models lack consensus between scholars as well as standard bodies on a specific set of usability evaluation criteria. Retaining irrelevant criteria and omitting essential ones will mislead the direction of the usability evaluation. Objective This study introduces a three-step method to develop a usability evaluation model in the context of OSS. Method The fuzzy Delphi method has been employed to unify the usability evaluation criteria in the context of OSS. The first step in the method is the usability criteria analysis, which involves redefining and restructuring all collected usability criteria reported in the literature. The second step is fuzzy Delphi analysis, which includes the design and validates the fuzzy Delphi instrument and the utilisation of the fuzzy Delphi method to analyse the fuzziness consensus of experts' opinions on the usability evaluation criteria. The third step is the proposal of the OSS usability evaluation model. Results A total of 124 usability criteria were identified, redefined, and restructured by creating groups of related meaning criteria. The result of the groupings generated 11 main criteria; the findings of the fuzzy Delphi narrowed down the criteria to only seven. The final set of criteria was sent back to the panellists for reconsideration of their responses. The panellists verified that these criteria are suitable in the evaluation of the usability of OSS. Discussion The empirical analysis confirmed that the proposed evaluation model is acceptable in assessing the usability of OSS. Therefore, this model can be used as a reference metric for OSS usability evaluation which will have a practical benefit for the community in public and private organisations in helping the decision-maker to select the best OSS software package amongst the alternatives.

37 citations


Journal ArticleDOI
TL;DR: In this article, a series of stable SMOTE-based oversampling techniques are proposed to improve the stability of SMOTEbased over-sampling techniques, which reduces the randomness in each step by selecting defective instances in turn, distance-based selection of K neighbor instances, and evenly distributed interpolation.
Abstract: Context: In practice, software datasets tend to have more non-defective instances than defective ones, which is referred to as the class imbalance problem in software defect prediction (SDP). Synthetic Minority Oversampling TEchnique (SMOTE) and its variants alleviate the class imbalance problem by generating synthetic defective instances. SMOTE-based oversampling techniques were widely adopted as the baselines to compare with the newly proposed oversampling techniques in SDP. However, randomness is introduced during the procedure of SMOTE-based oversampling techniques. If the performance of SMOTE-based oversampling techniques is highly unstable, the conclusion drawn from the comparison between SMOTE-based oversampling techniques and the newly proposed techniques may be misleading and less convincing. Objective: This paper aims to investigate the stability of SMOTE-based oversampling techniques. Moreover, a series of stable SMOTE-based oversampling techniques are proposed to improve the stability of SMOTE-based oversampling techniques. Method: Stable SMOTE-based oversampling techniques reduce the randomness in each step of SMOTE-based oversampling techniques by selecting defective instances in turn, distance-based selection of K neighbor instances, and evenly distributed interpolation. Besides, we mathematically prove and also empirically investigate the stability of SMOTE-based and stable SMOTE-based oversampling techniques on four common classifiers across 26 datasets in terms of AUC, b a l a n c e , and MCC. Results: The analysis of SMOTE-based and stable SMOTE-based oversampling techniques shows that the performance of stable SMOTE-based oversampling techniques is more stable and better than that of SMOTE-based oversampling techniques. The difference between the worst and best performances of SMOTE-based oversampling techniques is up to 23.3%, 32.6%, and 204.2% in terms of AUC, b a l a n c e , and MCC, respectively. Conclusion: Stable SMOTE-based oversampling techniques should be considered as a drop-in replacement for SMOTE-based oversampling techniques.

34 citations


Journal ArticleDOI
TL;DR: QA has received much attention in SE in more recent years and the improvement is evident since the last study conducted in 2015, where new findings show that the aims are more concise, the instruments are more diverse and rigorous, and the criteria are more thoughtful.
Abstract: Context: Quality Assessment (QA) of reviewed literature is paramount to a Systematic Literature Review (SLR) as the quality of conclusions completely depends on the quality of selected literature. A number of researchers in Software Engineering (SE) have developed a variety of QA instruments and also reported their challenges. We previously conducted a tertiary study on SLRs with QA from 2004 to 2013, and reported the findings in 2015. Objective: With the widespread use of SLRs in SE and the increasing adoption of QA in these SLRs in recent years, it is necessary to empirically investigate whether the previous conclusions are still valid and whether there are new insights to the subject in question using a larger and a more up-to-date SLR set. More importantly, we aim to depict a clear picture of QA used in SLRs in SE by aggregating and distilling good practices, including the commonly used QA instruments as well as the major roles and aspects of QA in research. Method: An extended tertiary study was conducted with the newly collected SLRs from 2014 to 2018 and the original SLRs from 2004 to 2013 to systematically review the QA used by SLRs in SE during the 15-year period from 2004 to 2018. In addition, this extended study also compared and contrasted the findings of the previous study conducted in 2015. Results: A total of 241 SLRs between 2004 and 2018 were included, from which we identified a number of QA instruments. These instruments are generally designed to focus on the rationality of study design, the rigor of study execution and analysis, and the credibility and contribution of study findings and conclusions, with the emphasis largely placed on its rigor. The quality data is mainly used for literature selection or as evidence to support conclusions. Conclusions: QA has received much attention in SE in more recent years and the improvement is evident since the last study in 2015. New findings show that the aims are more concise, the instruments are more diverse and rigorous, and the criteria are more thoughtful.

Journal ArticleDOI
TL;DR: Which hard and soft skills are more required in IT companies by analyzing the description of 20,000 job opportunities is investigated and the importance of soft skills is highlighted, as they appear in many job opportunities.
Abstract: Context: There is a growing demand for information on how IT companies look for candidates to their open positions. Objective: This paper investigates which hard and soft skills are more required in IT companies by analyzing the description of 20,000 job opportunities. Method: We applied open card sorting to perform a high-level analysis on which types of hard skills are more requested. Further, we manually analyzed the most mentioned soft skills. Results: Programming languages are the most demanded hard skills. Communication, collaboration, and problem-solving are the most demanded soft skills. Conclusion: We recommend developers to organize their resume according to the positions they are applying. We also highlight the importance of soft skills, as they appear in many job opportunities.

Journal ArticleDOI
TL;DR: A empirically evaluated and compared seven state-of-the-art meta-heuristics and three alternative surrogate metrics to solve the problem of identifying duplicate bug reports with LDA and indicated that meta- heuristics are mostly comparable to one another and the choice of the surrogate metric impacts the quality of the generated topics and the tuning overhead.
Abstract: Context:Latent Dirichlet Allocation (LDA) has been successfully used in the literature to extract topics from software documents and support developers in various software engineering tasks. While LDA has been mostly used with default settings, previous studies showed that default hyperparameter values generate sub-optimal topics from software documents. Objective: Recent studies applied meta-heuristic search (mostly evolutionary algorithms) to configure LDA in an unsupervised and automated fashion. However, previous work advocated for different meta-heuristics and surrogate metrics to optimize. The objective of this paper is to shed light on the influence of these two factors when tuning LDA for SE tasks. Method:We empirically evaluated and compared seven state-of-the-art meta-heuristics and three alternative surrogate metrics (i.e., fitness functions) to solve the problem of identifying duplicate bug reports with LDA. The benchmark consists of ten real-world and open-source projects from the Bench4BL dataset. Results:Our results indicate that (1) meta-heuristics are mostly comparable to one another (except for random search and CMA-ES), and (2) the choice of the surrogate metric impacts the quality of the generated topics and the tuning overhead. Furthermore, calibrating LDA helps identify twice as many duplicates than untuned LDA when inspecting the top five past similar reports. Conclusion:No meta-heuristic and/or fitness function outperforms all the others, as advocated in prior studies. However, we can make recommendations for some combinations of meta-heuristics and fitness functions over others for practical use. Future work should focus on improving the surrogate metrics used to calibrate/tune LDA in an unsupervised fashion.

Journal ArticleDOI
TL;DR: This study attempts to understand whether cultural barriers persist in distributed projects in which Indian engineers work with a more empowering Swedish management, and if so, how to overcome them, and identifies twelve cultural barriers.
Abstract: Context: Agile methods in offshored projects have become increasingly popular. Yet, many companies have found that the use of agile methods in coordination with companies located outside the regions of early agile adopters remains challenging. India has received particular attention as the leading destination of offshoring contracts due to significant cultural differences between sides of such contracts. Alarming differences are primarily rooted in the hierarchical business culture of Indian organizations and related command-and-control management behavior styles. Objective: In this study, we attempt to understand whether cultural barriers persist in distributed projects in which Indian engineers work with a more empowering Swedish management, and if so, how to overcome them. The present work is an invited extension of a conference paper. Method: We performed a multiple-case study in a mature agile company located in Sweden and a more hierarchical Indian vendor. We collected data from five group interviews with a total of 34 participants and five workshops with 96 participants in five distributed DevOps teams, including 36 Indian members, whose preferred behavior in different situations we surveyed. Results: We identified twelve cultural barriers, six of which were classified as impediments to agile software development practices, and report on the manifestation of these barriers in five DevOps teams. Finally, we put forward recommendations to overcome the identified barriers and emphasize the importance of cultural training, especially when onboarding new team members. Conclusions: Our findings confirm previously reported behaviors rooted in cultural differences that impede the adoption of agile approaches in offshore collaborations, and identify new barriers not previously reported. In contrast to the existing opinion that cultural characteristics are rigid and unchanging, we found that some barriers present at the beginning of the studied collaboration disappeared over time. Many offshore members reported behaving similarly to their onshore colleagues.

Journal ArticleDOI
TL;DR: Systematic use of agile practices conformed, and was observed to take place in conjunction with the use of security practices, which were most common in the requirement and implementation phases.
Abstract: Context: Software security engineering provides the means to define, implement and verify security in software products. Software security engineering is performed by following a software security development life cycle model or a security capability maturity model. However, agile software development methods and processes, dominant in the software industry, are viewed to be in conflict with these security practices and the security requirements. Objective: Empirically verify the use and impact of software security engineering activities in the context of agile software development, as practiced by software developer professionals. Method: A survey ( N = 61 ) was performed among software practitioners in Finland regarding their use of 40 common security engineering practices and their perceived security impact, in conjunction with the use of 16 agile software development items and activities. Results: The use of agile items and activities had a measurable effect on the selection of security engineering practices. Perceived impact of the security practices was lower than the rate of use would imply: This was taken to indicate a selection bias, caused by e.g. developers’ awareness of only certain security engineering practices, or by difficulties in applying the security engineering practices into an iterative software development workflow. Security practices deemed to have most impact were proactive and took place in the early phases of software development. Conclusion: Systematic use of agile practices conformed, and was observed to take place in conjunction with the use of security practices. Security activities were most common in the requirement and implementation phases. In general, the activities taking place early in the life cycle were also considered most impactful. A discrepancy between the level of use and the perceived security impact of many security activities was observed. This prompts research and methodological development for better integration of security engineering activities into software development processes, methods, and tools.

Journal ArticleDOI
TL;DR: In this article, the authors propose a framework that constructs temporal logic formulas for each template compositionally, from its fields, including scope, condition, timing, and response, with Real-Time Graphical Interval Logic semantics.
Abstract: The use of structured natural languages to capture requirements provides a reasonable trade-off between ambiguous natural language and unintuitive formal notations. There are two major challenges in making structured natural language amenable to formal analysis: (1) formalizing requirements as formulas that can be processed by analysis tools and (2) ensuring that the formulas conform to the semantics of the structured natural language. fretish is a structured natural language that incorporates features from existing research and from NASA applications. Even though fretish is quite expressive, its underlying semantics is determined by the types of four fields: scope, condition, timing, and response. Each combination of field types defines a template with Real-Time Graphical Interval Logic (RTGIL) semantics. We have developed a framework that constructs temporal logic formulas for each template compositionally, from its fields. The compositional nature of our algorithms facilitates maintenance and extensibility. Our goal is to be inclusive not only in terms of language expressivity, but also in terms of requirements analysis tools that we can interface with. For this reason we generate metric-temporal logic formulas with (1) exclusively future-time operators, over both finite and infinite traces, and (2) exclusively past-time operators. To establish trust in the produced formalizations for each template, our framework: (1) extensively tests the generated formulas against the template semantics and (2) proves equivalence between its past-time and future-time formulas. Our approach is available through the open-source tool fret and has been used to capture and analyze requirements for a Lockheed Martin Cyber–Physical System challenge.

Journal ArticleDOI
TL;DR: The Certus model describes the elements of a research collaboration process between researchers and practitioners in software engineering, grounded on the principles of research knowledge co-creation and continuous commitment to joint problem solving.
Abstract: Context Research collaborations between software engineering industry and academia can provide significant benefits to both sides, including improved innovation capacity for industry, and real-world environment for motivating and validating research ideas. However, building scalable and effective research collaborations in software engineering is known to be challenging. While such challenges can be varied and many, in this paper we focus on the challenges of achieving participative knowledge creation supported by active dialog between industry and academia and continuous commitment to joint problem solving. Objective This paper aims to understand what are the elements of a successful industry-academia collaboration that enable the culture of participative knowledge creation. Method We conducted participant observation collecting qualitative data spanning 8 years of collaborative research between a software engineering research group on software V&V and the Norwegian IT sector. The collected data was analyzed and synthesized into a practical collaboration model, named the Certus Model. Results The model is structured in seven phases, describing activities from setting up research projects to the exploitation of research results. As such, the Certus model advances other collaborations models from literature by delineating different phases covering the complete life cycle of participative research knowledge creation. Conclusion The Certus model describes the elements of a research collaboration process between researchers and practitioners in software engineering, grounded on the principles of research knowledge co-creation and continuous commitment to joint problem solving. The model can be applied and tested in other contexts where it may be adapted to the local context through experimentation.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the potential impact of using F1 on the validity of a large body of software defect prediction research and concluded that F1 is a problematic metric outside of an information retrieval context, since they are concerned about both classes (defectprone and not defect-prone units).
Abstract: Context: Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately some widely used performance metrics are known to be problematic, most notably F1, but nevertheless F1 is widely used. Objective: To investigate the potential impact of using F1 on the validity of this large body of research. Method: We undertook a systematic review to locate relevant experiments and then extract all pairwise comparisons of defect prediction performance using F1 and the unbiased Matthews correlation coefficient (MCC). Results: We found a total of 38 primary studies. These contain 12,471 pairs of results. Of these comparisons, 21.95% changed direction when the MCC metric is used instead of the biased F1 metric. Unfortunately, we also found evidence suggesting that F1 remains widely used in software defect prediction research. Conclusion: We reiterate the concerns of statisticians that the F1 is a problematic metric outside of an information retrieval context, since we are concerned about both classes (defect-prone and not defect-prone units). This inappropriate usage has led to a substantial number (more than one fifth) of erroneous (in terms of direction) results. Therefore we urge researchers to (i) use an unbiased metric and (ii) publish detailed results including confusion matrices such that alternative analyses become possible.

Journal ArticleDOI
TL;DR: The results show that BDA is employed throughout the whole ASD lifecycle and that data-driven software development is focused on the following areas: code repository analytics, defects/bug fixing, testing, project management analytics, and application usage analytics.
Abstract: Context: Over the last decade, Agile methods have changed the software development process in an unparalleled way and with the increasing popularity of Big Data, optimizing development cycles through data analytics is becoming a commodity. Objective: Although a myriad of research exists on software analytics as well as on Agile software development (ASD) practice on itself, there exists no systematic overview of the research done on ASD from a data analytics perspective. Therefore, the objective of this work is to make progress by linking ASD with Big Data analytics (BDA). Method: As the primary method to find relevant literature on the topic, we performed manual search and snowballing on papers published between 2011 and 2019. Results: In total, 88 primary studies were selected and analyzed. Our results show that BDA is employed throughout the whole ASD lifecycle. The results reveal that data-driven software development is focused on the following areas: code repository analytics, defects/bug fixing, testing, project management analytics, and application usage analytics. Conclusions: As BDA and ASD are fast-developing areas, improving the productivity of software development teams is one of the most important objectives BDA is facing in the industry. This study provides scholars with information about the state of software analytics research and the current trends as well as applications in the business environment. Whereas, thanks to this literature review, practitioners should be able to understand better how to obtain actionable insights from their software artifacts and on which aspects of data analytics to focus when investing in such initiatives.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an automated framework using Chaos-based Genetic Algorithm for Multi-fault Localization (CGAML) based on Spectrum Based Fault Localisation (SBFL) technique.
Abstract: Context: In the field of software engineering, the most complex and time consuming activity is fault-finding. Due to increasing size and complexity of software, there is a necessity of automated fault detection tool which can detect fault with minimal human intervention. A programmer spends a lot of time and effort on software fault localization. Various Spectrum Based Fault Localization (SBFL) techniques have already been developed to automate the fault localization in single-fault software. But, there is a scarcity of fault localization technique for multi-fault software. In our study, we have found that pure SBFL is not always sufficient for effective fault localization in multi-fault programs. Objective: To address the above challenge, we propose an automated framework using Chaos-based Genetic Algorithm for Multi-fault Localization (CGAML) based on SBFL technique. Methods: Traditional Genetic Algorithm (GA) sometimes stuck in local optima, and it takes more time to converge. Different chaos mapping functions have been applied to GA for better performance. We have used logistic mapping function to achieve chaotic sequence. The proposed technique CGAML first calculates the suspiciousness score for each program statement and then assigns ranks according to that score. The statements having smaller rank means there is a high probability of the statements to be faulty. Results: Five open-source benchmark programs are tested to evaluate the efficiency of CGAML technique. The experimental results show CGAML gives better results for both single-fault and multi-fault programs in comparison with existing spectrum-based fault localization techniques. Conclusion: E X A M metric is used to compare the performance of our proposed technique with other existing techniques. Smaller E X A M score denotes the higher accuracy of the technique. The proposed framework generates smaller E X A M score in comparison with other existing techniques. We found that, overall CGAML works on an average 8.5% better than GA for both single-fault and multi-fault software.

Journal ArticleDOI
TL;DR: It is shown that the field of development and maintenance IaC is in its infancy and deserves further attention, and critical insights concerning the top languages as well as the best practices adopted by practitioners to address some of those challenges are revealed.
Abstract: Context: Infrastructure-as-code (IaC) is the DevOps tactic of managing and provisioning software infrastructures through machine-readable definition files, rather than manual hardware configuration or interactive configuration tools. Objective: From a maintenance and evolution perspective, the topic has picked the interest of practitioners and academics alike, given the relative scarcity of supporting patterns and practices in the academic literature. At the same time, a considerable amount of gray literature exists on IaC. Thus we aim to characterize IaC and compile a catalog of best and bad practices for widely used IaC languages, all using gray literature materials. Method: In this paper, we systematically analyze the industrial gray literature on IaC, such as blog posts, tutorials, white papers using qualitative analysis techniques. Results: We proposed a definition for IaC and distilled a broad catalog summarized in a taxonomy consisting of 10 and 4 primary categories for best practices and bad practices, respectively, both language-agnostic and language-specific ones, for three IaC languages, namely Ansible, Puppet, and Chef. The practices reflect implementation issues, design issues, and the violation of/adherence to the essential principles of IaC. Conclusion: Our findings reveal critical insights concerning the top languages as well as the best practices adopted by practitioners to address (some of) those challenges. We evidence that the field of development and maintenance IaC is in its infancy and deserves further attention.

Journal ArticleDOI
TL;DR: A comparative study to have a holistic look at heterogeneous defect prediction, finding that HDP methods do not perform significantly better than some of UDP methods in terms of two NPIs and four EPIs and there is still a long way for the HDP issue to go.
Abstract: Context: Cross-project defect prediction applies to the scenarios that the target projects are new projects. Most of the previous studies tried to utilize the training data from other projects (i.e., the source projects). However, metrics used by practitioners to measure the extracted program modules from different projects may not be the same, and performing heterogeneous defect prediction (HDP) is challenging. Objective: Researchers have proposed many novel HDP methods with promising performance until now. Recently, unsupervised defect prediction (UDP) methods have received more attention and show competitive performance. However, to our best knowledge, whether HDP methods can perform significantly better than UDP methods has not yet been thoroughly investigated. Method: In this article, we perform a comparative study to have a holistic look at this issue. Specifically, we compare five HDP methods with four UDP methods on 34 projects in five groups under the same experimental setup from three different perspectives: non-effort-aware performance indicators (NPIs), effort-aware performance indicators (EPIs) and diversity analysis on identifying defective modules. Result: We have the following findings: (1) HDP methods do not perform significantly better than some of UDP methods in terms of two NPIs and four EPIs. (2) According to two satisfactory criteria recommended by previous studies, the satisfactory ratio of existing HDP methods is pessimistic. (3) The diversity of prediction for defective modules across HDP vs. UDP methods is more than that within HDP methods or UDP methods. Conclusion: The above findings implicate there is still a long way for the HDP issue to go. Given this, we present some observations about the road ahead for HDP.

Journal ArticleDOI
TL;DR: The results show that programming-languages are the most relevant features to predict the investigated roles and can assist companies during their hiring process, such as by recommending developers with the expertise required by job positions.
Abstract: Context:Modern software development demands high levels of technical specialization. These conditions make IT companies focus on creating cross-functional teams, such as frontend, backend, and mobile developers. In this context, the success of software projects is highly influenced by the expertise of these teams in each field. Objective:In this paper, we investigate machine-learning based approaches to automatically identify the technical roles of open source developers. Method:For this, we first build a ground truth with 2284 developers labeled in six different roles: backend, frontend, full-stack, mobile, devops, and data science. Then, we build three different machine-learning models used to identify these roles. Results:These models presented competitive results for precision (0.88) and AUC (0.89) when identifying all six roles. Moreover, our results show that programming-languages are the most relevant features to predict the investigated roles. Conclusion:The approach proposed in this paper can assist companies during their hiring process, such as by recommending developers with the expertise required by job positions.

Journal ArticleDOI
TL;DR: The proposed graph-based clustering algorithm GMA (Graph-based Modularization Algorithm) is expected to help a software designer in the software reverse engineering process to extract easy-to-manage and understandable modules from source code.
Abstract: Context: Clustering algorithms, as a modularization technique, are used to modularize a program aiming to understand large software systems as well as software refactoring. These algorithms partition the source code of the software system into smaller and easy-to-manage modules (clusters). The resulting decomposition is called the software system structure (or software architecture). Due to the NP-hardness of the modularization problem, evolutionary clustering approaches such as the genetic algorithm have been used to solve this problem. These methods do not make much use of the information and knowledge available in the artifact dependency graph which is extracted from the source code. Objective: To overcome the limitations of the existing modularization techniques, this paper presents a new modularization technique named GMA (Graph-based Modularization Algorithm). Methods: In this paper, a new graph-based clustering algorithm is presented for software modularization. To this end, the depth of relationships is used to compute the similarity between artifacts, as well as seven new criteria are proposed to evaluate the quality of a modularization. The similarity presented in this paper enables the algorithm to use graph-theoretic information. Results: To demonstrate the applicability of the proposed algorithm, ten folders of Mozilla Firefox with different domains and functions, along with four other applications, are selected. The experimental results demonstrate that the proposed algorithm produces modularization closer to the human expert’s decomposition (i.e., directory structure) than the other existing algorithms. Conclusion: The proposed algorithm is expected to help a software designer in the software reverse engineering process to extract easy-to-manage and understandable modules from source code.

Journal ArticleDOI
TL;DR: A decision model based on the framework for the programming language ecosystem selection problem is presented, which can be used to more rapidly evaluate and select programming language ecosystems.
Abstract: Context: Software development is a continuous decision-making process that mainly relies on the software engineer’s experience and intuition. One of the essential decisions in the early stages of the process is selecting the best fitting programming language ecosystem based on the project requirements. A significant number of criteria, such as developer availability and consistent documentation, in addition to the number of available options in the market, lead to a challenging decision-making process. As the selection of programming language ecosystems depends on the application to be developed and its environment, a decision model is required to analyze the selection problem using systematic identification and evaluation of potential alternatives for a development project. Method: Recently, we introduced a framework to build decision models for technology selection problems in software production. Furthermore, we designed and implemented a decision support system that uses such decision models to support software engineers with their decision-making problems. This study presents a decision model based on the framework for the programming language ecosystem selection problem. Results: The decision model has been evaluated through seven real-world case studies at seven software development companies. The case study participants declared that the approach provides significantly more insight into the programming language ecosystem selection process and decreases the decision-making process’s time and cost. Conclusion: With the decision model, software engineers can more rapidly evaluate and select programming language ecosystems. Having the knowledge in the decision model readily available supports software engineers in making more efficient and effective decisions that meet their requirements and priorities. Furthermore, such reusable knowledge can be employed by other researchers to develop new concepts and solutions for future challenges.

Journal ArticleDOI
TL;DR: In context of real faults, convolutional neural network is the most effective for fault localization among the investigated architectures, and potential factors of deep learning for improving fault localization are suggested.
Abstract: Context: The recent progress of deep learning has shown its promising learning ability in making sense of data, and many fields have utilized this learning ability to learn an effective model, successfully solving their problems. Fault localization has explored and used deep learning to server an aid in debugging, showing the promising results on fault localization. However, as far as we know, there is no detailed studies on evaluating the benefits of using deep learning for locating real faults present in programs. Objective: To understand the benefits of deep learning in locating real faults, this paper explores more about deep learning by studying the effectiveness of fault localization using deep learning for a set of real bugs reported in the widely used programs. Method: We use three representative deep learning architectures (i.e. convolutional neural network, recurrent neural network and multi-layer perceptron) for fault localization, and conduct large-scale experiments on 8 real-world programs equipped with all real faults to evaluate their effectiveness on fault localization. Results: We observe that the localization effectiveness varies considerably among three neural networks in the context of real faults. Specifically, convolutional neural network performs the best in locating real faults, showing an average of 38.97% and 26.22% saving over multi-layer perceptron and recurrent neural network respectively; recurrent neural network and multi-layer perceptron yield comparable effectiveness even if the effectiveness of recurrent neural network is marginally higher than multi-layer perceptron. Conclusion: In context of real faults, convolutional neural network is the most effective for fault localization among the investigated architectures, and we suggest potential factors of deep learning for improving fault localization.

Journal ArticleDOI
TL;DR: This paper introduces Context-Oriented Behavioral Programming - a novel paradigm for developing context-aware systems, centered on natural and incremental specification of context-dependent behaviors and provides design patterns and a methodology for coping with the above challenges.
Abstract: Context: Modern systems require programmers to develop code that dynamically adapts to different contexts, leading to the evolution of new context-oriented programming languages. These languages introduce new software-engineering challenges, such as: how to maintain the separation of concerns of the codebase? how to model the changing behaviors? how to verify the system behavior? and more. Objective: This paper introduces Context-Oriented Behavioral Programming (COBP) — a novel paradigm for developing context-aware systems, centered on natural and incremental specification of context-dependent behaviors. As the name suggests, we combine behavioral-programming (BP) — a scenario-based modeling paradigm — with context idioms that explicitly specify when scenarios are relevant and what information they need. The core idea is to connect the behavioral model with a data model that represents the context, allowing an intuitive connection between the models via update and select queries. Combining behavioral-programming with context-oriented programming brings the best of the two worlds, solving issues that arise when using each of the approaches in separation. Methods: We begin with providing abstract semantics for COBP and two implementations for the semantics, laying the foundations for applying reasoning algorithms to context-aware behavioral programs. Next, we exemplify the semantics with formal specifications of systems, including a variant of Conway’s Game of Life. Then, we provide two case studies of real-life context-aware systems (one in robotics and another in IoT) that were developed using this tool. Throughout the examples and case studies, we provide design patterns and a methodology for coping with the above challenges. Results: The case studies show that the proposed approach is applicable for developing real-life systems, and presents measurable advantages over the alternatives — behavioral programming alone and context-oriented programming alone. Conclusion: We present a paradigm allowing programmers and system engineers to capture complex context-dependent requirements and align their code with such requirements.

Journal ArticleDOI
TL;DR: The goal of this work is to map the existing knowledge on Micro-Frontends, by understanding the motivations of companies when adopting such applications as well as possible benefits and issues, by conducting a Multivocal Literature Review.
Abstract: Context: Micro-Frontends are increasing in popularity, being adopted by several large companies, such as DAZN, Ikea, Starbucks and may others. Micro-Frontends enable splitting of monolithic frontends into independent and smaller micro applications. However, many companies are still hesitant to adopt Micro-Frontends, due to the lack of knowledge concerning their benefits. Additionally, provided online documentation is often times perplexed and contradictory. Objective: The goal of this work is to map the existing knowledge on Micro-Frontends, by understanding the motivations of companies when adopting such applications as well as possible benefits and issues. Method: For this purpose, we surveyed the academic and grey literature by means of the Multivocal Literature Review process, analysing 173 sources, of which 43 reported motivations, benefits and issues. Results: The results show that existing architectural options to build web applications are cumbersome if the application and development team grows, and if multiple teams need to develop the same frontend application. In such cases, companies adopted Micro-Frontends to increase team independence and to reduce the overall complexity of the frontend. The application of the Micro-Frontend, confirmed the expected benefits, and Micro-Frontends resulted to provide the same benefits as microservices on the back end side, combining the development team into a fully cross-functional development team that can scale processes when needed. However, Micro-Frontends also showed some issues, such as the increased payload size of the application, increased code duplication and coupling between teams, and monitoring complexity. Conclusions: Micro-Frontends allow companies to scale development according to business needs in the same way microservices do with the back end side. In addition, Micro-Frontends have a lot of overhead and require careful planning if an advantage is achieved by using Micro-Frontends. Further research is needed to carefully investigate this new hype, by helping practitioners to understand how to use Micro-Frontends as well as understand in which contexts they are the most beneficial.

Journal ArticleDOI
TL;DR: This systematic literature review takes a longitudinal perspective on GUI test automation challenges by identifying them and investigating why the field has been unable to mitigate them for so many years, and discusses why the key challenges are still present and why some will likely remain in the future.
Abstract: Context: Automated testing is ubiquitous in modern software development and used to verify requirement conformance on all levels of system abstraction, including the system’s graphical user interface (GUI). GUI-based test automation, like other automation, aims to reduce the cost and time for testing compared to alternative, manual approaches. Automation has been successful in reducing costs for other forms of testing (like unit- or integration testing) in industrial practice. However, we have not yet seen the same convincing results for automated GUI-based testing, which has instead been associated with multiple technical challenges. Furthermore, the software industry has struggled with some of these challenges for more than a decade with what seems like only limited progress. Objective: This systematic literature review takes a longitudinal perspective on GUI test automation challenges by identifying them and then investigating why the field has been unable to mitigate them for so many years. Method: The review is based on a final set of 49 publications, all reporting empirical evidence from practice or industrial studies. Statements from the publications are synthesized, based on a thematic coding, into 24 challenges related to GUI test automation. Results: The most reported challenges were mapped chronologically and further analyzed to determine how they and their proposed solutions have evolved over time. This chronological mapping of reported challenges shows that four of them have existed for almost two decades. Conclusion: Based on the analysis, we discuss why the key challenges with GUI-based test automation are still present and why some will likely remain in the future. For others, we discuss possible ways of how the challenges can be addressed. Further research should focus on finding solutions to the identified technical challenges with GUI-based test automation that can be resolved or mitigated. However, in parallel, we also need to acknowledge and try to overcome non-technical challenges.

Journal ArticleDOI
TL;DR: The analysis revealed that techniques proposed for software startups in practice compress five different activities: elicitation, prioritization, specification, analysis, and management, which represents the first description for hypotheses engineering grounded in practice data.
Abstract: Context The higher availability of software usage data and the influence of the Lean Startup led to the rise of experimentation in software engineering, a new approach for development based on experiments to understand the user needs. In the models proposed to guide this approach, the first step is generally to identify, prioritize, and specify the hypotheses that will be tested through experimentation. However, although practitioners have proposed several techniques to handle hypotheses, the scientific literature is still scarce. Objective The goal of this study is to understand what activities, as proposed in industry, are entailed to handle hypotheses, facilitating the comparison, creation, and evaluation of relevant techniques. Methods We performed a gray literature review (GLR) on the practices proposed by practitioners to handle hypotheses in the context of software startups. We analyzed the identified documents using thematic synthesis. Results The analysis revealed that techniques proposed for software startups in practice compress five different activities: elicitation, prioritization, specification, analysis, and management. It also showed that practitioners often classify hypotheses in types and which qualities they aim for these statements. Conclusion Our results represent the first description for hypotheses engineering grounded in practice data. This mapping of the state-of-practice indicates how research could go forward in investigating hypotheses for experimentation in the context of software startups. For practitioners, they represent a catalog of available practices to be used in this context.