scispace - formally typeset
Search or ask a question

Showing papers in "Information & Software Technology in 2019"


Journal ArticleDOI
TL;DR: The provided MLR guidelines will support researchers to effectively and efficiently conduct new MLRs in any area of SE and are recommended to utilize in their MLR studies and then share their lessons learned and experiences.
Abstract: Context A Multivocal Literature Review (MLR) is a form of a Systematic Literature Review (SLR) which includes the grey literature (e.g., blog posts, videos and white papers) in addition to the published (formal) literature (e.g., journal and conference papers). MLRs are useful for both researchers and practitioners since they provide summaries both the state-of-the art and –practice in a given area. MLRs are popular in other fields and have recently started to appear in software engineering (SE). As more MLR studies are conducted and reported, it is important to have a set of guidelines to ensure high quality of MLR processes and their results. Objective There are several guidelines to conduct SLR studies in SE. However, several phases of MLRs differ from those of traditional SLRs, for instance with respect to the search process and source quality assessment. Therefore, SLR guidelines are only partially useful for conducting MLR studies. Our goal in this paper is to present guidelines on how to conduct MLR studies in SE. Method To develop the MLR guidelines, we benefit from several inputs: (1) existing SLR guidelines in SE, (2), a literature survey of MLR guidelines and experience papers in other fields, and (3) our own experiences in conducting several MLRs in SE. We took the popular SLR guidelines of Kitchenham and Charters as the baseline and extended/adopted them to conduct MLR studies in SE. All derived guidelines are discussed in the context of an already-published MLR in SE as the running example. Results The resulting guidelines cover all phases of conducting and reporting MLRs in SE from the planning phase, over conducting the review to the final reporting of the review. In particular, we believe that incorporating and adopting a vast set of experience-based recommendations from MLR guidelines and experience papers in other fields have enabled us to propose a set of guidelines with solid foundations. Conclusion Having been developed on the basis of several types of experience and evidence, the provided MLR guidelines will support researchers to effectively and efficiently conduct new MLRs in any area of SE. The authors recommend the researchers to utilize these guidelines in their MLR studies and then share their lessons learned and experiences.

358 citations


Journal ArticleDOI
TL;DR: There is still room for the improvement of machine learning techniques in the context of code smell detection and it is argued that JRip and Random Forest are the most effective classifiers in terms of performance.
Abstract: Background: Code smells indicate suboptimal design or implementation choices in the source code that often lead it to be more change- and fault-prone. Researchers defined dozens of code smell detectors, which exploit different sources of information to support developers when diagnosing design flaws. Despite their good accuracy, previous work pointed out three important limitations that might preclude the use of code smell detectors in practice: (i) subjectiveness of developers with respect to code smells detected by such tools, (ii) scarce agreement between different detectors, and (iii) difficulties in finding good thresholds to be used for detection. To overcome these limitations, the use of machine learning techniques represents an ever increasing research area. Objective: While the research community carefully studied the methodologies applied by researchers when defining heuristic-based code smell detectors, there is still a noticeable lack of knowledge on how machine learning approaches have been adopted for code smell detection and whether there are points of improvement to allow a better detection of code smells. Our goal is to provide an overview and discuss the usage of machine learning approaches in the field of code smells. Method: This paper presents a Systematic Literature Review (SLR) on Machine Learning Techniques for Code Smell Detection. Our work considers papers published between 2000 and 2017. Starting from an initial set of 2456 papers, we found that 15 of them actually adopted machine learning approaches. We studied them under four different perspectives: (i) code smells considered, (ii) setup of machine learning approaches, (iii) design of the evaluation strategies, and (iv) a meta-analysis on the performance achieved by the models proposed so far. Results: The analyses performed show that God Class, Long Method, Functional Decomposition, and Spaghetti Code have been heavily considered in the literature. Decision Trees and Support Vector Machines are the most commonly used machine learning algorithms for code smell detection. Models based on a large set of independent variables have performed well. JRip and Random Forest are the most effective classifiers in terms of performance. The analyses also reveal the existence of several open issues and challenges that the research community should focus on in the future. Conclusion: Based on our findings, we argue that there is still room for the improvement of machine learning techniques in the context of code smell detection. The open issues emerged in this study can represent the input for researchers interested in developing more powerful techniques.

148 citations


Journal ArticleDOI
TL;DR: A classification schema for reporting threats to validity and possible mitigation actions is proposed, which authors of secondary studies can use for identifying and categorizing threats tovalidity and corresponding mitigation actions, while readers of secondary Studies can use the checklist for assessing the validity of the reported results.
Abstract: Context Secondary studies are vulnerable to threats to validity. Although, mitigating these threats is crucial for the credibility of these studies, we currently lack a systematic approach to identify, categorize and mitigate threats to validity for secondary studies. Objective In this paper, we review the corpus of secondary studies, with the aim to identify: (a) the trend of reporting threats to validity, (b) the most common threats to validity and corresponding mitigation actions, and (c) possible categories in which threats to validity can be classified. Method To achieve this goal we employ the tertiary study research method that is used for synthesizing knowledge from existing secondary studies. In particular, we collected data from more than 100 studies, published until December 2016 in top quality software engineering venues (both journals and conference). Results Our results suggest that in recent years, secondary studies are more likely to report their threats to validity. However, the presentation of such threats is rather ad hoc, e.g., the same threat may be presented with a different name, or under a different category. To alleviate this problem, we propose a classification schema for reporting threats to validity and possible mitigation actions. Both the classification of threats and the associated mitigation actions have been validated by an empirical study, i.e., Delphi rounds with experts. Conclusion Based on the proposed schema, we provide a checklist, which authors of secondary studies can use for identifying and categorizing threats to validity and corresponding mitigation actions, while readers of secondary studies can use the checklist for assessing the validity of the reported results.

147 citations


Journal ArticleDOI
TL;DR: KPWE, a new software defect prediction framework that considers the feature extraction and class imbalance issues, is proposed, and the empirical study on 44 software projects indicate that KPWE is superior to the baseline methods in most cases.
Abstract: Context Software defect prediction strives to detect defect-prone software modules by mining the historical data. Effective prediction enables reasonable testing resource allocation, which eventually leads to a more reliable software. Objective The complex structures and the imbalanced class distribution in software defect data make it challenging to obtain suitable data features and learn an effective defect prediction model. In this paper, we propose a method to address these two challenges. Method We propose a defect prediction framework called KPWE that combines two techniques, i.e., Kernel Principal Component Analysis (KPCA) and Weighted Extreme Learning Machine (WELM). Our framework consists of two major stages. In the first stage, KPWE aims to extract representative data features. It leverages the KPCA technique to project the original data into a latent feature space by nonlinear mapping. In the second stage, KPWE aims to alleviate the class imbalance. It exploits the WELM technique to learn an effective defect prediction model with a weighting-based scheme. Results We have conducted extensive experiments on 34 projects from the PROMISE dataset and 10 projects from the NASA dataset. The experimental results show that KPWE achieves promising performance compared with 41 baseline methods, including seven basic classifiers with KPCA, five variants of KPWE, eight representative feature selection methods with WELM, 21 imbalanced learning methods. Conclusion In this paper, we propose KPWE, a new software defect prediction framework that considers the feature extraction and class imbalance issues. The empirical study on 44 software projects indicate that KPWE is superior to the baseline methods in most cases.

109 citations


Journal ArticleDOI
TL;DR: This exploratory study presents detailed descriptions of how DevOps is implemented in practice, particularly in small and medium sized companies, and contributes to the overall understanding of DevOps concept, practices and its perceived impacts.
Abstract: Context: DevOps is considered important in the ability to frequently and reliably update a system in operational state. DevOps presumes cross-functional collaboration and automation between software development and operations. DevOps adoption and implementation in companies is non-trivial due to required changes in technical, organisational and cultural aspects. Objectives: This exploratory study presents detailed descriptions of how DevOps is implemented in practice. The context of our empirical investigation is web application and service development in small and medium sized companies. Method: A multiple-case study was conducted in five different development contexts with successful DevOps implementations since its benefits, such as quick releases and minimum deployment errors, were achieved. Data was mainly collected through interviews with 26 practitioners and observations made at the companies. Data was analysed by first coding each case individually using a set of predefined themes and thereafter perform a cross-case synthesis. Results: Our analysis yielded some of the following results: (i) software development team attaining ownership and responsibility to deploy software changes in production is crucial in DevOps. (ii) toolchain usage and support in deployment pipeline activities accelerates the delivery of software changes, bug fixes and handling of production incidents. (ii) the delivery speed to production is affected by context factors, such as manual approvals by the product owner (iii) steep learning curve for new skills is experienced by both software developers and operations staff, who also have to cope with working under pressure. Conclusion: Our findings contributes to the overall understanding of DevOps concept, practices and its perceived impacts, particularly in small and medium sized companies. We discuss two practical implications of the results.

96 citations


Journal ArticleDOI
TL;DR: The proposed TPTL model can solve the instability problem of TCA+, showing substantial improvements over the state-of-the-art and related CPDP models.
Abstract: Context: Previous studies have shown that a transfer learning model, TCA+ proposed by Nam et al., can significantly improve the performance of cross-project defect prediction (CPDP). TCA+ achieves the improvement by reducing data distribution difference between source (training data) and target (testing data) projects. However, TCA+ is unstable, i.e., its performance varies largely when using different source projects to build prediction models. In practice, it is hard to choose a suitable source project to build the prediction model. Objective: To address the limitation of TCA+, we propose a two-phase transfer learning model (TPTL) for CPDP. Method: In the first phase, we propose a source project estimator (SPE) to automatically choose two source projects with the highest distribution similarity to a target project from candidates. Next, two source projects that are estimated to achieve the highest values of F1-score and cost-effectiveness are selected. In the second phase, we leverage TCA+ to build two prediction models based on the two selected projects and combine their prediction results to further improve the prediction performance. Results: We evaluate TPTL on 42 defect datasets from PROMISE repository, and compare it with two versions of TCA+ (TCA+_Rnd, randomly selecting one source project; TCA+_All, using all alternative source projects), a related source project selection model TDS proposed by Herbold, a state-of-the-art CPDP model leveraging a log transformation (LT) method, and a transfer learning model Dycom with better form of TCA. Experiment results show that, on average across 42 datasets, TPTL respectively improves these baseline models by 19%, 5%, 36%, 27%, and 11% in terms of F1-score; by 64%, 92%, 71%, 11%, and 66% in terms of cost-effectiveness. Conclusion: The proposed TPTL model can solve the instability problem of TCA+, showing substantial improvements over the state-of-the-art and related CPDP models.

86 citations


Journal ArticleDOI
TL;DR: A new deep forest model is proposed to build the defect prediction model (DPDF), which can identify more important defect features by using a new cascade strategy, which transforms random forest classifiers into a layer-by-layer structure.
Abstract: Context Software defect prediction is important to ensure the quality of software. Nowadays, many supervised learning techniques have been applied to identify defective instances (e.g., methods, classes, and modules). Objective However, the performance of these supervised learning techniques are still far from satisfactory, and it will be important to design more advanced techniques to improve the performance of defect prediction models. Method We propose a new deep forest model to build the defect prediction model (DPDF). This model can identify more important defect features by using a new cascade strategy, which transforms random forest classifiers into a layer-by-layer structure. This design takes full advantage of ensemble learning and deep learning. Results We evaluate our approach on 25 open source projects from four public datasets (i.e., NASA, PROMISE, AEEEM and Relink). Experimental results show that our approach increases AUC value by 5% compared with the best traditional machine learning algorithms. Conclusion The deep strategy in DPDF is effective for software defect prediction.

83 citations


Journal ArticleDOI
TL;DR: A large-scale empirical study on the influence of 9 Android-specific code smells on the energy consumption of 60 Android apps and finds that refactoring these code smells reduces energy consumption in all of the situations.
Abstract: Context. The demand for green software design is steadily growing higher especially in the context of mobile devices, where the computation is often limited by battery life. Previous studies found how wrong programming solutions have a strong impact on the energy consumption. Objective. Despite the efforts spent so far, only a little knowledge on the influence of code smells, i.e.,symptoms of poor design or implementation choices, on the energy consumption of mobile applications is available. Method. To provide a wider overview on the relationship between smells and energy efficiency, in this paper we conducted a large-scale empirical study on the influence of 9 Android-specific code smells on the energy consumption of 60 Android apps. In particular, we focus our attention on the design flaws that are theoretically supposed to be related to non-functional attributes of source code, such as performance and energy consumption. Results. The results of the study highlight that methods affected by four code smell types, i.e.,Internal Setter, Leaking Thread, Member Ignoring Method, and Slow Loop, consume up to 87 times more than methods affected by other code smells. Moreover, we found that refactoring these code smells reduces energy consumption in all of the situations. Conclusions. Based on our findings, we argue that more research aimed at designing automatic refactoring approaches and tools for mobile apps is needed.

79 citations


Journal ArticleDOI
TL;DR: DeepLoc is a novel deep learning-based model that is capable of automatically connecting bug reports to the corresponding buggy files and achieves better performance than four state-of-the-art approaches based on a deep understanding of semantics in bug reports and source code.
Abstract: Context: Automatic localization of buggy files can speed up the process of bug fixing to improve the efficiency and productivity of software quality assurance teams. Useful semantic information is available in bug reports and source code, but it is usually underutilized by existing bug localization approaches. Objective: To improve the performance of bug localization, we propose DeepLoc, a novel deep learning-based model that makes full use of semantic information. Method: DeepLoc is composed of an enhanced convolutional neural network (CNN) that considers bug-fixing recency and frequency, together with word-embedding and feature-detecting techniques. DeepLoc uses word embeddings to represent the words in bug reports and source files that retain their semantic information, and different CNNs to detect features from them. DeepLoc is evaluated on over 18,500 bug reports extracted from AspectJ, Eclipse, JDT, SWT, and Tomcat projects. Results: The experimental results show that DeepLoc achieves 10.87%–13.4% higher MAP (mean average precision) than conventional CNN. DeepLoc outperforms four current state-of-the-art approaches (DeepLocator, HyLoc, LR+WE, and BugLocator) in terms of Accuracy@k (the percentage of bug reports for which at least one real buggy file is located within the top k rank), MAP, and MRR (mean reciprocal rank) using less computation time. Conclusion: DeepLoc is capable of automatically connecting bug reports to the corresponding buggy files and achieves better performance than four state-of-the-art approaches based on a deep understanding of semantics in bug reports and source code.

75 citations


Journal ArticleDOI
TL;DR: It is suggested that researchers need to use the unsupervised method LOC_D as the baseline method, which is used for comparing their proposed novel methods for SDNP problem in the future.
Abstract: Context: Software defect number prediction (SDNP) can rank the program modules according to the prediction results and is helpful for the optimization of testing resource allocation. Objective: In previous studies, supervised methods vs unsupervised methods is an active issue for just-in-time defect prediction and file-level defect prediction based on effort-aware performance measures. However, this issue has not been investigated for SDNP. To the best of our knowledge, we are the first to make a thorough comparison for these two different types of methods. Method: In our empirical studies, we consider 7 real open-source projects with 24 versions in total, use FPA and Kendall as our effort-aware performance measures, and consider three different performance evaluation scenarios (i.e., within-version scenario, cross-version scenario, and cross-project scenario). Result: We first identify two unsupervised methods with best performance. These two methods simply rank modules according to the value of metric LOC and metric RFC from large to small respectively. Then we compare 9 state-of-the-art supervised methods incorporating SMOTEND, which is used for handling class imbalance problem, with the unsupervised method based on LOC metric (i.e., LOC_D method). Final results show that LOC_D method can perform significantly better than or the same as these supervised methods. Later motivated by a recent study conducted by Agrawla and Menzies, we apply differential evolutionary (DE) to optimize parameter value of SMOTEND used by these supervised methods and find that using DE can effectively improve the performance of these supervised methods for SDNP too. Finally, we continue to compare LOC_D with these optimized supervised methods using DE, and LOC_D method still has advantages in the performance, especially in the cross-version and cross-project scenarios. Conclusion: Based on these results, we suggest that researchers need to use the unsupervised method LOC_D as the baseline method, which is used for comparing their proposed novel methods for SDNP problem in the future.

73 citations


Journal ArticleDOI
TL;DR: In this paper, a systematic mapping study of infrastructure as code (IaC) related research was conducted by searching five scholar databases and collecting a set of 31,498 publications by using seven search strings.
Abstract: Context: Infrastructure as code (IaC) is the practice to automatically configure system dependencies and to provision local and remote instances Practitioners consider IaC as a fundamental pillar to implement DevOps practices, which helps them to rapidly deliver software and services to end-users Information technology (IT) organizations, such as GitHub, Mozilla, Facebook, Google and Netflix have adopted IaC A systematic mapping study on existing IaC research can help researchers to identify potential research areas related to IaC, for example defects and security flaws that may occur in IaC scripts Objective: The objective of this paper is to help researchers identify research areas related to infrastructure as code (IaC) by conducting a systematic mapping study of IaC-related research Method: We conduct our research study by searching five scholar databases We collect a set of 31,498 publications by using seven search strings By systematically applying inclusion and exclusion criteria, which includes removing duplicates and removing non-English and non peer-reviewed publications, we identify 32 publications related to IaC We identify topics addressed in these publications by applying qualitative analysis Results: We identify four topics studied in IaC-related publications: (i) framework/tool for infrastructure as code; (ii) adoption of infrastructure as code; (iii) empirical study related to infrastructure as code; and (iv) testing in infrastructure as code According to our analysis, 500% of the studied 32 publications propose a framework or tool to implement the practice of IaC or extend the functionality of an existing IaC tool Conclusion: Our findings suggest that framework or tools is a well-studied topic in IaC research As defects and security flaws can have serious consequences for the deployment and development environments in DevOps, we observe the need for research studies that will study defects and security flaws for IaC

Journal ArticleDOI
TL;DR: In this article, the authors use Abstract Syntax Tree (AST) n-grams to identify features of defective Java code that improve defect prediction performance and use non-parametric testing to determine relationships between AST ngrams and faults in both open source and commercial systems.
Abstract: Context: Identifying defects in code early is important. A wide range of static code metrics have been evaluated as potential defect indicators. Most of these metrics offer only high level insights and focus on particular pre-selected features of the code. None of the currently used metrics clearly performs best in defect prediction. Objective: We use Abstract Syntax Tree (AST) n-grams to identify features of defective Java code that improve defect prediction performance. Method: Our approach is bottom-up and does not rely on pre-selecting any specific features of code. We use non-parametric testing to determine relationships between AST n-grams and faults in both open source and commercial systems. We build defect prediction models using three machine learning techniques. Results: We show that AST n-grams are very significantly related to faults in some systems, with very large effect sizes. The occurrence of some frequently occurring AST n-grams in a method can mean that the method is up to three times more likely to contain a fault. AST n-grams can have a large effect on the performance of defect prediction models. Conclusions: We suggest that AST n-grams offer developers a promising approach to identifying potentially defective code.

Journal ArticleDOI
TL;DR: The findings show that requirements engineering activities in software startups are similar to those in agile teams, but some steps vary as a consequence of the lack of an accessible customer.
Abstract: Context Over the past 20 years, software startups have created many products that have changed human life. Since these companies are creating brand-new products or services, requirements are difficult to gather and highly volatile. Although scientific interest in software development in this context has increased, the studies on requirements engineering in software startups are still scarce and mostly focused on elicitation activities. Objective This study overcomes this gap by answering how requirements engineering practices are performed in this context. Method We conducted a grounded theory study based on 17 interviews with software startups practitioners. Results We constructed a model to show that software startups do not follow a single set of practices but, instead, build a custom process, changed throughout the development of the company, combining different practices according to a set of influences (Founders, Software Development Manager, Developers, Market, Business Model and Startup Ecosystem). Conclusion Our findings show that requirements engineering activities in software startups are similar to those in agile teams, but some steps vary as a consequence of the lack of an accessible customer.

Journal ArticleDOI
TL;DR: In this article, a systematic literature review was conducted to identify variability-aware implementation metrics, which are designed for the needs of software product lines, specifically for variability models, code artifacts, and metrics taking both kinds of artifacts into account.
Abstract: Context: Software Product Line (SPL) development requires at least concepts for variability implementation and variability modeling for deriving products from a product line. These variability implementation concepts are not required for the development of single systems and, thus, are not considered in traditional software engineering. Metrics are well established in traditional software engineering, but existing metrics are typically not applicable to SPLs as they do not address variability management. Over time, various specialized product line metrics have been described in literature, but no systematic description of these metrics and their characteristics is currently available. Objective: This paper describes and analyzes variability-aware metrics, designed for the needs of software product lines. More precisely we restrict the scope of our study explicitly to metrics designed for variability models, code artifacts, and metrics taking both kinds of artifacts into account. Further, we categorize the purpose for which these metrics were developed. We also analyze to what extent these metrics were evaluated to provide a basis for researchers for selecting adequate metrics. Method: We conducted a systematic literature review to identify variability-aware implementation metrics. We discovered 42 relevant papers reporting metrics intended to measure aspects of variability models or code artifacts. Results: We identified 57 variability model metrics, 34 annotation-based code metrics, 46 code metrics specific to composition-based implementation techniques, and 10 metrics integrating information from variability model and code artifacts. For only 31 metrics, an evaluation was performed assessing their suitability to draw any qualitative conclusions. Conclusions: We observed several problematic issues regarding the definition and the use of the metrics. Researchers and practitioners benefit from the catalog of variability-aware metrics, which is the first of its kind. Also, the research community benefits from the identified observations in order to avoid those problems when defining new metrics.

Journal ArticleDOI
TL;DR: In this paper, the authors performed a qualitative exploratory multiple case study in the context of real-life large-scale distributed Agile projects, in order to understand the challenges Agile teams face regarding quality requirements.
Abstract: [Context and Motivation] Focusing single-mindedly on delivering functional requirements while neglecting quality requirements has been a point of criticism of Agile software development methods since their introduction. [Question/problem] Empirical evidence on the challenges that organizations currently face when dealing with quality requirements in Agile, is however scant. [Principle ideas/results] We performed a qualitative exploratory multiple case study in the context of real-life large-scale distributed Agile projects, in order to understand the challenges Agile teams face regarding quality requirements. Based on 17 semi-structured, open-ended, in-depth interviews with Agile practitioners from six organizations in the Netherlands, we collected and analysed data, revealing 13 quality requirements challenges classified in five categories: (1) team coordination and communication, (2) quality assurance, (3) quality requirements elicitation, (4) conceptual definitions, and (5) software architecture. We found an incongruity in the way QRs are conceptualized by Agile practitioners and in RE textbooks. [Contribution] The main contributions of the paper are the explication of the challenges from practitioners’ perspective and the comparison of our findings with previously published results.

Journal ArticleDOI
TL;DR: There is no consensus about SPL formalization, what assets can evolve, nor how and when these evolve, and the SPL community needs to work together to improve the state of the art, creating methods and tools that support SPL evolution in a more comparable manner.
Abstract: Context: Software Product Lines (SPL) evolve when there are changes in the requirements, product structure or the technology being used. Different approaches have been proposed for managing SPL assets and some also address how evolution affects these assets. Existing mapping studies have focused on specific aspects of SPL evolution, but there is no cohesive body of work that gives an overview of the area as a whole. Objective: The goals of this work are to review the characteristics of the approaches reported as supporting SPL evolution, and to synthesize the evidence provided by primary studies about the nature of their processes, as well as how they are reported and validated. Method: We conducted a systematic literature review, considering six research questions formulated to evaluate evolution approaches for SPL. We considered journal, conference and workshop papers published up until March 2017 in leading digital libraries for computer science. Results: After a thorough analysis of the papers retrieved from the digital libraries, we ended up with a set of 60 primary studies. Feature models are widely used to represent SPLs, so feature evolution is frequently addressed. Other assets are less frequently addressed. The area has matured over time: papers presenting more rigorous work are becoming more common. The processes used to support SPL evolution are systematic, but with a low level of automation. Conclusions: Our research shows that there is no consensus about SPL formalization, what assets can evolve, nor how and when these evolve. Case studies are quite popular, but few industrial-sized case studies are publicly available. Also, few of the proposed techniques offer tool support. We believe that the SPL community needs to work together to improve the state of the art, creating methods and tools that support SPL evolution in a more comparable manner.

Journal ArticleDOI
TL;DR: The results indicate that current knowledge on the startup ecosystem is mainly shared by non-peer-reviewed literature, thus signifying the need for more systematic and empirical literature on the topic.
Abstract: Context: Successful startup firms have the ability to create jobs and contribute to economic welfare. A suitable ecosystem developed around startups is important to form and support these firms. In this regard, it is crucial to understand the startup ecosystem, particularly from researchers’ and practitioners’ perspectives. However, a systematic literature research on the startup ecosystem is limited. Objective: In this study, our objective was to conduct a multi-vocal literature review and rigorously find existing studies on the startup ecosystem in order to organize and analyze them, know the definitions and major elements of this ecosystem, and determine the roles of such elements in startups’ product development. Method: We conducted a multi-vocal literature review to analyze relevant articles, which are published technical articles, white papers, and Internet articles that focused on the startup ecosystem. Our search generated 18,310 articles, of which 63 were considered primary candidates focusing on the startup ecosystem. Results: From our analysis of primary articles, we found four definitions of a startup ecosystem. These definitions used common terms, such as stakeholders, supporting organization, infrastructure, network, and region. Out of 63 articles, 34 belonged to the opinion type, with contributions in the form of reports, whereas over 50% had full relevance to the startup ecosystem. We identified eight major elements (finance, demography, market, education, human capital, technology, entrepreneur, and support factors) of a startup ecosystem, which directly or indirectly affected startups. Conclusions: This study aims to provide the state of the art on the startup ecosystem through a multi-vocal literature review. The results indicate that current knowledge on the startup ecosystem is mainly shared by non-peer-reviewed literature, thus signifying the need for more systematic and empirical literature on the topic. Our study also provides some recommendations for future work.

Journal ArticleDOI
TL;DR: Euphoria is a new software architecture design and implementation that enables easy prototyping, deployment, and evaluation of adaptable and flexible interactions across heterogeneous devices in smart environments and is hoped to foster advances and developments in newSoftware architecture initiatives for the authors' increasingly complex smart environments.
Abstract: Context: From personal mobile and wearable devices to public ambient displays, our digital ecosystem has been growing with a large variety of smart sensors and devices that can capture and deliver insightful data to connected applications, creating thus the need for new software architectures to enable fluent and flexible interactions in such smart environments. Objective: We introduce Euphoria , a new software architecture design and implementation that enables easy prototyping, deployment, and evaluation of adaptable and flexible interactions across heterogeneous devices in smart environments. Method: We designed Euphoria by following the requirements of the ISO/IEC 25010:2011 standard on Software Quality Requirements and Evaluation applied to the specific context of smart environments. Results: To demonstrate the adaptability and flexibility of Euphoria , we describe three application scenarios for contexts of use involving multiple users, multiple input/output devices, and various types of smart environments, as follows: (1) wearable user interfaces and whole-body gesture input for interacting with public ambient displays, (2) multi-device interactions in physical-digital spaces, and (3) interactions on smartwatches for a connected car application scenario. We also perform a technical evaluation of Euphoria regarding the main factors responsible for the magnitudes of the request-response times for producing, broadcasting, and consuming messages inside the architecture. We deliver the source code of Euphoria free to download and use for research purposes. Conclusion: By introducing Euphoria and discussing its applicability, we hope to foster advances and developments in new software architecture initiatives for our increasingly complex smart environments, but also to readily support implementations of novel interactive systems and applications for smart environments of all kinds.

Journal ArticleDOI
TL;DR: The findings indicate that supporting factors, such as incubators and accelerators, can influence MVP development by providing young founders with the necessary entrepreneurship skills and education needed to create the right product-market fit.
Abstract: Context Software startups develop innovative products through which they scale their business rapidly, and thus, provide value to the economy, including job generation. However, most startups fail within two years of their launch because of a poor problem-solution fit and negligence of the learning process during minimum viable product (MVP) development. An ideal startup ecosystem can assist in MVP development by providing the necessary entrepreneurial education and technical skills to founding team members for identifying problem-solution fit for their product idea, allowing them to find the right product-market fit. However, existing knowledge on the effect of the startup ecosystem elements on the MVP development is limited. Objective The empirical study presented in this article aims to identify the effect of the six ecosystem elements (entrepreneurs, technology, market, support factors, finance, and human capital) on MVP development. Method We conducted a study with 13 software startups and five supporting organizations (accelerators, incubator, co-working space, and investment firm) in the startup ecosystem of the city of Oulu in Finland. Data were collected through semi-structured interviews, observation, and materials. Results The study results showed that internal sources are most common for identifying requirements for the product idea for MVP development. The findings indicate that supporting factors, such as incubators and accelerators, can influence MVP development by providing young founders with the necessary entrepreneurship skills and education needed to create the right product-market fit. Conclusions We conclude from this study of a regional startup ecosystem that the MVP development process is most affected by founding team members’ experiences and skill sets and by advanced technologies. Furthermore, a constructive startup ecosystem around software startups can boost up the creation of an effective MVP to test product ideas and find a product-market fit.

Journal ArticleDOI
TL;DR: This paper explores the concept of waste in agile/lean software development organizations and how it is defined, used, prioritized, reduced, or eliminated in practice.
Abstract: Context The principal focus of lean is the identification and elimination of waste from the process with respect to maximizing customer value. Similarly, the purpose of agile is to maximize customer value and minimize unnecessary work and time delays. In both cases the concept of waste is important. Through an empirical study, we explore how waste is approached in agile software development organizations. Objective This paper explores the concept of waste in agile/lean software development organizations and how it is defined, used, prioritized, reduced, or eliminated in practice Method The data were collected using semi-structured open-interviews. 23 practitioners from 14 embedded software development organizations were interviewed representing two core roles in each organization. Results Various wastes, categorized in 10 different categories, were identified by the respondents. From the mentioned wastes, not all were necessarily waste per se but could be symptoms caused by wastes. From the seven wastes of lean, Task-switching was ranked as the most important, and Extra-features, as the least important wastes according to the respondents’ opinion. However, most companies do not have their own or use an established definition of waste, more importantly, very few actively identify or try to eliminate waste in their organizations beyond local initiatives on project level. Conclusion In order to identify, recognize and eliminate waste, a common understanding, and a joint and holistic view of the concept is needed. It is also important to optimize the whole organization and the whole product, as waste on one level can be important on another, thus sub-optimization should be avoided. Furthermore, to achieve a sustainable and effective waste handling, both the short-term and the long-term perspectives need to be considered.

Journal ArticleDOI
TL;DR: The experiment confirms conventional wisdom in requirements engineering: identifying terminological ambiguities is time consuming, even when with tool support; and it is hard to determine whether a near-synonym may challenge the correct development of a software system.
Abstract: Context. Defects such as ambiguity and incompleteness are pervasive in software requirements, often due to the limited time that practitioners devote to writing good requirements. Objective.We study whether a synergy between humans’ analytic capabilities and natural language processing is an effective approach for quickly identifying near-synonyms, a possible source of terminological ambiguity. Method.We propose a tool-supported approach that blends information visualization with two natural language processing techniques: conceptual model extraction and semantic similarity. We evaluate the precision and recall of our approach compared to a pen-and-paper manual inspection session through a controlled quasi-experiment that involves 57 participants organized into 28 groups, each group working on one real-world requirements data set. Results.The experimental results indicate that manual inspection delivers higher recall (statistically significant with p ≤ 0.01) and non-significantly higher precision. Based on qualitative observations, we analyze the quantitative results and suggest interpretations that explain the advantages and disadvantages of each approach. Conclusions.Our experiment confirms conventional wisdom in requirements engineering: identifying terminological ambiguities is time consuming, even when with tool support; and it is hard to determine whether a near-synonym may challenge the correct development of a software system. The results suggest that the most effective approach may be a combination of manual inspection with an improved version of our tool.

Journal ArticleDOI
TL;DR: The participants’ performance and perceptions when using the prototype provided evidence that the proposal could reduce AK vaporization in AGSD environments, and encourage us to evaluate the proposal in a long-term test as future work.
Abstract: Context The adoption of agile methods is a trend in global software development (GSD), but may result in many challenges. One important challenge is architectural knowledge (AK) management, since agile developers prefer sharing knowledge through face-to-face interactions, while in GSD the preferred manner is documents. Agile knowledge-sharing practices tend to predominate in GSD companies that practice agile development (AGSD), leading to a lack of documents, such as architectural designs, data models, deployment specifications, etc., resulting in the loss of AK over time, i.e., it vaporizes. Objective In a previous study, we found that there is important AK in the log files of unstructured textual electronic media (UTEM), such as instant messengers, emails, forums, etc., which are the preferred means employed in AGSD to contact remote teammates. The objective of this paper is to present and evaluate a proposal with which to recover AK from UTEM logs. We developed and evaluated a prototype that implements our proposal in order to determine its feasibility. Method The evaluation was performed by conducting a study with agile/global developers and students, who used the prototype and different UTEM to execute tasks that emulate common situations concerning AGSD teams’ lack of documentation during development phases. Results Our prototype was considered a useful, usable and unobtrusive tool when retrieving AK from UTEM logs. The participants also preferred our prototype when searching for AK and found AK faster with the prototype than with UTEM when the origin of the AK required was unknown. Conclusion The participants’ performance and perceptions when using our prototype provided evidence that our proposal could reduce AK vaporization in AGSD environments. These results encourage us to evaluate our proposal in a long-term test as future work.

Journal ArticleDOI
TL;DR: This review paper serves for both researchers and practitioners as an “index” to the vast body of knowledge in the area of testability and to benefit the readers in preparing, measuring and improving software testability.
Abstract: Context Software testability is the degree to which a software system or a unit under test supports its own testing. To predict and improve software testability, a large number of techniques and metrics have been proposed by both practitioners and researchers in the last several decades. Reviewing and getting an overview of the entire state-of-the-art and state-of-the-practice in this area is often challenging for a practitioner or a new researcher. Objective Our objective is to summarize the body of knowledge in this area and to benefit the readers (both practitioners and researchers) in preparing, measuring and improving software testability. Method To address the above need, the authors conducted a survey in the form of a systematic literature mapping (classification) to find out what we as a community know about this topic. After compiling an initial pool of 303 papers, and applying a set of inclusion/exclusion criteria, our final pool included 208 papers (published between 1982 and 2017). Results The area of software testability has been comprehensively studied by researchers and practitioners. Approaches for measurement of testability and improvement of testability are the most-frequently addressed in the papers. The two most often mentioned factors affecting testability are observability and controllability. Common ways to improve testability are testability transformation, improving observability, adding assertions, and improving controllability. Conclusion This paper serves for both researchers and practitioners as an “index” to the vast body of knowledge in the area of testability. The results could help practitioners measure and improve software testability in their projects. To assess potential benefits of this review paper, we shared its draft version with two of our industrial collaborators. They stated that they found the review useful and beneficial in their testing activities. Our results can also benefit researchers in observing the trends in this area and identify the topics that require further investigation.

Journal ArticleDOI
TL;DR: The proposed FineLocator approach can improve the performances of method-level bug localization at average by 20%, 21% and 17% measured by Top-N indicator, MAP and MRR respectively, in comparison with state-of-the-art techniques.
Abstract: Context Bug localization, namely, to locate suspicious snippets from source code files for developers to fix the bug, is crucial for software quality assurance and software maintenance. Effective bug localization technique is desirable for software developers to reduce the effort involved in bug resolution. State-of-the-art bug localization techniques concentrate on file-level coarse-grained localization by lexical matching bug reports and source code files. However, this would bring about a heavy burden for developers to locate feasible code snippets to make change with the goal of fixing the bug. Objective This paper proposes a novel approach called FineLocator to method-level fine-grained bug localization by using semantic similarity, temporal proximity and call dependency for method expansion. Method Firstly, the bug reports and the methods of source code are represented by numeric vectors using word embedding (word2vec) and the TF-IDF method. Secondly, we propose three query expansion scores as semantic similarity score, temporal proximity score and call dependency score to address the representation sparseness problem caused by the short lengths of methods in the source code. Then, the representation of a method with short length is augmented by elements of its neighboring methods with query expansion. Thirdly, when a new bug report is incoming, FineLocator will retrieve the methods in source code by similarity ranking on the bug report and the augmented methods for bug localization. Results We collect bug repositories of ArgoUML, Maven, Kylin, Ant and AspectJ projects to investigate the performance of the proposed FineLocator approach. Experimental results demonstrate that the proposed FineLocator approach can improve the performances of method-level bug localization at average by 20%, 21% and 17% measured by Top-N indicator, MAP and MRR respectively, in comparison with state-of-the-art techniques. Conclusion This is the first paper to demonstrate how to make use of method expansion to address the representation sparseness problem for method-level fine-grained bug localization.

Journal ArticleDOI
TL;DR: In this paper, a qualitative analysis on defect-related commits mined from open source software repositories is performed to identify source code properties that correlate with defective infrastructure as code (IaC) scripts.
Abstract: Context In continuous deployment, software and services are rapidly deployed to end-users using an automated deployment pipeline. Defects in infrastructure as code (IaC) scripts can hinder the reliability of the automated deployment pipeline. We hypothesize that certain properties of IaC source code such as lines of code and hard-coded strings used as configuration values, show correlation with defective IaC scripts. Objective The objective of this paper is to help practitioners in increasing the quality of infrastructure as code (IaC) scripts through an empirical study that identifies source code properties of defective IaC scripts. Methodology We apply qualitative analysis on defect-related commits mined from open source software repositories to identify source code properties that correlate with defective IaC scripts. Next, we survey practitioners to assess the practitioner’s agreement level with the identified properties. We also construct defect prediction models using the identified properties for 2439 scripts collected from four datasets. Results We identify 10 source code properties that correlate with defective IaC scripts. Of the identified 10 properties we observe lines of code and hard-coded string i.e. putting strings as configuration values, to show the strongest correlation with defective IaC scripts. According to our survey analysis, majority of the practitioners show agreement for two properties: include, the property of executing external modules or scripts, and hard-coded string. Using the identified properties, our constructed defect prediction models show a precision of 0.70 ∼ 0.78, and a recall of 0.54 ∼ 0.67. Conclusion Based on our findings, we recommend practitioners to allocate sufficient inspection and testing efforts on IaC scripts that include any of the identified 10 source code properties of IaC scripts.

Journal ArticleDOI
TL;DR: A systematic literature review was carried out to identify, evaluate, and synthesize research published concerning software developers’ emotions as well as the measures used to assess its existence, providing a holistic view that will benefit researchers by providing the latest trends in this area and identifying the corresponding research gaps.
Abstract: Context Over the past 50 years of Software Engineering, numerous studies have acknowledged the importance of human factors. However, software developers’ emotions are still an area under investigation and debate that is gaining relevance in the software industry. Objective In this study, a systematic literature review (SLR) was carried out to identify, evaluate, and synthesize research published concerning software developers’ emotions as well as the measures used to assess its existence. Method By searching five major bibliographic databases, authors identified 7172 articles related to emotions in Software Engineering. We selected 66 of these papers as primary studies. Then, they were analyzed in order to find empirical evidence of the intersection of emotions and software engineering. Results Studies report a total of 40 discrete emotions but the most frequent were: anger, fear, disgust, sadness, joy, love, and happiness. There are also 2 different dimensional approaches and 10 datasets related to this topic which are publicly available on the Web. The findings also showed that self-reported mood instruments (e.g., SAM, PANAS), physiological measures (e.g., heart rate, perspiration) or behavioral measures (e.g., keyboard use) are the least reported tools, although, there is a recognized intrinsic problem with the accuracy of current state of the art sentiment analysis tools. Moreover, most of the studies used software practitioners and/or datasets from industrial context as subjects. Conclusions The study of emotions has received a growing attention from the research community in the recent years, but the management of emotions has always been challenging in practice. Although it can be said that this field is not mature enough yet, our results provide a holistic view that will benefit researchers by providing the latest trends in this area and identifying the corresponding research gaps.

Journal ArticleDOI
TL;DR: This work proposes using "bad smells", i.e. surface indications of deeper problems and popular in the agile software community, to consider how they may be manifest in software analytics studies and should encourage more debate on what constitutes a `valid' study.
Abstract: Context There has been a rapid growth in the use of data analytics to underpin evidence-based software engineering. However the combination of complex techniques, diverse reporting standards and poorly understood underlying phenomena are causing some concern as to the reliability of studies. Objective Our goal is to provide guidance for producers and consumers of software analytics studies (computational experiments and correlation studies). Method We propose using “bad smells”, i.e., surface indications of deeper problems and popular in the agile software community and consider how they may be manifest in software analytics studies. Results We list 12 “bad smells” in software analytics papers (and show their impact by examples). Conclusions We believe the metaphor of bad smell is a useful device. Therefore we encourage more debate on what contributes to the validity of software analytics studies (so we expect our list will mature over time).

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a hierarchical algorithm to detect duplicates based on the four similarity scores derived from the screenshots and textual descriptions, which can improve the duplicate report detection performance significantly and substantially.
Abstract: Context: Crowdtesting is effective especially when it comes to the feedback on GUI systems, or subjective opinions about features. Despite of this, we find crowdtesting reports are highly duplicated, i.e., 82% of them are duplicates of others. Most of the existing approaches mainly adopted textual information for duplicate detection, and suffered from low accuracy because of the lexical gap. Our observation on real industrial crowdtesting data found that when dealing with crowdtesting reports of GUI systems, the reports would be accompanied with images, i.e., the screenshots of the tested app. We assume the screenshot to be valuable for duplicate crowdtesting report detection because it reflects the real context of the bug and is not affected by the variety of natural languages. Objective: We aim at automatically detecting duplicate crowdtesting reports that could help reduce triaging effort. Method: In this work, we propose SETU which combines information from the ScrEenshots and the TextUal descriptions to detect duplicate crowdtesting reports. We extract four types of features to characterize the screenshots (i.e., image structure feature and image color feature) and the textual descriptions (i.e., TF-IDF feature and word embedding feature), and design a hierarchical algorithm to detect duplicates based on the four similarity scores derived from the four features respectively. Results: We investigate the effectiveness of SETU on 12 projects with 3,689 reports from one of the Chinese largest crowdtesting platforms. Results show that recall@1 achieved by SETU is 0.44 to 0.79, recall@5 is 0.66 to 0.92, and MAP is 0.21 to 0.58 across all experimental projects. Furthermore, SETU can outperform existing state-of-the-art approaches significantly and substantially. Conclusion: Through combining the screenshots and textual descriptions, our proposed SETU can improve the duplicate crowdtesting reports detection performance.

Journal ArticleDOI
TL;DR: This paper testifies that using appropriate deep learning approaches can indeed achieve better performance than traditional approaches in tag recommendation tasks for software information sites.
Abstract: Context Inspired by the success of deep learning in other domains, this new technique been gaining widespread recent interest in being applied to diverse data analysis problems in software engineering. Many deep learning models, such as CNN, DBN, RNN, LSTM and GAN, have been proposed and recently applied to software engineering tasks including effort estimation, vulnerability analysis, code clone detection, test case selection, requirements analysis and many others. However, there is a perception that applying deep learning is a ”silver bullet” if it can be applied to a software engineering data analysis problem. Object This motivated us to ask the question as to whether deep learning is better than traditional approaches in tag recommendation task for software information sites. Method In this paper we test this question by applying both the latest deep learning approaches and some traditional approaches on tag recommendation task for software information sites. This is a typical Software Engineering automation problem where intensive data processing is required to link disparate information to assist developers. Four different deep learning approaches – TagCNN, TagRNN, TagHAN and TagRCNN – are implemented and compared with three advanced traditional approaches – EnTagRec, TagMulRec, and FastTagRec. Results Our comprehensive experimental results show that the performance of these different deep learning approaches varies significantly. The performance of TagRNN and TagHAN approaches are worse than traditional approaches in tag recommendation tasks. The performance of TagCNN and TagRCNN approaches are better than traditional approaches in tag recommendation tasks. Conclusion Therefore, using appropriate deep learning approaches can indeed achieve better performance than traditional approaches in tag recommendation tasks for software information sites.

Journal ArticleDOI
TL;DR: A novel approach that automatically detects duplicate bug reports using stack traces and Hidden Markov Models is proposed and it is shown that HMM and stack traces are a powerful combination for detecting and classifying duplicatebug reports in large bug repositories.
Abstract: Context Software projects rely on their issue tracking systems to guide maintenance activities of software developers. Bug reports submitted to the issue tracking systems carry crucial information about the nature of the crash (such as texts from users or developers and execution information about the running functions before the occurrence of a crash). Typically, big software projects receive thousands of reports every day. Objective The aim is to reduce the time and effort required to fix bugs while improving software quality overall. Previous studies have shown that a large amount of bug reports are duplicates of previously reported ones. For example, as many as 30% of all reports in for Firefox are duplicates. Method While there exist a wide variety of approaches to automatically detect duplicate bug reports by natural language processing, only a few approaches have considered execution information (the so-called stack traces) inside bug reports. In this paper, we propose a novel approach that automatically detects duplicate bug reports using stack traces and Hidden Markov Models. Results When applying our approach to Firefox and GNOME datasets, we show that, for Firefox, the average recall for Rank k = 1 is 59%, for Rank k = 2 is 75.55%. We start reaching the 90% recall from k = 10. The Mean Average Precision (MAP) value is up to 76.5%. For GNOME, The recall at k = 1 is around 63%, while this value increases by about 10% for k = 2. The recall increases to 97% for k = 11. A MAP value of up to 73% is achieved. Conclusion We show that HMM and stack traces are a powerful combination for detecting and classifying duplicate bug reports in large bug repositories.