scispace - formally typeset
Search or ask a question

Showing papers in "Empirical Software Engineering in 2017"


Journal ArticleDOI
TL;DR: This work proposes a framework, and presents a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project and identifies software engineering practices (called dimensions) and proposes means for validating their existence in a GitHub repository.
Abstract: Software forges like GitHub host millions of repositories. Software engineering researchers have been able to take advantage of such a large corpora of potential study subjects with the help of tools like GHTorrent and Boa. However, the simplicity in querying comes with a caveat: there are limited means of separating the signal (e.g. repositories containing engineered software projects) from the noise (e.g. repositories containing home work assignments). The proportion of noise in a random sample of repositories could skew the study and may lead to researchers reaching unrealistic, potentially inaccurate, conclusions. We argue that it is imperative to have the ability to sieve out the noise in such large repository forges. We propose a framework, and present a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project. We identify software engineering practices (called dimensions) and propose means for validating their existence in a GitHub repository. We used reaper to measure the dimensions of 1,857,423 GitHub repositories. We then used manually classified data sets of repositories to train classifiers capable of predicting if a given GitHub repository contains an engineered software project. The performance of the classifiers was evaluated using a set of 200 repositories with known ground truth classification. We also compared the performance of the classifiers to other approaches to classification (e.g. number of GitHub Stargazers) and found our classifiers to outperform existing approaches. We found stargazers-based classifier (with 10 as the threshold for number of stargazers) to exhibit high precision (97%) but an inversely proportional recall (32%). On the other hand, our best classifier exhibited a high precision (82%) and a high recall (86%). The stargazer-based criteria offers precision but fails to recall a significant portion of the population.

301 citations


Journal ArticleDOI
TL;DR: The results of analyses of the Type 1 error efficiency and power of standard parametric and non-parametric statistical tests when applied to non-normal data sets are summarised.
Abstract: There have been many changes in statistical theory in the past 30 years, including increased evidence that non-robust methods may fail to detect important results. The statistical advice available to software engineering researchers needs to be updated to address these issues. This paper aims both to explain the new results in the area of robust analysis methods and to provide a large-scale worked example of the new methods. We summarise the results of analyses of the Type 1 error efficiency and power of standard parametric and non-parametric statistical tests when applied to non-normal data sets. We identify parametric and non-parametric methods that are robust to non-normality. We present an analysis of a large-scale software engineering experiment to illustrate their use. We illustrate the use of kernel density plots, and parametric and non-parametric methods using four different software engineering data sets. We explain why the methods are necessary and the rationale for selecting a specific analysis. We suggest using kernel density plots rather than box plots to visualise data distributions. For parametric analysis, we recommend trimmed means, which can support reliable tests of the differences between the central location of two or more samples. When the distribution of the data differs among groups, or we have ordinal scale data, we recommend non-parametric methods such as Cliff's ź or a robust rank-based ANOVA-like method.

192 citations


Journal ArticleDOI
TL;DR: The Naming the Pain in Requirements Engineering (NaPiRE) initiative as discussed by the authors is a family of surveys on the status quo and problems in practical requirements engineering (RE) in 10 countries in various domains.
Abstract: Requirements Engineering (RE) has received much attention in research and practice due to its importance to software project success. Its interdisciplinary nature, the dependency to the customer, and its inherent uncertainty still render the discipline difficult to investigate. This results in a lack of empirical data. These are necessary, however, to demonstrate which practically relevant RE problems exist and to what extent they matter. Motivated by this situation, we initiated the Naming the Pain in Requirements Engineering (NaPiRE) initiative which constitutes a globally distributed, bi-yearly replicated family of surveys on the status quo and problems in practical RE. In this article, we report on the qualitative analysis of data obtained from 228 companies working in 10 countries in various domains and we reveal which contemporary problems practitioners encounter. To this end, we analyse 21 problems derived from the literature with respect to their relevance and criticality in dependency to their context, and we complement this picture with a cause-effect analysis showing the causes and effects surrounding the most critical problems. Our results give us a better understanding of which problems exist and how they manifest themselves in practical environments. Thus, we provide a first step to ground contributions to RE on empirical observations which, until now, were dominated by conventional wisdom only.

170 citations


Journal ArticleDOI
TL;DR: Whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other is studied.
Abstract: Recent years have seen an increasing attention to social aspects of software engineering, including studies of emotions and sentiments experienced and expressed by the software developers. Most of these studies reuse existing sentiment analysis tools such as SentiStrength and NLTK. However, these tools have been trained on product reviews and movie reviews and, therefore, their results might not be applicable in the software engineering domain. In this paper we study whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. We repeat the study for seven datasets (issue trackers and Stack Overflow questions) and different sentiment analysis tools and observe that the disagreement between the tools can lead to diverging conclusions. Finally, we perform two replications of previously published studies and observe that the results of those studies cannot be confirmed when a different sentiment analysis tool is used.

166 citations


Journal ArticleDOI
TL;DR: Light is shed as to why practitioners often perform some of these search tasks and why they find some of them to be challenging, and the implications of the findings to future research in several research areas are discussed.
Abstract: Developers commonly make use of a web search engine such as Google to locate online resources to improve their productivity. A better understanding of what developers search for could help us understand their behaviors and the problems that they meet during the software development process. Unfortunately, we have a limited understanding of what developers frequently search for and of the search tasks that they often find challenging. To address this gap, we collected search queries from 60 developers, surveyed 235 software engineers from more than 21 countries across five continents. In particular, we asked our survey participants to rate the frequency and difficulty of 34 search tasks which are grouped along the following seven dimensions: general search, debugging and bug fixing, programming, third party code reuse, tools, database, and testing. We find that searching for explanations for unknown terminologies, explanations for exceptions/error messages (e.g., HTTP 404), reusable code snippets, solutions to common programming bugs, and suitable third-party libraries/services are the most frequent search tasks that developers perform, while searching for solutions to performance bugs, solutions to multi-threading bugs, public datasets to test newly developed algorithms or systems, reusable code snippets, best industrial practices, database optimization solutions, solutions to security bugs, and solutions to software configuration bugs are the most difficult search tasks that developers consider. Our study sheds light as to why practitioners often perform some of these tasks and why they find some of them to be challenging. We also discuss the implications of our findings to future research in several research areas, e.g., code search engines, domain-specific search engines, and automated generation and refinement of search queries.

161 citations


Journal ArticleDOI
TL;DR: A systematic mapping study to provide an overview of the current research on reengineering of existing systems to SPLs, identify the community activity in regarding of venues and frequency of publications in this field, and point out trends and open issues that could serve as references for future research.
Abstract: Software Product Lines (SPLs) are families of systems that share common assets allowing a disciplined reuse. Rarely SPLs start from scratch, instead they usually start from a set of existing systems that undergo a reengineering process. Many approaches to conduct the reengineering process have been proposed and documented in research literature. This scenario is a clear testament to the interest in this research area. We conducted a systematic mapping study to provide an overview of the current research on reengineering of existing systems to SPLs, identify the community activity in regarding of venues and frequency of publications in this field, and point out trends and open issues that could serve as references for future research. This study identified 119 relevant publications. These primary sources were classified in six different dimensions related to reengineering phases, strategies applied, types of systems used in the evaluation, input artefacts, output artefacts, and tool support. The analysis of the results points out the existence of a consolidate community on this topic and a wide range of strategies to deal with different phases and tasks of the reengineering process, besides the availability of some tools. We identify some open issues and areas for future research such as the implementation of automation and tool support, the use of different sources of information, need for improvements in the feature management, the definition of ways to combine different strategies and methods, lack of sophisticated refactoring, need for new metrics and measures and more robust empirical evaluation. Reengineering of existing systems into SPLs is an active research topic with real benefits in practice. This mapping study motivates new research in this field as well as the adoption of systematic reuse in software companies.

118 citations


Journal ArticleDOI
TL;DR: The results show that forking is mainly used for making contributions of original repositories, and it is beneficial for OSS community, and the value of recommendation is shown and provides important insights for GitHub to recommend repositories.
Abstract: Forking is the creation of a new software repository by copying another repository. Though forking is controversial in traditional open source software (OSS) community, it is encouraged and is a built-in feature in GitHub. Developers freely fork repositories, use codes as their own and make changes. A deep understanding of repository forking can provide important insights for OSS community and GitHub. In this paper, we explore why and how developers fork what from whom in GitHub. We collect a dataset containing 236,344 developers and 1,841,324 forks. We make surveys, and analyze programming languages and owners of forked repositories. Our main observations are: (1) Developers fork repositories to submit pull requests, fix bugs, add new features and keep copies etc. Developers find repositories to fork from various sources: search engines, external sites (e.g., Twitter, Reddit), social relationships, etc. More than 42 % of developers that we have surveyed agree that an automated recommendation tool is useful to help them pick repositories to fork, while more than 44.4 % of developers do not value a recommendation tool. Developers care about repository owners when they fork repositories. (2) A repository written in a developer's preferred programming language is more likely to be forked. (3) Developers mostly fork repositories from creators. In comparison with unattractive repository owners, attractive repository owners have higher percentage of organizations, more followers and earlier registration in GitHub. Our results show that forking is mainly used for making contributions of original repositories, and it is beneficial for OSS community. Moreover, our results show the value of recommendation and provide important insights for GitHub to recommend repositories.

109 citations


Journal ArticleDOI
TL;DR: It is found that keeping an archive of already covered goals along with the tests covering them and focusing the search on uncovered goals overcomes this small drawback on larger classes, leading to an improved overall effectiveness of whole test suite generation.
Abstract: A common application of search-based software testing is to generate test cases for all goals defined by a coverage criterion (e.g., lines, branches, mutants). Rather than generating one test case at a time for each of these goals individually, whole test suite generation optimizes entire test suites towards satisfying all goals at the same time. There is evidence that the overall coverage achieved with this approach is superior to that of targeting individual coverage goals. Nevertheless, there remains some uncertainty on (a) whether the results generalize beyond branch coverage, (b) whether the whole test suite approach might be inferior to a more focused search for some particular coverage goals, and (c) whether generating whole test suites could be optimized by only targeting coverage goals not already covered. In this paper, we perform an in-depth analysis to study these questions. An empirical study on 100 Java classes using three different coverage criteria reveals that indeed there are some testing goals that are only covered by the traditional approach, although their number is only very small in comparison with those which are exclusively covered by the whole test suite approach. We find that keeping an archive of already covered goals along with the tests covering them and focusing the search on uncovered goals overcomes this small drawback on larger classes, leading to an improved overall effectiveness of whole test suite generation.

96 citations


Journal ArticleDOI
TL;DR: A replication study of 21 different Java-based open source projects from three different categories shows that all projects contain logging code, which is actively maintained, however, contrary to the original study, bug reports containing log messages take a longer time to resolve than bug reports without log messages.
Abstract: Log messages, which are generated by the debug statements that developers insert into the code at runtime, contain rich information about the runtime behavior of software systems. Log messages are used widely for system monitoring, problem diagnoses and legal compliances. Yuan et al. performed the first empirical study on the logging practices in open source software systems. They studied the development history of four C/C++ server-side projects and derived ten interesting findings. In this paper, we have performed a replication study in order to assess whether their findings would be applicable to Java projects in Apache Software Foundations. We examined 21 different Java-based open source projects from three different categories: server-side, client-side and supporting-component. Similar to the original study, our results show that all projects contain logging code, which is actively maintained. However, contrary to the original study, bug reports containing log messages take a longer time to resolve than bug reports without log messages. A significantly higher portion of log updates are for enhancing the quality of logs (e.g., formatting & style changes and spelling/grammar fixes) rather than co-changes with feature implementations (e.g., updating variable names).

94 citations


Journal ArticleDOI
TL;DR: In this article, the authors present an experience-based guideline to aid researchers in designing systematic literature studies with special emphasis on the data collection and selection procedures, and provide a blueprint for a practical and pragmatic path through the plethora of currently available practices and deliverables.
Abstract: Systematic literature studies have received much attention in empirical software engineering in recent years. They have become a powerful tool to collect and structure reported knowledge in a systematic and reproducible way. We distinguish systematic literature reviews to systematically analyze reported evidence in depth, and systematic mapping studies to structure a field of interest in a broader, usually quantified manner. Due to the rapidly increasing body of knowledge in software engineering, researchers who want to capture the published work in a domain often face an extensive amount of publications, which need to be screened, rated for relevance, classified, and eventually analyzed. Although there are several guidelines to conduct literature studies, they do not yet help researchers coping with the specific difficulties encountered in the practical application of these guidelines. In this article, we present an experience-based guideline to aid researchers in designing systematic literature studies with special emphasis on the data collection and selection procedures. Our guideline aims at providing a blueprint for a practical and pragmatic path through the plethora of currently available practices and deliverables capturing the dependencies among the single steps. The guideline emerges from various mapping studies and literature reviews conducted by the authors and provides recommendations for the general study design, data collection, and study selection procedures. Finally, we share our experiences and lessons learned in applying the different practices of the proposed guideline.

92 citations


Journal ArticleDOI
TL;DR: This study investigates the characteristics of patches that do not attract reviewers, are not discussed, and receive slow initial feedback and suggests that the patches with these characteristics should be given more attention in order to increase review participation, which will likely lead to a more responsive review process.
Abstract: Software code review is a well-established software quality practice Recently, Modern Code Review (MCR) has been widely adopted in both open source and proprietary projects Our prior work shows that review participation plays an important role in MCR practices, since the amount of review participation shares a relationship with software quality However, little is known about which factors influence review participation in the MCR process Hence, in this study, we set out to investigate the characteristics of patches that: (1) do not attract reviewers, (2) are not discussed, and (3) receive slow initial feedback Through a case study of 196,712 reviews spread across the Android, Qt, and OpenStack open source projects, we find that the amount of review participation in the past is a significant indicator of patches that will suffer from poor review participation Moreover, we find that the description length of a patch shares a relationship with the likelihood of receiving poor reviewer participation or discussion, while the purpose of introducing new features can increase the likelihood of receiving slow initial feedback Our findings suggest that the patches with these characteristics should be given more attention in order to increase review participation, which will likely lead to a more responsive review process

Journal ArticleDOI
TL;DR: This study conducts a case survey study based on the secondary data of the major pivots happened in 49 software startups, and demonstrates that customer need pivot is the most common among all pivot types.
Abstract: In the context of software startups, project failure is embraced actively and considered crucial to obtain validated learning that can lead to pivots. A pivot is the strategic change of a business concept, product or the different elements of a business model. A better understanding is needed on different types of pivots and different factors that lead to failures and trigger pivots, for software entrepreneurial teams to make better decisions under chaotic and unpredictable environment. Due to the nascent nature of the topic, the existing research and knowledge on the pivots of software startups are very limited. In this study, we aimed at identifying the major types of pivots that software startups make during their startup processes, and highlighting the factors that fail software projects and trigger pivots. To achieve this, we conducted a case survey study based on the secondary data of the major pivots happened in 49 software startups. 10 pivot types and 14 triggering factors were identified. The findings show that customer need pivot is the most common among all pivot types. Together with customer segment pivot, they are common market related pivots. The major product related pivots are zoom-in and technology pivots. Several new pivot types were identified, including market zoom-in, complete and side project pivots. Our study also demonstrates that negative customer reaction and flawed business model are the most common factors that trigger pivots in software startups. Our study extends the research knowledge on software startup pivot types and pivot triggering factors. Meanwhile it provides practical knowledge to software startups, which they can utilize to guide their effective decisions on pivoting.

Journal ArticleDOI
TL;DR: In this paper, a detailed and refined innovative trend application methodology for global temperature increment calculation is presented by considering two-half and multi-duration trend possibilities in the global monthly temperature records.
Abstract: Climate change evidence has been documented by different authors based on the long years’ monthly temperature measurements since long back. In the literature, annual-mean and 5-year moving average time series of global mean land–ocean temperature index and meteorological station data of global annual-mean surface air temperature changes are presented with a base period including some parts as uncertainty estimates. This paper provides an innovative method for refined calculation of global warming calculation. The innovative multi-duration trend analysis application to the global monthly temperature data for identification of monthly temperature variability leads to temperature increase identifications in an innovative manner. The purpose is to present a detailed and refined innovative trend application methodology for global temperature increment calculation. After the general revision of non-parametric and parametric trend methodology explanations, the innovative trend template (ITT) analysis application is presented by considering two-half and multi-duration trend possibilities in the global monthly temperature records. The ITT methodology also presents various features of the global temperature increments during the whole record duration on monthly basis leading to a set of verbal interpretations and numerical values for each month including “Low” (minimum), “High” (maximum), and “Medium” (moderate) temperature amounts. It is proven that, on average, there is 0.9 °C and 1.78 °C monthly temperature increments for “Low” and “High” temperatures, respectively, in addition to average incremental temperature of 1.33 °C. The innovative trend template (ITT) methodology is explained briefly and applied to global monthly temperature records from 1881 to 2013. This new methodology provides information about “Low” (minimum) and “High” (maximum) temperature records in addition to the “Medium” transitional temperatures. First, two-half and then multi-period innovative trend analysis implementations are explained graphically, verbally, and numerically. Finally, the ITTs application indicated that the warming at global scale is at about 0.75 °C, which was determined by some other approach as 0.76 °C ± 0.19 °C (IPCC, 2007).

Journal ArticleDOI
TL;DR: This paper analyzes the development history of four open source projects, and uses ordinal regression models to automatically suggest the most appropriate level for each newly-added logging statement, finding that the characteristics of the containing block of a newly- added logging statements, the existing logging statements in the containing source code file, and the content of the newly-adding logging statement play important roles in determining the appropriate log level.
Abstract: Logging statements are used to record valuable runtime information about applications. Each logging statement is assigned a log level such that users can disable some verbose log messages while allowing the printing of other important ones. However, prior research finds that developers often have difficulties when determining the appropriate level for their logging statements. In this paper, we propose an approach to help developers determine the appropriate log level when they add a new logging statement. We analyze the development history of four open source projects (Hadoop, Directory Server, Hama, and Qpid), and leverage ordinal regression models to automatically suggest the most appropriate level for each newly-added logging statement. First, we find that our ordinal regression model can accurately suggest the levels of logging statements with an AUC (area under the curve; the higher the better) of 0.75 to 0.81 and a Brier score (the lower the better) of 0.44 to 0.66, which is better than randomly guessing the appropriate log level (with an AUC of 0.50 and a Brier score of 0.80 to 0.83) or naively guessing the log level based on the proportional distribution of each log level (with an AUC of 0.50 and a Brier score of 0.65 to 0.76). Second, we find that the characteristics of the containing block of a newly-added logging statement, the existing logging statements in the containing source code file, and the content of the newly-added logging statement play important roles in determining the appropriate log level for that logging statement.

Journal ArticleDOI
TL;DR: This paper proposes a code example search approach that applies a machine learning technique to automatically train a ranking schema and uses the trained ranking schema to rank candidate code examples for new queries at run-time.
Abstract: Source code examples are used by developers to implement unfamiliar tasks by learning from existing solutions. To better support developers in finding existing solutions, code search engines are designed to locate and rank code examples relevant to user's queries. Essentially, a code search engine provides a ranking schema, which combines a set of ranking features to calculate the relevance between a query and candidate code examples. Consequently, the ranking schema places relevant code examples at the top of the result list. However, it is difficult to determine the configurations of the ranking schemas subjectively. In this paper, we propose a code example search approach that applies a machine learning technique to automatically train a ranking schema. We use the trained ranking schema to rank candidate code examples for new queries at run-time. We evaluate the ranking performance of our approach using a corpus of over 360,000 code snippets crawled from 586 open-source Android projects. The performance evaluation study shows that the learning-to-rank approach can effectively rank code examples, and outperform the existing ranking schemas by about 35.65 % and 48.42 % in terms of normalized discounted cumulative gain (NDCG) and expected reciprocal rank (ERR) measures respectively.

Journal ArticleDOI
TL;DR: This paper proposes an approach that can provide developers with log change suggestions as soon as they commit a code change, which it refers to as “just-in-time” suggestions for log changes, and derives a set of measures based on manually examining the reasons forlog changes and the authors' experiences to model whether a code commit requires log changes.
Abstract: Software developers typically insert logging statements in their source code to record runtime information. However, providing proper logging statements remains a challenging task. Prior approaches automatically enhance logging statements, as a post-implementation process. Such automatic approaches do not take into account developers' domain knowledge; nevertheless, developers usually need to carefully design the logging statements since logs are a rich source about the field operation of a software system. The goals of this paper include: i) understanding the reasons for log changes; and ii) proposing an approach that can provide developers with log change suggestions as soon as they commit a code change, which we refer to as "just-in-time" suggestions for log changes. In particular, we derive a set of measures based on manually examining the reasons for log changes and our experiences. We use these measures as explanatory variables in random forest classifiers to model whether a code commit requires log changes. These classifiers can provide just-in-time suggestions for log changes. We perform a case study on four open source projects: Hadoop, Directory Server, Commons HttpClient, and Qpid. We find that: (i) The reasons for log changes can be grouped along four categories: block change, log improvement, dependence-driven change, and logging issue; (ii) our random forest classifiers can effectively suggest whether a log change is needed: the classifiers that are trained from within-project data achieve a balanced accuracy of 0.76 to 0.82, and the classifiers that are trained from cross-project data achieve a balanced accuracy of 0.76 to 0.80; (iii) the characteristics of code changes in a particular commit and the current snapshot of the source code are the most influential factors for determining the likelihood of a log change in a commit.

Journal ArticleDOI
TL;DR: This work introduced a multi-objective robust model, based on NSGA-II, for the software refactoring problem that tries to find the best trade-off between three objectives to maximize: quality improvements, severity and importance ofRefactoring opportunities to be fixed.
Abstract: Refactoring large systems involves several sources of uncertainty related to the severity levels of code smells to be corrected and the importance of the classes in which the smells are located. Both severity and importance of identified refactoring opportunities (e.g. code smells) are difficult to estimate. In fact, due to the dynamic nature of software development, these values cannot be accurately determined in practice, leading to refactoring sequences that lack robustness. In addition, some code fragments can contain severe quality issues but they are not playing an important role in the system. To address this problem, we introduced a multi-objective robust model, based on NSGA-II, for the software refactoring problem that tries to find the best trade-off between three objectives to maximize: quality improvements, severity and importance of refactoring opportunities to be fixed. We evaluated our approach using 8 open source systems and one industrial project, and demonstrated that it is significantly better than state-of-the-art refactoring approaches in terms of robustness in all the experiments based on a variety of real-world scenarios. Our suggested refactoring solutions were found to be comparable in terms of quality to those suggested by existing approaches, better prioritization of refactoring opportunities and to carry an acceptable robustness price.

Journal ArticleDOI
TL;DR: The study indicates that agile development methods can be successfully employed in organizations where the higher level planning processes are not agile, and how the requirements flow from strategy to release, and related benefits and problems are described.
Abstract: In a large organization, informal communication and simple backlogs are not sufficient for the management of requirements and development work. Many large organizations are struggling to successfully adopt agile methods, but there is still little scientific knowledge on requirements management in large-scale agile development organizations. We present an in-depth study of an Ericsson telecommunications node development organization which employs a large scale agile method to develop telecommunications system software. We describe how the requirements flow from strategy to release, and related benefits and problems. Data was collected by 43 interviews, which were analyzed qualitatively. The requirements management was done in three different processes, each of which had a different process model, purpose and planning horizon. The release project management process was plan-driven, feature development process was continuous and implementation management process was agile. The perceived benefits included reduced development lead time, increased flexibility, increased planning efficiency, increased developer motivation and improved communication effectiveness. The recognized problems included difficulties in balancing planning effort, overcommitment, insufficient understanding of the development team autonomy, defining the product owner role, balancing team specialization, organizing system-level work and growing technical debt. The study indicates that agile development methods can be successfully employed in organizations where the higher level planning processes are not agile. Combining agile methods with a flexible feature development process can bring many benefits, but large-scale software development seems to require specialist roles and significant coordination effort.

Journal ArticleDOI
TL;DR: It is shown that local models make only a minor difference in comparison to global models and transfer learning for cross-project defect prediction, and this provides valuable knowledge about the limitations of local models and increases the validity of previously gained research results.
Abstract: Although researchers invested significant effort, the performance of defect prediction in a cross-project setting, ie, with data that does not come from the same project, is still unsatisfactory A recent proposal for the improvement of defect prediction is using local models With local models, the available data is first clustered into homogeneous regions and afterwards separate classifiers are trained for each homogeneous region Since the main problem of cross-project defect prediction is data heterogeneity, the idea of local models is promising Therefore, we perform a conceptual replication of the previous studies on local models with a focus on cross-project defect prediction In a large case study, we evaluate the performance of local models and investigate their advantages and drawbacks for cross-project predictions To this aim, we also compare the performance with a global model and a transfer learning technique designed for cross-project defect predictions Our findings show that local models make only a minor difference in comparison to global models and transfer learning for cross-project defect prediction While these results are negative, they provide valuable knowledge about the limitations of local models and increase the validity of previously gained research results

Journal ArticleDOI
TL;DR: Overall programming experience gained in academia does tend to have a positive influence on programmer performance and experience in the use of productivity tools, such as testing frameworks and IDE also has positive effects.
Abstract: There is a widespread belief in both SE and other branches of science that experience helps professionals to improve their performance. However, cases have been reported where experience not only does not have a positive influence but sometimes even degrades the performance of professionals. Determine whether years of experience influence programmer performance. We have analysed 10 quasi-experiments executed both in academia with graduate and postgraduate students and in industry with professionals. The experimental task was to apply ITLD on two experimental problems and then measure external code quality and programmer productivity. Programming experience gained in industry does not appear to have any effect whatsoever on quality and productivity. Overall programming experience gained in academia does tend to have a positive influence on programmer performance. These two findings may be related to the fact that, as opposed to deliberate practice, routine practice does not appear to lead to improved performance. Experience in the use of productivity tools, such as testing frameworks and IDE also has positive effects. Years of experience are a poor predictor of programmer performance. Academic background and specialized knowledge of task-related aspects appear to be rather good predictors.

Journal ArticleDOI
TL;DR: It is argued that the update strategy that is chosen by a game developer affects the number of urgent updates that are released, and games that use a frequent update strategy tend to have a higher proportion of 0-day updates than games that used a traditional update strategy.
Abstract: The steadily increasing popularity of computer games has led to the rise of a multi-billion dollar industry. This increasing popularity is partly enabled by online digital distribution platforms for games, such as Steam. These platforms offer an insight into the development and test processes of game developers. In particular, we can extract the update cycle of a game and study what makes developers deviate from that cycle by releasing so-called urgent updates. An urgent update is a software update that fixes problems that are deemed critical enough to not be left unfixed until a regular-cycle update. Urgent updates are made in a state of emergency and outside the regular development and test timelines which causes unnecessary stress on the development team. Hence, avoiding the need for an urgent update is important for game developers. We define urgent updates as 0-day updates (updates that are released on the same day), updates that are released faster than the regular cycle, or self-admitted hotfixes. We conduct an empirical study of the urgent updates of the 50 most popular games from Steam, the dominant digital game delivery platform. As urgent updates are reflections of mistakes in the development and test processes, a better understanding of urgent updates can in turn stimulate the improvement of these processes, and eventually save resources for game developers. In this paper, we argue that the update strategy that is chosen by a game developer affects the number of urgent updates that are released. Although the choice of update strategy does not appear to have an impact on the percentage of updates that are released faster than the regular cycle or self-admitted hotfixes, games that use a frequent update strategy tend to have a higher proportion of 0-day updates than games that use a traditional update strategy.

Journal ArticleDOI
TL;DR: This study evaluates a number of techniques for handling imbalanced data sets using various data sampling methods and MetaCost learners on six open-source data sets and advocates the use of resample with replacement sampling method for effective imbalanced learning.
Abstract: Software change prediction is crucial in order to efficiently plan resource allocation during testing and maintenance phases of a software. Moreover, correct identification of change-prone classes in the early phases of software development life cycle helps in developing cost-effective, good quality and maintainable software. An effective software change prediction model should equally recognize change-prone and not change-prone classes with high accuracy. However, this is not the case as software practitioners often have to deal with imbalanced data sets where instances of one type of class is much higher than the other type. In such a scenario, the minority classes are not predicted with much accuracy leading to strategic losses. This study evaluates a number of techniques for handling imbalanced data sets using various data sampling methods and MetaCost learners on six open-source data sets. The results of the study advocate the use of resample with replacement sampling method for effective imbalanced learning.

Journal ArticleDOI
TL;DR: In this paper, the authors compared the performance of COCOMO-style data sets with modern estimation methods and concluded that COCO is more important than what learner is applied to that data set.
Abstract: More than half the literature on software effort estimation (SEE) focuses on comparisons of new estimation methods. Surprisingly, there are no studies comparing state of the art latest methods with decades-old approaches. Accordingly, this paper takes five steps to check if new SEE methods generated better estimates than older methods. Firstly, collect effort estimation methods ranging from “classical” COCOMO (parametric estimation over a pre-determined set of attributes) to “modern” (reasoning via analogy using spectral-based clustering plus instance and feature selection, and a recent “baseline method” proposed in ACM Transactions on Software Engineering). Secondly, catalog the list of objections that lead to the development of post-COCOMO estimation methods. Thirdly, characterize each of those objections as a comparison between newer and older estimation methods. Fourthly, using four COCOMO-style data sets (from 1991, 2000, 2005, 2010) and run those comparisons experiments. Fifthly, compare the performance of the different estimators using a Scott-Knott procedure using (i) the A12 effect size to rule out “small” differences and (ii) a 99 % confident bootstrap procedure to check for statistically different groupings of treatments. The major negative result of this paper is that for the COCOMO data sets, nothing we studied did any better than Boehms original procedure. Hence, we conclude that when COCOMO-style attributes are available, we strongly recommend (i) using that data and (ii) use COCOMO to generate predictions. We say this since the experiments of this paper show that, at least for effort estimation, how data is collected is more important than what learner is applied to that data.

Journal ArticleDOI
TL;DR: The results of this paper are interesting for practitioners since they highlight relevant observations from the survey participants’ experiences when deciding to implement requirements reuse practices and suggest future lines of research based on the needs pointed out.
Abstract: Requirements engineering is a discipline with numerous challenges to overcome. One of these challenges is the implementation of requirements reuse approaches. Although several theoretical proposals exist, little is known about the practices that are currently adopted in industry. Our goal is to contribute to the investigation of the state of the practice in the reuse of requirements, eliciting current practices from practitioners, and their opinions whenever appropriate. Besides reuse in general, we focus on requirement patterns as a particular strategy to reuse. We conducted an exploratory survey based on an online questionnaire. We received 71 responses from requirements engineers with industrial experience in the field, which were analyzed in order to derive observations. Although we found that a high majority of respondents declared some level of reuse in their projects (in particular, non-functional requirements were identified as the most similar and recurrent among projects), it is true that only a minority of them declared such reuse as a regular practice. Larger IT organizations and IT organizations with well-established software processes and methods present higher levels of reuse. Ignorance of reuse techniques and processes is the main reason preventing wider adoption. From the different existing reuse techniques, the simplest ones based on textual copy and subsequent tailoring of former requirements are the most adopted techniques. However, participants who apply reuse more often tend to use more elaborate techniques. Opinions of respondents about the use of requirement patterns show that they can be expected to mitigate problems related to the quality of the resulting requirements, such as lack of uniformity, inconsistency, or ambiguity. The main reasons behind the lack of adoption of requirement patterns by practitioners (in spite of the increasing research approaches proposed in the community) are related to the lack of a well-defined reuse method and involvement of requirement engineers. The results of our paper are interesting for practitioners since we highlight relevant observations from the survey participants' experiences when deciding to implement requirements reuse practices. We also suggest future lines of research based on the needs pointed out in the results.

Journal ArticleDOI
TL;DR: Eye tracking is used to measure the time and effort spent reading and understanding regular code, and shows that syntactic code complexity metrics (such as LOC and MCC) need to be made context-sensitive, e.g. by giving reduced weight to repeated segments according to their place in the sequence.
Abstract: Regular code, which includes repetitions of the same basic pattern, has been shown to have an effect on code comprehension: a regular function can be just as easy to comprehend as a non-regular one with the same functionality, despite being significantly longer and including more control constructs. It has been speculated that this effect is due to leveraging the understanding of the first instances to ease the understanding of repeated instances of the pattern. To verify and quantify this effect, we use eye tracking to measure the time and effort spent reading and understanding regular code. The experimental subjects were 18 students and 2 faculty members. The results are that time and effort invested in the initial code segments are indeed much larger than those spent on the later ones, and the decay in effort can be modeled by an exponential model. This shows that syntactic code complexity metrics (such as LOC and MCC) need to be made context-sensitive, e.g. by giving reduced weight to repeated segments according to their place in the sequence. However, it is not the case that repeated code segments are actually read more and more quickly. Rather, initial code segments receive more focus and are looked at more times, while later ones may be only skimmed. Further, a few recurring reading patterns have been identified, which together indicate that in general code reading is far from being purely linear, and exhibits significant variability across experimental subjects.

Journal ArticleDOI
TL;DR: The main discovery is that, with the appropriate (non-parametric) transformations, the validity of a metric can be accurately predicted from its correlation with size, and suggests code size is the only “unique” valid metric.
Abstract: Empirical validation of code metrics has a long history of success. Many metrics have been shown to be good predictors of external features, such as correlation to bugs. Our study provides an alternative explanation to such validation, attributing it to the confounding effect of size. In contradiction to received wisdom, we argue that the validity of a metric can be explained by its correlation to the size of the code artifact. In fact, this work came about in view of our failure in the quest of finding a metric that is both valid and free of this confounding effect. Our main discovery is that, with the appropriate (non-parametric) transformations, the validity of a metric can be accurately (with R-squared values being at times as high as 0.97) predicted from its correlation with size. The reported results are with respect to a suite of 26 metrics, that includes the famous Chidamber and Kemerer metrics. Concretely, it is shown that the more a metric is correlated with size, the more able it is to predict external features values, and vice-versa. We consider two methods for controlling for size, by linear transformations. As it turns out, metrics controlled for size, tend to eliminate their predictive capabilities. We also show that the famous Chidamber and Kemerer metrics are no better than other metrics in our suite. Overall, our results suggest code size is the only “unique” valid metric.

Journal ArticleDOI
TL;DR: This study investigates if cross-project defect prediction is affected by applying different transformations (i.e., log and rank transformations, as well as the Box-Cox transformation), and proposes an approach, namely Multiple Transformations (MT), to utilize multiple transformations for cross- project defect prediction.
Abstract: Software metrics rarely follow a normal distribution. Therefore, software metrics are usually transformed prior to building a defect prediction model. To the best of our knowledge, the impact that the transformation has on cross-project defect prediction models has not been thoroughly explored. A cross-project model is built from one project and applied on another project. In this study, we investigate if cross-project defect prediction is affected by applying different transformations (i.e., log and rank transformations, as well as the Box-Cox transformation). The Box-Cox transformation subsumes log and other power transformations (e.g., square root), but has not been studied in the defect prediction literature. We propose an approach, namely Multiple Transformations (MT), to utilize multiple transformations for cross-project defect prediction. We further propose an enhanced approach MT+ to use the parameter of the Box-Cox transformation to determine the most appropriate training project for each target project. Our experiments are conducted upon three publicly available data sets (i.e., AEEEM, ReLink, and PROMISE). Comparing to the random forest model built solely using the log transformation, our MT+ approach improves the F-measure by 7, 59 and 43% for the three data sets, respectively. As a summary, our major contributions are three-fold: 1) conduct an empirical study on the impact that data transformation has on cross-project defect prediction models; 2) propose an approach to utilize the various information retained by applying different transformation methods; and 3) propose an unsupervised approach to select the most appropriate training project for each target project.

Journal ArticleDOI
TL;DR: This paper introduces an architecture recovery framework, ARCADE, for conducting large-scale replicable empirical studies of architectural change across different versions of a software system, and utilizes ARCADE to conduct an empirical study of changes found in software architectures spanning several hundred versions of 23 open-source systems.
Abstract: From its very inception, the study of software architecture has recognized architectural decay as a regularly occurring phenomenon in long-lived systems. Architectural decay is caused by repeated, sometimes careless changes to a system during its lifespan. Despite decay's prevalence, there is a relative dearth of empirical data regarding the nature of architectural changes that may lead to decay, and of developers' understanding of those changes. In this paper, we take a step toward addressing that scarcity by introducing an architecture recovery framework, ARCADE, for conducting large-scale replicable empirical studies of architectural change across different versions of a software system. ARCADE includes two novel architectural change metrics, which are the key to enabling large-scale empirical studies of architectural change. We utilize ARCADE to conduct an empirical study of changes found in software architectures spanning several hundred versions of 23 open-source systems. Our study reveals several new findings regarding the frequency of architectural changes in software systems, the common points of departure in a system's architecture during the system's maintenance and evolution, the difference between system-level and component-level architectural change, and the suitability of a system's implementation-level structure as a proxy for its architecture.

Journal ArticleDOI
Ehsan Noei1, Mark D. Syer1, Ying Zou1, Ahmed E. Hassan1, Iman Keivanloo1 
TL;DR: The relation of both device attributes and app attributes with the user-perceived quality of Android apps from the Google Play Store is studied and it is found that the code size has the strongest relationship with the users’ user-Perceived quality.
Abstract: The number of mobile applications (apps) and mobile devices has increased considerably over the past few years. Online app markets, such as the Google Play Store, use a star-rating mechanism to quantify the user-perceived quality of mobile apps. Users may rate apps on a five point (star) scale where a five star-rating is the highest rating. Having considered the importance of a high star-rating to the success of an app, recent studies continue to explore the relationship between the app attributes, such as User Interface (UI) complexity, and the user-perceived quality. However, the user-perceived quality reflects the users’ experience using an app on a particular mobile device. Hence, the user-perceived quality of an app is not solely determined by app attributes. In this paper, we study the relation of both device attributes and app attributes with the user-perceived quality of Android apps from the Google Play Store. We study 20 device attributes, such as the CPU and the display size, and 13 app attributes, such as code size and UI complexity. Our study is based on data from 30 types of Android mobile devices and 280 Android apps. We use linear mixed effect models to identify the device attributes and app attributes with the strongest relationship with the user-perceived quality. We find that the code size has the strongest relationship with the user-perceived quality. However, some device attributes, such as the CPU, have stronger relationships with the user-perceived quality than some app attributes, such as the number of UI inputs and outputs of an app. Our work helps both device manufacturers and app developers. Manufacturers can focus on the attributes that have significant relationships with the user-perceived quality. Moreover, app developers should be careful about the devices for which they make their apps available because the device attributes have a strong relationship with the ratings that users give to apps.

Journal ArticleDOI
TL;DR: An in-depth understanding of the knowledge diffusion process in Stack overflow is obtained and the implications of URL sharing behavior for Q&A site design, developers who use crowdsourced knowledge in Stack Overflow, and future research on knowledge representation and search are exposed.
Abstract: Programming-specific Q&A sites (e.g., Stack Overflow) are being used extensively by software developers for knowledge sharing and acquisition. Due to the cross-reference of questions and answers (note that users also reference URLs external to the Q&A site. In this paper, URL sharing refers to internal URLs within the Q&A site, unless otherwise stated), knowledge is diffused in the Q&A site, forming a large knowledge network. In Stack Overflow, why do developers share URLs? How is the community feedback to the knowledge being shared? What are the unique topological and semantic properties of the resulting knowledge network in Stack Overflow? Has this knowledge network become stable? If so, how does it reach to stability? Answering these questions can help the software engineering community better understand the knowledge diffusion process in programming-specific Q&A sites like Stack Overflow, thereby enabling more effective knowledge sharing, knowledge use, and knowledge representation and search in the community. Previous work has focused on analyzing user activities in Q&A sites or mining the textual content of these sites. In this article, we present a methodology to analyze URL sharing activities in Stack Overflow. We use open coding method to analyze why users share URLs in Stack Overflow, and develop a set of quantitative analysis methods to study the structural and dynamic properties of the emergent knowledge network in Stack Overflow. We also identify system designs, community norms, and social behavior theories that help explain our empirical findings. Through this study, we obtain an in-depth understanding of the knowledge diffusion process in Stack Overflow and expose the implications of URL sharing behavior for Q&A site design, developers who use crowdsourced knowledge in Stack Overflow, and future research on knowledge representation and search.