scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Software Engineering in 2019"


Journal ArticleDOI
TL;DR: In this article, the authors study the impact of parameter optimization on defect prediction models and find that automated parameter optimization can substantially shift the importance ranking of variables, with as few as 28 percent of the top-ranked variables in optimized classifiers also being topranked in non-optimized classifiers.
Abstract: Defect prediction models—classifiers that identify defect-prone software modules—have configurable parameters that control their characteristics (e.g., the number of trees in a random forest). Recent studies show that these classifiers underperform when default settings are used. In this paper, we study the impact of automated parameter optimization on defect prediction models. Through a case study of 18 datasets, we find that automated parameter optimization: (1) improves AUC performance by up to 40 percentage points; (2) yields classifiers that are at least as stable as those trained using default settings; (3) substantially shifts the importance ranking of variables, with as few as 28 percent of the top-ranked variables in optimized classifiers also being top-ranked in non-optimized classifiers; (4) yields optimized settings for 17 of the 20 most sensitive parameters that transfer among datasets without a statistically significant drop in performance; and (5) adds less than 30 minutes of additional computation to 12 of the 26 studied classification techniques. While widely-used classification techniques like random forest and support vector machines are not optimization-sensitive, traditionally overlooked techniques like C5.0 and neural networks can actually outperform widely-used techniques after optimization is applied. This highlights the importance of exploring the parameter space when using parameter-sensitive classification techniques.

290 citations


Journal ArticleDOI
TL;DR: CPDP is still a challenge and requires more research before trustworthy applications can take place and this work synthesises literature to understand the state-of-the-art in CPDP with respect to metrics, models, data approaches, datasets and associated performances.
Abstract: Background: Cross project defect prediction (CPDP) recently gained considerable attention, yet there are no systematic efforts to analyse existing empirical evidence. Objective: To synthesise literature to understand the state-of-the-art in CPDP with respect to metrics, models, data approaches, datasets and associated performances. Further, we aim to assess the performance of CPDP versus within project DP models. Method: We conducted a systematic literature review. Results from primary studies are synthesised (thematic, meta-analysis) to answer research questions. Results: We identified 30 primary studies passing quality assessment. Performance measures, except precision, vary with the choice of metrics. Recall, precision, f-measure, and AUC are the most common measures. Models based on Nearest-Neighbour and Decision Tree tend to perform well in CPDP, whereas the popular naive Bayes yields average performance. Performance of ensembles varies greatly across f-measure and AUC. Data approaches address CPDP challenges using row/column processing, which improve CPDP in terms of recall at the cost of precision. This is observed in multiple occasions including the meta-analysis of CPDP versus WPDP. NASA and Jureczko datasets seem to favour CPDP over WPDP more frequently. Conclusion: CPDP is still a challenge and requires more research before trustworthy applications can take place. We provide guidelines for further research.

246 citations


Journal ArticleDOI
TL;DR: AFLFast is compared to the symbolic executor Klee in terms of vulnerability detection, AFLFast is significantly more effective than Klee on the same subject programs that were discussed in the original Klee paper.
Abstract: Coverage-based Greybox Fuzzing (CGF) is a random testing approach that requires no program analysis. A new test is generated by slightly mutating a seed input. If the test exercises a new and interesting path, it is added to the set of seeds; otherwise, it is discarded. We observe that most tests exercise the same few “high-frequency” paths and develop strategies to explore significantly more paths with the same number of tests by gravitating towards low-frequency paths. We explain the challenges and opportunities of CGF using a Markov chain model which specifies the probability that fuzzing the seed that exercises path $i$ i generates an input that exercises path $j$ j . Each state (i.e., seed) has an energy that specifies the number of inputs to be generated from that seed. We show that CGF is considerably more efficient if energy is inversely proportional to the density of the stationary distribution and increases monotonically every time that seed is chosen. Energy is controlled with a power schedule. We implemented several schedules by extending AFL. In 24 hours, AFLFast exposes 3 previously unreported CVEs that are not exposed by AFL and exposes 6 previously unreported CVEs 7x faster than AFL. AFLFast produces at least an order of magnitude more unique crashes than AFL. We compared AFLFast to the symbolic executor Klee. In terms of vulnerability detection, AFLFast is significantly more effective than Klee on the same subject programs that were discussed in the original Klee paper. In terms of code coverage, AFLFast only slightly outperforms Klee while a combination of both tools achieves best results by mitigating the individual weaknesses.

239 citations


Journal ArticleDOI
TL;DR: Imbalanced learning should only be considered for moderate or highly imbalanced SDP data sets and the appropriate combination of imbalanced method and classifier needs to be carefully chosen to ameliorate the imbalanced learning problem for SDP.
Abstract: Context: Software defect prediction (SDP) is an important challenge in the field of software engineering, hence much research work has been conducted, most notably through the use of machine learning algorithms. However, class-imbalance typified by few defective components and many non-defective ones is a common occurrence causing difficulties for these methods. Imbalanced learning aims to deal with this problem and has recently been deployed by some researchers, unfortunately with inconsistent results. Objective: We conduct a comprehensive experiment to explore (a) the basic characteristics of this problem; (b) the effect of imbalanced learning and its interactions with (i) data imbalance, (ii) type of classifier, (iii) input metrics and (iv) imbalanced learning method. Method: We systematically evaluate 27 data sets, 7 classifiers, 7 types of input metrics and 17 imbalanced learning methods (including doing nothing) using an experimental design that enables exploration of interactions between these factors and individual imbalanced learning algorithms. This yields 27 × 7 × 7 × 17 = 22491 results. The Matthews correlation coefficient (MCC) is used as an unbiased performance measure (unlike the more widely used F1 and AUC measures). Results: (a) we found a large majority (87 percent) of 106 public domain data sets exhibit moderate or low level of imbalance (imbalance ratio $ 10; median = 3.94); (b) anything other than low levels of imbalance clearly harm the performance of traditional learning for SDP; (c) imbalanced learning is more effective on the data sets with moderate or higher imbalance, however negative results are always possible; (d) type of classifier has most impact on the improvement in classification performance followed by the imbalanced learning method itself. Type of input metrics is not influential. (e) only ${\sim} 52\%$ ∼ 52 % of the combinations of Imbalanced Learner and Classifier have a significant positive effect. Conclusion: This paper offers two practical guidelines. First, imbalanced learning should only be considered for moderate or highly imbalanced SDP data sets. Second, the appropriate combination of imbalanced method and classifier needs to be carefully chosen to ameliorate the imbalanced learning problem for SDP. In contrast, the indiscriminate application of imbalanced learning can be harmful.

188 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose a prediction model for estimating story points based on a novel combination of two powerful deep learning architectures: long short-term memory and recurrent highway network.
Abstract: Although there has been substantial research in software analytics for effort estimation in traditional software projects, little work has been done for estimation in agile projects, especially estimating the effort required for completing user stories or issues. Story points are the most common unit of measure used for estimating the effort involved in completing a user story or resolving an issue. In this paper, we propose a prediction model for estimating story points based on a novel combination of two powerful deep learning architectures: long short-term memory and recurrent highway network. Our prediction system is end-to-end trainable from raw input data to prediction outcomes without any manual feature engineering. We offer a comprehensive dataset for story points-based estimation that contains 23,313 issues from 16 open source projects. An empirical evaluation demonstrates that our approach consistently outperforms three common baselines (Random Guessing, Mean, and Median methods) and six alternatives (e.g., using Doc2Vec and Random Forests) in Mean Absolute Error, Median Absolute Error, and the Standardized Accuracy.

141 citations


Journal ArticleDOI
TL;DR: This paper systematically summarizes the aerodynamic research relating to China’s high-speed railway network, and eight future development directions for the field of railway aerodynamics are proposed.
Abstract: High-speed railway aerodynamics is the key basic science for solving the bottleneck problem of high-speed railway development. This paper systematically summarizes the aerodynamic research relating to China’s high-speed railway network. Seven key research advances are comprehensively discussed, including train aerodynamic drag-reduction technology, train aerodynamic noise-reduction technology, train ventilation technology, train crossing aerodynamics, train/tunnel aerodynamics, train/climate environment aerodynamics, and train/human body aerodynamics. Seven types of railway aerodynamic test platform built by Central South University are introduced. Five major systems for a high-speed railway network—the aerodynamics theoretical system, the aerodynamic shape (train, tunnel, and so on) design system, the aerodynamics evaluation system, the 3D protection system for operational safety of the high-speed railway network, and the high-speed railway aerodynamic test/computation/analysis platform system—are also introduced. Finally, eight future development directions for the field of railway aerodynamics are proposed. For over 30 years, railway aerodynamics has been an important supporting element in the development of China’s high-speed railway network, which has also promoted the development of high-speed railway aerodynamics throughout the world.

126 citations


Journal ArticleDOI
TL;DR: PMT constructs a classification model, based on a series of features related to mutants and tests, and uses the model to predict whether a mutant would be killed or remain alive without executing it, and has high predictability when predicting the execution results of the majority of mutants.
Abstract: Test suites play a key role in ensuring software quality. A good test suite may detect more faults than a poor-quality one. Mutation testing is a powerful methodology for evaluating the fault-detection ability of test suites. In mutation testing, a large number of mutants may be generated and need to be executed against the test suite under evaluation to check how many mutants the test suite is able to detect, as well as the kind of mutants that the current test suite fails to detect. Consequently, although highly effective, mutation testing is widely recognized to be also computationally expensive, inhibiting wider uptake. To alleviate this efficiency concern, we propose Predictive Mutation Testing (PMT): the first approach to predicting mutation testing results without executing mutants. In particular, PMT constructs a classification model, based on a series of features related to mutants and tests, and uses the model to predict whether a mutant would be killed or remain alive without executing it. PMT has been evaluated on 163 real-world projects under two application scenarios (cross-version and cross-project). The experimental results demonstrate that PMT improves the efficiency of mutation testing by up to 151.4X while incurring only a small accuracy loss. It achieves above 0.80 AUC values for the majority of projects, indicating a good tradeoff between the efficiency and effectiveness of predictive mutation testing. Also, PMT is shown to perform well on different tools and tests, be robust in the presence of imbalanced data, and have high predictability (over 60 percent confidence) when predicting the execution results of the majority of mutants.

115 citations


Journal ArticleDOI
TL;DR: Despite their growing complexity and increasing size, modern software applications must satisfy strict release requirements that impose short bug fixing and maintenance cycles, putting significant time and money into bug fixing as discussed by the authors.
Abstract: Despite their growing complexity and increasing size, modern software applications must satisfy strict release requirements that impose short bug fixing and maintenance cycles, putting significant ...

114 citations


Journal ArticleDOI
TL;DR: This paper captures previous findings on bug-proneness to build a specialized bug prediction model for smelly classes and demonstrates how such model classifies bug-prone code components with an F-Measure at least 13 percent higher than the existing state-of-the-art models.
Abstract: Code smells are symptoms of poor design and implementation choices. Previous studies empirically assessed the impact of smells on code quality and clearly indicate their negative impact on maintainability, including a higher bug-proneness of components affected by code smells. In this paper, we capture previous findings on bug-proneness to build a specialized bug prediction model for smelly classes. Specifically, we evaluate the contribution of a measure of the severity of code smells (i.e., code smell intensity) by adding it to existing bug prediction models based on both product and process metrics, and comparing the results of the new model against the baseline models. Results indicate that the accuracy of a bug prediction model increases by adding the code smell intensity as predictor. We also compare the results achieved by the proposed model with the ones of an alternative technique which considers metrics about the history of code smells in files, finding that our model works generally better. However, we observed interesting complementarities between the set of buggy and smelly classes correctly classified by the two models. By evaluating the actual information gain provided by the intensity index with respect to the other metrics in the model, we found that the intensity index is a relevant feature for both product and process metrics-based models. At the same time, the metric counting the average number of code smells in previous versions of a class considered by the alternative model is also able to reduce the entropy of the model. On the basis of this result, we devise and evaluate a smell-aware combined bug prediction model that included product, process, and smell-related features. We demonstrate how such model classifies bug-prone code components with an F-Measure at least 13 percent higher than the existing state-of-the-art models.

93 citations


Journal ArticleDOI
TL;DR: In this article, the authors present the first-ever qualitative and quantitative evaluation that compares static API-misuse detectors along the same dimensions, and with original author validation, and show that existing detectors suffer from extremely low precision and recall.
Abstract: Application Programming Interfaces (APIs) often have usage constraints, such as restrictions on call order or call conditions. API misuses , i.e., violations of these constraints, may lead to software crashes, bugs, and vulnerabilities. Though researchers developed many API-misuse detectors over the last two decades, recent studies show that API misuses are still prevalent. Therefore, we need to understand the capabilities and limitations of existing detectors in order to advance the state of the art. In this paper, we present the first-ever qualitative and quantitative evaluation that compares static API-misuse detectors along the same dimensions, and with original author validation. To accomplish this, we develop MuC , a classification of API misuses, and MuBenchPipe , an automated benchmark for detector comparison, on top of our misuse dataset, MuBench . Our results show that the capabilities of existing detectors vary greatly and that existing detectors, though capable of detecting misuses, suffer from extremely low precision and recall. A systematic root-cause analysis reveals that, most importantly, detectors need to go beyond the naive assumption that a deviation from the most-frequent usage corresponds to a misuse and need to obtain additional usage examples to train their models. We present possible directions towards more-powerful API-misuse detectors.

92 citations


Journal ArticleDOI
TL;DR: A large-scale field study with 2,443 software engineers whose development activities were closely monitored over 2.5 years in four integrated development environments (IDEs) concludes that Test-Guided Development (TGD) is argued to be closer to the development reality of most developers than TDD.
Abstract: Software testing is one of the key activities to achieve software quality in practice. Despite its importance, however, we have a remarkable lack of knowledge on how developers test in real-world projects. In this paper, we report on a large-scale field study with 2,443 software engineers whose development activities we closely monitored over 2.5 years in four integrated development environments (IDEs). Our findings, which largely generalized across the studied IDEs and programming languages Java and C#, question several commonly shared assumptions and beliefs about developer testing: half of the developers in our study do not test; developers rarely run their tests in the IDE; most programming sessions end without any test execution; only once they start testing, do they do it extensively; a quarter of test cases is responsible for three quarters of all test failures; 12 percent of tests show flaky behavior; Test-Driven Development (TDD) is not widely practiced; and software developers only spend a quarter of their time engineering tests, whereas they think they test half of their time. We summarize these practices of loosely guiding one's development efforts with the help of testing in an initial summary on Test-Guided Development (TGD), a behavior we argue to be closer to the development reality of most developers than TDD.

Journal ArticleDOI
TL;DR: This paper attempts to clarify the nature and functions of process theories and taxonomies in software engineering research, and to synthesize methodological guidelines for their generation and evaluation, and advances the key insight that most process theories areTaxonomies with additional propositions, which helps inform their evaluation.
Abstract: Software engineering is increasingly concerned with theory because the foundational knowledge comprising theories provides a crucial counterpoint to the practical knowledge expressed through methods and techniques. Fortunately, much guidance is available for generating and evaluating theories for explaining why things happen (variance theories). Unfortunately, little guidance is available concerning theories for explaining how things happen (process theories), or theories for analyzing and understanding situations (taxonomies). This paper therefore attempts to clarify the nature and functions of process theories and taxonomies in software engineering research, and to synthesize methodological guidelines for their generation and evaluation. It further advances the key insight that most process theories are taxonomies with additional propositions, which helps inform their evaluation. The proposed methodological guidance has many benefits: it provides a concise summary of existing guidance from reference disciplines, it adapts techniques from reference disciplines to the software engineering context, and it promotes approaches that better facilitate scientific consensus.

Journal ArticleDOI
TL;DR: In this article, the authors propose the use of "bellwethers" to mitigate conclusion instability in software analytics, which is the project whose data yields the best predictions on all others.
Abstract: Software analytics builds quality prediction models for software projects. Experience shows that (a) the more projects studied, the more varied are the conclusions; and (b) project managers lose faith in the results of software analytics if those results keep changing. To reduce this conclusion instability, we propose the use of “bellwethers”: given N projects from a community the bellwether is the project whose data yields the best predictions on all others. The bellwethers offer a way to mitigate conclusion instability because conclusions about a community are stable as long as this bellwether continues as the best oracle. Bellwethers are also simple to discover (just wrap a for-loop around standard data miners). When compared to other transfer learning methods (TCA+, transfer Naive Bayes, value cognitive boosting), using just the bellwether data to construct a simple transfer learner yields comparable predictions. Further, bellwethers appear in many SE tasks such as defect prediction, effort estimation, and bad smell detection. We hence recommend using bellwethers as a baseline method for transfer learning against which future work should be compared.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed change-level SATD Determination model achieves a promising and better performance than four baselines in terms of AUC and cost-effectiveness and “Diffusion” is the most discriminative dimension among the three dimensions of features for determining TD-introducing changes.
Abstract: Technical debt (TD) is a metaphor to describe the situation where developers introduce suboptimal solutions during software development to achieve short-term goals that may affect the long-term software quality. Prior studies proposed different techniques to identify TD, such as identifying TD through code smells or by analyzing source code comments. Technical debt identified using comments is known as Self-Admitted Technical Debt (SATD) and refers to TD that is introduced intentionally. Compared with TD identified by code metrics or code smells, SATD is more reliable since it is admitted by developers using comments. Thus far, all of the state-of-the-art approaches identify SATD at the file-level. In essence, they identify whether a file has SATD or not. However, all of the SATD is introduced through software changes. Previous studies that identify SATD at the file-level in isolation cannot describe the TD context related to multiple files. Therefore, it is beneficial to identify the SATD once a change is being made. We refer to this type of TD identification as “Change-level SATD Determination”, which determines whether or not a change introduces SATD. Identifying SATD at the change-level can help to manage and control TD by understanding the TD context through tracing the introducing changes. To build a change-level SATD Determination model, we first identify TD from source code comments in source code files of all versions. Second, we label the changes that first introduce the SATD comments as TD-introducing changes. Third, we build the determination model by extracting 25 features from software changes that are divided into three dimensions, namely diffusion, history and message, respectively. To evaluate the effectiveness of our proposed model, we perform an empirical study on 7 open source projects containing a total of 100,011 software changes. The experimental results show that our model achieves a promising and better performance than four baselines in terms of AUC and cost-effectiveness (i.e., percentage of TD-introducing changes identified when inspecting 20 percent of changed LOC). On average across the 7 experimental projects, our model achieves AUC of 0.82, cost-effectiveness of 0.80, which is a significant improvement over the comparison baselines used. In addition, we found that “Diffusion” is the most discriminative dimension among the three dimensions of features for determining TD-introducing changes.

Journal ArticleDOI
TL;DR: To utilize multiple sources effectively, a multi-source selection based manifold discriminant alignment (MSMDA) approach is proposed and results show that MSMDA can achieve better performance than a range of baseline methods.
Abstract: Heterogeneous defect prediction (HDP) refers to predicting defect-proneness of software modules in a target project using heterogeneous metric data from other projects. Existing HDP methods mainly focus on predicting target instances with single source. In practice, there exist plenty of external projects. Multiple sources can generally provide more information than a single project. Therefore, it is meaningful to investigate whether the HDP performance can be improved by employing multiple sources. However, a precondition of conducting HDP is that the external sources are available. Due to privacy concerns, most companies are not willing to share their data. To facilitate data sharing, it is essential to study how to protect the privacy of data owners before they release their data. In this paper, we study the above two issues in HDP. Specifically, to utilize multiple sources effectively, we propose a multi-source selection based manifold discriminant alignment (MSMDA) approach. To protect the privacy of data owners, a sparse representation based double obfuscation algorithm is designed and applied to HDP. Through a case study of 28 projects, our results show that MSMDA can achieve better performance than a range of baseline methods. The improvement is 3.4-$15.3$15.3 percent in g-measure and 3.0-$19.1$19.1 percent in AUC.

Journal ArticleDOI
TL;DR: A black-box test generation approach, implemented based on meta-heuristic search, that aims to maximize diversity in test output signals generated by Simulink models, and a test prioritization algorithm to automatically rank test cases generated by the test generation algorithm according to their likelihood to reveal a fault.
Abstract: All engineering disciplines are founded and rely on models, although they may differ on purposes and usages of modeling. Among the different disciplines, the engineering of Cyber Physical Systems (CPSs) particularly relies on models with dynamic behaviors (i.e., models that exhibit time-varying changes). The Simulink modeling platform greatly appeals to CPS engineers since it captures dynamic behavior models. It further provides seamless support for two indispensable engineering activities: (1) automated verification of abstract system models via model simulation, and (2) automated generation of system implementation via code generation. We identify three main challenges in the verification and testing of Simulink models with dynamic behavior, namely incompatibility, oracle and scalability challenges. We propose a Simulink testing approach that attempts to address these challenges. Specifically, we propose a black-box test generation approach, implemented based on meta-heuristic search, that aims to maximize diversity in test output signals generated by Simulink models. We argue that in the CPS domain test oracles are likely to be manual and therefore the main cost driver of testing. In order to lower the cost of manual test oracles, we propose a test prioritization algorithm to automatically rank test cases generated by our test generation algorithm according to their likelihood to reveal a fault. Engineers can then select, according to their test budget, a subset of the most highly ranked test cases. To demonstrate scalability, we evaluate our testing approach using industrial Simulink models. Our evaluation shows that our test generation and test prioritization approaches outperform baseline techniques that rely on random testing and structural coverage.

Journal ArticleDOI
TL;DR: The research and contribution of various scholars in the field of visual inspection is introduced, the application and development of visual Inspection technology in the railway industry is summarized, and the future research direction ofVisual inspection technology is forecasted.
Abstract: In order to ensure the safety of railway transportation, it is necessary to regularly check for faults and defects in the railway system. Visual inspection technology is conducive to improving the low efficiency, poor economy and inaccurate detection results of traditional detection methods. This paper introduces the research and contribution of various scholars in the field of visual inspection, summarizes the application and development of visual inspection technology in the railway industry, and finally forecasts the future research direction of visual inspection technology.

Journal ArticleDOI
TL;DR: FARSEC, a framework for filtering and ranking bug reports for reducing the presence of security related keywords, is proposed and demonstrated that FARSEC improves the performance of text-based prediction models for security bug reports in 90 percent of cases.
Abstract: Security bug reports can describe security critical vulnerabilities in software products. Bug tracking systems may contain thousands of bug reports, where relatively few of them are security related. Therefore finding unlabelled security bugs among them can be challenging. To help security engineers identify these reports quickly and accurately, text-based prediction models have been proposed. These can often mislabel security bug reports due to a number of reasons such as class imbalance, where the ratio of non-security to security bug reports is very high. More critically, we have observed that the presence of security related keywords in both security and non-security bug reports can lead to the mislabelling of security bug reports. This paper proposes FARSEC, a framework for filtering and ranking bug reports for reducing the presence of security related keywords. Before building prediction models, our framework identifies and removes non-security bug reports with security related keywords. We demonstrate that FARSEC improves the performance of text-based prediction models for security bug reports in 90 percent of cases. Specifically, we evaluate it with 45,940 bug reports from Chromium and four Apache projects. With our framework, we mitigate the class imbalance issue and reduce the number of mislabelled security bug reports by 38 percent.

Journal ArticleDOI
TL;DR: An empirical study on API usages, with an emphasis on how different types of APIs are used, finds that single-type usages are mostly strict orders, but multi-type weages are more complicated since they include both strict orders and partial orders.
Abstract: API libraries provide thousands of APIs, and are essential in daily programming tasks. To understand their usages, it has long been a hot research topic to mine specifications that formally define legal usages for APIs. Furthermore, researchers are working on many other research topics on APIs. Although the research on APIs is intensively studied, many fundamental questions on APIs are still open. For example, the answers to open questions, such as which format can naturally define API usages and in which case, are still largely unknown. We notice that many such open questions are not concerned with concrete usages of specific APIs, but usages that describe how to use different types of APIs. To explore these questions, in this paper, we conduct an empirical study on API usages, with an emphasis on how different types of APIs are used. Our empirical results lead to nine findings on API usages. For example, we find that single-type usages are mostly strict orders, but multi-type usages are more complicated since they include both strict orders and partial orders. Based on these findings, for the research on APIs, we provide our suggestions on the four key aspects such as the challenges, the importance of different API elements, usage patterns, and pitfalls in designing evaluations. Furthermore, we interpret our findings, and present our insights on data sources, extraction techniques, mining techniques, and formats of specifications for the research of mining specifications.

Journal ArticleDOI
TL;DR: CLAP (Crowd Listener for releAse Planning), a web application able to categorize user reviews based on the information they carry out and cluster together related reviews, and prioritize the clusters of reviews to be implemented when planning the subsequent app release, is developed.
Abstract: The market for mobile apps is getting bigger and bigger, and it is expected to be worth over 100 Billion dollars in 2020. To have a chance to succeed in such a competitive environment, developers need to build and maintain high-quality apps, continuously astonishing their users with the coolest new features. Mobile app marketplaces allow users to release reviews. Despite reviews are aimed at recommending apps among users, they also contain precious information for developers, reporting bugs and suggesting new features. To exploit such a source of information, developers are supposed to manually read user reviews, something not doable when hundreds of them are collected per day. To help developers dealing with such a task, we developed CLAP ( C rowd L istener for rele A se P lanning), a web application able to (i) categorize user reviews based on the information they carry out, (ii) cluster together related reviews, and (iii) prioritize the clusters of reviews to be implemented when planning the subsequent app release. We evaluated all the steps behind CLAP , showing its high accuracy in categorizing and clustering reviews and the meaningfulness of the recommended prioritizations. Also, given the availability of CLAP as a working tool, we assessed its applicability in industrial environments.

Journal ArticleDOI
TL;DR: MSeer—an advanced fault localization technique for locating multiple bugs in parallel is proposed and case studies suggest that MSeer performs better in terms of effectiveness and efficiency than two other techniques for locating Multiple Bug Locator in parallel.
Abstract: In practice, a program may contain multiple bugs. The simultaneous presence of these bugs may deteriorate the effectiveness of existing fault-localization techniques to locate program bugs. While it is acceptable to use all failed and successful tests to identify suspicious code for programs with exactly one bug, it is not appropriate to use the same approach for programs with multiple bugs because the due-to relationship between failed tests and underlying bugs cannot be easily identified. One solution is to generate fault-focused clusters by grouping failed tests caused by the same bug into the same clusters. We propose MSeer—an advanced fault localization technique for locating multiple bugs in parallel. Our major contributions include the use of (1) a revised Kendall tau distance to measure the distance between two failed tests, (2) an innovative approach to simultaneously estimate the number of clusters and assign initial medoids to these clusters, and (3) an improved K-medoids clustering algorithm to better identify the due-to relationship between failed tests and their corresponding bugs. Case studies on 840 multiple-bug versions of seven programs suggest that MSeer performs better in terms of effectiveness and efficiency than two other techniques for locating multiple bugs in parallel.

Journal ArticleDOI
TL;DR: The design and implementation of LEILA (formaL tool for idEntifying mobIle maLicious behAviour), a tool targeted at Android malware families detection, are presented and demonstrated that the tool is effective in detecting malicious behaviour and, especially, in localizing the payload within the code.
Abstract: With the increasing diffusion of mobile technologies, nowadays mobile devices represent an irreplaceable tool to perform several operations, from posting a status on a social network to transfer money between bank accounts. As a consequence, mobile devices store a huge amount of private and sensitive information and this is the reason why attackers are developing very sophisticated techniques to extort data and money from our devices. This paper presents the design and the implementation of LEILA (formaL tool for idEntifying mobIle maLicious behAviour), a tool targeted at Android malware families detection. LEILA is based on a novel approach that exploits model checking to analyse and verify the Java Bytecode that is produced when the source code is compiled. After a thorough description of the method used for Android malware families detection, we report the experiments we have conducted using LEILA. The experiments demonstrated that the tool is effective in detecting malicious behaviour and, especially, in localizing the payload within the code: we evaluated real-world malware belonging to several widespread families obtaining an accuracy ranging between 0.97 and 1.

PatentDOI
TL;DR: This paper proposes a refactoring recommendation approach that dynamically adapts and interactively suggests refactorings to developers and takes their feedback into consideration, and performs significantly better than four existing search-basedRefactoring techniques and one fully-automated refactororing tool not based on heuristic search.
Abstract: Refactoring improves he software design while preserving overall functionality and behavior, and is an important technique in managing the growing complexity of software systems. This disclosure introduces an interactive way to refactor software systems using innovization and interactive dynamic multi-objective optimization. The interactive approach supports the adaption of refactoring solutions based on developer feedback while also taking into account other code changes that the developer may have performed in parallel with the refactoring activity.

Journal ArticleDOI
TL;DR: A theoretical model of crowd interest and actual participation in crowdsourcing competitions is proposed and evaluated using Structural Equation Modeling, finding that the level of prize and duration of competitions do not significantly increase crowd interest in competitions.
Abstract: Crowdsourcing is emerging as an alternative outsourcing strategy which is gaining increasing attention in the software engineering community. However, crowdsourcing software development involves complex tasks which differ significantly from the micro-tasks that can be found on crowdsourcing platforms such as Amazon Mechanical Turk which are much shorter in duration, are typically very simple, and do not involve any task interdependencies. To achieve the potential benefits of crowdsourcing in the software development context, companies need to understand how this strategy works, and what factors might affect crowd participation. We present a multi-method qualitative and quantitative theory-building research study. First, we derive a set of key concerns from the crowdsourcing literature as an initial analytical framework for an exploratory case study in a Fortune 500 company. We complement the case study findings with an analysis of 13,602 crowdsourcing competitions over a ten-year period on the very popular Topcoder crowdsourcing platform. Drawing from our empirical findings and the crowdsourcing literature, we propose a theoretical model of crowd interest and actual participation in crowdsourcing competitions. We evaluate this model using Structural Equation Modeling. Among the findings are that the level of prize and duration of competitions do not significantly increase crowd interest in competitions.

Journal ArticleDOI
TL;DR: In this paper, the sampling way is used to sample a large population and sample down to just the better solutions for search-based software engineering problems, which is called Sway, short for "the sampling way".
Abstract: Increasingly, Software Engineering (SE) researchers use search-based optimization techniques to solve SE problems with multiple conflicting objectives. These techniques often apply CPU-intensive evolutionary algorithms to explore generations of mutations to a population of candidate solutions. An alternative approach, proposed in this paper, is to start with a very large population and sample down to just the better solutions. We call this method “Sway”, short for “the sampling way”. This paper compares Sway versus state-of-the-art search-based SE tools using seven models: five software product line models; and two other software process control models (concerned with project management, effort estimation, and selection of requirements) during incremental agile development. For these models, the experiments of this paper show that Sway is competitive with corresponding state-of-the-art evolutionary algorithms while requiring orders of magnitude fewer evaluations. Considering the simplicity and effectiveness of Sway, we, therefore, propose this approach as a baseline method for search-based software engineering models, especially for models that are very slow to execute.

Journal ArticleDOI
TL;DR: This first extensive study aimed at empirically evaluating four static TCP techniques, comparing them with state-of-research dynamic TCP techniques across several quality metrics shows that static techniques can be surprisingly effective, particularly when measured by APFDc.
Abstract: Test Case Prioritization (TCP) is an increasingly important regression testing technique for reordering test cases according to a pre-defined goal, particularly as agile practices gain adoption. To better understand these techniques, we perform the first extensive study aimed at empirically evaluating four static TCP techniques, comparing them with state-of-research dynamic TCP techniques across several quality metrics. This study was performed on 58 real-word Java programs encompassing 714 KLoC and results in several notable observations. First, our results across two effectiveness metrics (the Average Percentage of Faults Detected APFD and the cost cognizant APFDc ) illustrate that at test-class granularity, these metrics tend to correlate, but this correlation does not hold at test-method granularity. Second, our analysis shows that static techniques can be surprisingly effective, particularly when measured by APFDc. Third, we found that TCP techniques tend to perform better on larger programs, but that program size does not affect comparative performance measures between techniques. Fourth, software evolution does not significantly impact comparative performance results between TCP techniques. Fifth, neither the number nor type of mutants utilized dramatically impact measures of TCP effectiveness under typical experimental settings. Finally, our similarity analysis illustrates that highly prioritized test cases tend to uncover dissimilar faults.

Journal ArticleDOI
TL;DR: A mixed methods empirical study of software engineering management at Microsoft is conducted to investigate what manager attributes developers and engineering managers perceive important and why, and finds that technical skills are not the sign of greatness for an engineering manager.
Abstract: Having great managers is as critical to success as having a good team or organization. In general, a great manager is seen as fuelling the team they manage, enabling it to use its full potential. Though software engineering research studies factors that may affect the performance and productivity of software engineers and teams (like tools and skills), it has overlooked the software engineering manager. The software industry’s growth and change in the last decades is creating a need for a domain-specific view of management. On the one hand, experts are questioning how the abundant work in management applies to software engineering. On the other hand, practitioners are looking to researchers for evidence-based guidance on how to manage software teams. We conducted a mixed methods empirical study of software engineering management at Microsoft to investigate what manager attributes developers and engineering managers perceive important and why. We present a conceptual framework of manager attributes, and find that technical skills are not the sign of greatness for an engineering manager. Through statistical analysis we identify how engineers and managers relate in their views, and how software engineering differs from other knowledge work groups in its perceptions about what makes great managers. We present strategies for putting the attributes to use, discuss implications for research and practice, and offer avenues for further work.

Journal ArticleDOI
TL;DR: This study empirically compares the tool, BinGo-E, with the pervious tool BinGo and the available state-of-the-art tools of binary code search in terms of search accuracy and performance, and proposes to incorporate features from different categories (e.g., structural features and high-level semantic features) for accuracy improvement and emulation for efficiency improvement.
Abstract: Different from source code clone detection, clone detection (similar code search) in binary executables faces big challenges due to the gigantic differences in the syntax and the structure of binary code that result from different configurations of compilers, architectures and OSs. Existing studies have proposed different categories of features for detecting binary code clones, including CFG structures, n-gram in CFG, input/output values, etc. In our previous study and the tool BinGo , to mitigate the huge gaps in CFG structures due to different compilation scenarios, we propose a selective inlining technique to capture the complete function semantics by inlining relevant library and user-defined functions. However, only features of input/output values are considered in BinGo . In this study, we propose to incorporate features from different categories (e.g., structural features and high-level semantic features) for accuracy improvement and emulation for efficiency improvement. We empirically compare our tool, BinGo-E , with the pervious tool BinGo and the available state-of-the-art tools of binary code search in terms of search accuracy and performance. Results show that BinGo-E achieves significantly better accuracies than BinGo for cross-architecture matching, cross-OS matching, cross-compiler matching and intra-compiler matching. Additionally, in the new task of matching binaries of forked projects, BinGo-E also exhibits a better accuracy than the existing benchmark tool. Meanwhile, BinGo-E takes less time than BinGo during the process of matching.

Journal ArticleDOI
TL;DR: A classification scheme focused on four crucial aspects of interactive search, i.e., the problem formulation, search technique, interactive approach, and the empirical framework is formulated and its intention is that the classification scheme affords a methodological approach for interactive SBSE.
Abstract: Search-Based Software Engineering (SBSE) has been successfully applied to automate a wide range of software development activities. Nevertheless, in those software engineering problems where human evaluation and preference are crucial, such insights have proved difficult to characterize in search, and solutions might not look natural when that is the expectation. In an attempt to address this, an increasing number of researchers have reported the incorporation of the ’human-in-the-loop’ during search and interactive SBSE has attracted significant attention recently. However, reported results are fragmented over different development phases, and a great variety of novel interactive approaches and algorithmic techniques have emerged. To better integrate these results, we have performed a systematic literature review of interactive SBSE. From a total of 669 papers, 26 primary studies were identified. To enable their analysis, we formulated a classification scheme focused on four crucial aspects of interactive search, i.e., the problem formulation, search technique, interactive approach, and the empirical framework. Our intention is that the classification scheme affords a methodological approach for interactive SBSE. Lastly, as well as providing a detailed cross analysis, we identify and discuss some open issues and potential future trends for the research community.

Journal ArticleDOI
TL;DR: The defect resolution process is studied—from the questions developers ask to their strategies for answering them and how future tools can leverage the findings to encourage better strategies.
Abstract: While using security tools to resolve security defects, software developers must apply considerable effort. Success depends on a developer's ability to interact with tools, ask the right questions, and make strategic decisions. To build better security tools and subsequently help developers resolve defects more accurately and efficiently, we studied the defect resolution process—from the questions developers ask to their strategies for answering them. In this paper, we report on an exploratory study with novice and experienced software developers. We equipped them with Find Security Bugs, a security-oriented static analysis tool, and observed their interactions with security vulnerabilities in an open-source system that they had previously contributed to. We found that they asked questions not only about security vulnerabilities, associated attacks, and fixes, but also questions about the software itself, the social ecosystem that built the software, and related resources and tools. We describe the strategic successes and failures we observed and how future tools can leverage our findings to encourage better strategies.