TL;DR: The Qualitas Corpus, a large curated collection of open source Java systems, is described, which reduces the cost of performing large empirical studies of code and supports comparison of measurements of the same artifacts.
Abstract: In order to increase our ability to use measurement to support software development practise we need to do more analysis of code. However, empirical studies of code are expensive and their results are difficult to compare. We describe the Qualitas Corpus, a large curated collection of open source Java systems. The corpus reduces the cost of performing large empirical studies of code and supports comparison of measurements of the same artifacts. We discuss its design, organisation, and issues associated with its development.
Keywords-Empirical studies; curated code corpus; experimental infrastructure I. I NTRODUCTION Measurement is fundamental to engineering, however its use in engineering software has been limited.
The authors need models explaining the relationship between the measurements and the quality attributes, and they need experiments to validate those models.
While the goals of applied linguistics research is not exactly the same as ours, the similarities are close enough to warrant examining how corpora are used in that field.
A. Empirical studies of Code
By “empirical study of code” the authors mean a study in which the artifacts under investigation consist of source code, there are multiple, unrelated, artifacts, and the artifacts were developed independently of the study.
They identified the systems studied, but did not identify the versions for all systems.
There are several issues with these studies however.
SIR provides a curated set of artifacts, including the code, test suites,and fault data.
C. The need for curation
If two studies that analyse code give conflicting reports of some phenomena, one obvious possible explanation is that the studies were applied to different samples.
Such code may not be representative of the deployed code, and so could bias the results of the study.
Different systems organise their source code in different ways.
Also, some systems provide their own implementations of some thirdparty libraries, further complicating what is system code and what is not.
According to Hunston, the content of a corpus primarily depends on the purpose it used for, and there are usually questions specific to a purpose that must be addressed in the design of the corpus.
A. Organisation
Each version consists of the original distribution and two “unpacked” forms, bin and src.
The original distribution is provided exactly as downloaded from the system’s download site.
First, it means the authors can distribute the corpus without creating thebin andsrc forms, as they can be automatically created from the distributed forms, thus reducing the size of the corpus distribution.
The authors use a standard naming convention to identify systems and versions.
B. Contents
Figure 2 lists the systems that are current represented in the corpus.
Figure 3 gives an idea of how big the systems are, when listing the latest version of each system in the current release in order of number of top-level types (that is, classes, interfaces, enums, and annotations).
For the most part, the systems in the corpus are open source and so the corpus can contain their distributions, especially as what is in the corpus is exactly what was downloaded from the system download site.
Sincejre is an interesting system to analyse, the authors consider it part of the corpus however corpus users must download what they need from the Java distribution site.
What is provided by the corpus is the metadata similar to that for other systems.
C. Criteria for inclusion
This allows people to have the latest release and yet still be able to reproduce studies based on previous releases.
One advantage with Java is that its “compiled” form is also fairly easy to analyse, easier than for the source code in fact (section IV-E), however there are slight differences between the source and binary forms.
This criteria will probably be the first to completely go away.
Some systems the authors used (and analysed) before the first external release of the corpus have suffered this fate, and so are not in the corpus.
In fact the authors already have the situation where the version of a system they have in the corpus is now apparently no longer available, as the developers only appear to keep (or make available at least) the most recent versions.
D. Metadata
As part of the curation process the authors gather metadata about each system version, and one of their near-term goals is to increase this information (section IV-J).
4, its sourcepackages value is “org.gudy com. aelitis”, indicating that types such ascom.aelitis.
Other metadata the authors keep includes the release date of the version, notes regarding the system and individual versions, domain information, and where the system distribution came from.
The latter allows users of the corpus to check corpus contents for themselves.
Another issue is what to do when systems stop being supported or otherwise become unavailable.
F. Content Management
Following criteria 1, a new release contains all the versions of systems in the previous release.
There are however ome changes between releases.
The authors have developed processes over time to support the management of the corpus.
The two main processes are for making a new entry of a version of a system into the corpus, and creating a distribution for release.
In the early days, these were all manual, but now, with each new release, scripts are being developed to automate more parts of the process.
G. Distributing the Corpus
To install the copy one acquires adistribution for a particular release.
One distribution contains just the most recent version of each system in the corpus.
For those interested in just “breadth” studies, this distribution is simpler to deal with (and much smaller to download).
Releases are identified by their date of release (in ISO 8601 format).
The current release is20090202 and the distribution containing only the most recent versions of systems is20090202r.
H. Using the corpus
A properly-installed distribution has the structure described in section IV-A.
If every study is performed on the complete contents of a given release, using the metadata provided in the corpus to identify the contents of a system (in particular sourcepackages, section IV-D), then the results of those studies can be compared with good confidence that comparison is meaningful.
Furthermore, what is actually studied can be described succinctly by just by indicating the release (and if necessary, particular distribution) used.
There is, however, no restriction on how the corpus can be used.
In such cases, in additionto identifying the release, the authors recommend that either what has been included be identified by listing the system versions used, or what has been left out similarly identified.
I. History
The Qualitas Corpus was initially conceived and developed by one of us for Ph.D. research during 2005.
The original corpus was used and added to by members of the University of Auckland group over the next three years, growing from 21 systems initially.
It was made available for external release in January of 2008, containing 88 systems, 21 systems with multiple versions, a total of 214 entries.
The main changes have been in terms of the metadata that is maintained, however there has also been a change in terminology.
J. Future Plans
As noted earlier, the next release is scheduled for July 2010.
As well as about 90 new versions of existing systems (but at this point, no new systems), the main change will be the addition of significantly more metadata.
Column 4 indicates whether the entry corresponds to a type identified as being in the system (that is, matches the sourcepackages value), with 0 indicating it does.
TL;DR: The largest experiment of applying machine learning algorithms to code smells to the best of the authors' knowledge concludes that the application of machine learning to the detection of these code smells can provide high accuracy (>96 %), and only a hundred training examples are needed to reach at least 95 % accuracy.
Abstract: Several code smell detection tools have been developed providing different results, because smells can be subjectively interpreted, and hence detected, in different ways. In this paper, we perform the largest experiment of applying machine learning algorithms to code smells to the best of our knowledge. We experiment 16 different machine-learning algorithms on four code smells (Data Class, Large Class, Feature Envy, Long Method) and 74 software systems, with 1986 manually validated code smell samples. We found that all algorithms achieved high performances in the cross-validation data set, yet the highest performances were obtained by J48 and Random Forest, while the worst performance were achieved by support vector machines. However, the lower prevalence of code smells, i.e., imbalanced data, in the entire data set caused varying performances that need to be addressed in the future studies. We conclude that the application of machine learning to the detection of these code smells can provide high accuracy (>96 %), and only a hundred training examples are needed to reach at least 95 % accuracy.
TL;DR: The study confirms that VOSUITE can achieve good levels of branch coverage in practice, and exemplifies how the choice of software systems for an empirical study can influence the results of the experiments, which can serve to inform researchers to make more conscious choices in the selection of software system subjects.
Abstract: Research on software testing produces many innovative automated techniques, but because software testing is by necessity incomplete and approximate, any new technique faces the challenge of an empirical assessment. In the past, we have demonstrated scientific advance in automated unit test generation with the EVOSUITE tool by evaluating it on manually selected open-source projects or examples that represent a particular problem addressed by the underlying technique. However, demonstrating scientific advance is not necessarily the same as demonstrating practical value; even if VOSUITE worked well on the software projects we selected for evaluation, it might not scale up to the complexity of real systems. Ideally, one would use large “real-world” software systems to minimize the threats to external validity when evaluating research tools. However, neither choosing such software systems nor applying research prototypes to them are trivial tasks. In this article we present the results of a large experiment in unit test generation using the VOSUITE tool on 100 randomly chosen open-source projects, the 10 most popular open-source projects according to the SourceForge Web site, seven industrial projects, and 11 automatically generated software projects. The study confirms that VOSUITE can achieve good levels of branch coverage (on average, 71p per class) in practice. However, the study also exemplifies how the choice of software systems for an empirical study can influence the results of the experiments, which can serve to inform researchers to make more conscious choices in the selection of software system subjects. Furthermore, our experiments demonstrate how practical limitations interfere with scientific advances, branch coverage on an unbiased sample is affected by predominant environmental dependencies. The surprisingly large effect of such practical engineering problems in unit testing will hopefully lead to a larger appreciation of work in this area, thus supporting transfer of knowledge from software testing research to practice.
176 citations
Cites methods from "The Qualitas Corpus: A Curated Coll..."
...The software artifacts were taken from the Qualitas Corpus [Tempero et al. 2010], from which 100 classes were chosen at random from each of the projects in that corpus....
[...]
...The Qualitas Corpus [Tempero et al. 2010] is a set of open-source Java programs originally collected to help empirical studies on static analysis....
[...]
...The Qualitas Corpus [Tempero et al. 2010] is a set of open source Java programs that were originally collected to help empirical studies on static analysis....
TL;DR: The effects of code duplication on machine learning models are explored, showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machineLearning models of code are used by software engineers.
Abstract: The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of near-duplicate code on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this work, we explore the effects of code duplication on machine learning models showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present a duplication index for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them. Finally, we release tools to help the community avoid this problem in future research.
166 citations
Cites background from "The Qualitas Corpus: A Curated Coll..."
...The existence of duplicates was noticed much earlier [27] but their negative effect became significantly more noticeable due to recent advancements that allowed the collection of large code corpora [19]....
TL;DR: Using game development as a starting point for impacting game development, researchers could create testing tools that enable game developers to create tests that assert flexible behavior with little up-front investment.
Abstract: Video games make up an important part of the software industry, yet the software engineering community rarely studies video games. This imbalance is a problem if video game development differs from general software development, as some game experts suggest. In this paper we describe a study with 14 interviewees and 364 survey respondents. The study elicited substantial differences between video game development and other software development. For example, in game development, “cowboy coders” are necessary to cope with the continuous interplay between creative desires and technical constraints. Consequently, game developers are hesitant to use automated testing because of these tests’ rapid obsolescence in the face of shifting creative desires of game designers. These differences between game and non-game development have implications for research, industry, and practice. For instance, as a starting point for impacting game development, researchers could create testing tools that enable game developers to create tests that assert flexible behavior with little up-front investment.
158 citations
Cites background from "The Qualitas Corpus: A Curated Coll..."
...Of the projects in two major
software engineering corpora, SIR [5] and Qualitus [6], 0% and 3% are games, respectively....
[...]
...Of the projects in two major software engineering corpora, SIR [5] and Qualitus [6], 0% and 3% are games, respectively....
TL;DR: The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.
Abstract: Code smells are symptoms of poor design and implementation choices weighing heavily on the quality of produced source code. During the last decades several code smell detection tools have been proposed. However, the literature shows that the results of these tools can be subjective and are intrinsically tied to the nature and approach of the detection. In a recent work the use of Machine-Learning (ML) techniques for code smell detection has been proposed, possibly solving the issue of tool subjectivity giving to a learner the ability to discern between smelly and non-smelly source code elements. While this work opened a new perspective for code smell detection, it only considered the case where instances affected by a single type smell are contained in each dataset used to train and test the machine learners. In this work we replicate the study with a different dataset configuration containing instances of more than one type of smell. The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.
155 citations
Cites background or methods from "The Qualitas Corpus: A Curated Coll..."
...Of course, we cannot exclude possible imprecisions contained in the Qualitas Corpus dataset [60], e.g., imprecisions in the computations of the metrics for the source code elements exploited in this study....
[...]
...The authors have analyzed systems from the Qualitas Corpus [60], release 20120401r, one of the largest curated benchmark datasets to date, specially designed for empirical software engineering research....
[...]
...Of course, we cannot exclude possible imprecisions contained in the Qualitas Corpus dataset [60], e....
[...]
...Finally, they empirically benchmarked a set of 16 machine-learning techniques for the detection of four code smell types [1]: they performed their analyses over 74 software systems belonging to the Qualitas Corpus dataset [60]....
TL;DR: This research addresses the needs for software measures in object-orientation design through the development and implementation of a new suite of metrics for OO design, and suggests ways in which managers may use these metrics for process improvement.
Abstract: Given the central role that software development plays in the delivery and application of information technology, managers are increasingly focusing on process improvement in the software development area. This demand has spurred the provision of a number of new and/or improved approaches to software development, with perhaps the most prominent being object-orientation (OO). In addition, the focus on process improvement has increased the demand for software measures, or metrics with which to manage the process. The need for such metrics is particularly acute when an organization is adopting a new technology for which established practices have yet to be developed. This research addresses these needs through the development and implementation of a new suite of metrics for OO design. Metrics developed in previous research, while contributing to the field's understanding of software development processes, have generally been subject to serious criticisms, including the lack of a theoretical base. Following Wand and Weber (1989), the theoretical base chosen for the metrics was the ontology of Bunge (1977). Six design metrics are developed, and then analytically evaluated against Weyuker's (1988) proposed set of measurement principles. An automated data collection tool was then developed and implemented to collect an empirical sample of these metrics at two field sites in order to demonstrate their feasibility and suggest ways in which managers may use these metrics for process improvement. >
TL;DR: This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks that improve over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements.
Abstract: Since benchmarks drive computer science research and industry product development, which ones we use and how we evaluate them are key questions for the community. Despite complex runtime tradeoffs due to dynamic compilation and garbage collection required for Java programs, many evaluations still use methodologies developed for C, C++, and Fortran. SPEC, the dominant purveyor of benchmarks, compounded this problem by institutionalizing these methodologies for their Java benchmark suite. This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks. We demonstrate that the complex interactions of (1) architecture, (2) compiler, (3) virtual machine, (4) memory management, and (5) application require more extensive evaluation than C, C++, and Fortran which stress (4) much less, and do not require (3). We use and introduce new value, time-series, and statistical metrics for static and dynamic properties such as code complexity, code size, heap composition, and pointer mutations. No benchmark suite is definitive, but these metrics show that DaCapo improves over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements. This paper takes a step towards improving methodologies for choosing and evaluating benchmarks to foster innovation in system design and implementation for Java and other managed languages.
1,561 citations
"The Qualitas Corpus: A Curated Coll..." refers background in this paper
...[18], which consists of a set of open source, real world Java applications with non-trivial memo ry loads....
TL;DR: The infrastructure that is being designed and constructed to support controlled experimentation with testing and regression testing techniques is described and the impact that this infrastructure has had and can be expected to have.
Abstract: Where the creation, understanding, and assessment of software testing and regression testing techniques are concerned, controlled experimentation is an indispensable research methodology. Obtaining the infrastructure necessary to support such experimentation, however, is difficult and expensive. As a result, progress in experimentation with testing techniques has been slow, and empirical data on the costs and effectiveness of techniques remains relatively scarce. To help address this problem, we have been designing and constructing infrastructure to support controlled experimentation with testing and regression testing techniques. This paper reports on the challenges faced by researchers experimenting with testing techniques, including those that inform the design of our infrastructure. The paper then describes the infrastructure that we are creating in response to these challenges, and that we are now making available to other researchers, and discusses the impact that this infrastructure has had and can be expected to have.
TL;DR: The following section describes the tools built to test the utilities, including the fuzz (random character) generator, ptyjig (to test interactive utilities), and scripts to automate the testing process.
Abstract: The following section describes the tools we built to test the utilities. These tools include the fuzz (random character) generator, ptyjig (to test interactive utilities), and scripts to automate the testing process. Next, we will describe the tests we performed, giving the types of input we presented to the utilities. Results from the tests will follow along with an analysis of the results, including identification and classification of the program bugs that caused the crashes. The final section presents concluding remarks, including suggestions for avoiding the types of problems detected by our study and some commentary on the bugs we found. We include an Appendix with the user manual pages for fuzz and ptyjig.
1,110 citations
"The Qualitas Corpus: A Curated Coll..." refers background in this paper
...[8] studied about 90 Unix applications (including emacs , TEX, LTEX, yacc) to determine how they responded to input....
TL;DR: This book discusses methods in corpus linguistics: interpreting concordance lines, applications of corpora in applied linguistics, and more.
Abstract: Corpus linguistics is leading to the development of theories about language which challenge existing orthodoxies in applied linguistics. However, there are also many questions which should be examined and debated: how big should a corpus be? Is the data from a corpus reliable? What are its applications for language teaching? Corpora in Applied Linguistics exams these and other questions related to this emerging field. It discusses these important issues and explores the techniques of investigating a corpus, as well as demonstrating the application of corpora in a wide variety of fields. It also outlines the impact corpus linguistics is having on how languages are taught in the classroom and how it is informing language teaching materials and dictionaries. It makes a superb and accessible introduction to corpus linguistics and is a must read for anyone interested in corpus linguistics and its impact on applied linguistics.
1,002 citations
"The Qualitas Corpus: A Curated Coll..." refers background in this paper
...Hunston observes that there are limitations on the use of corpora [2]....
[...]
...Hunston [2] opens her book with “It is no exagger-...
Q1. What have the authors contributed in "The qualitas corpus: a curated collection of java code for empirical studies" ?
The authors describe the Qualitas Corpus, a large curated collection of open source Java systems. The authors discuss its design, organisation, and issues associated with its development.
Q2. What are the future works in "The qualitas corpus: a curated collection of java code for empirical studies" ?
This will also allow for adding other kinds of metadata in the future. Their plans for the future of the corpus include growing it in size and representativeness ( section V ), making it easier to use for studies, and providing more “ value add ” in terms of metadata. The authors would like to include some of these measurements as part of the metadata in the future. One consequence of those outside the University of Auckland group using the corpus has been suggestions for systems to add.