The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies
Summary (3 min read)
Introduction
- Keywords-Empirical studies; curated code corpus; experimental infrastructure I. I NTRODUCTION Measurement is fundamental to engineering, however its use in engineering software has been limited.
- The authors need models explaining the relationship between the measurements and the quality attributes, and they need experiments to validate those models.
- While the goals of applied linguistics research is not exactly the same as ours, the similarities are close enough to warrant examining how corpora are used in that field.
A. Empirical studies of Code
- By “empirical study of code” the authors mean a study in which the artifacts under investigation consist of source code, there are multiple, unrelated, artifacts, and the artifacts were developed independently of the study.
- They identified the systems studied, but did not identify the versions for all systems.
- There are several issues with these studies however.
- SIR provides a curated set of artifacts, including the code, test suites,and fault data.
C. The need for curation
- If two studies that analyse code give conflicting reports of some phenomena, one obvious possible explanation is that the studies were applied to different samples.
- Such code may not be representative of the deployed code, and so could bias the results of the study.
- Different systems organise their source code in different ways.
- Also, some systems provide their own implementations of some thirdparty libraries, further complicating what is system code and what is not.
- According to Hunston, the content of a corpus primarily depends on the purpose it used for, and there are usually questions specific to a purpose that must be addressed in the design of the corpus.
A. Organisation
- Each version consists of the original distribution and two “unpacked” forms, bin and src.
- The original distribution is provided exactly as downloaded from the system’s download site.
- First, it means the authors can distribute the corpus without creating thebin andsrc forms, as they can be automatically created from the distributed forms, thus reducing the size of the corpus distribution.
- The authors use a standard naming convention to identify systems and versions.
B. Contents
- Figure 2 lists the systems that are current represented in the corpus.
- Figure 3 gives an idea of how big the systems are, when listing the latest version of each system in the current release in order of number of top-level types (that is, classes, interfaces, enums, and annotations).
- For the most part, the systems in the corpus are open source and so the corpus can contain their distributions, especially as what is in the corpus is exactly what was downloaded from the system download site.
- Sincejre is an interesting system to analyse, the authors consider it part of the corpus however corpus users must download what they need from the Java distribution site.
- What is provided by the corpus is the metadata similar to that for other systems.
C. Criteria for inclusion
- This allows people to have the latest release and yet still be able to reproduce studies based on previous releases.
- One advantage with Java is that its “compiled” form is also fairly easy to analyse, easier than for the source code in fact (section IV-E), however there are slight differences between the source and binary forms.
- This criteria will probably be the first to completely go away.
- Some systems the authors used (and analysed) before the first external release of the corpus have suffered this fate, and so are not in the corpus.
- In fact the authors already have the situation where the version of a system they have in the corpus is now apparently no longer available, as the developers only appear to keep (or make available at least) the most recent versions.
D. Metadata
- As part of the curation process the authors gather metadata about each system version, and one of their near-term goals is to increase this information (section IV-J).
- 4, its sourcepackages value is “org.gudy com. aelitis”, indicating that types such ascom.aelitis.
- Other metadata the authors keep includes the release date of the version, notes regarding the system and individual versions, domain information, and where the system distribution came from.
- The latter allows users of the corpus to check corpus contents for themselves.
- Another issue is what to do when systems stop being supported or otherwise become unavailable.
F. Content Management
- Following criteria 1, a new release contains all the versions of systems in the previous release.
- There are however ome changes between releases.
- The authors have developed processes over time to support the management of the corpus.
- The two main processes are for making a new entry of a version of a system into the corpus, and creating a distribution for release.
- In the early days, these were all manual, but now, with each new release, scripts are being developed to automate more parts of the process.
G. Distributing the Corpus
- To install the copy one acquires adistribution for a particular release.
- One distribution contains just the most recent version of each system in the corpus.
- For those interested in just “breadth” studies, this distribution is simpler to deal with (and much smaller to download).
- Releases are identified by their date of release (in ISO 8601 format).
- The current release is20090202 and the distribution containing only the most recent versions of systems is20090202r.
H. Using the corpus
- A properly-installed distribution has the structure described in section IV-A.
- If every study is performed on the complete contents of a given release, using the metadata provided in the corpus to identify the contents of a system (in particular sourcepackages, section IV-D), then the results of those studies can be compared with good confidence that comparison is meaningful.
- Furthermore, what is actually studied can be described succinctly by just by indicating the release (and if necessary, particular distribution) used.
- There is, however, no restriction on how the corpus can be used.
- In such cases, in additionto identifying the release, the authors recommend that either what has been included be identified by listing the system versions used, or what has been left out similarly identified.
I. History
- The Qualitas Corpus was initially conceived and developed by one of us for Ph.D. research during 2005.
- The original corpus was used and added to by members of the University of Auckland group over the next three years, growing from 21 systems initially.
- It was made available for external release in January of 2008, containing 88 systems, 21 systems with multiple versions, a total of 214 entries.
- The main changes have been in terms of the metadata that is maintained, however there has also been a change in terminology.
J. Future Plans
- As noted earlier, the next release is scheduled for July 2010.
- As well as about 90 new versions of existing systems (but at this point, no new systems), the main change will be the addition of significantly more metadata.
- Column 4 indicates whether the entry corresponds to a type identified as being in the system (that is, matches the sourcepackages value), with 0 indicating it does.
Did you find this useful? Give us your feedback
Citations
1 citations
1 citations
1 citations
1 citations
References
5,476 citations
1,561 citations
"The Qualitas Corpus: A Curated Coll..." refers background in this paper
...[18], which consists of a set of open source, real world Java applications with non-trivial memo ry loads....
[...]
1,114 citations
1,110 citations
"The Qualitas Corpus: A Curated Coll..." refers background in this paper
...[8] studied about 90 Unix applications (including emacs , TEX, LTEX, yacc) to determine how they responded to input....
[...]
1,002 citations
"The Qualitas Corpus: A Curated Coll..." refers background in this paper
...Hunston observes that there are limitations on the use of corpora [2]....
[...]
...Hunston [2] opens her book with “It is no exagger-...
[...]
Related Papers (5)
Frequently Asked Questions (2)
Q2. What are the future works in "The qualitas corpus: a curated collection of java code for empirical studies" ?
This will also allow for adding other kinds of metadata in the future. Their plans for the future of the corpus include growing it in size and representativeness ( section V ), making it easier to use for studies, and providing more “ value add ” in terms of metadata. The authors would like to include some of these measurements as part of the metadata in the future. One consequence of those outside the University of Auckland group using the corpus has been suggestions for systems to add.