Proceedings Article•DOI•

The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies

Ewan Tempero¹, Craig Anslow², Jens Dietrich³, Ted Han¹, Jing Li¹, Markus Lumpe⁴, Hayden Melton¹, James Noble² - Show less +4 more•Institutions (4)

University of Auckland¹, Victoria University of Wellington², Massey University³, Swinburne University of Technology⁴

30 Nov 2010-pp 336-345

TL;DR: The Qualitas Corpus, a large curated collection of open source Java systems, is described, which reduces the cost of performing large empirical studies of code and supports comparison of measurements of the same artifacts.

read less

Abstract: In order to increase our ability to use measurement to support software development practise we need to do more analysis of code. However, empirical studies of code are expensive and their results are difficult to compare. We describe the Qualitas Corpus, a large curated collection of open source Java systems. The corpus reduces the cost of performing large empirical studies of code and supports comparison of measurements of the same artifacts. We discuss its design, organisation, and issues associated with its development.

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [A. Empirical studies of Code] – [C. The need for curation] – [A. Organisation] – [B. Contents] – [C. Criteria for inclusion] – [D. Metadata] – [F. Content Management] – [G. Distributing the Corpus] – [H. Using the corpus] – [I. History] and [J. Future Plans]

Introduction

Keywords-Empirical studies; curated code corpus; experimental infrastructure I. I NTRODUCTION Measurement is fundamental to engineering, however its use in engineering software has been limited.
The authors need models explaining the relationship between the measurements and the quality attributes, and they need experiments to validate those models.
While the goals of applied linguistics research is not exactly the same as ours, the similarities are close enough to warrant examining how corpora are used in that field.

A. Empirical studies of Code

By “empirical study of code” the authors mean a study in which the artifacts under investigation consist of source code, there are multiple, unrelated, artifacts, and the artifacts were developed independently of the study.
They identified the systems studied, but did not identify the versions for all systems.
There are several issues with these studies however.
SIR provides a curated set of artifacts, including the code, test suites,and fault data.

C. The need for curation

If two studies that analyse code give conflicting reports of some phenomena, one obvious possible explanation is that the studies were applied to different samples.
Such code may not be representative of the deployed code, and so could bias the results of the study.
Different systems organise their source code in different ways.
Also, some systems provide their own implementations of some thirdparty libraries, further complicating what is system code and what is not.
According to Hunston, the content of a corpus primarily depends on the purpose it used for, and there are usually questions specific to a purpose that must be addressed in the design of the corpus.

A. Organisation

Each version consists of the original distribution and two “unpacked” forms, bin and src.
The original distribution is provided exactly as downloaded from the system’s download site.
First, it means the authors can distribute the corpus without creating thebin andsrc forms, as they can be automatically created from the distributed forms, thus reducing the size of the corpus distribution.
The authors use a standard naming convention to identify systems and versions.

B. Contents

Figure 2 lists the systems that are current represented in the corpus.
Figure 3 gives an idea of how big the systems are, when listing the latest version of each system in the current release in order of number of top-level types (that is, classes, interfaces, enums, and annotations).
For the most part, the systems in the corpus are open source and so the corpus can contain their distributions, especially as what is in the corpus is exactly what was downloaded from the system download site.
Sincejre is an interesting system to analyse, the authors consider it part of the corpus however corpus users must download what they need from the Java distribution site.
What is provided by the corpus is the metadata similar to that for other systems.

C. Criteria for inclusion

This allows people to have the latest release and yet still be able to reproduce studies based on previous releases.
One advantage with Java is that its “compiled” form is also fairly easy to analyse, easier than for the source code in fact (section IV-E), however there are slight differences between the source and binary forms.
This criteria will probably be the first to completely go away.
Some systems the authors used (and analysed) before the first external release of the corpus have suffered this fate, and so are not in the corpus.
In fact the authors already have the situation where the version of a system they have in the corpus is now apparently no longer available, as the developers only appear to keep (or make available at least) the most recent versions.

D. Metadata

As part of the curation process the authors gather metadata about each system version, and one of their near-term goals is to increase this information (section IV-J).
4, its sourcepackages value is “org.gudy com. aelitis”, indicating that types such ascom.aelitis.
Other metadata the authors keep includes the release date of the version, notes regarding the system and individual versions, domain information, and where the system distribution came from.
The latter allows users of the corpus to check corpus contents for themselves.
Another issue is what to do when systems stop being supported or otherwise become unavailable.

F. Content Management

Following criteria 1, a new release contains all the versions of systems in the previous release.
There are however ome changes between releases.
The authors have developed processes over time to support the management of the corpus.
The two main processes are for making a new entry of a version of a system into the corpus, and creating a distribution for release.
In the early days, these were all manual, but now, with each new release, scripts are being developed to automate more parts of the process.

G. Distributing the Corpus

To install the copy one acquires adistribution for a particular release.
One distribution contains just the most recent version of each system in the corpus.
For those interested in just “breadth” studies, this distribution is simpler to deal with (and much smaller to download).
Releases are identified by their date of release (in ISO 8601 format).
The current release is20090202 and the distribution containing only the most recent versions of systems is20090202r.

H. Using the corpus

A properly-installed distribution has the structure described in section IV-A.
If every study is performed on the complete contents of a given release, using the metadata provided in the corpus to identify the contents of a system (in particular sourcepackages, section IV-D), then the results of those studies can be compared with good confidence that comparison is meaningful.
Furthermore, what is actually studied can be described succinctly by just by indicating the release (and if necessary, particular distribution) used.
There is, however, no restriction on how the corpus can be used.
In such cases, in additionto identifying the release, the authors recommend that either what has been included be identified by listing the system versions used, or what has been left out similarly identified.

I. History

The Qualitas Corpus was initially conceived and developed by one of us for Ph.D. research during 2005.
The original corpus was used and added to by members of the University of Auckland group over the next three years, growing from 21 systems initially.
It was made available for external release in January of 2008, containing 88 systems, 21 systems with multiple versions, a total of 214 entries.
The main changes have been in terms of the metadata that is maintained, however there has also been a change in terminology.

J. Future Plans

As noted earlier, the next release is scheduled for July 2010.
As well as about 90 new versions of existing systems (but at this point, no new systems), the main change will be the addition of significantly more metadata.
Column 4 indicates whether the entry corresponds to a type identified as being in the system (that is, matches the sourcepackages value), with 0 indicating it does.

Did you find this useful? Give us your feedback

Figures (4)

Figure 4. Metadata for system version content details forant-1.7.1. Some names have been elided for space.

Table I DOMAINS REPRESENTED IN THE CORPUS.

Figure 3. Distribution of sizes of systems.

Figure 1. Organisation of Qualitas Corpus.

Content maybe subject to copyright Report

The Qualitas Corpus: A Curated Collection of Java Code

for Empirical Studies

Ewan Tempero

∗

, Craig Anslow

, Jens Dietrich

†

, Ted Han

∗

, Jing Li

∗

Markus Lumpe

‡

, Hayden Melton

∗

, James Noble

∗

Department of Computer Science, The University of Auckland

Auckland, New Zealand. e.tempero@cs.auckland.ac.nz

†

Massey University, School of Engineering and Advanced Technology

Palmerston North, New Zealand. j.b.dietrich@massey.ac.nz

‡

Faculty of Information & Communication Technologies, Swinburne University of Technology

Hawthorn, Australia. mlumpe@ict.swin.edu.au

School of Engineering and Computer Science, Victoria University of Wellington

Wellington, New Zealand. kjx@ecs.vuw.ac.nz

Abstract—In order to increase our ability to use measure-

ment to support software development practise we need to

do more analysis of code. However, empirical studies of code

are expensive and their results are difﬁcult to compare. We

describe the Qualitas Corpus, a large curated collection of open

source Java systems. The corpus reduces the cost of performing

large empirical studies of code and supports comparison of

measurements of the same artifacts. We discuss its design,

organisation, and issues associated with its development.

Keywords-Empirical studies; curated code corpus; experi-

mental infrastructure

I. INTRODUCTION

Measurement is fundamental to engineering, however its

use in engineering software has been limited. While many

software metrics have been proposed (e.g. [1]), few are

regularly used in industry to support decision making. A key

reason for this is that our understanding of the relationship

between measurements we know how to make and quality

attributes, such as modiﬁability, understandability, extensi-

bility, reusability, and testability, that we care about is poor.

This is particularly true with respect to theories regarding

characteristics of software structure such as encapsulation,

inheritance, coupling and cohesion. Traditional engineering

disciplines have had hundreds or thousands of years of expe-

rience of comparing measurements with quality outcomes,

but central to this experience is the taking and sharing of

measurements and outcomes. In contrast there have been

few useful measurements of code. In this paper we describe

the Qualitas Corpus, infrastructure that supports taking and

sharing measurements of code artifacts.

Barriers to measuring code and understanding what the

measurements mean include access to code to measure and

the tools to do the measurement. The advent of open source

software (OSS) has meant signiﬁcantly more code is now

accessible for measurement than in the past. This has led to

an increase in interest in empirical studies of code. However,

there is still a non trivial cost to gathering the artifacts from

enough OSS projects to make a study useful. One of the

main goals of the Qualitas Corpus is to substantially reduce

the cost of performing large empirical studies of code.

However, just measuring code is not enough. We need

models explaining the relationship between the measure-

ments and the quality attributes, and we need experiments

to validate those models. Validation does not come though

a single experiment — experiments must be replicated.

Replication requires at least understanding of the relation-

ship between the artifacts used in the different experiments.

In some forms of experiments, we want to use the same

artifacts so as to be able to compare results in a meaningful

way. This means we need to know in detail what artifacts

are used in any experiment, meaning an ad hoc collection

of code whose contents is unknown is not sufﬁcient. What

is needed is a curated collection of code artifacts. A second

goal of the Qualitas Corpus is to support comparison of

measurements of the same artifacts, that is, to provide a

reference corpus for empirical studies of code.

The contributions of this paper are:

• We present arguments for the provision of a reference

corpus of code for empirical studies of code.

• We identify the issues regarding performing replication

of studies that analyse Java code.

• We describe the Qualitas Corpus, a curated collection

of Java code that reduces the cost and increases the

replicability of empirical studies.

The rest of the paper is organised as follows. In the

next section we present the motivation for our work, which

includes inspiration from the use of corpora in applied

linguistics and the limited empirical studies of code that have

been performed. We also discuss the use of reference collec-

tions in other areas of software engineering and in computer

science, and discuss the need for a curated collection of

code. In section III we discuss the challenges faced when

doing empirical studies of code, and from that, determine

the requirements of a curated corpus. Section IV presents

the details of the Qualitas Corpus, its current organisation,

immediate future plans, and rationale of the decisions we

have taken. Section V evaluates the Qualitas Corpus. Finally

we present our conclusions in section VI.

II. MOTIVATION AND RELATED WORK

The use of a standard collection of artifacts to support

study in an area is not new, neither in general nor in software

engineering. One area is that of applied linguistics, where

standard corpora are the basis for much of the research being

done. Hunston [2] opens her book with “It is no exagger-

ation to say that corpora, and the study of corpora, have

revolutionised the study of language, and of the applications

of language, over the last few decades.” Ironically, it is the

availability of software systems support for language corpora

that has enabled this form of research, whereas researchers

examining code artifacts have been slow to adopt this idea.

While the goals of applied linguistics research is not exactly

the same as ours, the similarities are close enough to warrant

examining how corpora are used in that ﬁeld. Their use of

corpora is a major motivation for the Qualitas Corpus. We

will discuss language corpora in more detail in section III.

A. Empirical studies of Code

To answer the question of whether a code corpus is

necessary, we sample past empirical studies of code. By

“empirical study of code” we mean a study in which the

artifacts under investigation consist of source code, there

are multiple, unrelated, artifacts, and the artifacts were

developed independently of the study. This rules out, for

example, studies that included the creation of the code

artifacts, such as those by Briand et al. [3] or Lewis et al.

[4], and studies of one system, such as that by Barry [5].

Empirical studies of code have been performed for at least

four decades. As with many other things, Knuth was one of

the ﬁrst to carry out empirical studies to understand what

code that is actually written looks like [6]. He presented a

static analysis of over 400 FORTRAN programmes, totalling

about 250,000 cards, and dynamic analysis of about 25

programs. He chose programs that could “run to completion”

from job submissions to Stanford’s Computation Center,

various subroutine libraries and scientiﬁc packages, contri-

butions from IBM, and personal programs. His main moti-

vation was compiler design, with the concern that compilers

may not optimise for the typical case as no-one knew what

the typical case was. The programs used were not identiﬁed.

In another early example, Chevance and Heidet studied

50 COBOL programs also looking at how language features

are used [7]. The programs were also not identiﬁed and no

details were given of size.

Open source software has existed for several decades,

with systems such as Unix, emacs, and T

X. Their use in

empirical studies is relatively recent. For example, Miller et

al. [8] studied about 90 Unix applications (including emacs,

X, L

X, yacc) to determine how they responded to input.

Frakes and Pole [9] used Unix tools as the basis for a study

on methods for searching for reusable components.

During the 1990s the number of accessible systems in-

creased, particularly those written in C++, and consequently

the number of studies increased. Chidamber and Kemerer

applied their metrics to two systems, one had 634 C++

classes, the other had 1459 Smalltalk classes [1]. No further

information on the systems was given.

Bieman and Zhao studied inheritance in 19 C++ systems,

ranging from 7 classes to 922 classes in size, with 2744

classes in total [10]. They identiﬁed the systems studied,

but did not identify the versions for all systems.

Harrison et al. applied two coupling metrics to ﬁve

collections of C++ code, consisting of 96, 197, 113, 61,

and 12 classes respectively [11]. They identiﬁed the systems

involved but not the versions studied.

Chidamber et al. studied three systems, one with 45

C++ classes, one with 27 Objective C classes, and one

identifying 25 classes in design documents [12]. They were

required to restrict information about the systems studied for

commercial reasons.

By the end of the millennium, repositories supporting

open source development such as sourceforge, as well

as the increase in effectiveness of Internet search systems,

meant a large number of systems were accessible. This

affected both the number of studies done, and often their

size. A representative set of examples include one with 3

fairly large Java systems [13], a study of 14 Java systems

[14], and a study of 35 systems, from several languages

including Java, C++, Self, and Smalltalk [15].

Two particularly large studies were by Succi et al. [16]

and Collberg et al [17]. Succi et al. studied 100 Java and 100

C++ applications. The Java applications ranged from 28 to

936 classes in size (median 83.5) and the C++ applications

ranged from 30 to 2520 classes (median 59). The actual

applications were not identiﬁed. Collberg et al. analysed

1132 Java jar ﬁles collected from the Internet. According

to their statistics they analyse a total of 102,688 classes

and 12,188 interfaces. No information was given as to what

applications were analysed.

The studies described above suggest that there is interest

in doing studies that involve analysing code and the ability

to do such studies has signiﬁcantly advanced our knowledge

about the characteristics of code structure. There are several

issues with these studies however. The ﬁrst is that none of

these studies use the same set of systems, making it difﬁcult

to compare or combine results. Another is that because full

details of the systems analysed are not provided, we are

limited in our ability to replicate them. A third issue is that

it is not clear that even the authors are fully aware of what

they have studied, which we discuss further below. Finally,

while the authors have gone to some effort to gather the

artifacts needed for their study, few others are able to beneﬁt

from that effort, meaning each new study requires duplicated

effort. The Qualitas Corpus addresses these issues.

B. Infrastructure for empirical studies

Of course the use of standard collections of artifacts to

support research in computer science and software engi-

neering is not new. The use of benchmarks for various

forms of performance testing and comparison is very mature.

One recent example is the DaCapo benchmark suite by

Blackburn et al. [18], which consists of a set of open

source, real world Java applications with non-trivial memory

loads. Another example of research infrastructure is the New

Zealand Digital Library project, which aims is to develop the

technology for the creation of digital libraries and make it

available publicly so that others can use it [19].

There are also some examples in Software Engineering.

One is the Software-artifact Infrastructure Repository (SIR)

[20]. The explicit goal of SIR is to support controlled

experimentation in software testing techniques. SIR provides

a curated set of artifacts, including the code, test suites, and

fault data. SIR represents the kind of support the Qualitas

Corpus is intended to provide. We discuss SIR’s motivation

in the section III.

Bajracharya et al. describe Sourcerer, which provides

infrastructure to support code search [21]. At the time of

publication, the Sourcerer database held 1500 real-world

open source projects, a total of 254,049 Java classes, gath-

ered from Sourceforge. Their goals are different to ours, but

it does give an indication as to what is available.

Finally, we must mention the Purdue Benchmark Suite.

This was described by Grothoff et al. in support of their

work on conﬁned types [22]. It consisted of 33 Java systems,

5 with more than 200 classes, and a total of 46,165 classes.

At the time it was probably the largest organised collection

of Java code, and was the starting point for our work.

C. The need for curation

If two studies that analyse code give conﬂicting reports

of some phenomena, one obvious possible explanation is

that the studies were applied to different samples. If the two

studies claimed to be analysing the same set of systems, we

might suspect error somewhere, although it could just be that

the speciﬁc versions analysed were different. In fact, even if

we limit our sample to be from open source Java systems,

there is still room for variation even within speciﬁc versions,

as we will now discuss.

In an ideal world, it would be sufﬁcient for a researcher to

just analyse what was provided on the system’s download

website. However, it is not that simple. Open source Java

systems come in both deployable (“binary”) and source

versions of the code. While we are interested in analysing

the source code, in some cases it is easier to analyse the

binary version. However, it is frequently the case that what

is distributed in the source version is not the same as

what is in the binary version. The source often includes

“infrastructure” code, such as that used for testing, code

demonstrating aspects of the system, and code that supports

the installation, building, or other management tasks of the

code. Such code may not be representative of the deployed

code, and so could bias the results of the study.

In some cases, this extra code can be a signiﬁcant propor-

tion of what is available. For example, jFin_DateMath

versionR1-0.0 has 109 top-level non-test classes and 38

JUnit test classes. If the goal of a study is to characterise

how inheritance is used, then the JUnit classes (which

extend TestCase) could bias the result. Another examples

is fitjava version 1.1, which has 37 top level classes,

and, in addition, 22 example classes. If there are many

example classes, which are typically quite simple, then they

would bias the results in a study to characterise some aspect

of the complexity of the system design.

Another issue is identifying the infrastructure code. Dif-

ferent systems organise their source code in different ways.

In many cases, the source code is organised as different

source directories, one for the system source, one for the test

infrastructure, one for examples, and so on. However there

are many other organisations. For example, gt2 version

2.2-rc3 has nearly 90 different source directories, of

which only about 40 contain source code that is distributed

in binary form.

The presence of infrastructure code means that a decision

has to be made as to what exactly to analyse. Without careful

investigation, researchers may not even be aware that the

infrastructure code exists and that a decision needs to be

made. If this decision is not reported, then it impacts other

researchers’ ability to replicate the study. It may be possible

to avoid this problem by just analysing the binary form of

the system, as this can be expected to represent how the

system was built. Unfortunately, some systems do include

infrastructure code in the deployed form.

Another complication is third-party libraries. Since such

software is usually not under the control of the developers of

the system, including it in the analysis would be misleading

in terms of understanding what decisions have been made

by developers. Some systems include these libraries in their

distribution and some do not. Also, different systems can use

the same libraries. This means that third-party library use

must be identiﬁed, and where appropriate, excluded from

the analysis, to avoid bias due to double counting.

Identifying third-party libraries is not easy. Some systems

are deployed as many archive (jar) ﬁles, meaning it is quite

time-consuming to determine which are third-party libraries

and which are not. For example, compiere version 250d

has 114 archive ﬁles in its distribution. Complicating the

identiﬁcation of third-party libraries is the fact that some

systems have such libraries packaged along with the system

code, that is, the library binary code has been unpacked

and then repacked with the binary system code. This means

excluding library code is not just a matter of leaving out the

relevant archive ﬁle.

Some systems are careful to identify what third-party

systems are included in the distribution (eclipse for

example). However usually this is in simple text document

that must be processed by a human, and so some judgement

is needed.

Another means to determine what to analyse might be to

look at the code that appears in both source and binary form.

Since there is no need for third-party source to be distributed,

we might reasonably expect it would only appear in binary

form. However, this is not the case. Some systems do in

fact distribute what appears to be original source of third-

party libraries (for example compiere version 250d has

a copy of the Apache Element Construction Set

that differs

only in one class and that only by a few lines). Also, some

systems provide their own implementations of some third-

party libraries, further complicating what is system code and

what is not.

In conclusion, to study the code from a collection of

systems it is not sufﬁcient to just analysis the downloaded

code, whether it is binary or the original source. Decisions

need to be made regarding exactly what is going to be

analysed. If these decisions are not reported, then the results

may be difﬁcult to analyse (or even fully evaluate). If the

decisions are reported, then anyone wanting to replicate the

study has, as well as having to recreate the collection, the

addition burden of accurately recreating the decisions.

If the collection is curated, that is, the contents are

organised and clearly identiﬁed, then the issues described

above can be more easily managed. This is the purpose of

the Qualitas Corpus.

III. DESIGNING A CORPUS

In discussing the need for the Software-artifact Infrastruc-

ture Repository (SIR), Do et al. identiﬁed ﬁve challenges that

need to be addressed to support controlled experimentation:

supporting replicability across experiments; supporting ag-

gregation of ﬁndings; reducing the cost of controlled exper-

iments; obtaining sample representativeness; and isolating

the effects of individual factors [20]. Their conclusion was

that these challenges could be addressed to one degree or

other by creating a collection of relevant artifacts.

When collecting artifacts, the target of those artifacts

must be kept in mind. Researchers use the artifacts in SIR

to determine the effectiveness of techniques and tools for

testing software, that is, the artifacts themselves are not the

objects of study. Similarly, benchmarks are also a collection

http://jakarta.apache.org/ecs

of artifacts where they are not the object of study, but provide

input to systems whose performance is the object of study.

While any collection of code may be used for a variety of

purposes, our interest is in the code itself, and so we refer

to our collection as a corpus.

Corpora are now commonly used in linguistics and there

are many used in that area, such as the International Corpus

of English [23]. The development of standard corpora for

various kinds of linguistics work is an area of research in

itself. Hunston says the main argument for using a corpus

is that it provides a reliable guide to what language is like,

more reliable than the intuition of native speakers [2, p20].

This applies to programming languages as well. While both

research and trade literature contain many claims about use

of programming language features, code corpora could be

used to provide evidence for such claims.

Hunston lists four aspects that should be considered when

designing a corpus: size, content, representativeness, and

permanence. Regarding size, she makes the point that it is

possible to have too much information, making it difﬁcult

to process it in any useful way, but that generally linguistics

researchers will take as much data as is available. For the

Qualitas Corpus, our intent is to make it as big as is practical,

given our goal of supporting replication.

According to Hunston, the content of a corpus primarily

depends on the purpose it used for, and there are usually

questions speciﬁc to a purpose that must be addressed in the

design of the corpus. However, the design of a corpus is also

impacted by what is available, and pragmatic issues such

as whether the corpus creators have permission from the

authors and publishers to make the contents available. The

primary purpose that has guided the design of the Qualitas

Corpus has been to support studies involving static analysis

of code. The choice of contents is due to the large number

of open source Java systems that are available.

The representativeness of a corpus is important for making

statements about the population it is a sample of, that is,

the generalisability of any conclusions based on its study.

Hunston describes a number of issues that impact the design

of the corpus, but notes that the real question is how the

representativeness of the corpus should be taken into account

when interpreting results. The Qualitas Corpus supports this

assessment by providing full details of where its entries came

from, as well as metadata on such things as the domain of

an entry.

Finally, Hunston notes that a corpus needs to be regularly

updated in order to remain representative of the current

usage, and so its design must support that.

IV. THE QUALITAS CORPUS

The current release is 20090202. It has 100 systems, 23

systems with multiple versions, with 400 versions total. Its

distributed form is 5.4GB, and once installed is 18.8GB.

It contains the source and binary forms of each system

Systems

ant

ant−1.1

bin

compressed

src

ant−1.7.1

Other versions omitted

Contents omitted

.properties

apache−ant−1.7.1−bin.zip

apache−ant−1.7.1−src.zip

Figure 1. Organisation of Qualitas Corpus.

version as distributed by the developers (section IV-B). The

100 systems had to meet certain criteria (section IV-C).

These criteria were developed for the ﬁrst external release,

one consequence of which is that some systems that were

considered part of the corpus previously now are not as they

do not meet the criteria (section IV-I). There are questions

regarding what things are in the corpus (section IV-E). The

next release is scheduled for the middle of July 2010 (section

IV-J).

As discussed previously, the main goals for the corpus are

that it reduces the costs of studies and supports replication of

studies. These goals have impacted the criteria for inclusion

and the corpus organisation.

A. Organisation

The corpus contains of a collection of systems, each of

which consists of a set of versions. Each version consists of

the original distribution (compressed) and two “unpacked”

forms, bin and src. The unpacked forms are provided in

order to reduce the costs of performing studies. The bin form

contains the binary system as it was intended to be used,

that is, Java bytecode. The src form contains everything in

the source distribution. If the binary and source forms are

ant antlr aoi argouml aspectJ axion azureus c jdbc checkstyle

cobertura colt columba compiere derby displaytag drawswf drjava

eclipse SDK emma exoportal ﬁndbugs ﬁtjava ﬁtlibraryforﬁtnesse

freecol freecs galleon ganttproject gt2 heritrix hibernate hsqldb htm-

lunit informa ireport itext ivatagroupware jFin

DateMath jag james

jasml jasperreports javacc jchempaint jedit jena jext jfreechart jgraph

jgraphpad jgrapht jgroupsn jhotdraw jmeter jmoney joggplayer jparse

jpf jrat jre jrefactory jruby jsXe jspwiki jtopen jung junit log4j lucene

marauroa megamek mvnforum myfaces

core nakedobjects nekohtml

openjms oscache picocontainer pmd poi pooka proguard quartz

quickserver quilt roller rssowl sablecc sandmark springframework

squirrel

sql struts sunﬂow tomcat trove velocity webmail weka xalan

xerces xmojo

Figure 2. Systems in the Qualitas Corpus.

distributed as a single archive ﬁle, then it is unpacked in src

and the relevant ﬁles are copied into bin.

The original distribution is provided exactly as down-

loaded from the system’s download site. This serves several

purposes. First, it means we can distribute the corpus without

creating the bin and src forms, as they can be automatically

created from the distributed forms, thus reducing the size

of the corpus distribution. Second, it allows any user of the

corpus to verify that the bin and src forms match what was

distributed, or even create their own form of the corpus.

Third, many distributions contain artifacts other than the

code in the system, such as test and build infrastructure

and so we want to keep these in case someone wishes to

analyse them as well. We also provide metadata in the ﬁle

.properties (section IV-D).

We use a standard naming convention to identify systems

and versions. A system is identiﬁed by a string that cannot

contain any occurrence of “-”. A version is identiﬁed

by <system>-<versionid>, where <system> is the

system name, and <versionid> is some system-speciﬁc

version identiﬁer. Where possible, we use the names used

by the original distribution. So far, the only time we have

not been able to do this is when the system name contains

“-”, which we typically replace with “_”.

Figure 1 shows an example of the distribution for ant.

There are 18 versions of ant, from ant-1.1 to ant-1.

7.1. The original distribution of ant-1.7.1 consists

of apache-ant-1.7.1-bin.zip, containing the de-

ployable form of ant, which is unpacked in bin, and

apache-ant-1.7.1-src.zip containing the source

code, unpacked in src.

B. Contents

Figure 2 lists the systems that are current represented in

the corpus. Figure 3 gives an idea of how big the systems

are, when listing the latest version of each system in the

HTML Viewer

Frequently Asked Questions (2)

Q1. What have the authors contributed in "The qualitas corpus: a curated collection of java code for empirical studies" ?

The authors describe the Qualitas Corpus, a large curated collection of open source Java systems. The authors discuss its design, organisation, and issues associated with its development.

Q2. What are the future works in "The qualitas corpus: a curated collection of java code for empirical studies" ?

This will also allow for adding other kinds of metadata in the future. Their plans for the future of the corpus include growing it in size and representativeness ( section V ), making it easier to use for studies, and providing more “ value add ” in terms of metadata. The authors would like to include some of these measurements as part of the metadata in the future. One consequence of those outside the University of Auckland group using the corpus has been suggestions for systems to add.

The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies

Summary (3 min read)

Introduction

A. Empirical studies of Code

C. The need for curation

A. Organisation

B. Contents

C. Criteria for inclusion

D. Metadata

F. Content Management

G. Distributing the Corpus

H. Using the corpus

I. History

J. Future Plans

Figures (4)

Citations

References

"The Qualitas Corpus: A Curated Coll..." refers background in this paper

"The Qualitas Corpus: A Curated Coll..." refers background in this paper

"The Qualitas Corpus: A Curated Coll..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (2)

Q1. What have the authors contributed in "The qualitas corpus: a curated collection of java code for empirical studies" ?

Q2. What are the future works in "The qualitas corpus: a curated collection of java code for empirical studies" ?