scispace - formally typeset
SciSpace - Your AI assistant to discover and understand research papers | Product Hunt

Proceedings ArticleDOI

The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies

30 Nov 2010-pp 336-345

TL;DR: The Qualitas Corpus, a large curated collection of open source Java systems, is described, which reduces the cost of performing large empirical studies of code and supports comparison of measurements of the same artifacts.

AbstractIn order to increase our ability to use measurement to support software development practise we need to do more analysis of code. However, empirical studies of code are expensive and their results are difficult to compare. We describe the Qualitas Corpus, a large curated collection of open source Java systems. The corpus reduces the cost of performing large empirical studies of code and supports comparison of measurements of the same artifacts. We discuss its design, organisation, and issues associated with its development.

Topics: KPI-driven code analysis (57%), Java annotation (56%), Java (52%), Software development (50%)

Summary (3 min read)

Introduction

  • Keywords-Empirical studies; curated code corpus; experimental infrastructure I. I NTRODUCTION Measurement is fundamental to engineering, however its use in engineering software has been limited.
  • The authors need models explaining the relationship between the measurements and the quality attributes, and they need experiments to validate those models.
  • While the goals of applied linguistics research is not exactly the same as ours, the similarities are close enough to warrant examining how corpora are used in that field.

A. Empirical studies of Code

  • By “empirical study of code” the authors mean a study in which the artifacts under investigation consist of source code, there are multiple, unrelated, artifacts, and the artifacts were developed independently of the study.
  • They identified the systems studied, but did not identify the versions for all systems.
  • There are several issues with these studies however.
  • SIR provides a curated set of artifacts, including the code, test suites,and fault data.

C. The need for curation

  • If two studies that analyse code give conflicting reports of some phenomena, one obvious possible explanation is that the studies were applied to different samples.
  • Such code may not be representative of the deployed code, and so could bias the results of the study.
  • Different systems organise their source code in different ways.
  • Also, some systems provide their own implementations of some thirdparty libraries, further complicating what is system code and what is not.
  • According to Hunston, the content of a corpus primarily depends on the purpose it used for, and there are usually questions specific to a purpose that must be addressed in the design of the corpus.

A. Organisation

  • Each version consists of the original distribution and two “unpacked” forms, bin and src.
  • The original distribution is provided exactly as downloaded from the system’s download site.
  • First, it means the authors can distribute the corpus without creating thebin andsrc forms, as they can be automatically created from the distributed forms, thus reducing the size of the corpus distribution.
  • The authors use a standard naming convention to identify systems and versions.

B. Contents

  • Figure 2 lists the systems that are current represented in the corpus.
  • Figure 3 gives an idea of how big the systems are, when listing the latest version of each system in the current release in order of number of top-level types (that is, classes, interfaces, enums, and annotations).
  • For the most part, the systems in the corpus are open source and so the corpus can contain their distributions, especially as what is in the corpus is exactly what was downloaded from the system download site.
  • Sincejre is an interesting system to analyse, the authors consider it part of the corpus however corpus users must download what they need from the Java distribution site.
  • What is provided by the corpus is the metadata similar to that for other systems.

C. Criteria for inclusion

  • This allows people to have the latest release and yet still be able to reproduce studies based on previous releases.
  • One advantage with Java is that its “compiled” form is also fairly easy to analyse, easier than for the source code in fact (section IV-E), however there are slight differences between the source and binary forms.
  • This criteria will probably be the first to completely go away.
  • Some systems the authors used (and analysed) before the first external release of the corpus have suffered this fate, and so are not in the corpus.
  • In fact the authors already have the situation where the version of a system they have in the corpus is now apparently no longer available, as the developers only appear to keep (or make available at least) the most recent versions.

D. Metadata

  • As part of the curation process the authors gather metadata about each system version, and one of their near-term goals is to increase this information (section IV-J).
  • 4, its sourcepackages value is “org.gudy com. aelitis”, indicating that types such ascom.aelitis.
  • Other metadata the authors keep includes the release date of the version, notes regarding the system and individual versions, domain information, and where the system distribution came from.
  • The latter allows users of the corpus to check corpus contents for themselves.
  • Another issue is what to do when systems stop being supported or otherwise become unavailable.

F. Content Management

  • Following criteria 1, a new release contains all the versions of systems in the previous release.
  • There are however ome changes between releases.
  • The authors have developed processes over time to support the management of the corpus.
  • The two main processes are for making a new entry of a version of a system into the corpus, and creating a distribution for release.
  • In the early days, these were all manual, but now, with each new release, scripts are being developed to automate more parts of the process.

G. Distributing the Corpus

  • To install the copy one acquires adistribution for a particular release.
  • One distribution contains just the most recent version of each system in the corpus.
  • For those interested in just “breadth” studies, this distribution is simpler to deal with (and much smaller to download).
  • Releases are identified by their date of release (in ISO 8601 format).
  • The current release is20090202 and the distribution containing only the most recent versions of systems is20090202r.

H. Using the corpus

  • A properly-installed distribution has the structure described in section IV-A.
  • If every study is performed on the complete contents of a given release, using the metadata provided in the corpus to identify the contents of a system (in particular sourcepackages, section IV-D), then the results of those studies can be compared with good confidence that comparison is meaningful.
  • Furthermore, what is actually studied can be described succinctly by just by indicating the release (and if necessary, particular distribution) used.
  • There is, however, no restriction on how the corpus can be used.
  • In such cases, in additionto identifying the release, the authors recommend that either what has been included be identified by listing the system versions used, or what has been left out similarly identified.

I. History

  • The Qualitas Corpus was initially conceived and developed by one of us for Ph.D. research during 2005.
  • The original corpus was used and added to by members of the University of Auckland group over the next three years, growing from 21 systems initially.
  • It was made available for external release in January of 2008, containing 88 systems, 21 systems with multiple versions, a total of 214 entries.
  • The main changes have been in terms of the metadata that is maintained, however there has also been a change in terminology.

J. Future Plans

  • As noted earlier, the next release is scheduled for July 2010.
  • As well as about 90 new versions of existing systems (but at this point, no new systems), the main change will be the addition of significantly more metadata.
  • Column 4 indicates whether the entry corresponds to a type identified as being in the system (that is, matches the sourcepackages value), with 0 indicating it does.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

The Qualitas Corpus: A Curated Collection of Java Code
for Empirical Studies
Ewan Tempero
, Craig Anslow
§
, Jens Dietrich
, Ted Han
, Jing Li
,
Markus Lumpe
, Hayden Melton
, James Noble
§
Department of Computer Science, The University of Auckland
Auckland, New Zealand. e.tempero@cs.auckland.ac.nz
Massey University, School of Engineering and Advanced Technology
Palmerston North, New Zealand. j.b.dietrich@massey.ac.nz
Faculty of Information & Communication Technologies, Swinburne University of Technology
Hawthorn, Australia. mlumpe@ict.swin.edu.au
§
School of Engineering and Computer Science, Victoria University of Wellington
Wellington, New Zealand. kjx@ecs.vuw.ac.nz
Abstract—In order to increase our ability to use measure-
ment to support software development practise we need to
do more analysis of code. However, empirical studies of code
are expensive and their results are difficult to compare. We
describe the Qualitas Corpus, a large curated collection of open
source Java systems. The corpus reduces the cost of performing
large empirical studies of code and supports comparison of
measurements of the same artifacts. We discuss its design,
organisation, and issues associated with its development.
Keywords-Empirical studies; curated code corpus; experi-
mental infrastructure
I. INTRODUCTION
Measurement is fundamental to engineering, however its
use in engineering software has been limited. While many
software metrics have been proposed (e.g. [1]), few are
regularly used in industry to support decision making. A key
reason for this is that our understanding of the relationship
between measurements we know how to make and quality
attributes, such as modifiability, understandability, extensi-
bility, reusability, and testability, that we care about is poor.
This is particularly true with respect to theories regarding
characteristics of software structure such as encapsulation,
inheritance, coupling and cohesion. Traditional engineering
disciplines have had hundreds or thousands of years of expe-
rience of comparing measurements with quality outcomes,
but central to this experience is the taking and sharing of
measurements and outcomes. In contrast there have been
few useful measurements of code. In this paper we describe
the Qualitas Corpus, infrastructure that supports taking and
sharing measurements of code artifacts.
Barriers to measuring code and understanding what the
measurements mean include access to code to measure and
the tools to do the measurement. The advent of open source
software (OSS) has meant significantly more code is now
accessible for measurement than in the past. This has led to
an increase in interest in empirical studies of code. However,
there is still a non trivial cost to gathering the artifacts from
enough OSS projects to make a study useful. One of the
main goals of the Qualitas Corpus is to substantially reduce
the cost of performing large empirical studies of code.
However, just measuring code is not enough. We need
models explaining the relationship between the measure-
ments and the quality attributes, and we need experiments
to validate those models. Validation does not come though
a single experiment experiments must be replicated.
Replication requires at least understanding of the relation-
ship between the artifacts used in the different experiments.
In some forms of experiments, we want to use the same
artifacts so as to be able to compare results in a meaningful
way. This means we need to know in detail what artifacts
are used in any experiment, meaning an ad hoc collection
of code whose contents is unknown is not sufficient. What
is needed is a curated collection of code artifacts. A second
goal of the Qualitas Corpus is to support comparison of
measurements of the same artifacts, that is, to provide a
reference corpus for empirical studies of code.
The contributions of this paper are:
We present arguments for the provision of a reference
corpus of code for empirical studies of code.
We identify the issues regarding performing replication
of studies that analyse Java code.
We describe the Qualitas Corpus, a curated collection
of Java code that reduces the cost and increases the
replicability of empirical studies.
The rest of the paper is organised as follows. In the
next section we present the motivation for our work, which
includes inspiration from the use of corpora in applied
linguistics and the limited empirical studies of code that have
been performed. We also discuss the use of reference collec-
tions in other areas of software engineering and in computer
science, and discuss the need for a curated collection of

code. In section III we discuss the challenges faced when
doing empirical studies of code, and from that, determine
the requirements of a curated corpus. Section IV presents
the details of the Qualitas Corpus, its current organisation,
immediate future plans, and rationale of the decisions we
have taken. Section V evaluates the Qualitas Corpus. Finally
we present our conclusions in section VI.
II. MOTIVATION AND RELATED WORK
The use of a standard collection of artifacts to support
study in an area is not new, neither in general nor in software
engineering. One area is that of applied linguistics, where
standard corpora are the basis for much of the research being
done. Hunston [2] opens her book with It is no exagger-
ation to say that corpora, and the study of corpora, have
revolutionised the study of language, and of the applications
of language, over the last few decades. Ironically, it is the
availability of software systems support for language corpora
that has enabled this form of research, whereas researchers
examining code artifacts have been slow to adopt this idea.
While the goals of applied linguistics research is not exactly
the same as ours, the similarities are close enough to warrant
examining how corpora are used in that field. Their use of
corpora is a major motivation for the Qualitas Corpus. We
will discuss language corpora in more detail in section III.
A. Empirical studies of Code
To answer the question of whether a code corpus is
necessary, we sample past empirical studies of code. By
“empirical study of code” we mean a study in which the
artifacts under investigation consist of source code, there
are multiple, unrelated, artifacts, and the artifacts were
developed independently of the study. This rules out, for
example, studies that included the creation of the code
artifacts, such as those by Briand et al. [3] or Lewis et al.
[4], and studies of one system, such as that by Barry [5].
Empirical studies of code have been performed for at least
four decades. As with many other things, Knuth was one of
the first to carry out empirical studies to understand what
code that is actually written looks like [6]. He presented a
static analysis of over 400 FORTRAN programmes, totalling
about 250,000 cards, and dynamic analysis of about 25
programs. He chose programs that could “run to completion”
from job submissions to Stanford’s Computation Center,
various subroutine libraries and scientific packages, contri-
butions from IBM, and personal programs. His main moti-
vation was compiler design, with the concern that compilers
may not optimise for the typical case as no-one knew what
the typical case was. The programs used were not identified.
In another early example, Chevance and Heidet studied
50 COBOL programs also looking at how language features
are used [7]. The programs were also not identified and no
details were given of size.
Open source software has existed for several decades,
with systems such as Unix, emacs, and T
E
X. Their use in
empirical studies is relatively recent. For example, Miller et
al. [8] studied about 90 Unix applications (including emacs,
T
E
X, L
A
T
E
X, yacc) to determine how they responded to input.
Frakes and Pole [9] used Unix tools as the basis for a study
on methods for searching for reusable components.
During the 1990s the number of accessible systems in-
creased, particularly those written in C++, and consequently
the number of studies increased. Chidamber and Kemerer
applied their metrics to two systems, one had 634 C++
classes, the other had 1459 Smalltalk classes [1]. No further
information on the systems was given.
Bieman and Zhao studied inheritance in 19 C++ systems,
ranging from 7 classes to 922 classes in size, with 2744
classes in total [10]. They identified the systems studied,
but did not identify the versions for all systems.
Harrison et al. applied two coupling metrics to five
collections of C++ code, consisting of 96, 197, 113, 61,
and 12 classes respectively [11]. They identified the systems
involved but not the versions studied.
Chidamber et al. studied three systems, one with 45
C++ classes, one with 27 Objective C classes, and one
identifying 25 classes in design documents [12]. They were
required to restrict information about the systems studied for
commercial reasons.
By the end of the millennium, repositories supporting
open source development such as sourceforge, as well
as the increase in effectiveness of Internet search systems,
meant a large number of systems were accessible. This
affected both the number of studies done, and often their
size. A representative set of examples include one with 3
fairly large Java systems [13], a study of 14 Java systems
[14], and a study of 35 systems, from several languages
including Java, C++, Self, and Smalltalk [15].
Two particularly large studies were by Succi et al. [16]
and Collberg et al [17]. Succi et al. studied 100 Java and 100
C++ applications. The Java applications ranged from 28 to
936 classes in size (median 83.5) and the C++ applications
ranged from 30 to 2520 classes (median 59). The actual
applications were not identified. Collberg et al. analysed
1132 Java jar files collected from the Internet. According
to their statistics they analyse a total of 102,688 classes
and 12,188 interfaces. No information was given as to what
applications were analysed.
The studies described above suggest that there is interest
in doing studies that involve analysing code and the ability
to do such studies has significantly advanced our knowledge
about the characteristics of code structure. There are several
issues with these studies however. The first is that none of
these studies use the same set of systems, making it difficult
to compare or combine results. Another is that because full
details of the systems analysed are not provided, we are
limited in our ability to replicate them. A third issue is that

it is not clear that even the authors are fully aware of what
they have studied, which we discuss further below. Finally,
while the authors have gone to some effort to gather the
artifacts needed for their study, few others are able to benefit
from that effort, meaning each new study requires duplicated
effort. The Qualitas Corpus addresses these issues.
B. Infrastructure for empirical studies
Of course the use of standard collections of artifacts to
support research in computer science and software engi-
neering is not new. The use of benchmarks for various
forms of performance testing and comparison is very mature.
One recent example is the DaCapo benchmark suite by
Blackburn et al. [18], which consists of a set of open
source, real world Java applications with non-trivial memory
loads. Another example of research infrastructure is the New
Zealand Digital Library project, which aims is to develop the
technology for the creation of digital libraries and make it
available publicly so that others can use it [19].
There are also some examples in Software Engineering.
One is the Software-artifact Infrastructure Repository (SIR)
[20]. The explicit goal of SIR is to support controlled
experimentation in software testing techniques. SIR provides
a curated set of artifacts, including the code, test suites, and
fault data. SIR represents the kind of support the Qualitas
Corpus is intended to provide. We discuss SIR’s motivation
in the section III.
Bajracharya et al. describe Sourcerer, which provides
infrastructure to support code search [21]. At the time of
publication, the Sourcerer database held 1500 real-world
open source projects, a total of 254,049 Java classes, gath-
ered from Sourceforge. Their goals are different to ours, but
it does give an indication as to what is available.
Finally, we must mention the Purdue Benchmark Suite.
This was described by Grothoff et al. in support of their
work on confined types [22]. It consisted of 33 Java systems,
5 with more than 200 classes, and a total of 46,165 classes.
At the time it was probably the largest organised collection
of Java code, and was the starting point for our work.
C. The need for curation
If two studies that analyse code give conflicting reports
of some phenomena, one obvious possible explanation is
that the studies were applied to different samples. If the two
studies claimed to be analysing the same set of systems, we
might suspect error somewhere, although it could just be that
the specific versions analysed were different. In fact, even if
we limit our sample to be from open source Java systems,
there is still room for variation even within specific versions,
as we will now discuss.
In an ideal world, it would be sufficient for a researcher to
just analyse what was provided on the system’s download
website. However, it is not that simple. Open source Java
systems come in both deployable (“binary”) and source
versions of the code. While we are interested in analysing
the source code, in some cases it is easier to analyse the
binary version. However, it is frequently the case that what
is distributed in the source version is not the same as
what is in the binary version. The source often includes
“infrastructure” code, such as that used for testing, code
demonstrating aspects of the system, and code that supports
the installation, building, or other management tasks of the
code. Such code may not be representative of the deployed
code, and so could bias the results of the study.
In some cases, this extra code can be a significant propor-
tion of what is available. For example, jFin_DateMath
versionR1-0.0 has 109 top-level non-test classes and 38
JUnit test classes. If the goal of a study is to characterise
how inheritance is used, then the JUnit classes (which
extend TestCase) could bias the result. Another examples
is fitjava version 1.1, which has 37 top level classes,
and, in addition, 22 example classes. If there are many
example classes, which are typically quite simple, then they
would bias the results in a study to characterise some aspect
of the complexity of the system design.
Another issue is identifying the infrastructure code. Dif-
ferent systems organise their source code in different ways.
In many cases, the source code is organised as different
source directories, one for the system source, one for the test
infrastructure, one for examples, and so on. However there
are many other organisations. For example, gt2 version
2.2-rc3 has nearly 90 different source directories, of
which only about 40 contain source code that is distributed
in binary form.
The presence of infrastructure code means that a decision
has to be made as to what exactly to analyse. Without careful
investigation, researchers may not even be aware that the
infrastructure code exists and that a decision needs to be
made. If this decision is not reported, then it impacts other
researchers’ ability to replicate the study. It may be possible
to avoid this problem by just analysing the binary form of
the system, as this can be expected to represent how the
system was built. Unfortunately, some systems do include
infrastructure code in the deployed form.
Another complication is third-party libraries. Since such
software is usually not under the control of the developers of
the system, including it in the analysis would be misleading
in terms of understanding what decisions have been made
by developers. Some systems include these libraries in their
distribution and some do not. Also, different systems can use
the same libraries. This means that third-party library use
must be identified, and where appropriate, excluded from
the analysis, to avoid bias due to double counting.
Identifying third-party libraries is not easy. Some systems
are deployed as many archive (jar) files, meaning it is quite
time-consuming to determine which are third-party libraries
and which are not. For example, compiere version 250d
has 114 archive files in its distribution. Complicating the

identification of third-party libraries is the fact that some
systems have such libraries packaged along with the system
code, that is, the library binary code has been unpacked
and then repacked with the binary system code. This means
excluding library code is not just a matter of leaving out the
relevant archive file.
Some systems are careful to identify what third-party
systems are included in the distribution (eclipse for
example). However usually this is in simple text document
that must be processed by a human, and so some judgement
is needed.
Another means to determine what to analyse might be to
look at the code that appears in both source and binary form.
Since there is no need for third-party source to be distributed,
we might reasonably expect it would only appear in binary
form. However, this is not the case. Some systems do in
fact distribute what appears to be original source of third-
party libraries (for example compiere version 250d has
a copy of the Apache Element Construction Set
1
that differs
only in one class and that only by a few lines). Also, some
systems provide their own implementations of some third-
party libraries, further complicating what is system code and
what is not.
In conclusion, to study the code from a collection of
systems it is not sufficient to just analysis the downloaded
code, whether it is binary or the original source. Decisions
need to be made regarding exactly what is going to be
analysed. If these decisions are not reported, then the results
may be difficult to analyse (or even fully evaluate). If the
decisions are reported, then anyone wanting to replicate the
study has, as well as having to recreate the collection, the
addition burden of accurately recreating the decisions.
If the collection is curated, that is, the contents are
organised and clearly identified, then the issues described
above can be more easily managed. This is the purpose of
the Qualitas Corpus.
III. DESIGNING A CORPUS
In discussing the need for the Software-artifact Infrastruc-
ture Repository (SIR), Do et al. identified ve challenges that
need to be addressed to support controlled experimentation:
supporting replicability across experiments; supporting ag-
gregation of findings; reducing the cost of controlled exper-
iments; obtaining sample representativeness; and isolating
the effects of individual factors [20]. Their conclusion was
that these challenges could be addressed to one degree or
other by creating a collection of relevant artifacts.
When collecting artifacts, the target of those artifacts
must be kept in mind. Researchers use the artifacts in SIR
to determine the effectiveness of techniques and tools for
testing software, that is, the artifacts themselves are not the
objects of study. Similarly, benchmarks are also a collection
1
http://jakarta.apache.org/ecs
of artifacts where they are not the object of study, but provide
input to systems whose performance is the object of study.
While any collection of code may be used for a variety of
purposes, our interest is in the code itself, and so we refer
to our collection as a corpus.
Corpora are now commonly used in linguistics and there
are many used in that area, such as the International Corpus
of English [23]. The development of standard corpora for
various kinds of linguistics work is an area of research in
itself. Hunston says the main argument for using a corpus
is that it provides a reliable guide to what language is like,
more reliable than the intuition of native speakers [2, p20].
This applies to programming languages as well. While both
research and trade literature contain many claims about use
of programming language features, code corpora could be
used to provide evidence for such claims.
Hunston lists four aspects that should be considered when
designing a corpus: size, content, representativeness, and
permanence. Regarding size, she makes the point that it is
possible to have too much information, making it difficult
to process it in any useful way, but that generally linguistics
researchers will take as much data as is available. For the
Qualitas Corpus, our intent is to make it as big as is practical,
given our goal of supporting replication.
According to Hunston, the content of a corpus primarily
depends on the purpose it used for, and there are usually
questions specific to a purpose that must be addressed in the
design of the corpus. However, the design of a corpus is also
impacted by what is available, and pragmatic issues such
as whether the corpus creators have permission from the
authors and publishers to make the contents available. The
primary purpose that has guided the design of the Qualitas
Corpus has been to support studies involving static analysis
of code. The choice of contents is due to the large number
of open source Java systems that are available.
The representativeness of a corpus is important for making
statements about the population it is a sample of, that is,
the generalisability of any conclusions based on its study.
Hunston describes a number of issues that impact the design
of the corpus, but notes that the real question is how the
representativeness of the corpus should be taken into account
when interpreting results. The Qualitas Corpus supports this
assessment by providing full details of where its entries came
from, as well as metadata on such things as the domain of
an entry.
Finally, Hunston notes that a corpus needs to be regularly
updated in order to remain representative of the current
usage, and so its design must support that.
IV. THE QUALITAS CORPUS
The current release is 20090202. It has 100 systems, 23
systems with multiple versions, with 400 versions total. Its
distributed form is 5.4GB, and once installed is 18.8GB.
It contains the source and binary forms of each system

Systems
ant
ant−1.1
bin
compressed
src
ant−1.7.1
Other versions omitted
Contents omitted
Contents omitted
.properties
apache−ant−1.7.1−bin.zip
apache−ant−1.7.1−src.zip
Figure 1. Organisation of Qualitas Corpus.
version as distributed by the developers (section IV-B). The
100 systems had to meet certain criteria (section IV-C).
These criteria were developed for the first external release,
one consequence of which is that some systems that were
considered part of the corpus previously now are not as they
do not meet the criteria (section IV-I). There are questions
regarding what things are in the corpus (section IV-E). The
next release is scheduled for the middle of July 2010 (section
IV-J).
As discussed previously, the main goals for the corpus are
that it reduces the costs of studies and supports replication of
studies. These goals have impacted the criteria for inclusion
and the corpus organisation.
A. Organisation
The corpus contains of a collection of systems, each of
which consists of a set of versions. Each version consists of
the original distribution (compressed) and two “unpacked”
forms, bin and src. The unpacked forms are provided in
order to reduce the costs of performing studies. The bin form
contains the binary system as it was intended to be used,
that is, Java bytecode. The src form contains everything in
the source distribution. If the binary and source forms are
ant antlr aoi argouml aspectJ axion azureus c jdbc checkstyle
cobertura colt columba compiere derby displaytag drawswf drjava
eclipse SDK emma exoportal findbugs fitjava fitlibraryforfitnesse
freecol freecs galleon ganttproject gt2 heritrix hibernate hsqldb htm-
lunit informa ireport itext ivatagroupware jFin
DateMath jag james
jasml jasperreports javacc jchempaint jedit jena jext jfreechart jgraph
jgraphpad jgrapht jgroupsn jhotdraw jmeter jmoney joggplayer jparse
jpf jrat jre jrefactory jruby jsXe jspwiki jtopen jung junit log4j lucene
marauroa megamek mvnforum myfaces
core nakedobjects nekohtml
openjms oscache picocontainer pmd poi pooka proguard quartz
quickserver quilt roller rssowl sablecc sandmark springframework
squirrel
sql struts sunflow tomcat trove velocity webmail weka xalan
xerces xmojo
Figure 2. Systems in the Qualitas Corpus.
distributed as a single archive file, then it is unpacked in src
and the relevant files are copied into bin.
The original distribution is provided exactly as down-
loaded from the system’s download site. This serves several
purposes. First, it means we can distribute the corpus without
creating the bin and src forms, as they can be automatically
created from the distributed forms, thus reducing the size
of the corpus distribution. Second, it allows any user of the
corpus to verify that the bin and src forms match what was
distributed, or even create their own form of the corpus.
Third, many distributions contain artifacts other than the
code in the system, such as test and build infrastructure
and so we want to keep these in case someone wishes to
analyse them as well. We also provide metadata in the file
.properties (section IV-D).
We use a standard naming convention to identify systems
and versions. A system is identified by a string that cannot
contain any occurrence of -”. A version is identified
by <system>-<versionid>, where <system> is the
system name, and <versionid> is some system-specific
version identifier. Where possible, we use the names used
by the original distribution. So far, the only time we have
not been able to do this is when the system name contains
-”, which we typically replace with _”.
Figure 1 shows an example of the distribution for ant.
There are 18 versions of ant, from ant-1.1 to ant-1.
7.1. The original distribution of ant-1.7.1 consists
of apache-ant-1.7.1-bin.zip, containing the de-
ployable form of ant, which is unpacked in bin, and
apache-ant-1.7.1-src.zip containing the source
code, unpacked in src.
B. Contents
Figure 2 lists the systems that are current represented in
the corpus. Figure 3 gives an idea of how big the systems
are, when listing the latest version of each system in the

Citations
More filters

Journal ArticleDOI
TL;DR: The largest experiment of applying machine learning algorithms to code smells to the best of the authors' knowledge concludes that the application of machine learning to the detection of these code smells can provide high accuracy (>96 %), and only a hundred training examples are needed to reach at least 95 % accuracy.
Abstract: Several code smell detection tools have been developed providing different results, because smells can be subjectively interpreted, and hence detected, in different ways. In this paper, we perform the largest experiment of applying machine learning algorithms to code smells to the best of our knowledge. We experiment 16 different machine-learning algorithms on four code smells (Data Class, Large Class, Feature Envy, Long Method) and 74 software systems, with 1986 manually validated code smell samples. We found that all algorithms achieved high performances in the cross-validation data set, yet the highest performances were obtained by J48 and Random Forest, while the worst performance were achieved by support vector machines. However, the lower prevalence of code smells, i.e., imbalanced data, in the entire data set caused varying performances that need to be addressed in the future studies. We conclude that the application of machine learning to the detection of these code smells can provide high accuracy (>96 %), and only a hundred training examples are needed to reach at least 95 % accuracy.

196 citations


Journal ArticleDOI
TL;DR: The study confirms that VOSUITE can achieve good levels of branch coverage in practice, and exemplifies how the choice of software systems for an empirical study can influence the results of the experiments, which can serve to inform researchers to make more conscious choices in the selection of software system subjects.
Abstract: Research on software testing produces many innovative automated techniques, but because software testing is by necessity incomplete and approximate, any new technique faces the challenge of an empirical assessment. In the past, we have demonstrated scientific advance in automated unit test generation with the EVOSUITE tool by evaluating it on manually selected open-source projects or examples that represent a particular problem addressed by the underlying technique. However, demonstrating scientific advance is not necessarily the same as demonstrating practical value; even if VOSUITE worked well on the software projects we selected for evaluation, it might not scale up to the complexity of real systems. Ideally, one would use large “real-world” software systems to minimize the threats to external validity when evaluating research tools. However, neither choosing such software systems nor applying research prototypes to them are trivial tasks. In this article we present the results of a large experiment in unit test generation using the VOSUITE tool on 100 randomly chosen open-source projects, the 10 most popular open-source projects according to the SourceForge Web site, seven industrial projects, and 11 automatically generated software projects. The study confirms that VOSUITE can achieve good levels of branch coverage (on average, 71p per class) in practice. However, the study also exemplifies how the choice of software systems for an empirical study can influence the results of the experiments, which can serve to inform researchers to make more conscious choices in the selection of software system subjects. Furthermore, our experiments demonstrate how practical limitations interfere with scientific advances, branch coverage on an unbiased sample is affected by predominant environmental dependencies. The surprisingly large effect of such practical engineering problems in unit testing will hopefully lead to a larger appreciation of work in this area, thus supporting transfer of knowledge from software testing research to practice.

143 citations


Cites methods from "The Qualitas Corpus: A Curated Coll..."

  • ...The software artifacts were taken from the Qualitas Corpus [Tempero et al. 2010], from which 100 classes were chosen at random from each of the projects in that corpus....

    [...]

  • ...The Qualitas Corpus [Tempero et al. 2010] is a set of open-source Java programs originally collected to help empirical studies on static analysis....

    [...]

  • ...The Qualitas Corpus [Tempero et al. 2010] is a set of open source Java programs that were originally collected to help empirical studies on static analysis....

    [...]


Proceedings ArticleDOI
31 May 2014
TL;DR: Using game development as a starting point for impacting game development, researchers could create testing tools that enable game developers to create tests that assert flexible behavior with little up-front investment.
Abstract: Video games make up an important part of the software industry, yet the software engineering community rarely studies video games. This imbalance is a problem if video game development differs from general software development, as some game experts suggest. In this paper we describe a study with 14 interviewees and 364 survey respondents. The study elicited substantial differences between video game development and other software development. For example, in game development, “cowboy coders” are necessary to cope with the continuous interplay between creative desires and technical constraints. Consequently, game developers are hesitant to use automated testing because of these tests’ rapid obsolescence in the face of shifting creative desires of game designers. These differences between game and non-game development have implications for research, industry, and practice. For instance, as a starting point for impacting game development, researchers could create testing tools that enable game developers to create tests that assert flexible behavior with little up-front investment.

129 citations


Cites background from "The Qualitas Corpus: A Curated Coll..."

  • ...Of the projects in two major software engineering corpora, SIR [5] and Qualitus [6], 0% and 3% are games, respectively....

    [...]

  • ...Of the projects in two major software engineering corpora, SIR [5] and Qualitus [6], 0% and 3% are games, respectively....

    [...]


Proceedings ArticleDOI
02 Apr 2018
TL;DR: The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.
Abstract: Code smells are symptoms of poor design and implementation choices weighing heavily on the quality of produced source code. During the last decades several code smell detection tools have been proposed. However, the literature shows that the results of these tools can be subjective and are intrinsically tied to the nature and approach of the detection. In a recent work the use of Machine-Learning (ML) techniques for code smell detection has been proposed, possibly solving the issue of tool subjectivity giving to a learner the ability to discern between smelly and non-smelly source code elements. While this work opened a new perspective for code smell detection, it only considered the case where instances affected by a single type smell are contained in each dataset used to train and test the machine learners. In this work we replicate the study with a different dataset configuration containing instances of more than one type of smell. The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.

96 citations


Cites background or methods from "The Qualitas Corpus: A Curated Coll..."

  • ...Of course, we cannot exclude possible imprecisions contained in the Qualitas Corpus dataset [60], e.g., imprecisions in the computations of the metrics for the source code elements exploited in this study....

    [...]

  • ...The authors have analyzed systems from the Qualitas Corpus [60], release 20120401r, one of the largest curated benchmark datasets to date, specially designed for empirical software engineering research....

    [...]

  • ...Of course, we cannot exclude possible imprecisions contained in the Qualitas Corpus dataset [60], e....

    [...]

  • ...Finally, they empirically benchmarked a set of 16 machine-learning techniques for the detection of four code smell types [1]: they performed their analyses over 74 software systems belonging to the Qualitas Corpus dataset [60]....

    [...]


Proceedings ArticleDOI
27 Feb 2014
TL;DR: An empirical method for extracting relative thresholds from real systems is described and it is argued that the proposed thresholds express a balance between real and idealized design practices.
Abstract: Establishing credible thresholds is a central challenge for promoting source code metrics as an effective instrument to control the internal quality of software systems. To address this challenge, we propose the concept of relative thresholds for evaluating metrics data following heavy-tailed distributions. The proposed thresholds are relative because they assume that metric thresholds should be followed by most source code entities, but that it is also natural to have a number of entities in the “long-tail” that do not follow the defined limits. In the paper, we describe an empirical method for extracting relative thresholds from real systems. We also report a study on applying this method in a corpus with 106 systems. Based on the results of this study, we argue that the proposed thresholds express a balance between real and idealized design practices.

89 citations


Cites background or methods from "The Qualitas Corpus: A Curated Coll..."

  • ...Particularly, we use VerveineJ—a Moose application—to parse the source code of each system and to generate MSE files, which is the format supported by Moose to persist source code models....

    [...]

  • ...…seven metrics related to distinct factors affecting the internal quality of object-oriented systems: Number of methods (NOM), Number of Lines of Code (LOC), FAN-OUT, Response For a Class (RFC), Weighted Method Count (WMC), Lack of Cohesion in Methods (LCOM), and the ratio between Number of…...

    [...]


References
More filters

Book
02 Sep 2011
TL;DR: This research addresses the needs for software measures in object-orientation design through the development and implementation of a new suite of metrics for OO design, and suggests ways in which managers may use these metrics for process improvement.
Abstract: Given the central role that software development plays in the delivery and application of information technology, managers are increasingly focusing on process improvement in the software development area. This demand has spurred the provision of a number of new and/or improved approaches to software development, with perhaps the most prominent being object-orientation (OO). In addition, the focus on process improvement has increased the demand for software measures, or metrics with which to manage the process. The need for such metrics is particularly acute when an organization is adopting a new technology for which established practices have yet to be developed. This research addresses these needs through the development and implementation of a new suite of metrics for OO design. Metrics developed in previous research, while contributing to the field's understanding of software development processes, have generally been subject to serious criticisms, including the lack of a theoretical base. Following Wand and Weber (1989), the theoretical base chosen for the metrics was the ontology of Bunge (1977). Six design metrics are developed, and then analytically evaluated against Weyuker's (1988) proposed set of measurement principles. An automated data collection tool was then developed and implemented to collect an empirical sample of these metrics at two field sites in order to demonstrate their feasibility and suggest ways in which managers may use these metrics for process improvement. >

5,185 citations


Proceedings ArticleDOI
16 Oct 2006
TL;DR: This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks that improve over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements.
Abstract: Since benchmarks drive computer science research and industry product development, which ones we use and how we evaluate them are key questions for the community. Despite complex runtime tradeoffs due to dynamic compilation and garbage collection required for Java programs, many evaluations still use methodologies developed for C, C++, and Fortran. SPEC, the dominant purveyor of benchmarks, compounded this problem by institutionalizing these methodologies for their Java benchmark suite. This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks. We demonstrate that the complex interactions of (1) architecture, (2) compiler, (3) virtual machine, (4) memory management, and (5) application require more extensive evaluation than C, C++, and Fortran which stress (4) much less, and do not require (3). We use and introduce new value, time-series, and statistical metrics for static and dynamic properties such as code complexity, code size, heap composition, and pointer mutations. No benchmark suite is definitive, but these metrics show that DaCapo improves over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements. This paper takes a step towards improving methodologies for choosing and evaluating benchmarks to foster innovation in system design and implementation for Java and other managed languages.

1,469 citations


"The Qualitas Corpus: A Curated Coll..." refers background in this paper

  • ...[18], which consists of a set of open source, real world Java applications with non-trivial memo ry loads....

    [...]


Journal ArticleDOI
TL;DR: The infrastructure that is being designed and constructed to support controlled experimentation with testing and regression testing techniques is described and the impact that this infrastructure has had and can be expected to have.
Abstract: Where the creation, understanding, and assessment of software testing and regression testing techniques are concerned, controlled experimentation is an indispensable research methodology. Obtaining the infrastructure necessary to support such experimentation, however, is difficult and expensive. As a result, progress in experimentation with testing techniques has been slow, and empirical data on the costs and effectiveness of techniques remains relatively scarce. To help address this problem, we have been designing and constructing infrastructure to support controlled experimentation with testing and regression testing techniques. This paper reports on the challenges faced by researchers experimenting with testing techniques, including those that inform the design of our infrastructure. The paper then describes the infrastructure that we are creating in response to these challenges, and that we are now making available to other researchers, and discusses the impact that this infrastructure has had and can be expected to have.

1,041 citations


Book
01 Jan 2002
TL;DR: This book discusses methods in corpus linguistics: interpreting concordance lines, applications of corpora in applied linguistics, and more.
Abstract: Corpus linguistics is leading to the development of theories about language which challenge existing orthodoxies in applied linguistics. However, there are also many questions which should be examined and debated: how big should a corpus be? Is the data from a corpus reliable? What are its applications for language teaching? Corpora in Applied Linguistics exams these and other questions related to this emerging field. It discusses these important issues and explores the techniques of investigating a corpus, as well as demonstrating the application of corpora in a wide variety of fields. It also outlines the impact corpus linguistics is having on how languages are taught in the classroom and how it is informing language teaching materials and dictionaries. It makes a superb and accessible introduction to corpus linguistics and is a must read for anyone interested in corpus linguistics and its impact on applied linguistics.

997 citations


"The Qualitas Corpus: A Curated Coll..." refers background in this paper

  • ...Hunston observes that there are limitations on the use of corpora [2]....

    [...]

  • ...Hunston [2] opens her book with “It is no exagger-...

    [...]


Journal ArticleDOI
TL;DR: The following section describes the tools built to test the utilities, including the fuzz (random character) generator, ptyjig (to test interactive utilities), and scripts to automate the testing process.
Abstract: The following section describes the tools we built to test the utilities. These tools include the fuzz (random character) generator, ptyjig (to test interactive utilities), and scripts to automate the testing process. Next, we will describe the tests we performed, giving the types of input we presented to the utilities. Results from the tests will follow along with an analysis of the results, including identification and classification of the program bugs that caused the crashes. The final section presents concluding remarks, including suggestions for avoiding the types of problems detected by our study and some commentary on the bugs we found. We include an Appendix with the user manual pages for fuzz and ptyjig.

943 citations


"The Qualitas Corpus: A Curated Coll..." refers background in this paper

  • ...[8] studied about 90 Unix applications (including emacs , TEX, LTEX, yacc) to determine how they responded to input....

    [...]


Frequently Asked Questions (2)
Q1. What have the authors contributed in "The qualitas corpus: a curated collection of java code for empirical studies" ?

The authors describe the Qualitas Corpus, a large curated collection of open source Java systems. The authors discuss its design, organisation, and issues associated with its development. 

This will also allow for adding other kinds of metadata in the future. Their plans for the future of the corpus include growing it in size and representativeness ( section V ), making it easier to use for studies, and providing more “ value add ” in terms of metadata. The authors would like to include some of these measurements as part of the metadata in the future. One consequence of those outside the University of Auckland group using the corpus has been suggestions for systems to add.