scispace - formally typeset
Open AccessBook ChapterDOI

The Road Towards Reproducibility in Science: The Case of Data Citation

Reads0
Chats0
TLDR
In this article, the authors present the state of the art of data citation and discuss open issues and research directions with a specific focus on reproducibility, using experimental evaluation in Information Retrieval (IR) as a test case.
Abstract
Data citation has a profound impact on the reproducibility of science, a hot topic in many disciplines such as as astronomy, biology, physics, computer science and more. Lately, several authoritative journals have been requesting the sharing of data and the provision of validation methodologies for experiments (e.g., Nature Scientific Data and Nature Physics); these publications and the publishing industry in general see data citation as the means to provide new, reliable and usable means for sharing and referring to scientific data. In this paper, we present the state of the art of data citation and we discuss open issues and research directions with a specific focus on reproducibility. Furthermore, we investigate reproducibility issues by using experimental evaluation in Information Retrieval (IR) as a test case. (This paper is a revised and extended version of [33, 35, 57]).

read more

Content maybe subject to copyright    Report

Original Citation:
The Road Towards Reproducibility in Science: The Case of Data Citation
Publisher:
Published version:
DOI:
Terms of use:
Open Access
(Article begins on next page)
This article is made available under terms and conditions applicable to Open Access Guidelines, as
described at http://www.unipd.it/download/file/fid/55401 (Italian only)
Availability:
This version is available at: 11577/3235155 since: 2017-06-13T09:23:13Z
Università degli Studi di Padova
Padua Research Archive - Institutional Repository

The Road Towards Reproducibility in Science:
The Case of Data Citation
Nicola Ferro and Gianmaria Silvello
Dept. of Information Engineering, University of Padua, Italy
{ferro,silvello}@dei.unipd.it
Abstract. Data citation has a profound impact on the reproducibility
of science, a hot topic in many disciplines such as as astronomy, biol-
ogy, physics, computer science and more. Lately, several authoritative
journals have been requesting the sharing of data and the provision of
validation methodologies for experiments (e.g., Nature Scientific Data
and Nature Physics); these publications and the publishing industry in
general see data citation as the means to provide new, reliable and us-
able means for sharing and referring to scientific data. In this paper, we
present the state of the art of data citation and we discuss open issues and
research directions with a specific focus on reproducibility. Furthermore,
we investigate reproducibility issues by using experimental evaluation in
Information Retrieval (IR) as a test case.
1
1 Motivations
Data citation plays a central role for providing better transparency and repro-
ducibility in science [16], a challenge taken up by several fields such as Biomedical
Research [2], Public Health Research [27] and Biology [18]. Computer Science is
also particularly active in reproducibility, as witnessed by the recent Association
for Computing Machinery (ACM) policy on result and artifact review and badg-
ing
2
. For instance, the Database community started an effort called “SIGMOD
reproducibility” [38] “to assist in building a culture where sharing results, code,
and scripts of database research”
3
. Since 2015, the European Conference in IR
(ECIR) [34, 41], allocated a whole paper track on reproducibility and in 2015
the RIGOR workshop at SIGIR was dedicated to this topic [12]. Moreover, in
2016 the “Reproducibility of Data-Oriented Experiments in e-Science” seminar
was held in Dagstuhl (Germany) [3] bringing together researchers from different
fields of computer science with the goal “to come to a common ground across
disciplines, leverage best-of-breed approaches, and provide a unifying vision on
reproducibility” [33, 35].
In recent years, the nature of research and scientific publishing has been
rapidly evolving and progressively relying on data to sustain claims and provide
1
This paper is a revised and extended version of [33,35, 57].
2
https://www.acm.org/publications/policies/artifact-review-badging
3
http://db-reproducibility.seas.harvard.edu/

experimental evidence for scientific breakthroughs [44]. The preservation, man-
agement, access, discovery and retrieval of research data are topics of utmost
importance as witnessed by the great deal of attention they are receiving from
the scientific and publishing communities [21]. Along with the pervasiveness and
availability of research data, we are witnessing the growing importance of cit-
ing these data. Indeed, data citation is required to make results of research fully
available to others, provide suitable means to connect publications with the data
they rely upon [59], give credit to data creators, curators and publishers [20],
and enabling others to better build on previous results and to ask new questions
about data [19].
In the traditional context of printed material, the practice of citation has been
evolving and adapting across the centuries [21] reaching a stable and reliable
state; nevertheless, traditional citation methods and practices cannot be easily
applied for citing data. Indeed, citing data poses new significant challenges, such
as:
1. the use of heterogeneous data models and formats e.g., flat data, rela-
tional databases, Comma Separated Value (CSV), eXtensible Markup Lan-
guage (XML), Resource Description Framework (RDF) requiring different
methods to manage, retrieve and access the data;
2. the transience of data calling for versioning and archiving methods and sys-
tems;
3. the necessity to cite data at different levels of coarseness e.g., if we consider
a relational database, then we may need to cite a specific attribute, a tuple,
a tuple sets, a table or the database as a whole requiring methods to
individuate, select and reference specific subsets of data;
4. the necessity to automatically generate citations to data because a citation
snippet is required to allow the data to be understood and correctly inter-
preted and it must be composed of the essential information for identifying
the cited data as well as contextual information. Such contextual informa-
tion must be extracted from the given dataset and/or from external sources
automatically, because we cannot assume one knows how to access and select
additional relevant data and to structure them appropriately.
As a consequence, traditional practices need to evolve and adapt in order to
provide effective and usable methods for citing data.
IR represents a challenging field for data citation as well as for reproducibil-
ity. In particular, experimental evaluation in IR represents an effective testbed
for new ideas and methods for reproducing experiments and citing data. In-
deed, reproducing IR experiments is extremely challenging and there are three
main different areas that are of major concern for reproducibility: experiments
(or system runs), experimental collections, and meta-evaluation studies. Exper-
iments can be seen as the output of a retrieval system e.g., a ranking list of
documents given a corpus of documents and an information need; to repro-
duce an experiment we need to get access to the corpus or sub-corpus and to
the information needs used in the experiments as well as the software and the

methods employed. Meta-evaluation studies are even more complex since they
often involve manipulation of the data used in the actual analysis; this, among
other things, requires to keep track of the provenance of the data and to include
provenance information also in the citations to data.
This paper is organized as follows: Section 2 briefly presents the state of the
art of research in data citation and some open issues and research lines focusing
also on provenance which is particularly important for reproducibility in IR.
Section 3 describes the main issues concerning reproducibility in IR evaluation
with a specific focus on the role of data citation in this context. Finally, Section
4 draws some final remarks.
2 Data Citation: Open Issues and Research Directions
Data citation is a complex problem that can be tackled from many perspectives
and involves different areas of information and computer science. Overall, data
citation has been studied from two main angles: the scholar publishing viewpoint
and the infrastructural and computational one.
The former has been investigating the core principles for data citation and the
conditions that any data citation solution should meet [1,37]; the need to connect
scientific publications and the underlying data [17]; the role of data journals [26];
the definition of metrics based on data citations [45]; and the measurement of
datasets impact [11, 53].
The latter has been focusing on the infrastructures and systems required to
handle the evolution of data such as archiving systems for XML [23], RDF [49]
and databases [51]; the use of persistent identifiers [47, 58]; the definition frame-
works and ontologies to publish data [40]; and, the creation of repositories to
store and provide access to data [4, 25].
As described in [22], from the computational perspective the problem of data
citation can be formulated as follows: “Given a dataset D and a query Q, generate
an appropriate citation C”. Several of the existing approaches to address this
problem allow us to reference datasets as a single unit having textual data serving
as metadata source, but as pointed out by [51] most data citations “can often
not be generated automatically and they are often not machine interpretable”.
Furthermore, most data citation approaches do not provide ways to cite datasets
with variable granularity.
Until now, the problem of how to cite a dataset at different levels of coarse-
ness, to automatically generate citations and to create human- and machine-
readable citations has been tackled only by a few working systems. In [51]
an approach relying on persistent and timestamped queries to cite relational
databases has been proposed; this method has been implemented to work with
CSV files [52]. On the other hand, this system does not provide a suitable means
to automatically generate human- and machine-readable citations. In [24] a rule-
based citation system that creates machine- and human-readable citations by us-
ing only the information present in the data has been proposed for citing XML.
This system has been extended into a methodology that works with database

views provided that the data to be cited can be represented as a hierarchy [22];
this work has been further extended for general queries over relational databases
in [28–30]. [55] proposed a methodology for citing XML data based on machine
learning techniques, which allows us to create citations with variable granularity
learning from examples and reducing the human effort to a minimum. In [54]
a methodology based on named meta-graphs to cite RDF sub-graphs has been
proposed; this solution for RDF graphs targets the variable granularity problem
and proposes an approach to create human-readable and machine-actionable
data citations even though the actual elements composing a citation are not
automatically selected. In the context of RDF citation, [40] proposed the nano-
publication model where a single statement RDF triple is made citable in its
own right; the idea is to enrich a statement via annotations adding context in-
formation such as time, authority and provenance. The statement becomes a
publication itself carrying all the information to be understood, validated and
re-used. This solution is centered around a single statement and the possibility
of enriching it.
A great deal of attention has been dedicated to the use of persistent iden-
tifiers [9, 47, 58] such as Digital Object Identifiers (DOI), Persistent Uniform
Resource Locator (PURL) and the Archival Resource Key (ARK). Normally,
these solutions propose to associate a persistent identifier with a citable dataset
and to create a related set of metadata (e.g., author, version, URL) to be used
to cite the dataset. Persistent identifiers are foundational for data citation, but
they represent just one part of the solution since they do not allow us to create
citations with variable granularity, unless we create a unique identifier for each
single datum in a dataset, which in most of the cases may be unfeasible. As a
consequence, the use of persistent identifiers as well as their study and evaluation
is mainly related to the publication of research data in order to provide a handle
for subsequent citation purposes rather than a data citation solution itself.
Data citation is a compound and complex problem and a “one size fits all”
system to address it does not exist, yet. Indeed, as we have discussed above, flat
data, relational databases, XML and RDF datasets are intrinsically different
one from the other, present heterogeneous structures and functions and, as a
consequence, require specific solutions for addressing data citation problems.
Furthermore, different communities present specific peculiarities, practices and
policies that must be considered when a citation to data has to be provided.
As a consequence, within the context of data citation, there are several open
issues and research directions we can take into account:
Automatic generation of citations Most of the solutions addressing this problem
work for XML data because they exploit its hierarchical structure to gather the
relevant (meta)data to be used in a citation. On the other hand, there is no ready
to use solution for non-hierarchical datasets as it may be a relational database or
a RDF dataset. A further problem is to automatically create citations for data
with no structure at all.

Citations
More filters
Journal ArticleDOI

Promoção de Transparência e Impacto da Pesquisa em Negócios

TL;DR: Este e o primeiro editorial que escrevo como editor da Revista de Administracao Contemporânea (RAC), for um mandato voluntario com duracao definida (2018-2021) as discussed by the authors.

A Model for Fine-Grained Data Citation.

TL;DR: This work presents the novel problem of automatically generating citations for general queries over a relational database, and explores a solution based on a set of citation views, each of which attaches a citation to a view of the database.

Keyword-based access to relational data: To reproduce, or to not reproduce?

TL;DR: This work investigates the problem of the reproducibility of keywordbased access systems to relational data by implementing from scratch several state-of-the-art algorithms and testing them on shared datasets.
References
More filters
Journal ArticleDOI

The conundrum of sharing research data

TL;DR: In this article, the authors examined four rationales for sharing data, drawing examples from the sciences, social sciences, and humanities: (1) to reproduce or to verify research, (2) to make results of publicly funded research available to the public, (3) to enable others to ask new questions of extant data, and (4) to advance the state of research and innovation.
Journal ArticleDOI

Variations in relevance judgments and the measurement of retrieval effectiveness

TL;DR: Very high correlations were found among the rankings of systems produced using different relevance judgment sets, indicating that the comparative evaluation of retrieval performance is stable despite substantial differences in relevance judgments, and thus reaffirm the use of the TREC collections as laboratory tools.
Journal ArticleDOI

The anatomy of a nanopublication

TL;DR: This document presents a model of nanopublications along with a Named Graph/RDF serialization of the model and discusses the importance of aggregating nanopublication and the role that the Concept Wiki plays in facilitating it.
Book ChapterDOI

Improving the Reproducibility of PAN’s Shared Tasks:

TL;DR: This paper reports on the PAN 2014 evaluation lab which hosts three shared tasks on plagiarism detection, author identification, and author profiling, which forms the largest collection of softwares for these tasks to date.
Journal ArticleDOI

Using crowdsourcing for TREC relevance assessment

TL;DR: This paper reports on the first attempts to combine crowdsourcing and TREC: the aim is to validate the use of crowdsourcing for relevance assessment, using the Amazon Mechanical Turk crowdsourcing platform to run experiments on TREC data, evaluate the outcomes, and discuss the results.
Related Papers (5)
Frequently Asked Questions (6)
Q1. What have the authors contributed in "The road towards reproducibility in science: the case of data citation" ?

Lately, several authoritative journals have been requesting the sharing of data and the provision of validation methodologies for experiments ( e. g., Nature Scientific Data and Nature Physics ) ; these publications and the publishing industry in general see data citation as the means to provide new, reliable and usable means for sharing and referring to scientific data. In this paper, the authors present the state of the art of data citation and they discuss open issues and research directions with a specific focus on reproducibility. Furthermore, the authors investigate reproducibility issues by using experimental evaluation in Information Retrieval ( IR ) as a test case. 

reproducing IR experiments is extremely challenging and there are three main different areas that are of major concern for reproducibility: experiments (or system runs), experimental collections, and meta-evaluation studies. 

The most common concern for reproducibility are system runs, i.e. the outputs of the execution of an IR system, since they are what typically researchers and developers want to compare their new ideas against. 

Data citation plays a central role for enabling reproducibility, but despite its importance and the attention dedicated by the information and computer science communities, there still are several open issues that need to be tackled in order to have a general and usable data citation system. 

7 http://lod-direct.dei.unipd.it/fledged abstract conceptual framework for describing IR experiments with reproducibility and data citation in mind, e.g. an evolution of PRIMAD, should be paired with semantic models clearly formalizing it, e.g. a further development of LOD-DIRECT, and proper systems should be developed to implement and operationalize it, e.g. starting from DIRECT and TIRA. 

In the traditional context of printed material, the practice of citation has been evolving and adapting across the centuries [21] reaching a stable and reliable state; nevertheless, traditional citation methods and practices cannot be easily applied for citing data.