What have the authors contributed in "The road towards reproducibility in science: the case of data citation" ?

Lately, several authoritative journals have been requesting the sharing of data and the provision of validation methodologies for experiments ( e. g., Nature Scientific Data and Nature Physics ) ; these publications and the publishing industry in general see data citation as the means to provide new, reliable and usable means for sharing and referring to scientific data. In this paper, the authors present the state of the art of data citation and they discuss open issues and research directions with a specific focus on reproducibility. Furthermore, the authors investigate reproducibility issues by using experimental evaluation in Information Retrieval ( IR ) as a test case.

What are the main areas of concern for reproducibility?

The most common concern for reproducibility are system runs, i.e. the outputs of the execution of an IR system, since they are what typically researchers and developers want to compare their new ideas against.

What is the main issue that needs to be addressed in this paper?

Data citation plays a central role for enabling reproducibility, but despite its importance and the attention dedicated by the information and computer science communities, there still are several open issues that need to be tackled in order to have a general and usable data citation system.

How should the framework be paired with semantic models to support reproducibility?

7 http://lod-direct.dei.unipd.it/fledged abstract conceptual framework for describing IR experiments with reproducibility and data citation in mind, e.g. an evolution of PRIMAD, should be paired with semantic models clearly formalizing it, e.g. a further development of LOD-DIRECT, and proper systems should be developed to implement and operationalize it, e.g. starting from DIRECT and TIRA.

(Open Access) The Road Towards Reproducibility in Science: The Case of Data Citation (2017) | Nicola Ferro

Original Citation:

The Road Towards Reproducibility in Science: The Case of Data Citation

Publisher:

Published version:

DOI:

Open Access

(Article begins on next page)

This article is made available under terms and conditions applicable to Open Access Guidelines, as

described at http://www.unipd.it/download/file/fid/55401 (Italian only)

Availability:

This version is available at: 11577/3235155 since: 2017-06-13T09:23:13Z

Università degli Studi di Padova

Padua Research Archive - Institutional Repository

The Road Towards Reproducibility in Science:

The Case of Data Citation

Nicola Ferro and Gianmaria Silvello

Dept. of Information Engineering, University of Padua, Italy

{ferro,silvello}@dei.unipd.it

Abstract. Data citation has a profound impact on the reproducibility

of science, a hot topic in many disciplines such as as astronomy, biol-

ogy, physics, computer science and more. Lately, several authoritative

journals have been requesting the sharing of data and the provision of

validation methodologies for experiments (e.g., Nature Scientiﬁc Data

and Nature Physics); these publications and the publishing industry in

general see data citation as the means to provide new, reliable and us-

able means for sharing and referring to scientiﬁc data. In this paper, we

present the state of the art of data citation and we discuss open issues and

research directions with a speciﬁc focus on reproducibility. Furthermore,

we investigate reproducibility issues by using experimental evaluation in

Information Retrieval (IR) as a test case.

1 Motivations

Data citation plays a central role for providing better transparency and repro-

ducibility in science [16], a challenge taken up by several ﬁelds such as Biomedical

Research [2], Public Health Research [27] and Biology [18]. Computer Science is

also particularly active in reproducibility, as witnessed by the recent Association

for Computing Machinery (ACM) policy on result and artifact review and badg-

ing

. For instance, the Database community started an eﬀort called “SIGMOD

reproducibility” [38] “to assist in building a culture where sharing results, code,

and scripts of database research”

. Since 2015, the European Conference in IR

(ECIR) [34, 41], allocated a whole paper track on reproducibility and in 2015

the RIGOR workshop at SIGIR was dedicated to this topic [12]. Moreover, in

2016 the “Reproducibility of Data-Oriented Experiments in e-Science” seminar

was held in Dagstuhl (Germany) [3] bringing together researchers from diﬀerent

ﬁelds of computer science with the goal “to come to a common ground across

disciplines, leverage best-of-breed approaches, and provide a unifying vision on

reproducibility” [33, 35].

In recent years, the nature of research and scientiﬁc publishing has been

rapidly evolving and progressively relying on data to sustain claims and provide

This paper is a revised and extended version of [33,35, 57].

https://www.acm.org/publications/policies/artifact-review-badging

http://db-reproducibility.seas.harvard.edu/

experimental evidence for scientiﬁc breakthroughs [44]. The preservation, man-

agement, access, discovery and retrieval of research data are topics of utmost

importance as witnessed by the great deal of attention they are receiving from

the scientiﬁc and publishing communities [21]. Along with the pervasiveness and

availability of research data, we are witnessing the growing importance of cit-

ing these data. Indeed, data citation is required to make results of research fully

available to others, provide suitable means to connect publications with the data

they rely upon [59], give credit to data creators, curators and publishers [20],

and enabling others to better build on previous results and to ask new questions

about data [19].

In the traditional context of printed material, the practice of citation has been

evolving and adapting across the centuries [21] reaching a stable and reliable

state; nevertheless, traditional citation methods and practices cannot be easily

applied for citing data. Indeed, citing data poses new signiﬁcant challenges, such

as:

1. the use of heterogeneous data models and formats – e.g., ﬂat data, rela-

tional databases, Comma Separated Value (CSV), eXtensible Markup Lan-

guage (XML), Resource Description Framework (RDF) – requiring diﬀerent

methods to manage, retrieve and access the data;

2. the transience of data calling for versioning and archiving methods and sys-

tems;

3. the necessity to cite data at diﬀerent levels of coarseness – e.g., if we consider

a relational database, then we may need to cite a speciﬁc attribute, a tuple,

a tuple sets, a table or the database as a whole – requiring methods to

individuate, select and reference speciﬁc subsets of data;

4. the necessity to automatically generate citations to data because a citation

snippet is required to allow the data to be understood and correctly inter-

preted and it must be composed of the essential information for identifying

the cited data as well as contextual information. Such contextual informa-

tion must be extracted from the given dataset and/or from external sources

automatically, because we cannot assume one knows how to access and select

additional relevant data and to structure them appropriately.

As a consequence, traditional practices need to evolve and adapt in order to

provide eﬀective and usable methods for citing data.

IR represents a challenging ﬁeld for data citation as well as for reproducibil-

ity. In particular, experimental evaluation in IR represents an eﬀective testbed

for new ideas and methods for reproducing experiments and citing data. In-

deed, reproducing IR experiments is extremely challenging and there are three

main diﬀerent areas that are of major concern for reproducibility: experiments

(or system runs), experimental collections, and meta-evaluation studies. Exper-

iments can be seen as the output of a retrieval system – e.g., a ranking list of

documents – given a corpus of documents and an information need; to repro-

duce an experiment we need to get access to the corpus or sub-corpus and to

the information needs used in the experiments as well as the software and the

methods employed. Meta-evaluation studies are even more complex since they

often involve manipulation of the data used in the actual analysis; this, among

other things, requires to keep track of the provenance of the data and to include

provenance information also in the citations to data.

This paper is organized as follows: Section 2 brieﬂy presents the state of the

art of research in data citation and some open issues and research lines focusing

also on provenance which is particularly important for reproducibility in IR.

Section 3 describes the main issues concerning reproducibility in IR evaluation

with a speciﬁc focus on the role of data citation in this context. Finally, Section

4 draws some ﬁnal remarks.

2 Data Citation: Open Issues and Research Directions

Data citation is a complex problem that can be tackled from many perspectives

and involves diﬀerent areas of information and computer science. Overall, data

citation has been studied from two main angles: the scholar publishing viewpoint

and the infrastructural and computational one.

The former has been investigating the core principles for data citation and the

conditions that any data citation solution should meet [1,37]; the need to connect

scientiﬁc publications and the underlying data [17]; the role of data journals [26];

the deﬁnition of metrics based on data citations [45]; and the measurement of

datasets impact [11, 53].

The latter has been focusing on the infrastructures and systems required to

handle the evolution of data such as archiving systems for XML [23], RDF [49]

and databases [51]; the use of persistent identiﬁers [47, 58]; the deﬁnition frame-

works and ontologies to publish data [40]; and, the creation of repositories to

store and provide access to data [4, 25].

As described in [22], from the computational perspective the problem of data

citation can be formulated as follows: “Given a dataset D and a query Q, generate

an appropriate citation C”. Several of the existing approaches to address this

problem allow us to reference datasets as a single unit having textual data serving

as metadata source, but as pointed out by [51] most data citations “can often

not be generated automatically and they are often not machine interpretable”.

Furthermore, most data citation approaches do not provide ways to cite datasets

with variable granularity.

Until now, the problem of how to cite a dataset at diﬀerent levels of coarse-

ness, to automatically generate citations and to create human- and machine-

readable citations has been tackled only by a few working systems. In [51]

an approach relying on persistent and timestamped queries to cite relational

databases has been proposed; this method has been implemented to work with

CSV ﬁles [52]. On the other hand, this system does not provide a suitable means

to automatically generate human- and machine-readable citations. In [24] a rule-

based citation system that creates machine- and human-readable citations by us-

ing only the information present in the data has been proposed for citing XML.

This system has been extended into a methodology that works with database

views provided that the data to be cited can be represented as a hierarchy [22];

this work has been further extended for general queries over relational databases

in [28–30]. [55] proposed a methodology for citing XML data based on machine

learning techniques, which allows us to create citations with variable granularity

learning from examples and reducing the human eﬀort to a minimum. In [54]

a methodology based on named meta-graphs to cite RDF sub-graphs has been

proposed; this solution for RDF graphs targets the variable granularity problem

and proposes an approach to create human-readable and machine-actionable

data citations even though the actual elements composing a citation are not

automatically selected. In the context of RDF citation, [40] proposed the nano-

publication model where a single statement RDF triple is made citable in its

own right; the idea is to enrich a statement via annotations adding context in-

formation such as time, authority and provenance. The statement becomes a

publication itself carrying all the information to be understood, validated and

re-used. This solution is centered around a single statement and the possibility

of enriching it.

A great deal of attention has been dedicated to the use of persistent iden-

tiﬁers [9, 47, 58] such as Digital Object Identiﬁers (DOI), Persistent Uniform

Resource Locator (PURL) and the Archival Resource Key (ARK). Normally,

these solutions propose to associate a persistent identiﬁer with a citable dataset

and to create a related set of metadata (e.g., author, version, URL) to be used

to cite the dataset. Persistent identiﬁers are foundational for data citation, but

they represent just one part of the solution since they do not allow us to create

citations with variable granularity, unless we create a unique identiﬁer for each

single datum in a dataset, which in most of the cases may be unfeasible. As a

consequence, the use of persistent identiﬁers as well as their study and evaluation

is mainly related to the publication of research data in order to provide a handle

for subsequent citation purposes rather than a data citation solution itself.

Data citation is a compound and complex problem and a “one size ﬁts all”

system to address it does not exist, yet. Indeed, as we have discussed above, ﬂat

data, relational databases, XML and RDF datasets are intrinsically diﬀerent

one from the other, present heterogeneous structures and functions and, as a

consequence, require speciﬁc solutions for addressing data citation problems.

Furthermore, diﬀerent communities present speciﬁc peculiarities, practices and

policies that must be considered when a citation to data has to be provided.

As a consequence, within the context of data citation, there are several open

issues and research directions we can take into account:

Automatic generation of citations Most of the solutions addressing this problem

work for XML data because they exploit its hierarchical structure to gather the

relevant (meta)data to be used in a citation. On the other hand, there is no ready

to use solution for non-hierarchical datasets as it may be a relational database or

a RDF dataset. A further problem is to automatically create citations for data

with no structure at all.

The Road Towards Reproducibility in Science: The Case of Data Citation

Citations

Promoção de Transparência e Impacto da Pesquisa em Negócios

A Model for Fine-Grained Data Citation.

Keyword-based access to relational data: To reproduce, or to not reproduce?

References

The conundrum of sharing research data

Variations in relevance judgments and the measurement of retrieval effectiveness

The anatomy of a nanopublication

Improving the Reproducibility of PAN’s Shared Tasks:

Using crowdsourcing for TREC relevance assessment

Related Papers (5)

Theory and practice of data citation

An empirical analysis of journal policy effectiveness for computational reproducibility.

Citation recommendation: approaches and datasets

Scientific impact quantity and quality: Analysis of two sources of bibliographic data

Analysis and Visualization of Citation Networks

Frequently Asked Questions (6)

Q1. What have the authors contributed in "The road towards reproducibility in science: the case of data citation" ?

Q2. What are the main areas of concern for reproducibility in IR?

Q3. What are the main areas of concern for reproducibility?

Q4. What is the main issue that needs to be addressed in this paper?

Q5. How should the framework be paired with semantic models to support reproducibility?

Q6. What is the role of citation in scientific publishing?