scispace - formally typeset
Open AccessJournal ArticleDOI

Why linked data is not enough for scientists

Reads0
Chats0
TLDR
This paper makes the case for a scientific data publication model on top of linked data and introduces the notion of Research Objects as first class citizens for sharing and publishing.
Abstract
Scientific data represents a significant portion of the linked open data cloud and scientists stand to benefit from the data fusion capability this will afford. Publishing linked data into the cloud, however, does not ensure the required reusability. Publishing has requirements of provenance, quality, credit, attribution and methods to provide the reproducibility that enables validation of results. In this paper we make the case for a scientific data publication model on top of linked data and introduce the notion of Research Objects as first class citizens for sharing and publishing. Highlights? We identify and characterise different aspects of reuse and reproducibility. ? We examine requirements for such reuse. ? We propose a scientific data publication model that layers on top of linked data publishing.

read more

Content maybe subject to copyright    Report

Why Linked Data is Not Enough for Scientists
Sean Bechhofer
a,
, Iain Buchan
b
, David De Roure
d,c
, Paolo Missier
a
, John
Ainsworth
b
, Jiten Bhagat
a
, Philip Couch
b
, Don Cruickshank
c
, Mark
Delderfield
b
, Ian Dunlop
a
, Matthew Gamble
a
, Danius Michaelides
c
, Stuart
Owen
a
, David Newman
c
, Shoaib Sufi
a
, Carole Goble
a
a
School of Computer Science, University of Manchester, UK
b
School of Community Based Medicine, University of Manchester, UK
c
School of Electronics and Computer Science, University of Southampton, UK
d
Oxford e-Research Centre, University of Oxford, UK
Abstract
Scientific data represents a significant portion of the linked open data cloud
and scientists stand to benefit from the data fusion capability this will afford.
Publishing linked data into the cloud, however, doesn’t ensure the required
reusability. Publishing has requirements of provenance, quality, credit, attri-
bution and methods to provide the reproducibility that enables validation of
results. In this paper we make the case for a scientific data publication model
on top of linked data and introduce the notion of Research Objects as first class
citizens for sharing and publishing.
1. Introduction
Changes are occurring in the ways in which research is conducted. Within
wholly digital environments, methods such as scientific workflows, research pro-
tocols, standard operating procedures and algorithms for analysis or simulation
are used to manipulate and produce data. Experimental or observational data
and scientific models are typically “born digital” with no physical counterpart.
This move to digital content is driving a sea change in scientific publication,
and challenging traditional scholarly publication. Shifts in dissemination mech-
anisms are thus leading towards increasing use of electronic publication methods.
Traditional paper publications are, in the main linear and human (rather than
machine) readable. A simple move from paper-based to electronic publication,
however, does not necessarily make a scientific output decomposable. Nor does
it guarantee that outputs, results or methods are reusable.
Current scientific knowledge management serves society poorly, where for
example the time to get new knowledge into practice can be more than a decade.
In medicine, the information used to support clinical decisions is not dynamically
Corresponding Author
Email address: sean.bechhofer@manchester.ac.uk (Sean Bechhofer)
Preprint submitted to Elsevier July 18, 2011

linked to the cumulative knowledge of best practice from research and audit.
More than half of the effects of medications cannot be predicted from scientific
literature because trials usually exclude women of childbearing age, people with
other diseases or those on other medications. Many clinicians audit the outcomes
of their treatments using research methods. This work could help bridge the
knowledge gap between clinical trials and real-world outcomes if it is made
reusable in wider research [1].
As a further example from the medical field, there are multiple studies re-
lating sleep patterns to work performance. Each study has a slightly different
design, and there is disagreement in reviews as to whether or not the overall
message separates out cause from effect. Ideally the study-data, context infor-
mation, and modelling methods would be extracted from each paper and put
together in a larger model not just a review of summary data. To do this
well is intellectually harder than running a primary study one that measures
things directly. This need for broad-ranging “meta-science” and not just deep
“mega-science” is shared by many domains of research, not just medicine.
Studies continue to show that research in all fields is increasingly collabo-
rative [2]. Most scientific and engineering domains would benefit from being
able to “borrow strength” from the outputs of other research, not only in in-
formation to reason over but also in data to incorporate in the modelling task
at hand. We thus see a need for a framework that facilitates the reuse and
exchange of digital knowledge. Linked Data [3] provides a compelling approach
to dissemination of scientific data for reuse. However, simply publishing data
out of context would fail to: 1) reflect the research methodology; and 2) respect
the rights and reputation of the researcher. Scientific practice is based on pub-
lication of results being associated with provenance to aid interpretation and
trust, and description of methods to support reproducibility.
In this paper, we discuss the notion of Research Objects (ROs), semantically
rich aggregations of (potentially distributed) resources that provide a layer of
structure on top of information delivered as Linked Data. An RO provides a
container for a principled aggregation of resources, produced and consumed by
common services and shareable within and across organisational boundaries.
An RO bundles together essential information relating to experiments and in-
vestigations. This includes not only the data used, and methods employed to
produce and analyse that data, but also the people involved in the investiga-
tion. In the following sections, we look at the motivation for linking up science,
consider scientific practice and look to three examples to inform our discussion.
Based on this, we identify principles of ROs and map this to a set of features. We
discuss the implementation of ROs in the emerging Object Reuse and Exchange
(ORE) representation and conclude with a discussion of the insights from this
exercise and critical reflection on Linked Data and ORE.
2. Reproducible research, linking data and the publication process
Our work here is situated in the context of e-Laboratories, environments
that provide distributed and collaborative spaces for e-Science, enabling the
2

planning and execution of in silico and hybrid studies processes that combine
data with computational activities to yield research results. This includes the
notion of an e-Laboratory as a traditional laboratory with on-line equipment
or a Laboratory Information Management System, but goes well beyond this
notion to scholars in any setting reasoning through distributed digital resources
as their laboratory.
2.1. Reproducible Research
Mesirov [4] describes the notion of Accessible Reproducible Research, where
scientific publications should provide clear enough descriptions of the protocols
to enable successful repetition and extension. Mesirov describes a Reproducible
Results System that facilitates the enactment and publication of reproducible
research. Such a system should provide the ability to track the provenance of
data, analyses and results, and to package them for redistribution/publication.
A key role of the publication is argumentation: convincing the reader that the
conclusions presented do indeed follow from the evidence presented.
De Roure and Goble [5] observe that results are “reinforced by reproducibil-
ity”, with traditional scholarly lifecycles focused on the need for reproducibil-
ity. They also argue for the primacy of method, ensuring that users can then
reuse those methods in pursuing reproducibility. While traditional “paper”
publication can present intellectual arguments, fostering reinforcement requires
inclusion of data, methods and results in our publications, thus supporting re-
producibility. A problem with traditional paper publication, as identified by
Mons [6] is that of “Knowledge Burying”. The results of an experiment are
written up in a paper which is then published. Rather than explicitly including
information in structured forms however, techniques such as text mining are
then used to extract the knowledge from that paper, resulting in a loss of that
knowledge.
In a paper from the Yale Law School Roundtable on Data and Code Shar-
ing in Computational Science, Stodden et al [7] also discuss the notion of Re-
producible Research. Here they identify verifiability as a key factor, with the
generation of verifiable knowledge being scientific discovery’s central goal. They
outline a number of guidelines or recommendations to facilitate the generation
of reproducible results. These guidelines largely concern openness in the data
publication process, for example the use of open licences and non-proprietary
standards. Long term goals identified here include the development of version
control systems for data; tools for effective download tracking of code and data in
order to support citation and attribution; and the development of standardised
terminologies and vocabularies for data description. Mechanisms for citation
and attribution (including data citation, e.g. Data Cite
1
) are key in providing
incentives for scientists to publish data.
The Scientific Knowledge Objects [8] of the LiquidPub project describe ag-
gregation structures intended to describe scientific papers, books and journals.
1
http://datacite.org/
3

The approach explicitly considers the lifecycle of publications in terms of three
“states”: Gas, Liquid and Solid, which represent early, tentative and finalised
work respectively.
Groth et al [9] describe the notion of a “Nano-publication” an explicit rep-
resentation of a statement that is made in scientific literature. Such statements
may be made in multiple locations, for example in different papers, and valida-
tion of that statement can only be done given the context. An example given
is the statement that malaria is transmitted by mosquitos, which will appear
in many places in published literature, each occurrence potentially backed by
differing evidence. Each nano-publication is associated with a set of annotations
that refer to the statement and provide a minimum set of (community) agreed
annotations that identify authorship, provenance, and so on. These annotations
can then be used as the basis for review, citation and indeed further annota-
tion. The Nano-publication model described in [9] considers a statement to be
a triple a tuple of three concepts, subject, predicate and object which fits
closely with the Resource Description Framework (RDF) data model [10], used
widely for (meta)data publication (see the discussion on Linked Data below).
The proposed implementation uses RDF and Named Graphs
2
. Aggregation
of nano-publications will be facilitated by the use of common identifiers (fol-
lowing Linked Data principles as discussed in Section 7), and to support this,
the Concept Web Alliance
3
are developing a ConceptWiki
4
, providing URIs
for biomedical concepts. The nano-publication approach is rather “fine-grain”,
focusing on single statements along with their provenance.
The Executable Paper Grand Challenge
5
was a contest for proposals that
will “improve the way scientific information is communicated and used”. For
executable papers, this will be through adaptations to existing publication mod-
els to include data and analyses and thus facilitate the validation, citation and
tracking of that information. The three winning entries in 2011 highlight differ-
ent aspects of the notion of executable papers. Collage [11] provides infrastruc-
ture which allows for the embedding of executable codes in papers. SHARE [12]
focuses on the issue of reproducability, using virtual machines to provide exe-
cution. Finally, Gavish and Donoh [13] focus on verifiability, through a system
consisting of a Repository holding Verifiable Computational Results (VCRs)
that are identified using Verifiable Result Identifiers (VRIs). We note, however,
that none of these proposals provide an explicit notion of “Research Object” as
introduced here. In addition, provenance information is only considered in the
third proposal, where Gavish and Donoh suggest that the ability to re-execute
processes may be unnecessary. Rather, understanding of the process can be
supported through providing access to the computation tree along with inputs,
outputs, parameters and code descriptions.
2
see Section 7 for an explanation of Named Graphs)
3
http://www.nbic.nl/about-nbic/affiliated-organisations/cwa/introduction/
4
http://conceptwiki.org/
5
http://www.executablepapers.com/
4

2.2. Linked Data
Benefits of explicit representation are clear. An association with a dataset (or
service, or result collection, or instrument) should be more than just a citation or
reference to that dataset (or service, or result collection). The association should
rather be a link to that dataset (or service, or result collection, or instrument)
which can be followed or dereferenced explicitly. Such linking provides access
to the actual resource and thus enactment of the service, query or retrieval of
data, and so on, fostering reproducability.
The term Linked Data is used to refer to a set of best practices for pub-
lishing and connecting structured data on the Web [3]. Linked Data explicitly
encourages the use of dereferenceable links as discussed above, and the Linked
Data “principles” use of HTTP URIs for naming, providing useful information
when dereferencing URIs, and including links to other URIs are intended to
foster reuse, linkage and consumption of that data. Further discussion of Linked
Data is given in Section 7.
2.3. Preservation and Archiving
The Open Archival Information System (OAIS) reference model [14] de-
scribes ”open archival information systems” which are concerned with preserv-
ing information for the benefit of a community. The OAIS Functional Model
describes a core set of mechanisms which include Ingest, Storage and Access
along with Planning, Data Management and Administration. There is also sep-
aration of Submission Information Packages, the mechanism by which content
is submitted for ingest by a Producer; Archival Information Package, the ver-
sion stored by the system; and Dissemination Information Package, the version
delivered to a Consumer.
OAIS considers three external entities or actors that interact with the sys-
tem. Producers, Management and Consumers, to characterise those who trans-
fer information to the system for preservation; formulate and enforce high level
policies (planning, defining scope, providing ”guarantees”) and are expected to
use the information respectively. OAIS also consider a notion of a Designated
Community, a subset of consumers that are expected to understand the archived
information.
2.4. Scientific Publication Packages
One notable precursor to the notion of Research Object presented in this
paper is the idea of Scientific Publication Packages (SPP), proposed in 2006 by
Hunter to describe “the selective encapsulation of raw data, derived products,
algorithms, software and textual publications” [15].
SPPs are motivated primarily by the need to create archives for the variety
of artifacts, such as those listed above, that are produced during the course
of a scientific investigation. In this “digital libraries” view of experimental
science, SPPs ideally contain not only data, software, and documents, but their
provenance as well. As we note here, the latter is a key enabler both for scientific
reproducibility, and to let third parties verify scientific accuracy. Thus, SPPs are
5

Citations
More filters

Knowledge Infrastructures: Intellectual Frameworks and Research Challenges

TL;DR: In this article, Edwards, Paul N., Jackson, Steven J., Chalmers, Melissa K., Bowker, Geoffrey C., Borgman, Christine L., Ribes, David; Burton, Matt; Calvert, Scout
Journal ArticleDOI

Web technologies for environmental Big Data

TL;DR: The processing of the simple datasets used in the pilot proved to be relatively straightforward using a combination of R, RPy2, PyWPS and PostgreSQL, but the use of NoSQL databases and more versatile frameworks such as OGC standard based implementations may provide a wider and more flexible set of features that particularly facilitate working with larger volumes and more heterogeneous data sources.
Journal ArticleDOI

The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet

TL;DR: The key need for the IPT is discussed, how it has developed in response to community input, and how it continues to evolve to streamline and enhance the interoperability, discoverability, and mobilization of new data types beyond basic Darwin Core records are discussed.
Journal ArticleDOI

ClinicalCodes: An Online Clinical Codes Repository to Improve the Validity and Reproducibility of Research Using Electronic Medical Records

TL;DR: An online repository is built where researchers using EMRs can upload and download lists of clinical codes to enable clinical researchers to better validate EMR studies, build on previous code lists and compare disease definitions across studies.
References
More filters
Book

Design Patterns: Elements of Reusable Object-Oriented Software

TL;DR: The book is an introduction to the idea of design patterns in software engineering, and a catalog of twenty-three common patterns, which most experienced OOP designers will find out they've known about patterns all along.
Journal ArticleDOI

Linked Data - the story so far

TL;DR: The authors describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked data community as it moves forward.

RDF Vocabulary Description Language 1.0 : RDF Schema. W3C Proposed Recommendation

D. Brickley
TL;DR: The Resource Description Framework (RDF) as mentioned in this paper is a general-purpose language for representing information in the Web. This specification defines a vocabulary for this purpose and defines other built-in RDF vocabulary initially specified in the RDF Model and Syntax Specification.
Journal ArticleDOI

Bio2RDF: Towards a mashup to build bioinformatics knowledge systems

TL;DR: The present article details this new approach to build mashups of bioinformatics data and illustrates the building of a mashup used to explore the implication of four transcription factor genes in Parkinson's disease.
Related Papers (5)

The FAIR Guiding Principles for scientific data management and stewardship

Frequently Asked Questions (7)
Q1. What are the contributions in "Why linked data is not enough for scientists" ?

In this paper the authors make the case for a scientific data publication model on top of linked data and introduce the notion of Research Objects as first class citizens for sharing and publishing. 

As a result, traceability and referenceability are the key kinds of reuse that are needed within SysMO, allowing for validation of the results. 

The ROs will be used as containers to package together workflows with data, results and provenance trails, with a key consideration being to support preservation27http://www.wf4ever-project.org/, funded under FP7’s Digital Libraries and Digital Preservation Call. (ICT-2009.4.1)of results. 

With sharing come issues of access, authentication, ownership, and trust that the authors can loosely classify as being relevant to Security. 

ROs would then encapsulate the Model together with information about its simulation environment, parameters and data thereby providing a third party with everything they need to reproduce and validate the model, along with the hypothesis and provenance behind its creation. 

Current scientific knowledge management serves society poorly, where for example the time to get new knowledge into practice can be more than a decade. 

Researchers may take days to wade through supporting documentation and large datasets to extract the variables and metadata they need.