What are the key kinds of reuse that are needed within SysMO?

As a result, traceability and referenceability are the key kinds of reuse that are needed within SysMO, allowing for validation of the results.

What is the key consideration for the Wf4Ever project?

The ROs will be used as containers to package together workflows with data, results and provenance trails, with a key consideration being to support preservation27http://www.wf4ever-project.org/, funded under FP7’s Digital Libraries and Digital Preservation Call. (ICT-2009.4.1)of results.

What are the main issues of sharing?

With sharing come issues of access, authentication, ownership, and trust that the authors can loosely classify as being relevant to Security.

What would be the requirements for a third party to reproduce and validate the model?

ROs would then encapsulate the Model together with information about its simulation environment, parameters and data thereby providing a third party with everything they need to reproduce and validate the model, along with the hypothesis and provenance behind its creation.

How long does it take to extract the variables and metadata?

Researchers may take days to wade through supporting documentation and large datasets to extract the variables and metadata they need.

(Open Access) Why linked data is not enough for scientists (2013) | Sean Bechhofer

Q: What are the contributions in "Why linked data is not enough for scientists" ?

In this paper the authors make the case for a scientific data publication model on top of linked data and introduce the notion of Research Objects as first class citizens for sharing and publishing.

Q: How long does it take to get new knowledge into practice?

Current scientific knowledge management serves society poorly, where for example the time to get new knowledge into practice can be more than a decade.

Why Linked Data is Not Enough for Scientists

Sean Bechhofer

a,∗

, Iain Buchan

, David De Roure

d,c

, Paolo Missier

, John

Ainsworth

, Jiten Bhagat

, Philip Couch

, Don Cruickshank

, Mark

Delderﬁeld

, Ian Dunlop

, Matthew Gamble

, Danius Michaelides

, Stuart

Owen

, David Newman

, Shoaib Suﬁ

, Carole Goble

School of Computer Science, University of Manchester, UK

School of Community Based Medicine, University of Manchester, UK

School of Electronics and Computer Science, University of Southampton, UK

Oxford e-Research Centre, University of Oxford, UK

Abstract

Scientiﬁc data represents a signiﬁcant portion of the linked open data cloud

and scientists stand to beneﬁt from the data fusion capability this will aﬀord.

Publishing linked data into the cloud, however, doesn’t ensure the required

reusability. Publishing has requirements of provenance, quality, credit, attri-

bution and methods to provide the reproducibility that enables validation of

results. In this paper we make the case for a scientiﬁc data publication model

on top of linked data and introduce the notion of Research Objects as ﬁrst class

citizens for sharing and publishing.

1. Introduction

Changes are occurring in the ways in which research is conducted. Within

wholly digital environments, methods such as scientiﬁc workﬂows, research pro-

tocols, standard operating procedures and algorithms for analysis or simulation

are used to manipulate and produce data. Experimental or observational data

and scientiﬁc models are typically “born digital” with no physical counterpart.

This move to digital content is driving a sea change in scientiﬁc publication,

and challenging traditional scholarly publication. Shifts in dissemination mech-

anisms are thus leading towards increasing use of electronic publication methods.

Traditional paper publications are, in the main linear and human (rather than

machine) readable. A simple move from paper-based to electronic publication,

however, does not necessarily make a scientiﬁc output decomposable. Nor does

it guarantee that outputs, results or methods are reusable.

Current scientiﬁc knowledge management serves society poorly, where for

example the time to get new knowledge into practice can be more than a decade.

In medicine, the information used to support clinical decisions is not dynamically

∗

Corresponding Author

Email address: sean.bechhofer@manchester.ac.uk (Sean Bechhofer)

Preprint submitted to Elsevier July 18, 2011

linked to the cumulative knowledge of best practice from research and audit.

More than half of the eﬀects of medications cannot be predicted from scientiﬁc

literature because trials usually exclude women of childbearing age, people with

other diseases or those on other medications. Many clinicians audit the outcomes

of their treatments using research methods. This work could help bridge the

knowledge gap between clinical trials and real-world outcomes if it is made

reusable in wider research [1].

As a further example from the medical ﬁeld, there are multiple studies re-

lating sleep patterns to work performance. Each study has a slightly diﬀerent

design, and there is disagreement in reviews as to whether or not the overall

message separates out cause from eﬀect. Ideally the study-data, context infor-

mation, and modelling methods would be extracted from each paper and put

together in a larger model – not just a review of summary data. To do this

well is intellectually harder than running a primary study – one that measures

things directly. This need for broad-ranging “meta-science” and not just deep

“mega-science” is shared by many domains of research, not just medicine.

Studies continue to show that research in all ﬁelds is increasingly collabo-

rative [2]. Most scientiﬁc and engineering domains would beneﬁt from being

able to “borrow strength” from the outputs of other research, not only in in-

formation to reason over but also in data to incorporate in the modelling task

at hand. We thus see a need for a framework that facilitates the reuse and

exchange of digital knowledge. Linked Data [3] provides a compelling approach

to dissemination of scientiﬁc data for reuse. However, simply publishing data

out of context would fail to: 1) reﬂect the research methodology; and 2) respect

the rights and reputation of the researcher. Scientiﬁc practice is based on pub-

lication of results being associated with provenance to aid interpretation and

trust, and description of methods to support reproducibility.

In this paper, we discuss the notion of Research Objects (ROs), semantically

rich aggregations of (potentially distributed) resources that provide a layer of

structure on top of information delivered as Linked Data. An RO provides a

container for a principled aggregation of resources, produced and consumed by

common services and shareable within and across organisational boundaries.

An RO bundles together essential information relating to experiments and in-

vestigations. This includes not only the data used, and methods employed to

produce and analyse that data, but also the people involved in the investiga-

tion. In the following sections, we look at the motivation for linking up science,

consider scientiﬁc practice and look to three examples to inform our discussion.

Based on this, we identify principles of ROs and map this to a set of features. We

discuss the implementation of ROs in the emerging Object Reuse and Exchange

(ORE) representation and conclude with a discussion of the insights from this

exercise and critical reﬂection on Linked Data and ORE.

2. Reproducible research, linking data and the publication process

Our work here is situated in the context of e-Laboratories, environments

that provide distributed and collaborative spaces for e-Science, enabling the

planning and execution of in silico and hybrid studies – processes that combine

data with computational activities to yield research results. This includes the

notion of an e-Laboratory as a traditional laboratory with on-line equipment

or a Laboratory Information Management System, but goes well beyond this

notion to scholars in any setting reasoning through distributed digital resources

as their laboratory.

2.1. Reproducible Research

Mesirov [4] describes the notion of Accessible Reproducible Research, where

scientiﬁc publications should provide clear enough descriptions of the protocols

to enable successful repetition and extension. Mesirov describes a Reproducible

Results System that facilitates the enactment and publication of reproducible

research. Such a system should provide the ability to track the provenance of

data, analyses and results, and to package them for redistribution/publication.

A key role of the publication is argumentation: convincing the reader that the

conclusions presented do indeed follow from the evidence presented.

De Roure and Goble [5] observe that results are “reinforced by reproducibil-

ity”, with traditional scholarly lifecycles focused on the need for reproducibil-

ity. They also argue for the primacy of method, ensuring that users can then

reuse those methods in pursuing reproducibility. While traditional “paper”

publication can present intellectual arguments, fostering reinforcement requires

inclusion of data, methods and results in our publications, thus supporting re-

producibility. A problem with traditional paper publication, as identiﬁed by

Mons [6] is that of “Knowledge Burying”. The results of an experiment are

written up in a paper which is then published. Rather than explicitly including

information in structured forms however, techniques such as text mining are

then used to extract the knowledge from that paper, resulting in a loss of that

knowledge.

In a paper from the Yale Law School Roundtable on Data and Code Shar-

ing in Computational Science, Stodden et al [7] also discuss the notion of Re-

producible Research. Here they identify veriﬁability as a key factor, with the

generation of veriﬁable knowledge being scientiﬁc discovery’s central goal. They

outline a number of guidelines or recommendations to facilitate the generation

of reproducible results. These guidelines largely concern openness in the data

publication process, for example the use of open licences and non-proprietary

standards. Long term goals identiﬁed here include the development of version

control systems for data; tools for eﬀective download tracking of code and data in

order to support citation and attribution; and the development of standardised

terminologies and vocabularies for data description. Mechanisms for citation

and attribution (including data citation, e.g. Data Cite

) are key in providing

incentives for scientists to publish data.

The Scientiﬁc Knowledge Objects [8] of the LiquidPub project describe ag-

gregation structures intended to describe scientiﬁc papers, books and journals.

http://datacite.org/

The approach explicitly considers the lifecycle of publications in terms of three

“states”: Gas, Liquid and Solid, which represent early, tentative and ﬁnalised

work respectively.

Groth et al [9] describe the notion of a “Nano-publication” – an explicit rep-

resentation of a statement that is made in scientiﬁc literature. Such statements

may be made in multiple locations, for example in diﬀerent papers, and valida-

tion of that statement can only be done given the context. An example given

is the statement that malaria is transmitted by mosquitos, which will appear

in many places in published literature, each occurrence potentially backed by

diﬀering evidence. Each nano-publication is associated with a set of annotations

that refer to the statement and provide a minimum set of (community) agreed

annotations that identify authorship, provenance, and so on. These annotations

can then be used as the basis for review, citation and indeed further annota-

tion. The Nano-publication model described in [9] considers a statement to be

a triple – a tuple of three concepts, subject, predicate and object – which ﬁts

closely with the Resource Description Framework (RDF) data model [10], used

widely for (meta)data publication (see the discussion on Linked Data below).

The proposed implementation uses RDF and Named Graphs

. Aggregation

of nano-publications will be facilitated by the use of common identiﬁers (fol-

lowing Linked Data principles as discussed in Section 7), and to support this,

the Concept Web Alliance

are developing a ConceptWiki

, providing URIs

for biomedical concepts. The nano-publication approach is rather “ﬁne-grain”,

focusing on single statements along with their provenance.

The Executable Paper Grand Challenge

was a contest for proposals that

will “improve the way scientiﬁc information is communicated and used”. For

executable papers, this will be through adaptations to existing publication mod-

els to include data and analyses and thus facilitate the validation, citation and

tracking of that information. The three winning entries in 2011 highlight diﬀer-

ent aspects of the notion of executable papers. Collage [11] provides infrastruc-

ture which allows for the embedding of executable codes in papers. SHARE [12]

focuses on the issue of reproducability, using virtual machines to provide exe-

cution. Finally, Gavish and Donoh [13] focus on veriﬁability, through a system

consisting of a Repository holding Veriﬁable Computational Results (VCRs)

that are identiﬁed using Veriﬁable Result Identiﬁers (VRIs). We note, however,

that none of these proposals provide an explicit notion of “Research Object” as

introduced here. In addition, provenance information is only considered in the

third proposal, where Gavish and Donoh suggest that the ability to re-execute

processes may be unnecessary. Rather, understanding of the process can be

supported through providing access to the computation tree along with inputs,

outputs, parameters and code descriptions.

see Section 7 for an explanation of Named Graphs)

http://www.nbic.nl/about-nbic/affiliated-organisations/cwa/introduction/

http://conceptwiki.org/

http://www.executablepapers.com/

2.2. Linked Data

Beneﬁts of explicit representation are clear. An association with a dataset (or

service, or result collection, or instrument) should be more than just a citation or

reference to that dataset (or service, or result collection). The association should

rather be a link to that dataset (or service, or result collection, or instrument)

which can be followed or dereferenced explicitly. Such linking provides access

to the actual resource and thus enactment of the service, query or retrieval of

data, and so on, fostering reproducability.

The term Linked Data is used to refer to a set of best practices for pub-

lishing and connecting structured data on the Web [3]. Linked Data explicitly

encourages the use of dereferenceable links as discussed above, and the Linked

Data “principles” – use of HTTP URIs for naming, providing useful information

when dereferencing URIs, and including links to other URIs – are intended to

foster reuse, linkage and consumption of that data. Further discussion of Linked

Data is given in Section 7.

2.3. Preservation and Archiving

The Open Archival Information System (OAIS) reference model [14] de-

scribes ”open archival information systems” which are concerned with preserv-

ing information for the beneﬁt of a community. The OAIS Functional Model

describes a core set of mechanisms which include Ingest, Storage and Access

along with Planning, Data Management and Administration. There is also sep-

aration of Submission Information Packages, the mechanism by which content

is submitted for ingest by a Producer; Archival Information Package, the ver-

sion stored by the system; and Dissemination Information Package, the version

delivered to a Consumer.

OAIS considers three external entities or actors that interact with the sys-

tem. Producers, Management and Consumers, to characterise those who trans-

fer information to the system for preservation; formulate and enforce high level

policies (planning, deﬁning scope, providing ”guarantees”) and are expected to

use the information respectively. OAIS also consider a notion of a Designated

Community, a subset of consumers that are expected to understand the archived

information.

2.4. Scientiﬁc Publication Packages

One notable precursor to the notion of Research Object presented in this

paper is the idea of Scientiﬁc Publication Packages (SPP), proposed in 2006 by

Hunter to describe “the selective encapsulation of raw data, derived products,

algorithms, software and textual publications” [15].

SPPs are motivated primarily by the need to create archives for the variety

of artifacts, such as those listed above, that are produced during the course

of a scientiﬁc investigation. In this “digital libraries” view of experimental

science, SPPs ideally contain not only data, software, and documents, but their

provenance as well. As we note here, the latter is a key enabler both for scientiﬁc

reproducibility, and to let third parties verify scientiﬁc accuracy. Thus, SPPs are

Why linked data is not enough for scientists

Figures

Citations

Knowledge Infrastructures: Intellectual Frameworks and Research Challenges

Web technologies for environmental Big Data

The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet

Software Citation Principles

ClinicalCodes: An Online Clinical Codes Repository to Improve the Validity and Reproducibility of Research Using Electronic Medical Records

References

Design Patterns: Elements of Reusable Object-Oriented Software

Linked Data - the story so far

RDF Vocabulary Description Language 1.0 : RDF Schema. W3C Proposed Recommendation

Bio2RDF: Towards a mashup to build bioinformatics knowledge systems

The Open Provenance Model core specification (v1.1)

Related Papers (5)

The FAIR Guiding Principles for scientific data management and stewardship

Linked Data - the story so far

From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory)

Reproducible Research in Computational Science

Data Sharing by Scientists: Practices and Perceptions

Frequently Asked Questions (7)

Q1. What are the contributions in "Why linked data is not enough for scientists" ?

Q2. What are the key kinds of reuse that are needed within SysMO?

Q3. What is the key consideration for the Wf4Ever project?

Q4. What are the main issues of sharing?

Q5. What would be the requirements for a third party to reproduce and validate the model?

Q6. How long does it take to get new knowledge into practice?

Q7. How long does it take to extract the variables and metadata?