scispace - formally typeset
Search or ask a question

Showing papers by "Carole Goble published in 2014"


Journal ArticleDOI
TL;DR: The micropublications semantic model of scientific argument and evidence as mentioned in this paper is a semantic model for scientific publications, which is based on the OWL 2 vocabulary for OWL and can be used to express a broad spectrum of representational complexity from minimal to maximal forms.
Abstract: Scientific publications are documentary representations of defeasible arguments, supported by data and repeatable methods. They are the essential mediating artifacts in the ecosystem of scientific communications. The institutional “goal” of science is publishing results. The linear document publication format, dating from 1665, has survived transition to the Web. Intractable publication volumes; the difficulty of verifying evidence; and observed problems in evidence and citation chains suggest a need for a web-friendly and machine-tractable model of scientific publications. This model should support: digital summarization, evidence examination, challenge, verification and remix, and incremental adoption. Such a model must be capable of expressing a broad spectrum of representational complexity, ranging from minimal to maximal forms. The micropublications semantic model of scientific argument and evidence provides these features. Micropublications support natural language statements; data; methods and materials specifications; discussion and commentary; challenge and disagreement; as well as allowing many kinds of statement formalization. The minimal form of a micropublication is a statement with its attribution. The maximal form is a statement with its complete supporting argument, consisting of all relevant evidence, interpretations, discussion and challenges brought forward in support of or opposition to it. Micropublications may be formalized and serialized in multiple ways, including in RDF. They may be added to publications as stand-off metadata. An OWL 2 vocabulary for micropublications is available at http://purl.org/mp . A discussion of this vocabulary along with RDF examples from the case studies, appears as OWL Vocabulary and RDF Examples in Additional file1. Micropublications, because they model evidence and allow qualified, nuanced assertions, can play essential roles in the scientific communications ecosystem in places where simpler, formalized and purely statement-based models, such as the nanopublications model, will not be sufficient. At the same time they will add significant value to, and are intentionally compatible with, statement-based formalizations. We suggest that micropublications, generated by useful software tools supporting such activities as writing, editing, reviewing, and discussion, will be of great value in improving the quality and tractability of biomedical communications.

109 citations


Journal ArticleDOI
TL;DR: A manual analysis performed over a set of real-world scientific workflows from Taverna, Wings, Galaxy and Vistrails has resulted in set of scientific workflow motifs that are helpful to identify the functionality of the steps in a given workflow, to develop best practices for workflow design, and to develop approaches for automated generation of workflow abstractions.

79 citations


Journal ArticleDOI
TL;DR: This article raises a number of points concerning quality, code review, and openness; development practices and training in scientific computing; career recognition of research software engineers; sustainability models for funding scientific software.
Abstract: Modern scientific research isn't possible without software. However, its vital role is often overlooked by funders, universities, assessment committees, and even the research community itself. This is a serious issue that needs urgent attention. This article raises a number of points concerning quality, code review, and openness; development practices and training in scientific computing; career recognition of research software engineers; and sustainability models for funding scientific software. We must get software recognized to be the first-class experimental scientific instrument that it is and get "better software for better research."

69 citations


Journal ArticleDOI
TL;DR: The OPS platform as discussed by the authors is a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications, which is based on a collection of prioritised drug discovery business questions created as part of the Open PHACTS project.
Abstract: The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. In this application report, we describe a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies. We describe the architecture of the platform focusing on seven design decisions that drove its development with the aim of informing others developing similar software in this or other domains. The utility of the platform is demonstrated by the variety of drug discovery applications being built to access the integrated data.An alpha version of the OPS platform is currently available to the Open PHACTS consortium and a first public release will be made in late 2012, see http://www.openphacts.org/ for details.

63 citations


Journal ArticleDOI
TL;DR: The Open PHACTS Discovery Platform as discussed by the authors leverages Linked Data to provide integrated access to pharmacology databases and has been accessed over 13.5 million times and has multiple applications that integrate with it.

51 citations


Journal ArticleDOI
TL;DR: This work designs and implements an approach to improving workflow structure by way of rewriting preserving workflow semantics and introduces a distilling algorithm that takes in a workflow and produces a distilled semantically-equivalent workflow.
Abstract: Scientific workflows management systems are increasingly used to specify and manage bioinformatics experiments. Their programming model appeals to bioinformaticians, who can use them to easily specify complex data processing pipelines. Such a model is underpinned by a graph structure, where nodes represent bioinformatics tasks and links represent the dataflow. The complexity of such graph structures is increasing over time, with possible impacts on scientific workflows reuse. In this work, we propose effective methods for workflow design, with a focus on the Taverna model. We argue that one of the contributing factors for the difficulties in reuse is the presence of "anti-patterns", a term broadly used in program design, to indicate the use of idiomatic forms that lead to over-complicated design. The main contribution of this work is a method for automatically detecting such anti-patterns, and replacing them with different patterns which result in a reduction in the workflow's overall structural complexity. Rewriting workflows in this way will be beneficial both in terms of user experience (easier design and maintenance), and in terms of operational efficiency (easier to manage, and sometimes to exploit the latent parallelism amongst the tasks). We have conducted a thorough study of the workflows structures available in Taverna, with the aim of finding out workflow fragments whose structure could be made simpler without altering the workflow semantics. We provide four contributions. Firstly, we identify a set of anti-patterns that contribute to the structural workflow complexity. Secondly, we design a series of refactoring transformations to replace each anti-pattern by a new semantically-equivalent pattern with less redundancy and simplified structure. Thirdly, we introduce a distilling algorithm that takes in a workflow and produces a distilled semantically-equivalent workflow. Lastly, we provide an implementation of our refactoring approach that we evaluate on both the public Taverna workflows and on a private collection of workflows from the BioVel project. We have designed and implemented an approach to improving workflow structure by way of rewriting preserving workflow semantics. Future work includes considering our refactoring approach during the phase of workflow design and proposing guidelines for designing distilled workflows.

46 citations


Journal ArticleDOI
TL;DR: A Taverna-based Data Refinement Workflow is designed and implemented which integrates taxonomic data retrieval, data cleaning, and data selection into a consistent, standards-based, and effective system hiding the complexity of underlying service infrastructures.
Abstract: The compilation and cleaning of data needed for analyses and prediction of species distributions is a time consuming process requiring a solid understanding of data formats and service APIs provided by biodiversity informatics infrastructures. We designed and implemented a Taverna-based Data Refinement Workflow which integrates taxonomic data retrieval, data cleaning, and data selection into a consistent, standards-based, and effective system hiding the complexity of underlying service infrastructures. The workflow can be freely used both locally and through a web-portal which does not require additional software installations by users.

39 citations


Journal ArticleDOI
TL;DR: In this paper, a workflow-centric Research Object (RO) model is proposed to aggregate and annotate the resources used in a bioinformatics experiment, allowing to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data.
Abstract: Background: One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows. Results: We present the application of the workflow-centric RO model for our bioinformatics case study. Three workflows were produced following recently defined Best Practices for workflow design. By modelling the experiment as an RO, we were able to automatically query the experiment and answer questions such as “which particular data was input to a particular workflow to test a particular hypothesis?” ,a nd“which particular conclusions were drawn from a particular workflow?”. Conclusions: Applying a workflow-centric RO model to aggregate and annotate the resources used in a bioinformatics experiment, allowed us to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data. The RO model is an extendable reference model that can be used by other systems as well. Availability: The Research Object is available at http://www.myexperiment.org/packs/428 The Wf4Ever Research Object Model is available at http://wf4ever.github.io/ro

29 citations


DatasetDOI
04 Dec 2014
TL;DR: This spreadsheet contains the anonymised data collected as part of a survey of UK researchers in their use of research software, which received 417 responses, a statistically significant number of responses that can be used to represent the views of people in research-intensive universities in the UK.
Abstract: This spreadsheet contains the anonymised data collected as part of a survey of UK researchers in their use of research software. We asked people specifically about “research software” which we defined as: “Software that is used to generate, process or analyse results that you intend to appear in a publication (either in a journal, conference paper, monograph, book or thesis). Research software can be anything from a few lines of code written by yourself, to a professionally developed software package. Software that does not generate, process or analyse results - such as word processing software, or the use of a web search - does not count as ‘research software’ for the purposes of this survey.” We contacted 1,000 randomly selected researchers at each of 15 Russell Group universities. From the 15,000 invitations to complete the survey, we received 417 responses – a rate of 3% which is fairly normal for a blind survey. We used Google Forms to collect responses. The responses have good representation from across the disciplines, seniorities and genders. This is a statistically significant number of responses that can be used to represent the views of people in research-intensive universities in the UK. An overview of the data is available on the worksheet "Summary data". Responses to questions are ordered by unique respondent ID. Please read the "README" worksheet for additional information about the collection and processing of this data. This survey data is licensed under a Creative Commons by Attribution licence. Copyright resides with The University of Edinburgh on behalf of the Software Sustainability Institute. Please cite as: APA Hettrick. S. J., et al. (2014). UK Research Software Survey 2014 [Data set]. doi:10.5281/zenodo.14809 Chicago S.J. Hettrick et al, UK Research Software Survey 2014 (accessed December 4, 2014), 10.5281/zenodo.14809. MLA Hettrick S.J., et al. “UK Research Software Survey 2014” ZENODO, 2014. Web. 4 December 2014. .

28 citations


Journal Article
TL;DR: Research Objects as discussed by the authors are portable units that enable the sharing, preservation, interpretation and reuse of research investigation results and provide a single entry point to access information about the hypothesis investigated, the datasets used, the experiments carried out, the results of the experiments, the people involved in the research, etc.

25 citations


Book ChapterDOI
09 Jun 2014
TL;DR: It is shown that by basic mark-up of the data processing within activities and using a set of domain specific label generation functions, standard workflow provenance can be utilised as a platform for the labelling of data artefacts.
Abstract: Provenance traces captured by scientific workflows can be useful for designing, debugging and maintenance. However, our experience suggests that they are of limited use for reporting results, in part because traces do not comprise domain-specific annotations needed for explaining results, and the black-box nature of some workflow activities. We show that by basic mark-up of the data processing within activities and using a set of domain specific label generation functions, standard workflow provenance can be utilised as a platform for the labelling of data artefacts. These labels can in turn aid selection of data subsets and proxy for data descriptors for shared datasets.

Book ChapterDOI
19 Oct 2014
TL;DR: This paper shows that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation, supporting multiple dynamic views over the Linked Data.
Abstract: When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied. In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.

Posted Content
TL;DR: Research Objects as discussed by the authors are portable units that enable the sharing, preservation, interpretation and reuse of research investigation results and provide a single entry point to access information about the hypothesis investigated, the datasets used, the experiments carried out, the results of the experiments, the people involved in the research, etc.
Abstract: Research in life sciences is increasingly being conducted in a digital and online environment. In particular, life scientists have been pioneers in embracing new computational tools to conduct their investigations. To support the sharing of digital objects produced during such research investigations, we have witnessed in the last few years the emergence of specialized repositories, e.g., DataVerse and FigShare. Such repositories provide users with the means to share and publish datasets that were used or generated in research investigations. While these repositories have proven their usefulness, interpreting and reusing evidence for most research results is a challenging task. Additional contextual descriptions are needed to understand how those results were generated and/or the circumstances under which they were concluded. Because of this, scientists are calling for models that go beyond the publication of datasets to systematically capture the life cycle of scientific investigations and provide a single entry point to access the information about the hypothesis investigated, the datasets used, the experiments carried out, the results of the experiments, the people involved in the research, etc. In this paper we present the Research Object (RO) suite of ontologies, which provide a structured container to encapsulate research data and methods along with essential metadata descriptions. Research Objects are portable units that enable the sharing, preservation, interpretation and reuse of research investigation results. The ontologies we present have been designed in the light of requirements that we gathered from life scientists. They have been built upon existing popular vocabularies to facilitate interoperability. Furthermore, we have developed tools to support the creation and sharing of Research Objects, thereby promoting and facilitating their adoption.

BookDOI
01 Jan 2014
TL;DR: The Dutch Ships and Sailors Linked data Cloud is a potential hub dataset for digital history research and a prime example of the benefits of Linked Data for this field.
Abstract: We present the Dutch Ships and Sailors Linked Data Cloud. This heterogeneous dataset brings together four curated datasets on Dutch Maritime history as five-star linked data. The individual datasets use separate datamodels, designed in close collaboration with maritime historical researchers. The individual models are mapped to a common interoperability layer, allowing for analysis of the data on the general level. We present the datasets, modeling decisions, internal links and links to external data sources. We show ways of accessing the data and present a number of examples of how the dataset can be used for historical research. The Dutch Ships and Sailors Linked Data Cloud is a potential hub dataset for digital history research and a prime example of the benefits of Linked Data for this field.

Journal ArticleDOI
TL;DR: The Taverna Workbench is an open source software providing the ability to combine various services within a workflow, and BIFI offers user interfaces that allow users to interactively construct workflow views and share them with the community, thus significantly increasing usability of heterogeneous, distributed service consumption.
Abstract: Heterogeneity in the features, input-output behaviour and user interface for available bioinformatics tools and services is still a bottleneck for both expert and non-expert users. Advancement in providing common interfaces over such tools and services are gaining interest among researchers. However, the lack of (meta-) information about input-output data and parameter prevents to provide automated and standardized solutions, which can assist users in setting the appropriate parameters. These limitations must be resolved especially in the workflow-based solution in order to ease the integration of software. We report a Taverna Workbench plugin: the XworX BIFI (Beautiful Interfaces for Inputs) implemented as a solution for the aforementioned issues. BIFI provides a Graphical User Interface (GUI) definition language used to layout the user interface and to define parameter options for Taverna workflows. BIFI is also able to submit GUI Definition Files (GDF) directly or discover appropriate instances from a configured repository. In the absence of a GDF, BIFI generates a default interface. The Taverna Workbench is an open source software providing the ability to combine various services within a workflow. Nevertheless, users can supply input data to the workflow via a simple user interface providing only a text area to enter the input in text form. The workflow may contain meta-information in human readable form such as description text for the port and an example value. However, not all workflow ports are documented so well or have all the required information. BIFI uses custom user interface components for ports which give users feedback on the parameter data type or structure to be used for service execution and enables client-side data validations. Moreover, BIFI offers user interfaces that allow users to interactively construct workflow views and share them with the community, thus significantly increasing usability of heterogeneous, distributed service consumption.

Proceedings ArticleDOI
09 Jun 2014
TL;DR: This paper describes three distinct areas of concern which emerged from the Collaborations Workshop 2014: collaboration readiness, capability enhancement and advocacy.
Abstract: The Collaborations Workshop 2014 (CW14) brought together representatives from across the research community to discuss the issues around software's role in reproducible research. In this paper we summarise the themes, practices and ideas raised at the workshop. We also consider how the "unconference" format of the CW14 helps in eliciting information and forming future collaborations around aspects of reproducible research. In particular, we describe three distinct areas of concern which emerged from the event: collaboration readiness, capability enhancement and advocacy.

Journal ArticleDOI
TL;DR: The Open PHACTS Discovery Platform as discussed by the authors leverages Linked Data to provide integrated access to pharmacology databases between its launch in April 2013 and March 2014, the platform has been accessed over 135 million times and has multiple applications that integrate with it.
Abstract: Data integration is a key challenge faced in pharmacology where there are numerous heterogenous databases spanning multiple domains (eg chemistry and biology) To address this challenge, the Open PHACTS consortium has developed the Open PHACTS Discovery Platform that leverages Linked Data to provide integrated access to pharmacology databases Between its launch in April 2013 and March 2014, the platform has been accessed over 135 million times and has multiple applications that integrate with it In this work, we discuss how Application Programming Interfaces can extend the classical Linked Data Application Architecture to facilitate data integration Additionally, we show how the Open PHACTS Discovery Platform implements this extended architecture

01 Jan 2014
TL;DR: This application report describes a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications, drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project.
Abstract: The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. In this application report, we describe a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies. We describe the architec- ture of the platform focusing on seven design decisions that drove its development with the aim of informing others developing similar software in this or other domains. The utility of the platform is demonstrated by the variety of drug discovery applications being built to access the integrated data. An alpha version of the OPS platform is currently available to the Open PHACTS consortium and a first public release will be made in late 2012, see http://www.openphacts.org/ for details.

Reference EntryDOI
07 Apr 2014

BookDOI
01 Jan 2014
TL;DR: Data integration, search and query answering, Ontology based data access and query rewriting and reasoning, and Natural language processing and information extraction.
Abstract: Linked data, its quality, link discovery and application in the life sciences.- Data integration.- Search and query answering.- SPARQL.- Ontology based data access and query rewriting and reasoning Natural language processing and information extraction.- User interaction and personalization, and social media Ontology alignment and modularization.- Sensors and streams.- Biomedicine and drug discovery.- Smart cities.- Sensor streams.- Multimedia.- Visualization.- Link generation.- Ontology development.- Linked stream data.- Federated query processing.- Tag recommendation.- Entity summarization.- Mobile semantic web.

DOI
14 Jul 2014
TL;DR: This work was enabled by BioVeL and ViBRANT and received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration.
Abstract: This work was enabled by BioVeL (grant no. 283359) and ViBRANT (grant no. 261532) that received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration. www.biovel.eu | www.vbrant.eu Taverna Player: www.taverna.org.uk | Source code: github.com/myGrid/taverna-player | Licence: BSD Scratchpads: scratchpads.eu | Source code: git.scratchpads.eu/git/scratchpads-2.0.git | Licence: GPL2

Proceedings ArticleDOI
30 Jun 2014
TL;DR: DistillFlow is able to detect "anti-patterns" in the structure of workflows (idiomatic forms that lead to over-complicated design) and replace them with different patterns to reduce the workflow's overall structural complexity.
Abstract: Scientific workflows management systems are increasingly used by scientists to specify complex data processing pipelines. Workflows are represented using a graph structure, where nodes represent tasks and links represent the dataflow. However, the complexity of workflow structures is increasing over time, reducing the rate of scientific workflows reuse. Here, we introduce DistillFlow, a tool based on effective methods for workflow design, with a focus on the Taverna model. DistillFlow is able to detect "anti-patterns" in the structure of workflows (idiomatic forms that lead to over-complicated design) and replace them with different patterns to reduce the workflow's overall structural complexity. Rewriting workflows in this way is beneficial both in terms of user experience and workflow maintenance.

Proceedings Article
18 Jun 2014
TL;DR: This article used W3C Open Annotation to produce multipolar argumentation networks as an overlay on Web documents to assist in improving scientific reproducibility by integrating discussion and analysis of doubtful results.
Abstract: Too many scientific arguments in peer-reviewed articles are backed by flawed or inadequate data. Articles can now be annotated in pre- or post-publication review using models such as W3C Open Annotation, to produce multipolar argumentation networks as an overlay on Web documents. This would assist in improving scientific reproducibility by integrating discussion and analysis of doubtful results.

Journal ArticleDOI
TL;DR: This special issue encouraged researchers to submit and present original work related to the latest trends in parallel and distributed high-performance systems applied to life science problems.
Abstract: Computing systems are rapidly changing with multicore, graphics processing units (GPUs), clusters, volunteer systems, clouds, and grids offering a confusing dazzling array of opportunities. New programming paradigms such as Google MapReduce and many-task computing have joined the traditional repertoire of workflow and parallel computing for the highest performance systems. Meanwhile, the life sciences are continuing to expand in data generated with continuing improvement in the instruments for high-throughput analysis. This ‘fourth paradigm’ (data driven science) is joined by complex systems or biocomplexity that can build phenomenological models of biological systems and processes. This special issue for Emerging Computational Methods for the Life Sciences Workshop ECMLS2012 [1], juxtaposes these trends seeking those computational methods that will enhance scientific discovery. Within this overall scope, this special issue encouraged researchers to submit and present original work related to the latest trends in parallel and distributed high-performance systems applied to life science problems. Important contributions have been provided by Weber et al. [2], Yang et al. [3], Elllingson et al. [4], Hamacher et al. [5], Cushing et al. [6], and Stanberry et al. [7]. These contributions focus on:

Proceedings Article
21 Oct 2014
TL;DR: The Open PHACTS VoID Editor helps non-Semantic Web experts to create machine interpretable descriptions for their datasets to enable its discovery and reuse.
Abstract: The Open PHACTS VoID Editor helps non-Semantic Web experts to create machine interpretable descriptions for their datasets. The web app guides the user, an expert in the domain of the data, through a series of questions to capture details of their dataset and then generates a VoID dataset description. The generated dataset description conforms to the Open PHACTS dataset description guidelines that ensure suitable provenance information is available about the dataset to enable its discovery and reuse. The VoID Editor is available at http://voideditor.cs.man.ac.uk. The source code can be found at https://github.com/openphacts/Void-Editor2.