scispace - formally typeset
Search or ask a question
Author

Darren Marvin

Bio: Darren Marvin is an academic researcher from University of Southampton. The author has contributed to research in topics: e-Science & Workflow. The author has an hindex of 10, co-authored 15 publications receiving 2905 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: The Taverna project has developed a tool for the composition and enactment of bioinformatics workflows for the life sciences community that is written in a new language called Scufl, where by each step within a workflow represents one atomic task.
Abstract: Motivation:In silico experiments in bioinformatics involve the co-ordinated use of computational tools and information repositories. A growing number of these resources are being made available with programmatic access in the form of Web services. Bioinformatics scientists will need to orchestrate these Web services in workflows as part of their analyses. Results: The Taverna project has developed a tool for the composition and enactment of bioinformatics workflows for the life sciences community. The tool includes a workbench application which provides a graphical user interface for the composition of workflows. These workflows are written in a new language called the simple conceptual unified flow language (Scufl), where by each step within a workflow represents one atomic task. Two examples are used to illustrate the ease by which in silico experiments can be represented as Scufl workflows using the workbench application. Availability: The Taverna workflow system is available as open source and can be downloaded with example Scufl workflows from http://taverna.sourceforge.net

1,709 citations

Journal ArticleDOI
TL;DR: The Taverna Workbench as discussed by the authors is a Grid environment for the composition and execution of workflows for the life sciences community, which is based on the myGrid project's workbench.
Abstract: Life sciences research is based on individuals, often with diverse skills, assembled into research groups. These groups use their specialist expertise to address scientific problems. The in silico experiments undertaken by these research groups can be represented as workflows involving the co-ordinated use of analysis programs and information repositories that may be globally distributed. With regards to Grid computing, the requirements relate to the sharing of analysis and information resources rather than sharing computational power. The myGrid project has developed the Taverna Workbench for the composition and execution of workflows for the life sciences community. This experience paper describes lessons learnt during the development of Taverna. A common theme is the importance of understanding how workflows fit into the scientists' experimental context. The lessons reflect an evolving understanding of life scientists' requirements on a workflow environment, which is relevant to other areas of data intensive and exploratory science.

729 citations

01 Jan 2003
TL;DR: An overview of initial work on the provenance of bioinformatics e-Science experiments within myGrid uses two kinds of provenance: the derivation path of information and annotation and explores how the resulting Webs of experimental data holdings can be mined for useful information and presentations for the e-Scientist.
Abstract: Like experiments performed at a laboratory bench, the data associated with an e-Science experiment are of reduced value if other scientists are not able to identify the origin, or provenance, of those data. Provenance information is essential if experiments are to be validated and verified by others, or even by those who originally performed them. In this article, we give an overview of our initial work on the provenance of bioinformatics e-Science experiments within myGrid. We use two kinds of provenance: the derivation path of information and annotation. We show how this kind of provenance can be delivered within the myGrid demonstrator WorkBench and we explore how the resulting Webs of experimental data holdings can be mined for useful information and presentations for the e-Scientist.

126 citations

Journal IssueDOI
TL;DR: The Taverna Workbench as mentioned in this paper is a Grid environment for the composition and execution of workflows for the life sciences community, which is based on the myGrid project's workbench.
Abstract: Life sciences research is based on individuals, often with diverse skills, assembled into research groups. These groups use their specialist expertise to address scientific problems. The in silico experiments undertaken by these research groups can be represented as workflows involving the co-ordinated use of analysis programs and information repositories that may be globally distributed. With regards to Grid computing, the requirements relate to the sharing of analysis and information resources rather than sharing computational power. The myGrid project has developed the Taverna Workbench for the composition and execution of workflows for the life sciences community. This experience paper describes lessons learnt during the development of Taverna. A common theme is the importance of understanding how workflows fit into the scientists' experimental context. The lessons reflect an evolving understanding of life scientists' requirements on a workflow environment, which is relevant to other areas of data intensive and exploratory science. Copyright © 2005 John Wiley & Sons, Ltd.

115 citations

01 Jan 2003
TL;DR: The EPSRC funded Grid project has developed a graphical toolset and workflow enactor which uses its own high level representation of a process flow, including specification of processing units, data transfers and execution constraints.
Abstract: Workflow techniques form an important part of in-silico experimentation within the bioinformatics domain and potentially allow the eScientist to describe and enact their experimental processes in a structured, repeatable and verifiable way. Bioinformaticians routinely use Web-based resources within their in-silico experiments. However, the use of current web service orchestration techniques is problematic, and represents a significant barrier to take-up by the bioinformatics community, due to the rapidly evolving and competing standards, a lack of freely available tools, limited support for interaction with stateful services, and inappropriate levels of abstraction for the bioinformatics domain. As a result, the EPSRC funded Grid[11] project has, in collaboration with the European Bioinformatics Institute and the Human Genome Mapping Project, developed a graphical toolset and workflow enactor which uses its own high level representation of a process flow, including specification of processing units, data transfers and execution constraints.

58 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis and provide support for capturing the context and intent of computational methods.
Abstract: Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis.

3,576 citations

Journal ArticleDOI
TL;DR: The most important new developments in STRING 8 over previous releases include a URL-based programming interface, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures.
Abstract: Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a metadatabase that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein–protein interactions currently available. STRING can be reached at http://string-db.org/.

2,394 citations

Journal ArticleDOI
TL;DR: Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow.
Abstract: Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow. It is the first system to support the use of automatically inferred multiple named wildcards (or variables) in input and output filenames.

1,932 citations

Journal ArticleDOI
TL;DR: Kepler as mentioned in this paper is a scientific workflow system, which is currently under development across a number of scientific data management projects and is a community-driven, open source project, and always welcome related projects and new contributors to join.
Abstract: Many scientific disciplines are now data and information driven, and new scientific knowledge is often gained by scientists putting together data analysis and knowledge discovery “pipelines”. A related trend is that more and more scientific communities realize the benefits of sharing their data and computational services, and are thus contributing to a distributed data and computational community infrastructure (a.k.a. “the Grid”). However, this infrastructure is only a means to an end and scientists ideally should be bothered little with its existence. The goal is for scientists to focus on development and use of what we call scientific workflows. These are networks of analytical steps that may involve, e.g., database access and querying steps, data analysis and mining steps, and many other steps including computationally intensive jobs on high performance cluster computers. In this paper we describe characteristics of and requirements for scientific workflows as identified in a number of our application projects. We then elaborate on Kepler, a particular scientific workflow system, currently under development across a number of scientific data management projects. We describe some key features of Kepler and its underlying Ptolemyii system, planned extensions, and areas of future research. Kepler is a communitydriven, open source project, and we always welcome related projects and new contributors to join.

1,926 citations

Journal ArticleDOI
TL;DR: Galaxy seeks to make data-intensive research more accessible, transparent and reproducible by providing a Web-based environment in which users can perform computational analyses and have all of the details automatically tracked for later inspection, publication, or reuse.
Abstract: High-throughput data production technologies, particularly 'next-generation' DNA sequencing, have ushered in widespread and disruptive changes to biomedical research. Making sense of the large datasets produced by these technologies requires sophisticated statistical and computational methods , as well as substantial computational power. This has led to an acute crisis in life sciences, as researchers without informatics training attempt to perform computation-dependent analyses. Since 2005, the Galaxy project has worked to address this problem by providing a framework that makes advanced computational tools usable by non experts. Galaxy seeks to make data-intensive research more accessible , transparent and reproducible by providing a Web-based environment in which users can perform computational analyses and have all of the details automatically tracked for later inspection, publication , or reuse. In this report we highlight recently added features enabling biomedical analyses on a large scale.

1,774 citations