Leveraging Semantics to Improve Reproducibility in Scientific Workflows

Home
/
Papers
/
Leveraging Semantics to Improve Reproducibility in Scientific Workflows

Leveraging Semantics to Improve Reproducibility in Scientific Workflows

Idafen Santana-Perez, Rafael Ferreira da Silva, Mats Rynge, Ewa Deelman, María S Pérez-Henández, Oscar Corcho - Show less +2 more

01 Jan 2014-

TL;DR: The reproducibility of published results is a cornerstone in scientific publishing and progress and efforts have been defining standards and patterns to assess whether an experimental result is reproducible.

read less

Abstract: Reproducibility of published results is a cornerstone in scientific publishing and progress. Therefore, the scientific community has been encouraging authors and editors to publish their contributions in a verifiable and understandable way. Efforts such as the Reproducibility Initiative [1], or the Reproducibility Projects on Biology [2] and Psychology [3] domains, have been defining standards and patterns to assess whether an experimental result is reproducible.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Standing together for reproducibility in large-scale computing: report on reproducibility@XSEDE

[...]

Doug James, Nancy Wilkins-Diehr, Victoria Stodden, Dirk Colbry, Carlos Rosales, Mark R Fahey, Justin Y. Shi, Rafael Ferreira da Silva, Kyo Lee, Ralph Roskies, Laurence Loewe, Susan Lindsey, Rob Kooper, Lorena A. Barba, David H. Bailey, Jonathan M. Borwein, Oscar Corcho, Ewa Deelman, Michael C. Dietze, Benjamin Gilbert, Jan Harkes, Seth Keele, Praveen Kumar, Jong Lee, Erika Linke, Richard Marciano, Luigi Marini, Chris Mattmann, Dave Mattson, Kenton McHenry, Robert McLay, Sheila Miguez, Barbara S. Minsker, María S. Pérez-Hernández, Dan Ryan, Mats Rynge, Idafen Santana-Perez, Mahadev Satyanarayanan, Gloriana St. Clair, Keith Webster, Eivind Hovig, Daniel S. Katz, Sophie Kay, Geir Kjetil Sandve, David Skinner, Gabrielle Allen, John Cazes, Kym Won Cho, James Fonseca, Lorraine J. Hwang, Lars Koesterke, Pragnesh Patel, Line Pouchard, Edward Seidel, Isuru Suriarachchi - Show less +51 more

01 Jan 2014

TL;DR: The Reproducibility@xsede workshop as discussed by the authors focused on reproducibility in large-scale computational research and highlighted four areas of particular interest to XSEDE: documentation and training that promotes reproducible research; system-level tools that provide build and run-time information at the level of the individual job; the need to model best practices in research collaborations involving Extreme Science and Engineering Discovery Environment staff; and continued work on gateways and related technologies.

...read moreread less

Abstract: This is the final report on reproducibility@xsede, a one-day workshop held in conjunction with XSEDE14, the annual conference of the Extreme Science and Engineering Discovery Environment (XSEDE). The workshop's discussion-oriented agenda focused on reproducibility in large-scale computational research. Two important themes capture the spirit of the workshop submissions and discussions: (1) organizational stakeholders, especially supercomputer centers, are in a unique position to promote, enable, and support reproducible research; and (2) individual researchers should conduct each experiment as though someone will replicate that experiment. Participants documented numerous issues, questions, technologies, practices, and potentially promising initiatives emerging from the discussion, but also highlighted four areas of particular interest to XSEDE: (1) documentation and training that promotes reproducible research; (2) system-level tools that provide build- and run-time information at the level of the individual job; (3) the need to model best practices in research collaborations involving XSEDE staff; and (4) continued work on gateways and related technologies. In addition, an intriguing question emerged from the day's interactions: would there be value in establishing an annual award for excellence in reproducible research? Overview

...read moreread less

15 citations

Proceedings Article•DOI•

Hi-WAY : Execution of scientific workflows on hadoop YARN

[...]

Marc Bux¹, Jörgen Brandt¹, Carl Witt², Jim Dowling³, Ulf Leser¹ - Show less +1 more•Institutions (3)

Humboldt University of Berlin¹, University of Potsdam², Royal Institute of Technology³

01 Jan 2017

TL;DR: This paper presents a meta-modelling framework that automates the very labor-intensive and therefore time-heavy and therefore expensive process of manually cataloging and exchanging data in scientific workflows.

...read moreread less

Abstract: Scientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today’s data-driven science. However, existing scientific workflow man ...

...read moreread less

11 citations

Cites background from "Leveraging Semantics to Improve Rep..."

..., [8, 33, 50]), to date only isolated, often domainspecific solutions addressing only subsets of these problems have been proposed (e....
[...]
...Furthermore, despite reproducibility being advocated as a major strength of scientific workflows, most systems focus only on sharing workflows, disregarding the provisioning of input data and setup of the execution environment [15, 33]....
[...]

Journal Article•DOI•

Cloud infrastructure provenance collection and management to reproduce scientific workflows execution

[...]

Khawar Hasham¹, Kamran Munir¹, Richard McClatchey¹•Institutions (1)

University of the West of England¹

01 Sep 2018-Future Generation Computer Systems

TL;DR: This paper presents a framework to Reproduce Scientific Workflow Execution using Cloud-Aware Provenance (ReCAP), along with the proposed mapping approaches that aid in capturing the Cloud-aware provenance information and help in re-provisioning the execution resource on the Cloud with similar configurations.

...read moreread less

10 citations

Cites background or methods from "Leveraging Semantics to Improve Rep..."

...The physical approach, where actual computational hardware are made available for long time periods to scientists, often conserves the computational environment including supercomputers, clusters, or Grids (Santana-Perez et al., 2014b)....
[...]
...The same concern has been shared by Santana-Perez et al. (2014b) that most of the approaches in the conserva- tion of computational science, in particular for scientific workflow executions, have been focused on data, code, and the workflow description....
[...]
...There have been a few recent projects (e.g. Chirigati et al. (2013); Janin et al. (2014)) and research studies e.g. (Santana-Perez et al., 2014a) on collecting provenance and using it to reproduce an experiment....
[...]
...A semanticbased approach (Santana-Perez et al., 2014a) has been proposed to improve reproducibility of workflows in the Cloud....
[...]
...Code must be available to be distributed, and data must be accessible in a readable format (Santana-Perez et al., 2014a)....
[...]

Proceedings Article•DOI•

Using Cloud-Aware Provenance to Reproduce Scientific Workflow Execution on Cloud

[...]

Khawar Hasham¹, Kamran Munir¹, Richard McClatchey¹•Institutions (1)

University of the West of England¹

01 Jan 2015

TL;DR: A model that is used in the proposed design to achieve workflow reproducibility in the Cloud environment is presented, which can collect Cloud infrastructure information from an outside Cloud client along with workflow provenance and can establish a mapping between them.

...read moreread less

Abstract: Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect provenance information from the Cloud and to utilize it for workflow repeatability in the Cloud infrastructure. This paper presents a novel approach that can assist in mitigating this challenge. This approach can collect Cloud infrastructure information from an outside Cloud client along with workflow provenance and can establish a mapping between them. This mapping is later used to re-provision resources on the Cloud for workflow execution. The reproducibility of the workflow execution is performed by: (a) capturing the Cloud infrastructure information (virtual machine configuration) along with the workflow provenance, (b) re-provisioning the similar resources on the Cloud and re-executing the workflow on them and (c) by comparing the outputs of workflows. The evaluation of the prototype suggests that the proposed approach is feasible and can be investigated further. Moreover, there is no reference reproducibility model exists in literature that can provide guidelines to achieve this goal in Cloud. This paper also attempts to present a model that is used in the proposed design to achieve workflow reproducibility in the Cloud environment.

...read moreread less

4 citations

Proceedings Article•DOI•

Scientific Workflow Repeatability through Cloud-Aware Provenance

[...]

Khawar Hasham¹, Kamran Munir¹, Jetendr Shamdasani¹, Richard McClatchey¹•Institutions (1)

University of the West of England¹

08 Dec 2014

TL;DR: In this paper, the authors present an approach to collect provenance information from the cloud and to utilize it for workflow repeatability in the cloud infrastructure, which can assist in mitigating the challenge of Grid-based scientific workflow systems.

...read moreread less

Abstract: The transformations, analyses and interpretations of data in scientific workflows are vital for the repeatability and reliability of scientific workflows. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect provenance information from the Cloud and to utilize it for workflow repeatability in the Cloud infrastructure. The dynamic nature of the Cloud in comparison to the Grid makes it difficult because resources are provisioned on-demand unlike the Grid. This paper presents a novel approach that can assist in mitigating this challenge. This approach can collect Cloud infrastructure information along with workflow provenance and can establish a mapping between them. This mapping is later used to re-provision resources on the Cloud. The repeatability of the workflow execution is performed by: (a) capturing the Cloud infrastructure information (virtual machine configuration) along with the workflow provenance, and (b) re-provisioning the similar resources on the Cloud and re-executing the workflow on them. The evaluation of an initial prototype suggests that the proposed approach is feasible and can be investigated further.

...read moreread less

3 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

Pegasus: A framework for mapping complex scientific workflows onto distributed systems

[...]

Ewa Deelman¹, Gurmeet Singh¹, Mei-Hui Su¹, Jim Blythe¹, Yolanda Gil¹, Carl Kesselman¹, Gaurang Mehta¹, Karan Vahi¹, G. Bruce Berriman², John C. Good², Anastasia C. Laity², Joseph C. Jacob², Daniel S. Katz² - Show less +9 more•Institutions (2)

University of Southern California¹, California Institute of Technology²

01 Jul 2005-Scientific Programming

TL;DR: The results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities are presented.

...read moreread less

Abstract: This paper describes the Pegasus framework that can be used to map complex scientific workflows onto distributed resources. Pegasus enables users to represent the workflows at an abstract level without needing to worry about the particulars of the target execution systems. The paper describes general issues in mapping applications and the functionality of Pegasus. We present the results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities. A real-life astronomy application is used as the basis for the study.

...read moreread less

1,324 citations

Book•DOI•

Workflows for e-Science

[...]

Ian Taylor, Ewa Deelman, Dennis Gannon, Matthew Shields

01 Jan 2007

TL;DR: In this article, the authors present an overview of the current state-of-the-art within established projects, presenting many different aspects of workflow from users to tool builders, from a number of different perspectives.

...read moreread less

Abstract: This is a timely book presenting an overview of the current state-of-the-art within established projects, presenting many different aspects of workflow from users to tool builders. It provides an overview of active research, from a number of different perspectives. It includes theoretical aspects of workflow and deals with workflow for e-Science as opposed to e-Commerce. The topics covered will be of interest to a wide range of practitioners.

...read moreread less

810 citations

Proceedings Article•DOI•

Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand

[...]

G. Bruce Berriman¹, Ewa Deelman², John C. Good¹, Joseph C. Jacob¹, Daniel S. Katz¹, Carl Kesselman¹, Anastasia C. Laity¹, Thomas A. Prince¹, Gurmeet Singh², Mei-Hu Su² - Show less +6 more•Institutions (2)

California Institute of Technology¹, University of Southern California²

16 Sep 2004-Proceedings of SPIE

TL;DR: Montage as discussed by the authors is a grid-enabled version of Montage, an astronomical image mosaic service, suitable for large scale processing of the sky, where re-projection jobs can be added to a pool of tasks and performed by as many processors as are available.

...read moreread less

Abstract: This paper describes the design of a grid-enabled version of Montage, an astronomical image mosaic service, suitable for large scale processing of the sky. All the re-projection jobs can be added to a pool of tasks and performed by as many processors as are available, exploiting the parallelization inherent in the Montage architecture. We show how we can describe the Montage application in terms of an abstract workflow so that a planning tool such as Pegasus can derive an executable workflow that can be run in the Grid environment. The execution of the workflow is performed by the workflow manager DAGMan and the associated Condor-G. The grid processing will support tiling of images to a manageable size when the input images can no longer be held in memory. Montage will ultimately run operationally on the Teragrid. We describe science applications of Montage, including its application to science product generation by Spitzer Legacy Program teams and large-scale, all-sky image processing projects.

...read moreread less

168 citations

Journal Article•DOI•

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

[...]

Bo Mao¹, Hong Jiang², Suzhen Wu¹, Yinjin Fu³, Lei Tian² - Show less +1 more•Institutions (3)

Xiamen University¹, University of Nebraska–Lincoln², National University of Defense Technology³

01 Mar 2014-ACM Transactions on Storage

TL;DR: SAR, an SSD (solid-state drive)-Assisted Read scheme, is proposed, that effectively exploits the high random-read performance properties of SSDs and the unique data-sharing characteristic of deduplication-based storage systems by storing in SSDs theunique data chunks with high reference count, small size, and nonsequential characteristics.

...read moreread less

Abstract: Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, the performance of restore operations from a deduplicated backup can be significantly lower than that without deduplication. The main reason lies in the fact that a file or block is split into multiple small data chunks that are often located in different disks after deduplication, which can cause a subsequent read operation to invoke many disk IOs involving multiple disks and thus degrade the read performance significantly. While this problem has been by and large ignored in the literature thus far, we argue that the time is ripe for us to pay significant attention to it in light of the emerging cloud storage applications and the increasing popularity of the VM platform in the cloud. This is because, in a cloud storage or VM environment, a simple read request on the client side may translate into a restore operation if the data to be read or a VM suspended by the user was previously deduplicated when written to the cloud or the VM storage server, a likely scenario considering the network bandwidth and storage capacity concerns in such an environment.To address this problem, in this article, we propose SAR, an SSD (solid-state drive)-Assisted Read scheme, that effectively exploits the high random-read performance properties of SSDs and the unique data-sharing characteristic of deduplication-based storage systems by storing in SSDs the unique data chunks with high reference count, small size, and nonsequential characteristics. In this way, many read requests to HDDs are replaced by read requests to SSDs, thus significantly improving the read performance of the deduplication-based storage systems in the cloud. The extensive trace-driven and VM restore evaluations on the prototype implementation of SAR show that SAR outperforms the traditional deduplication-based and flash-based cache schemes significantly, in terms of the average response times.

...read moreread less

97 citations

Journal Article•DOI•

Implementing Reproducible Research

[...]

Brigid Wilson

24 Oct 2014-Journal of Statistical Software

TL;DR: The self-correcting nature of science has been questioned repeatedly in the decade since Ioannidis (2005), "Why Most Published Research Findings Are False" and whether public faith in science was misplaced.

...read moreread less

Abstract: The self-correcting nature of science has been questioned repeatedly in the decade since Ioannidis (2005), “Why Most Published Research Findings Are False”. The title of the paper, if not the entirety of its content, was widely publicized and called into question whether scientific research was truly contributing to human knowledge and whether public faith in science was misplaced. The existential crisis that ensued within the scientific community identified a lack of replication and reproducibility as a major problem that needed to be addressed.

...read moreread less

84 citations

Additional excerpts

...Efforts such as the Reproducibility Initiative [1], or the Reproducibility Projects on Biology [2] and Psychology [3] domains, have been defining standards and patterns to assess whether an experimental result is reproducible....
[...]