scispace - formally typeset
Search or ask a question

Leveraging Semantics to Improve Reproducibility in Scientific Workflows

TL;DR: The reproducibility of published results is a cornerstone in scientific publishing and progress and efforts have been defining standards and patterns to assess whether an experimental result is reproducible.
Abstract: Reproducibility of published results is a cornerstone in scientific publishing and progress. Therefore, the scientific community has been encouraging authors and editors to publish their contributions in a verifiable and understandable way. Efforts such as the Reproducibility Initiative [1], or the Reproducibility Projects on Biology [2] and Psychology [3] domains, have been defining standards and patterns to assess whether an experimental result is reproducible.

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 2014
TL;DR: The Reproducibility@xsede workshop as discussed by the authors focused on reproducibility in large-scale computational research and highlighted four areas of particular interest to XSEDE: documentation and training that promotes reproducible research; system-level tools that provide build and run-time information at the level of the individual job; the need to model best practices in research collaborations involving Extreme Science and Engineering Discovery Environment staff; and continued work on gateways and related technologies.
Abstract: This is the final report on reproducibility@xsede, a one-day workshop held in conjunction with XSEDE14, the annual conference of the Extreme Science and Engineering Discovery Environment (XSEDE). The workshop's discussion-oriented agenda focused on reproducibility in large-scale computational research. Two important themes capture the spirit of the workshop submissions and discussions: (1) organizational stakeholders, especially supercomputer centers, are in a unique position to promote, enable, and support reproducible research; and (2) individual researchers should conduct each experiment as though someone will replicate that experiment. Participants documented numerous issues, questions, technologies, practices, and potentially promising initiatives emerging from the discussion, but also highlighted four areas of particular interest to XSEDE: (1) documentation and training that promotes reproducible research; (2) system-level tools that provide build- and run-time information at the level of the individual job; (3) the need to model best practices in research collaborations involving XSEDE staff; and (4) continued work on gateways and related technologies. In addition, an intriguing question emerged from the day's interactions: would there be value in establishing an annual award for excellence in reproducible research? Overview

15 citations

Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper presents a meta-modelling framework that automates the very labor-intensive and therefore time-heavy and therefore expensive process of manually cataloging and exchanging data in scientific workflows.
Abstract: Scientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today’s data-driven science. However, existing scientific workflow man ...

11 citations


Cites background from "Leveraging Semantics to Improve Rep..."

  • ..., [8, 33, 50]), to date only isolated, often domainspecific solutions addressing only subsets of these problems have been proposed (e....

    [...]

  • ...Furthermore, despite reproducibility being advocated as a major strength of scientific workflows, most systems focus only on sharing workflows, disregarding the provisioning of input data and setup of the execution environment [15, 33]....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a framework to Reproduce Scientific Workflow Execution using Cloud-Aware Provenance (ReCAP), along with the proposed mapping approaches that aid in capturing the Cloud-aware provenance information and help in re-provisioning the execution resource on the Cloud with similar configurations.

10 citations


Cites background or methods from "Leveraging Semantics to Improve Rep..."

  • ...The physical approach, where actual computational hardware are made available for long time periods to scientists, often conserves the computational environment including supercomputers, clusters, or Grids (Santana-Perez et al., 2014b)....

    [...]

  • ...The same concern has been shared by Santana-Perez et al. (2014b) that most of the approaches in the conserva- tion of computational science, in particular for scientific workflow executions, have been focused on data, code, and the workflow description....

    [...]

  • ...There have been a few recent projects (e.g. Chirigati et al. (2013); Janin et al. (2014)) and research studies e.g. (Santana-Perez et al., 2014a) on collecting provenance and using it to reproduce an experiment....

    [...]

  • ...A semanticbased approach (Santana-Perez et al., 2014a) has been proposed to improve reproducibility of workflows in the Cloud....

    [...]

  • ...Code must be available to be distributed, and data must be accessible in a readable format (Santana-Perez et al., 2014a)....

    [...]

Proceedings ArticleDOI
01 Jan 2015
TL;DR: A model that is used in the proposed design to achieve workflow reproducibility in the Cloud environment is presented, which can collect Cloud infrastructure information from an outside Cloud client along with workflow provenance and can establish a mapping between them.
Abstract: Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect provenance information from the Cloud and to utilize it for workflow repeatability in the Cloud infrastructure. This paper presents a novel approach that can assist in mitigating this challenge. This approach can collect Cloud infrastructure information from an outside Cloud client along with workflow provenance and can establish a mapping between them. This mapping is later used to re-provision resources on the Cloud for workflow execution. The reproducibility of the workflow execution is performed by: (a) capturing the Cloud infrastructure information (virtual machine configuration) along with the workflow provenance, (b) re-provisioning the similar resources on the Cloud and re-executing the workflow on them and (c) by comparing the outputs of workflows. The evaluation of the prototype suggests that the proposed approach is feasible and can be investigated further. Moreover, there is no reference reproducibility model exists in literature that can provide guidelines to achieve this goal in Cloud. This paper also attempts to present a model that is used in the proposed design to achieve workflow reproducibility in the Cloud environment.

4 citations

Proceedings ArticleDOI
08 Dec 2014
TL;DR: In this paper, the authors present an approach to collect provenance information from the cloud and to utilize it for workflow repeatability in the cloud infrastructure, which can assist in mitigating the challenge of Grid-based scientific workflow systems.
Abstract: The transformations, analyses and interpretations of data in scientific workflows are vital for the repeatability and reliability of scientific workflows. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect provenance information from the Cloud and to utilize it for workflow repeatability in the Cloud infrastructure. The dynamic nature of the Cloud in comparison to the Grid makes it difficult because resources are provisioned on-demand unlike the Grid. This paper presents a novel approach that can assist in mitigating this challenge. This approach can collect Cloud infrastructure information along with workflow provenance and can establish a mapping between them. This mapping is later used to re-provision resources on the Cloud. The repeatability of the workflow execution is performed by: (a) capturing the Cloud infrastructure information (virtual machine configuration) along with the workflow provenance, and (b) re-provisioning the similar resources on the Cloud and re-executing the workflow on them. The evaluation of an initial prototype suggests that the proposed approach is feasible and can be investigated further.

3 citations

References
More filters
Journal ArticleDOI
TL;DR: The results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities are presented.
Abstract: This paper describes the Pegasus framework that can be used to map complex scientific workflows onto distributed resources. Pegasus enables users to represent the workflows at an abstract level without needing to worry about the particulars of the target execution systems. The paper describes general issues in mapping applications and the functionality of Pegasus. We present the results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities. A real-life astronomy application is used as the basis for the study.

1,324 citations

BookDOI
01 Jan 2007
TL;DR: In this article, the authors present an overview of the current state-of-the-art within established projects, presenting many different aspects of workflow from users to tool builders, from a number of different perspectives.
Abstract: This is a timely book presenting an overview of the current state-of-the-art within established projects, presenting many different aspects of workflow from users to tool builders. It provides an overview of active research, from a number of different perspectives. It includes theoretical aspects of workflow and deals with workflow for e-Science as opposed to e-Commerce. The topics covered will be of interest to a wide range of practitioners.

810 citations

Proceedings ArticleDOI
TL;DR: Montage as discussed by the authors is a grid-enabled version of Montage, an astronomical image mosaic service, suitable for large scale processing of the sky, where re-projection jobs can be added to a pool of tasks and performed by as many processors as are available.
Abstract: This paper describes the design of a grid-enabled version of Montage, an astronomical image mosaic service, suitable for large scale processing of the sky. All the re-projection jobs can be added to a pool of tasks and performed by as many processors as are available, exploiting the parallelization inherent in the Montage architecture. We show how we can describe the Montage application in terms of an abstract workflow so that a planning tool such as Pegasus can derive an executable workflow that can be run in the Grid environment. The execution of the workflow is performed by the workflow manager DAGMan and the associated Condor-G. The grid processing will support tiling of images to a manageable size when the input images can no longer be held in memory. Montage will ultimately run operationally on the Teragrid. We describe science applications of Montage, including its application to science product generation by Spitzer Legacy Program teams and large-scale, all-sky image processing projects.

168 citations

Journal ArticleDOI
TL;DR: SAR, an SSD (solid-state drive)-Assisted Read scheme, is proposed, that effectively exploits the high random-read performance properties of SSDs and the unique data-sharing characteristic of deduplication-based storage systems by storing in SSDs theunique data chunks with high reference count, small size, and nonsequential characteristics.
Abstract: Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, the performance of restore operations from a deduplicated backup can be significantly lower than that without deduplication. The main reason lies in the fact that a file or block is split into multiple small data chunks that are often located in different disks after deduplication, which can cause a subsequent read operation to invoke many disk IOs involving multiple disks and thus degrade the read performance significantly. While this problem has been by and large ignored in the literature thus far, we argue that the time is ripe for us to pay significant attention to it in light of the emerging cloud storage applications and the increasing popularity of the VM platform in the cloud. This is because, in a cloud storage or VM environment, a simple read request on the client side may translate into a restore operation if the data to be read or a VM suspended by the user was previously deduplicated when written to the cloud or the VM storage server, a likely scenario considering the network bandwidth and storage capacity concerns in such an environment.To address this problem, in this article, we propose SAR, an SSD (solid-state drive)-Assisted Read scheme, that effectively exploits the high random-read performance properties of SSDs and the unique data-sharing characteristic of deduplication-based storage systems by storing in SSDs the unique data chunks with high reference count, small size, and nonsequential characteristics. In this way, many read requests to HDDs are replaced by read requests to SSDs, thus significantly improving the read performance of the deduplication-based storage systems in the cloud. The extensive trace-driven and VM restore evaluations on the prototype implementation of SAR show that SAR outperforms the traditional deduplication-based and flash-based cache schemes significantly, in terms of the average response times.

97 citations

Journal ArticleDOI
TL;DR: The self-correcting nature of science has been questioned repeatedly in the decade since Ioannidis (2005), "Why Most Published Research Findings Are False" and whether public faith in science was misplaced.
Abstract: The self-correcting nature of science has been questioned repeatedly in the decade since Ioannidis (2005), “Why Most Published Research Findings Are False”. The title of the paper, if not the entirety of its content, was widely publicized and called into question whether scientific research was truly contributing to human knowledge and whether public faith in science was misplaced. The existential crisis that ensued within the scientific community identified a lack of replication and reproducibility as a major problem that needed to be addressed.

84 citations


Additional excerpts

  • ...Efforts such as the Reproducibility Initiative [1], or the Reproducibility Projects on Biology [2] and Psychology [3] domains, have been defining standards and patterns to assess whether an experimental result is reproducible....

    [...]