scispace - formally typeset
Search or ask a question

Showing papers on "Scientific workflow system published in 2018"


Journal ArticleDOI
TL;DR: The Computational Testing for Automated Preprocessing (CTAP) toolbox facilitates batch processing that is easy for experts and novices alike; and testing and comparison of preprocessing methods, which demonstrates the application of CTAP to high-resolution EEG data in three modes of use.
Abstract: Existing tools for the preprocessing of EEG data provide a large choice of methods to suitably prepare and analyse a given dataset. Yet it remains a challenge for the average user to integrate methods for batch processing of the increasingly large datasets of modern research, and compare methods to choose an optimal approach across the many possible parameter configurations. Additionally, many tools still require a high degree of manual decision making for, e.g. the classification of artefacts in channels, epochs or segments. This introduces extra subjectivity, is slow, and is not reproducible. Batching and well-designed automation can help to regularise EEG preprocessing, and thus reduce human effort, subjectivity, and consequent error. The Computational Testing for Automated Preprocessing (CTAP) toolbox facilitates: i) batch processing that is easy for experts and novices alike; ii) testing and comparison of preprocessing methods. Here we demonstrate the application of CTAP to high-resolution EEG data in three modes of use. First, a linear processing pipeline with mostly default parameters illustrates ease-of-use for naive users. Second, a branching pipeline illustrates CTAP's support for comparison of competing methods. Third, a pipeline with built-in parameter-sweeping illustrates CTAP's capability to support data-driven method parametrisation. CTAP extends the existing data structure and functions from the well-known EEGLAB toolbox, based on Matlab, and produces extensive quality control outputs. CTAP is available under MIT open-source licence from https://github.com/bwrc/ctap

8 citations


Journal ArticleDOI
TL;DR: A set of design dimensions and conventions for provenance collection mechanisms in the context of working on scientific workflow systems are identified and defined and used in order to evaluate a number of existing provenancecollection mechanisms.
Abstract: Scientific workflow management systems run scientific experiments. They manage sequences of complex process transformations and collect provenance information at various levels of abstraction. Collected provenance information from scientific experiments documents how experimental results are derived from input values along with experimental parameters and workflow configurations. Provenance greatly enhances usability and acceptance of workflow systems among scientists, because provenance allows workflow systems to capture process configuration and behaviour at different levels of detail. On this basis, a sufficient level of collected provenance information enables scientists to validate their hypotheses and make a workflow reproducible. Currently SWfMS’s do not use a standard or portable provenance model for either capturing, storing, querying or representing model. There are a variety of design issues in provenance models and mechanisms in workflow system, owing to the variation of design dimensions in workflow architectures. Given this variety, it seems desirable to classify provenance mechanisms in workflow systems. We aim to survey provenance collection mechanisms, that are either a part of scientific workflow system, or of a software infrastructure that supports collection mechanisms in a scientific workflow system. In this paper, firstly, we identify and define a set of design dimensions and conventions for provenance collection mechanisms in the context of working on scientific workflow systems. After this, we survey a set of scientific workflow projects based on our design dimensions with an emphasis on provenance collection mechanisms. Then, those conventions are used in order to evaluate a number of existing provenance collection mechanisms, presented at the end of this paper. This survey provides an understanding of primary design issues for provenance collection mechanisms along with a set of desirable design dimensions.

6 citations


Proceedings ArticleDOI
29 Oct 2018
TL;DR: This paper introduces the Gfarm file system which uses the storage of the worker node for parallel I/O performance and introduces the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline to improve core utilization.
Abstract: In this paper, we describe a use case applying a scientific workflow system and a distributed file system to improve the performance of telescope data processing. The application is pipeline processing of data generated by Hyper Suprime-Cam (HSC) which is a focal plane camera mounted on the Subaru telescope. In this paper, we focus on the scalability of parallel I/O and core utilization. The IBM Spectrum Scale (GPFS) used for actual operation has a limit on scalability due to the configuration using storage servers. Therefore, we introduce the Gfarm file system which uses the storage of the worker node for parallel I/O performance. To improve core utilization, we introduce the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline. Descriptions of task dependencies are necessary to further improve core utilization by overlapping different types of tasks. We discuss the usefulness of the workflow description language with the function of scripting language for defining complex task dependency. In the experiment, the performance of the pipeline is evaluated using a quarter of the observation data per night (input files: 80 GB, output files: 1.2 TB). Measurements on strong scaling from 48 to 576 cores show that the processing with Gfarm file system is more scalable than that with GPFS. Measurement using 576 cores shows that our method improves the processing speed of the pipeline by 2.2 times compared with the method used in actual operation.

4 citations


Book ChapterDOI
04 Oct 2018
TL;DR: A scalable patent analytics framework built on top of a big-data architecture and a scientific workflow system allows to seamlessly integrate essential services for patent analysis employing natural language processing as well as machine learning algorithms for deeply structuring and semantically annotating patent texts for realizing complex scientific workflows.
Abstract: The analysis of large volumes and complex scientific information such as patents requires new methods and a flexible, highly interactive and easy-to-use platform in order to enable a variety of applications ranging from information search, semantic analysis to specific text- and data mining tasks for information professionals in industry and research In this paper, we present a scalable patent analytics framework built on top of a big-data architecture and a scientific workflow system The framework allows to seamlessly integrate essential services for patent analysis employing natural language processing as well as machine learning algorithms for deeply structuring and semantically annotating patent texts for realizing complex scientific workflows In two case studies we will show how the framework can be utilized for querying, annotating and analyzing large amounts of patent data

3 citations


Posted Content
TL;DR: It is argued that providing a compute-node side storage system is not sufficient to fully exploit data locality in HPC storage hierarchy, and a cross-layer solution combining storage system, compiler, and runtime is necessary.
Abstract: Scientific applications in HPC environment are more com-plex and more data-intensive nowadays. Scientists usually rely on workflow system to manage the complexity: simply define multiple processing steps into a single script and let the work-flow systems compile it and schedule all tasks accordingly. Numerous workflow systems have been proposed and widely used, like Galaxy, Pegasus, Taverna, Kepler, Swift, AWE, etc., to name a few examples. Traditionally, scientific workflow systems work with parallel file systems, like Lustre, PVFS, Ceph, or other forms of remote shared storage systems. As such, the data (including the intermediate data generated during workflow execution) need to be transferred back and forth between compute nodes and storage systems, which introduces a significant performance bottleneck on I/O operations. Along with the enlarging perfor-mance gap between CPU and storage devices, this bottleneck is expected to be worse. Recently, we have introduced a new concept of Compute-on-Data-Path to allow tasks and data binding to be more efficient to reduce the data movement cost. To workflow systems, the key is to exploit the data locality in HPC storage hierarchy: if the datasets are stored in compute nodes, near the workflow tasks, then the task can directly access them with better performance with less network usage. Several recent studies have been done regarding building such a shared storage system, utilizing compute node resources, to serve HPC workflows with locality, such as Hercules [1] and WOSS [2] etc. In this research, we further argue that providing a compute-node side storage system is not sufficient to fully exploit data locality. A cross-layer solution combining storage system, compiler, and runtime is necessary. We take Swift/T [3], a workflow system for data-intensive applications, as a prototype platform to demonstrate such a cross-layer solution

1 citations


01 Jun 2018
TL;DR: The development of this research instrument is commenced, based on the Scientific Workflow System concept, with the aim of making both the instrument and the resulting data freely available, truthful to the spirit of the open source and open data tradition.
Abstract: In recent years, the quantity of open data has multiplied dramatically. This newly found wealth of data has raised a number of issues for any research exercise within the field of spatial planning: underpowered traditional tools and methods, difficulties in tracking data, transparency issues, sharing difficulties and, above all, an erratic workflow. Some new research tools to counter this irksome tendency do exist at the moment, but, unfortunately, we feel that there is still ample room for improvement. We have therefore commenced the development of our own research instrument, based on the Scientific Workflow System concept. This paper lays the foundation for its architecture. Once completed, both the instrument and the resulting data shall be freely available, truthful to the spirit of the open source and open data tradition. We strongly believe that spatial planning professionals and researchers might find it interesting and worthwhile for increasing the quality and speed of their work.

Book ChapterDOI
Qi Sun1, Yue Liu1, Wenjie Tian1, Yike Guo1, Bocheng Li1 
10 Dec 2018
TL;DR: This paper proposes a universal multi-domain intelligent scientific data processing workflow framework (UMDISW), which builds a general model that can be used in multiple domains by defining directed graphs and descriptors, and makes the underlying layer transparent to scientists to just focus on high-level experimental design.
Abstract: Existing scientific data management systems rarely manage scientific data from a whole-life-cycle perspective, and the value-creating steps defined throughout the cycle constitute essentially a scientific workflow. The scientific workflow system developed by many organizations can well meet their own domain-oriented needs, but from the perspective of the entire scientific data, there is a lack of a common framework for multiple domains. At the same time, some systems require scientists to understand the underlying content of the system, which virtually increases the workload and research costs of scientists. In this context, this paper proposes a universal multi-domain intelligent scientific data processing workflow framework (UMDISW), which builds a general model that can be used in multiple domains by defining directed graphs and descriptors, and makes the underlying layer transparent to scientists to just focus on high-level experimental design. On this basis, the paper also uses scientific data as a driving force, incorporating a mechanism of intelligently recommending algorithms into the workflow to reduce the workload of scientific experiments and provide decision support for exploring new scientific discoveries.