scispace - formally typeset
Search or ask a question

Showing papers on "Workflow published in 2015"


Book
01 Jun 2015
TL;DR: A practical primer on how to calculate and report effect sizes for t-tests and ANOVA's such that effect sizes can be used in a-priori power analyses and meta-analyses and a detailed overview of the similarities and differences between within- and between-subjects designs is provided.
Abstract: Effect sizes are the most important outcome of empirical studies. Most articles on effect sizes highlight their importance to communicate the practical significance of results. For scientists themselves, effect sizes are most useful because they facilitate cumulative science. Effect sizes can be used to determine the sample size for follow-up studies, or examining effects across studies. This article aims to provide a practical primer on how to calculate and report effect sizes for t-tests and ANOVA’s such that effect sizes can be used in a-priori power analyses and meta-analyses. Whereas many articles about effect sizes focus on between-subjects designs and address within-subjects designs only briefly, I provide a detailed overview of the similarities and differences between within- and between-subjects designs. I suggest that some research questions in experimental psychology examine inherently intra-individual effects, which makes effect sizes that incorporate the correlation between measures the best summary of the results. Finally, a supplementary spreadsheet is provided to make it as easy as possible for researchers to incorporate effect size calculations into their workflow.

5,374 citations


Journal ArticleDOI
TL;DR: How the popular emerging technology Docker combines several areas from systems research - such as operating system virtualization, cross-platform portability, modular re-usable elements, versioning, and a 'DevOps' philosophy, to address these challenges is examined.
Abstract: As computational work becomes more and more integral to many aspects of scientific research, computational reproducibility has become an issue of increasing importance to computer systems researchers and domain scientists alike Though computational reproducibility seems more straight forward than replicating physical experiments, the complex and rapidly changing nature of computer environments makes being able to reproduce and extend such work a serious challenge In this paper, I explore common reasons that code developed for one research project cannot be successfully executed or extended by subsequent researchers I review current approaches to these issues, including virtual machines and workflow systems, and their limitations I then examine how the popular emerging technology Docker combines several areas from systems research - such as operating system virtualization, cross-platform portability, modular re-usable elements, versioning, and a 'DevOps' philosophy, to address these challenges I illustrate this with several examples of Docker use with a focus on the R statistical environment

729 citations


Journal ArticleDOI
TL;DR: An integrated view of the Pegasus system is provided, showing its capabilities that have been developed over time in response to application needs and to the evolution of the scientific computing platforms.

701 citations


Journal ArticleDOI
TL;DR: FireWorks has been used to complete over 50 million CPU‐hours worth of computational chemistry and materials science calculations at the National Energy Research Supercomputing Center, and its implementation strategy that rests on Python and NoSQL databases (MongoDB) is discussed.
Abstract: This paper introduces FireWorks, a workflow software for running high-throughput calculation workflows at supercomputing centers. FireWorks has been used to complete over 50 million CPU-hours worth of computational chemistry and materials science calculations at the National Energy Research Supercomputing Center. It has been designed to serve the demanding high-throughput computing needs of these applications, with extensive support for i concurrent execution through job packing, ii failure detection and correction, iii provenance and reporting for long-running projects, iv automated duplicate detection, and v dynamic workflows i.e., modifying the workflow graph during runtime. We have found that these features are highly relevant to enabling modern data-driven and high-throughput science applications, and we discuss our implementation strategy that rests on Python and NoSQL databases MongoDB. Finally, we present performance data and limitations of our approach along with planned future work. Copyright © 2015 John Wiley & Sons, Ltd.

405 citations


Journal ArticleDOI
TL;DR: Two key elements in collaborative workflows, the consistency of data sharing and the reproducibility of calculation result, are embedded in the IBEX workflow: image data, feature algorithms, and model validation including newly developed ones from different users can be easily and consistently shared so that results can be more easily reproduced between institutions.
Abstract: Purpose: Radiomics, which is the high-throughput extraction and analysis of quantitative image features, has been shown to have considerable potential to quantify the tumor phenotype. However, at present, a lack of software infrastructure has impeded the development of radiomics and its applications. Therefore, the authors developed the imaging biomarker explorer (ibex), an open infrastructure software platform that flexibly supports common radiomics workflow tasks such as multimodality image data import and review, development of feature extraction algorithms, model validation, and consistent data sharing among multiple institutions. Methods: The ibex software package was developed using the matlab and c/c++ programming languages. The software architecture deploys the modern model-view-controller, unit testing, and function handle programming concepts to isolate each quantitative imaging analysis task, to validate if their relevant data and algorithms are fit for use, and to plug in new modules. On one hand, ibex is self-contained and ready to use: it has implemented common data importers, common image filters, and common feature extraction algorithms. On the other hand, ibex provides an integrated development environment on top of matlab and c/c++, so users are not limited to its built-in functions. In the ibex developer studio, users can plug in, debug, and test new algorithms, extending ibex’s functionality. ibex also supports quality assurance for data and feature algorithms: image data, regions of interest, and feature algorithm-related data can be reviewed, validated, and/or modified. More importantly, two key elements in collaborative workflows, the consistency of data sharing and the reproducibility of calculation result, are embedded in the ibex workflow: image data, feature algorithms, and model validation including newly developed ones from different users can be easily and consistently shared so that results can be more easily reproduced between institutions. Results: Researchers with a variety of technical skill levels, including radiation oncologists, physicists, and computer scientists, have found the ibex software to be intuitive, powerful, and easy to use. ibex can be run at any computer with the windows operating system and 1GB RAM. The authors fully validated the implementation of all importers, preprocessing algorithms, and feature extraction algorithms. Windows version 1.0 beta of stand-alone ibex and ibex’s source code can be downloaded. Conclusions: The authors successfully implemented ibex, an open infrastructure software platform that streamlines common radiomics workflow tasks. Its transparency, flexibility, and portability can greatly accelerate the pace of radiomics research and pave the way toward successful clinical translation.

264 citations


Journal ArticleDOI
01 Dec 2015
TL;DR: A survey of data-intensive scientific workflow management in SWfMSs and their parallelization techniques is provided based on aSWfMS functional architecture and a comparative analysis of the existing solutions is given.
Abstract: Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for modeling such process. Since the sequential execution of data-intensive scientific workflows may take much time, Scientific Workflow Management Systems (SWfMSs) should enable the parallel execution of data-intensive scientific workflows and exploit the resources distributed in different infrastructures such as grid and cloud. This paper provides a survey of data-intensive scientific workflow management in SWfMSs and their parallelization techniques. Based on a SWfMS functional architecture, we give a comparative analysis of the existing solutions. Finally, we identify research issues for improving the execution of data-intensive scientific workflows in a multisite cloud.

244 citations


Journal ArticleDOI
TL;DR: This paper makes a comprehensive survey of workflow scheduling in cloud environment in a problem–solution manner and conducts taxonomy and comparative review on workflow scheduling algorithms.
Abstract: To program in distributed computing environments such as grids and clouds, workflow is adopted as an attractive paradigm for its powerful ability in expressing a wide range of applications, including scientific computing, multi-tier Web, and big data processing applications. With the development of cloud technology and extensive deployment of cloud platform, the problem of workflow scheduling in cloud becomes an important research topic. The challenges of the problem lie in: NP-hard nature of task-resource mapping; diverse QoS requirements; on-demand resource provisioning; performance fluctuation and failure handling; hybrid resource scheduling; data storage and transmission optimization. Consequently, a number of studies, focusing on different aspects, emerged in the literature. In this paper, we firstly conduct taxonomy and comparative review on workflow scheduling algorithms. Then, we make a comprehensive survey of workflow scheduling in cloud environment in a problem---solution manner. Based on the analysis, we also highlight some research directions for future investigation.

206 citations


Journal ArticleDOI
TL;DR: It is found that the key factor determining the performance of an algorithm is its ability to decide which workflows in an ensemble to admit or reject for execution, and an admission procedure based on workflow structure and estimates of task runtimes can significantly improve the quality of solutions.

204 citations


Proceedings ArticleDOI
Caitlin Sadowski1, Jeffrey van Gogh1, Ciera Jaspan1, Emma Söderberg1, Collin Winter1 
16 May 2015
TL;DR: This work presents Tricorder, a program analysis platform aimed at building a data-driven ecosystem around program analysis, and presents a set of guiding principles for program analysis tools and a scalable architecture for an analysis platform implementing these principles.
Abstract: Static analysis tools help developers find bugs, improve code readability, and ensure consistent style across a project. However, these tools can be difficult to smoothly integrate with each other and into the developer workflow, particularly when scaling to large codebases. We present Tricorder, a program analysis platform aimed at building a data-driven ecosystem around program analysis. We present a set of guiding principles for our program analysis tools and a scalable architecture for an analysis platform implementing these principles. We include an empirical, in-situ evaluation of the tool as it is used by developers across Google that shows the usefulness and impact of the platform.

192 citations


Journal ArticleDOI
TL;DR: An automated workflow for the generation of BIM data from 3D point clouds is presented and quality indicators for reconstructed geometry elements and a framework in which to assess the quality of the reconstructed geometry against a reference are presented.
Abstract: The need for better 3D documentation of the built environment has come to the fore in recent years, led primarily by city modelling at the large scale and Building Information Modelling (BIM) at the smaller scale. Automation is seen as desirable as it removes the time-consuming and therefore costly amount of human intervention in the process of model generation. BIM is the focus of this paper as not only is there a commercial need, as will be shown by the number of commercial solutions, but also wide research interest due to the aspiration of automated 3D models from both Geomatics and Computer Science communities. The aim is to go beyond the current labour-intensive tracing of the point cloud to an automated process that produces geometry that is both open and more verifiable. This work investigates what can be achieved today with automation through both literature review and by proposing a novel point cloud processing process. We present an automated workflow for the generation of BIM data from 3D point clouds. We also present quality indicators for reconstructed geometry elements and a framework in which to assess the quality of the reconstructed geometry against a reference.

175 citations


Journal ArticleDOI
TL;DR: This study provides the largest high-confidence plasma proteome dataset available to date and provides clear demonstration of the value of using isobaric mass tag reagents in plasma-based biomarker discovery experiments.

Journal ArticleDOI
TL;DR: The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types.
Abstract: CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.

Journal ArticleDOI
TL;DR: A survey of the requirements and solutions and challenges in the area of information abstraction and an efficient workflow to extract meaningful information from raw sensor data based on the current state-of-the-art in this area are provided.
Abstract: The term Internet of Things (IoT) refers to the interaction and communication between billions of devices that produce and exchange data related to real-world objects (i.e. things). Extracting higher level information from the raw sensory data captured by the devices and representing this data as machine-interpretable or human-understandable information has several interesting applications. Deriving raw data into higher level information representations demands mechanisms to find, extract, and characterize meaningful abstractions from the raw data. This meaningful abstractions then have to be presented in a human and/or machine-understandable representation. However, the heterogeneity of the data originated from different sensor devices and application scenarios such as e-health, environmental monitoring, and smart home applications, and the dynamic nature of sensor data make it difficult to apply only one particular information processing technique to the underlying data. A considerable amount of methods from machine-learning, the semantic web, as well as pattern and data mining have been used to abstract from sensor observations to information representations. This paper provides a survey of the requirements and solutions and describes challenges in the area of information abstraction and presents an efficient workflow to extract meaningful information from raw sensor data based on the current state-of-the-art in this area. This paper also identifies research directions at the edge of information abstraction for sensor data. To ease the understanding of the abstraction workflow process, we introduce a software toolkit that implements the introduced techniques and motivates to apply them on various data sets.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: An approach for generating deep comprehension questions from novel text that bypasses the myriad challenges of creating a full semantic representation by decomposing the task into an ontologycrowd-relevance workflow, consisting of first representing the original text in a low-dimensional ontology, then crowdsourcing candidate question templates aligned with that space, and finally ranking potentially relevant templates for a novel region of text.
Abstract: We develop an approach for generating deep (i.e, high-level) comprehension questions from novel text that bypasses the myriad challenges of creating a full semantic representation. We do this by decomposing the task into an ontologycrowd-relevance workflow, consisting of first representing the original text in a low-dimensional ontology, then crowdsourcing candidate question templates aligned with that space, and finally ranking potentially relevant templates for a novel region of text. If ontological labels are not available, we infer them from the text. We demonstrate the effectiveness of this method on a corpus of articles from Wikipedia alongside human judgments, and find that we can generate relevant deep questions with a precision of over 85% while maintaining a recall of 70%.

Journal ArticleDOI
TL;DR: A novel approach to the preservation of scientific workflows through the application of research objects-aggregations of data and metadata that enrich the workflow specifications that support the creation of workflow-centric research objects.

Journal ArticleDOI
TL;DR: The effectiveness of cloud-based BIM for real-time delivery of information to support progress monitoring and management of the construction of a reinforced concrete (RC) structure is examined using action based research.

Journal ArticleDOI
TL;DR: The use of EHR technology has a major impact on ICU physician work and workflow and workflow (e.g., increased time spent on clinical review and documentation), and the impact of changes in physician work on the quality of care provided.

Proceedings ArticleDOI
15 Jun 2015
TL;DR: This paper considers how to best integrate container technology into an existing workflow system, using Makeflow, Work Queue, and Docker as examples of current technology.
Abstract: Workflows are a widely used abstraction for representing large scientific applications and executing them on distributed systems such as clusters, clouds, and grids. However, workflow systems have been largely silent on the question of precisely what environment each task in the workflow is expected to run in. As a result, a workflow may run correctly in the environment in which it was designed, but when moved to another machine, is highly likely to fail due to differences in the operating system, installed applications, available data, and so forth. Lightweight container technology has recently arisen as a potential solution to this problem, by providing a well-defined execution environments at the operating system level. In this paper, we consider how to best integrate container technology into an existing workflow system, using Makeflow, Work Queue, and Docker as examples of current technology. A brief performance study of Docker shows very little overhead in CPU and I/O performance, but significant costs in creating and deleting containers. Taking this into account, we describe four different methods of connecting containers to different points of the infrastructure, and explain several methods of managing the container images that must be distributed to executing tasks. We explore the performance of a large bioinformatics workload on a Docker-enabled cluster, and observe the best configuration to be locally-managed containers that are shared between multiple tasks.

Posted ContentDOI
06 Nov 2015-bioRxiv
TL;DR: What’s in my Pot?
Abstract: Whole genome sequencing on next-generation instruments provides an unbiased way to identify the organisms present in complex metagenomic samples. However, the time-to-result can be protracted because of fixed-time sequencing runs and cumbersome bioinformatics workflows. This limits the utility of the approach in settings where rapid species identification is crucial, such as in the quality control of food-chain components, or in during an outbreak of an infectious disease. Here we present What′s in my Pot? (WIMP), a laboratory and analysis workflow in which, starting with an unprocessed sample, sequence data is generated and bacteria, viruses and fungi present in the sample are classified to subspecies and strain level in a quantitative manner, without prior knowledge of the sample composition, in approximately 3.5 hours. This workflow relies on the combination of Oxford Nanopore Technologies′ MinION ™ sensing device with a real-time species identification bioinformatics application.

Journal ArticleDOI
TL;DR: The implications of this emerging monoculture: its advantages and disadvantages for physicians and hospitals and its role in innovation, professional autonomy, implementation difficulties, workflow, flexibility, cost, data standards, interoperability, and interactions with other information technology systems are examined.

Journal ArticleDOI
TL;DR: In this paper, a holistic approach that strategically plans for BIM execution in green building projects is proposed, where the authors find that green BIM can improve project outcomes, and facilitate the accomplishment of established sustainability goals.
Abstract: Companies that embrace both building information modeling (BIM) and green building are making conscientious efforts to pursue the synergies between the two, namely the green BIM practice. When done right, project teams find that green BIM can improve project outcomes, and facilitate the accomplishment of established sustainability goals. Nevertheless, green BIM remains an emerging trend to the majority of the industry, and its full potential is yet to be explored according to recent market research reports. The hypothesis of this research is that for companies to succeed in practicing green BIM, a holistic approach that strategically plans for BIM execution in green building projects is needed. The BIM project execution planning guidelines (PEPG) is widely adopted today to offer general guidance and standardized workflow of strategic BIM implementation. PEPG is not meant to and lacks the specificity to address green building projects in particular. However, the popularity of PEPG is conceived as a...


Journal ArticleDOI
26 Oct 2015-PLOS ONE
TL;DR: The Genomics Virtual Laboratory is designed and implemented as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options.
Abstract: Background: Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visualisation tools, closely linked to large user and reference data sets; workflow platform(s) enabling accessible, reproducible, portable analyses, through a flexible set of interfaces; highly available, scalable computational resources; and flexibility and versatility in the use of these resources to meet demands and expertise of a variety of users. Access to an appropriate computational platform can be a significant barrier to researchers, as establishing such a platform requires a large upfront investment in hardware, experience, and expertise. Results: We designed and implemented the Genomics Virtual Laboratory (GVL) as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options. The platform is flexible in that users can conduct analyses through web-based (Galaxy, RStudio, IPython Notebook) or command-line interfaces, and add/remove compute nodes and data resources as required. Best-practice tutorials and protocols provide a path from introductory training to practice. The GVL is available on the OpenStack-based Australian Research Cloud (http://nectar.org.au) and the Amazon Web Services cloud. The principles, implementation and build process are designed to be cloud-agnostic. Conclusions: This paper provides a blueprint for the design and implementation of a cloud-based Genomics Virtual Laboratory. We discuss scope, design considerations and technical and logistical constraints, and explore the value added to the research community through the suite of services and resources provided by our implementation.

Journal ArticleDOI
TL;DR: The fully automated collection and merging of partial data sets from a series of cryocooled crystals of biological macromolecules contained on the same support is presented, as are the results of test experiments carried out on various systems.
Abstract: Here, an automated procedure is described to identify the positions of many cryocooled crystals mounted on the same sample holder, to rapidly predict and rank their relative diffraction strengths and to collect partial X-ray diffraction data sets from as many of the crystals as desired. Subsequent hierarchical cluster analysis then allows the best combination of partial data sets, optimizing the quality of the final data set obtained. The results of applying the method developed to various systems and scenarios including the compilation of a complete data set from tiny crystals of the membrane protein bacterio­rhodopsin and the collection of data sets for successful structure determination using the single-wavelength anomalous dispersion technique are also presented.

Patent
22 Sep 2015
TL;DR: In this paper, a workflow management system for managing the storage, retrieval, and transport of items in a warehouse includes a voice-directed mobile terminal and a server computer in communication with the mobile terminal.
Abstract: A workflow management system for managing the storage, retrieval, and transport of items in a warehouse includes a voice-directed mobile terminal. The system also includes a server computer in communication with the voice-directed mobile terminal. The server computer includes a tasking module for transmitting task data to the voice-directed mobile terminal. The server computer also includes a workflow-analysis module for generating, based at least in part upon an analysis of workflow dialog between the voice-directed mobile terminal and the user, performance data relating to the performance of tasks associated with the storage, retrieval, and/or transport of the items.

Journal ArticleDOI
TL;DR: This article uses a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb.
Abstract: Genome assembly remains an unsolved problem. Assembly projects face a range of hurdles that confound assembly. Thus a variety of tools and approaches are needed to improve draft genomes. We used a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb. We used this map for super scaffolding the T. castaneum sequence assembly, more than tripling its N50 with the program Stitch. In this article we present software that leverages consensus genome maps assembled from extremely long single molecule maps to increase the contiguity of sequence assemblies. We report the results of applying these tools to validate and improve a 7x Sanger draft of the T. castaneum genome.

Proceedings ArticleDOI
17 Apr 2015
TL;DR: Musketeer is built, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines and speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort.
Abstract: Many systems for the parallel processing of big data are available today. Yet, few users can tell by intuition which system, or combination of systems, is "best" for a given workflow. Porting workflows between systems is tedious. Hence, users become "locked in", despite faster or more efficient systems being available. This is a direct consequence of the tight coupling between user-facing front-ends that express workflows (e.g., Hive, SparkSQL, Lindi, GraphLINQ) and the back-end execution engines that run them (e.g., MapReduce, Spark, PowerGraph, Naiad). We argue that the ways that workflows are defined should be decoupled from the manner in which they are executed. To explore this idea, we have built Musketeer, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines. Our prototype maps workflows expressed in four high-level query languages to seven different popular data processing systems. Musketeer speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort. Its automatically generated back-end code comes within 5%--30% of the performance of hand-optimized implementations.

Journal ArticleDOI
TL;DR: This paper introduces immoveable dataset concept which constrains the movement of certain datasets due to security and cost considerations and proposes a new scheduling model in the context of Cloud systems, which holds an economical distribution of tasks among the available CSPs (Cloud Service Providers) in the market.

Patent
22 Dec 2015
TL;DR: In this paper, a method for obtaining primary workflow having a list of activities being performed by a worker, obtaining a surprise activity, comparing a context of the worker to contextual needs of the surprise activity; and interleaving the surprise action in the list of activity based on a best fit context.
Abstract: A method includes obtaining primary workflow having a list of activities being performed by a worker; obtaining a surprise activity; comparing a context of the worker to contextual needs of the surprise activity; and interleaving the surprise activity in the list of activities based on a best fit context of the worker.

Journal ArticleDOI
TL;DR: The taxonomies of cloud workflow scheduling problem and techniques are proposed based on analytical review and identified the aspects and classifications unique to workflow scheduling in the cloud environment in three categories, namely, scheduling process, task and resource.