scispace - formally typeset
Search or ask a question

Showing papers on "Workflow published in 2014"


Journal ArticleDOI
02 Apr 2014
TL;DR: An algorithm based on the meta-heuristic optimization technique, particle swarm optimization (PSO), which aims to minimize the overall workflow execution cost while meeting deadline constraints is presented.
Abstract: Cloud computing is the latest distributed computing paradigm and it offers tremendous opportunities to solve large-scale scientific problems. However, it presents various challenges that need to be addressed in order to be efficiently utilized for workflow applications. Although the workflow scheduling problem has been widely studied, there are very few initiatives tailored for cloud environments. Furthermore, the existing works fail to either meet the user's quality of service (QoS) requirements or to incorporate some basic principles of cloud computing such as the elasticity and heterogeneity of the computing resources. This paper proposes a resource provisioning and scheduling strategy for scientific workflows on Infrastructure as a Service (IaaS) clouds. We present an algorithm based on the meta-heuristic optimization technique, particle swarm optimization (PSO), which aims to minimize the overall workflow execution cost while meeting deadline constraints. Our heuristic is evaluated using CloudSim and various well-known scientific workflows of different sizes. The results show that our approach performs better than the current state-of-the-art algorithms.

601 citations


Journal ArticleDOI
TL;DR: In this paper, the authors explore common reasons that code developed for one research project cannot be successfully executed or extended by subsequent researchers, and examine how the popular emerging technology Docker combines several areas from systems research - such as operating system virtualization, cross-platform portability, modular re-usable elements, versioning, and a ''DevOps'' philosophy, to address these challenges.
Abstract: As computational work becomes more and more integral to many aspects of scientific research, computational reproducibility has become an issue of increasing importance to computer systems researchers and domain scientists alike. Though computational reproducibility seems more straight forward than replicating physical experiments, the complex and rapidly changing nature of computer environments makes being able to reproduce and extend such work a serious challenge. In this paper, I explore common reasons that code developed for one research project cannot be successfully executed or extended by subsequent researchers. I review current approaches to these issues, including virtual machines and workflow systems, and their limitations. I then examine how the popular emerging technology Docker combines several areas from systems research - such as operating system virtualization, cross-platform portability, modular re-usable elements, versioning, and a `DevOps' philosophy, to address these challenges. I illustrate this with several examples of Docker use with a focus on the R statistical environment.

458 citations


Journal ArticleDOI
TL;DR: The BIT model provides a step towards formalizing the translation of developer aims into intervention components, larger treatments, and methods of delivery in a manner that supports research and communication between investigators on how to design, develop, and deploy BITs.
Abstract: A growing number of investigators have commented on the lack of models to inform the design of behavioral intervention technologies (BITs). BITs, which include a subset of mHealth and eHealth interventions, employ a broad range of technologies, such as mobile phones, the Web, and sensors, to support users in changing behaviors and cognitions related to health, mental health, and wellness. We propose a model that conceptually defines BITs, from the clinical aim to the technological delivery framework. The BIT model defines both the conceptual and technological architecture of a BIT. Conceptually, a BIT model should answer the questions why, what, how (conceptual and technical), and when. While BITs generally have a larger treatment goal, such goals generally consist of smaller intervention aims (the "why") such as promotion or reduction of specific behaviors, and behavior change strategies (the conceptual "how"), such as education, goal setting, and monitoring. Behavior change strategies are instantiated with specific intervention components or “elements” (the "what"). The characteristics of intervention elements may be further defined or modified (the technical "how") to meet the needs, capabilities, and preferences of a user. Finally, many BITs require specification of a workflow that defines when an intervention component will be delivered. The BIT model includes a technological framework (BIT-Tech) that can integrate and implement the intervention elements, characteristics, and workflow to deliver the entire BIT to users over time. This implementation may be either predefined or include adaptive systems that can tailor the intervention based on data from the user and the user’s environment. The BIT model provides a step towards formalizing the translation of developer aims into intervention components, larger treatments, and methods of delivery in a manner that supports research and communication between investigators on how to design, develop, and deploy BITs.

346 citations


Journal ArticleDOI
TL;DR: A detailed survey of the state of the art of information systems designed to support or automate individual tasks in the systematic review, and in particular systematic reviews of randomized controlled clinical trials, reveals trends that see the convergence of several parallel research projects.
Abstract: Systematic reviews, a cornerstone of evidence-based medicine, are not produced quickly enough to support clinical practice. The cost of production, availability of the requisite expertise and timeliness are often quoted as major contributors for the delay. This detailed survey of the state of the art of information systems designed to support or automate individual tasks in the systematic review, and in particular systematic reviews of randomized controlled clinical trials, reveals trends that see the convergence of several parallel research projects. We surveyed literature describing informatics systems that support or automate the processes of systematic review or each of the tasks of the systematic review. Several projects focus on automating, simplifying and/or streamlining specific tasks of the systematic review. Some tasks are already fully automated while others are still largely manual. In this review, we describe each task and the effect that its automation would have on the entire systematic review process, summarize the existing information system support for each task, and highlight where further research is needed for realizing automation for the task. Integration of the systems that automate systematic review tasks may lead to a revised systematic review workflow. We envisage the optimized workflow will lead to system in which each systematic review is described as a computer program that automatically retrieves relevant trials, appraises them, extracts and synthesizes data, evaluates the risk of bias, performs meta-analysis calculations, and produces a report in real time.

298 citations



Journal ArticleDOI
TL;DR: The developed MAP-RSeq workflow is a comprehensive computational workflow that can be used for obtaining genomic features from transcriptomic sequencing data, for any genome, and has thus far enabled clinicians and researchers to understand the transcriptomic landscape of diseases for better diagnosis and treatment of patients.
Abstract: Although the costs of next generation sequencing technology have decreased over the past years, there is still a lack of simple-to-use applications, for a comprehensive analysis of RNA sequencing data. There is no one-stop shop for transcriptomic genomics. We have developed MAP-RSeq, a comprehensive computational workflow that can be used for obtaining genomic features from transcriptomic sequencing data, for any genome. For optimization of tools and parameters, MAP-RSeq was validated using both simulated and real datasets. MAP-RSeq workflow consists of six major modules such as alignment of reads, quality assessment of reads, gene expression assessment and exon read counting, identification of expressed single nucleotide variants (SNVs), detection of fusion transcripts, summarization of transcriptomics data and final report. This workflow is available for Human transcriptome analysis and can be easily adapted and used for other genomes. Several clinical and research projects at the Mayo Clinic have applied the MAP-RSeq workflow for RNA-Seq studies. The results from MAP-RSeq have thus far enabled clinicians and researchers to understand the transcriptomic landscape of diseases for better diagnosis and treatment of patients. Our software provides gene counts, exon counts, fusion candidates, expressed single nucleotide variants, mapping statistics, visualizations, and a detailed research data report for RNA-Seq. The workflow can be executed on a standalone virtual machine or on a parallel Sun Grid Engine cluster. The software can be downloaded from http://bioinformaticstools.mayo.edu/research/maprseq/ .

277 citations


Proceedings ArticleDOI
18 Jun 2014
TL;DR: Corleone is described, a HOC solution for EM, which uses the crowd in all major steps of the EM process, and the implications of this work to executing crowdsourced RDBMS joins, cleaning learning models, and soliciting complex information types from crowd workers.
Abstract: Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot handle scenarios where ordinary users (i.e., the masses) want to leverage crowdsourcing to match entities. In response, we propose the notion of hands-off crowdsourcing (HOC)}, which crowdsources the entire workflow of a task, thus requiring no developers. We show how HOC can represent a next logical direction for crowdsourcing research, scale up EM at enterprises and crowdsourcing startups, and open up crowdsourcing for the masses. We describe Corleone, a HOC solution for EM, which uses the crowd in all major steps of the EM process. Finally, we discuss the implications of our work to executing crowdsourced RDBMS joins, cleaning learning models, and soliciting complex information types from crowd workers.

251 citations


Journal ArticleDOI
TL;DR: Simulation experiments show that the proposed algorithm increases the likelihood of deadlines being met and reduces the total execution time of applications as the budget available for replication increases.
Abstract: The elasticity of Cloud infrastructures makes them a suitable platform for execution of deadline-constrained workflow applications, because resources available to the application can be dynamically increased to enable application speedup. Existing research in execution of scientific workflows in Clouds either try to minimize the workflow execution time ignoring deadlines and budgets or focus on the minimization of cost while trying to meet the application deadline. However, they implement limited contingency strategies to correct delays caused by underestimation of tasks execution time or fluctuations in the delivered performance of leased public Cloud resources. To mitigate effects of performance variation of resources on soft deadlines of workflow applications, we propose an algorithm that uses idle time of provisioned resources and budget surplus to replicate tasks. Simulation experiments with four well-known scientific workflows show that the proposed algorithm increases the likelihood of deadlines being met and reduces the total execution time of applications as the budget available for replication increases.

221 citations


Journal ArticleDOI
Matt McCormick1, Xiaoxiao Liu1, Julien Jomier1, Charles Marion1, Luis Ibanez1 
TL;DR: The multiple tools, methodologies, and practices that the ITK community has adopted, refined, and followed during the past decade, in order to become one of the research communities with the most modern reproducibility verification infrastructure are described.
Abstract: Reproducibility verification is essential to the practice of the scientific method. Researchers report their findings, which are strengthened as other independent groups in the scientific community share similar outcomes. In the many scientific fields where software has become a fundamental tool for capturing and analyzing data, this requirement of reproducibility implies that reliable and comprehensive software platforms and tools should be made available to the scientific community. The tools will empower them and the public to verify, through practice, the reproducibility of observations that are reported in the scientific literature.Medical image analysis is one of the fields in which the use of computational resources, both software and hardware, are an essential platform for performing experimental work. In this arena, the introduction of the Insight Toolkit (ITK) in 1999 has transformed the field and facilitates its progress by accelerating the rate at which algorithmic implementations are developed, tested, disseminated and improved. By building on the efficiency and quality of open source methodologies, ITK has provided the medical image community with an effective platform on which to build a daily workflow that incorporates the true scientific practices of reproducibility verification.This article describes the multiple tools, methodologies, and practices that the ITK community has adopted, refined, and followed during the past decade, in order to become one of the research communities with the most modern reproducibility verification infrastructure. For example, 207 contributors have created over 2400 unit tests that provide over 84% code line test coverage. The Insight Journal, an open publication journal associated with the toolkit, has seen over 360,000 publication downloads. The median normalized closeness centrality, a measure of knowledge flow, resulting from the distributed peer code review system was high, 0.46.

214 citations


Journal ArticleDOI
TL;DR: This work draws from social network theory-- specifically, advice networks--to understand a key post-implementation job outcome (i.e., job performance) of enterprise systems success and finds support for hypotheses that workflow advice and software advice are associated with job performance.
Abstract: The implementation of enterprise systems, such as enterprise resource planning (ERP) systems, alters business processes and associated workflows, and introduces new software applications that employees must use. Employees frequently find such technology-enabled organizational change to be a major challenge. Although many challenges related to such changes have been discussed in prior work, little research has focused on post-implementation job outcomes of employees affected by such change. We draw from social network theory-- specifically, advice networks--to understand a key post-implementation job outcome (i.e., job performance). We conducted a study among 87 employees, with data gathered before and after the implementation of an ERP system module in a business unit of a large organization. We found support for our hypotheses that workflow advice and software advice are associated with job performance. Further, as predicted, we found that the interactions of workflow and software get-advice, workflow and software give-advice, and software get- and give-advice were associated with job performance. This nuanced treatment of advice networks advances our understanding of post-implementation success of enterprise systems.

195 citations


Journal ArticleDOI
TL;DR: The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy.
Abstract: Background: Recent innovations in sequencing technologies have provided researchers with the ability to rapidly characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These approaches are producing a wealth of information that is providing novel insights into the microbial ecology of the environment and human health. However, these sequencing-based approaches produce large and complex datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, efficiency, and typically do not include a complete metagenomic analysis framework. Results: We present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data from the 2011 outbreak of Shiga-toxigenic Escherichia coli O104:H4. Conclusions: The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at: http://sourceforge.net/projects/pathoscope/. Background The rapid and accurate characterization of microbial flora in clinical or environmental samples is critical for many applications including understanding the role of the microbiome in human health, personalized responses to acute or chronic infections, and early detection in

Journal ArticleDOI
TL;DR: OmicCircos is an R software package used to generate high-quality circular plots for visualizing genomic variations, including mutation patterns, copy number variations, expression patterns, and methylation patterns, using cluster, boxplot, histogram, and heatmap formats.
Abstract: Summary: OmicCircos is an R software package used to generate high-quality circular plots for visualizing genomic variations, including mutation patterns, copy number variations (CNVs), expression patterns, and methylation patterns. Such variations can be displayed as scatterplot, line, or text-label figures. Relationships among genomic features in different chromosome positions can be represented in the forms of polygons or curves. Utilizing the statistical and graphic functions in an R/Bioconductor environment, OmicCircos performs statistical analyses and displays results using cluster, boxplot, histogram, and heatmap formats. In addition, OmicCircos offers a number of unique capabilities, including independent track drawing for easy modification and integration, zoom functions, link-polygons, and position-independent heatmaps supporting detailed visualization. Availability and Implementation: OmicCircos is available through Bioconductor at http://www.bioconductor.org/packages/devel/bioc/html/OmicCircos.html. An extensive vignette in the package describes installation, data formatting, and workflow procedures. The software is open source under the Artistic-2.0 license.

Proceedings ArticleDOI
31 May 2014
TL;DR: An empirical study of 26.6 million builds produced during a period of nine months by thousands of developers describes the workflow through which those builds are generated, and analyzes failure frequency, compiler error types, and resolution efforts to fix those compiler errors.
Abstract: Building is an integral part of the software development process However, little is known about the compiler errors that occur in this process In this paper, we present an empirical study of 266 million builds produced during a period of nine months by thousands of developers We describe the workflow through which those builds are generated, and we analyze failure frequency, compiler error types, and resolution efforts to fix those compiler errors The results provide insights on how a large organization build process works, and pinpoints errors for which further developer support would be most effective

Journal ArticleDOI
TL;DR: MOHEFT, a Pareto-based list scheduling heuristic that provides the user with a set of tradeoff optimal solutions from which the one that better suits the user requirements can be manually selected, is analysed.
Abstract: Nowadays, scientists and companies are confronted with multiple competing goals such as makespan in high-performance computing and economic cost in Clouds that have to be simultaneously optimised. Multi-objective scheduling of scientific applications in these systems is therefore receiving increasing research attention. Most existing approaches typically aggregate all objectives in a single function, defined a-priori without any knowledge about the problem being solved, which negatively impacts the quality of the solutions. In contrast, Pareto-based approaches having as outcome a set of (nearly) optimal solutions that represent a tradeoff among the different objectives, have been scarcely studied. In this paper, we analyse MOHEFT, a Pareto-based list scheduling heuristic that provides the user with a set of tradeoff optimal solutions from which the one that better suits the user requirements can be manually selected. We demonstrate the potential of our method for multi-objective workflow scheduling on the commercial Amazon EC2 Cloud. We compare the quality of the MOHEFT tradeoff solutions with two state-of-the-art approaches using different synthetic and real-world workflows: the classical HEFT algorithm for single-objective scheduling and the SPEA2* genetic algorithm used in multi-objective optimisation problems. The results demonstrate that our approach is able to compute solutions of higher quality than SPEA2*. In addition, we show that MOHEFT is more suitable than SPEA2* for workflow scheduling in the context of commercial Clouds, since the genetic-based approach is unable of dealing with some of the constraints imposed by these systems.

Journal ArticleDOI
TL;DR: A Building Information Modeling (BIM) enabled information infrastructure for FDD is proposed, which streamlines the information exchange process and therefore has the potential to improve the efficiency of similar works in practice.

Proceedings ArticleDOI
13 May 2014
TL;DR: The results show that the proposed resource allocation policies provide robust and fault-tolerant schedule while minimizing make span and the results also show that with the increase in budget, the policies increase the robustness of the schedule.
Abstract: Dynamic resource provisioning and the notion of seemingly unlimited resources are attracting scientific workflows rapidly into Cloud computing. Existing works on workflow scheduling in the context of Clouds are either on deadline or cost optimization, ignoring the necessity for robustness. Robust scheduling that handles performance variations of Cloud resources and failures in the environment is essential in the context of Clouds. In this paper, we present a robust scheduling algorithm with resource allocation policies that schedule workflow tasks on heterogeneous Cloud resources while trying to minimize the total elapsed time (make span) and the cost. Our results show that the proposed resource allocation policies provide robust and fault-tolerant schedule while minimizing make span. The results also show that with the increase in budget, our policies increase the robustness of the schedule.

Book ChapterDOI
01 Jan 2014
TL;DR: This paper describes some of the most important challenges, including the need to develop and apply novel methods, algorithms and tools for the integration, fusion, pre-processing, mapping, analysis and interpretation of complex biomedical data with the aim to identify testable hypotheses, and build realistic models.
Abstract: Biomedical research is drowning in data, yet starving for knowledge. Current challenges in biomedical research and clinical practice include information overload – the need to combine vast amounts of structured, semi-structured, weakly structured data and vast amounts of unstructured information – and the need to optimize workflows, processes and guidelines, to increase capacity while reducing costs and improving efficiencies. In this paper we provide a very short overview on interactive and integrative solutions for knowledge discovery and data mining. In particular, we emphasize the benefits of including the end user into the “interactive” knowledge discovery process. We describe some of the most important challenges, including the need to develop and apply novel methods, algorithms and tools for the integration, fusion, pre-processing, mapping, analysis and interpretation of complex biomedical data with the aim to identify testable hypotheses, and build realistic models. The HCI-KDD approach, which is a synergistic combination of methodologies and approaches of two areas, Human–Computer Interaction (HCI) and Knowledge Discovery & Data Mining (KDD), offer ideal conditions towards solving these challenges: with the goal of supporting human intelligence with machine intelligence. There is an urgent need for integrative and interactive machine learning solutions, because no medical doctor or biomedical researcher can keep pace today with the increasingly large and complex data sets – often called “Big Data”.

Journal ArticleDOI
TL;DR: Orione, a Galaxy-based framework consisting of publicly available research software and specifically designed pipelines to build complex, reproducible workflows for next-generation sequencing microbiology data analysis provides new opportunities for data-intensive computational analyses in microbiology and metagenomics.
Abstract: Summary: End-to-end next-generation sequencing microbiology data analysis requires a diversity of tools covering bacterial resequencing, de novo assembly, scaffolding, bacterial RNA-Seq, gene annotation and metagenomics. However, the construction of computational pipelines that use different software packages is difficult owing to a lack of interoperability, reproducibility and transparency. To overcome these limitations we present Orione, a Galaxy-based framework consisting of publicly available research software and specifically designed pipelines to build complex, reproducible workflows for next-generation sequencing microbiology data analysis. Enabling microbiology researchers to conduct their own custom analysis and data manipulation without software installation or programming, Orione provides new opportunities for data-intensive computational analyses in microbiology and metagenomics. Availability and implementation: Orione is available online at http://orione.crs4.it. Contact: ti.4src@uruccuc.oruamnaig Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This work proposes a new method to generate suboptimal or sufficiently good schedules for smooth multitask workflows on cloud platforms and proves the suboptimality through mathematical analysis.

Patent
14 Feb 2014
TL;DR: In this paper, a system and method for managing workflow including the ability to manage system-level workflow strategy, manage individual workflow activity, and provide suggestions to optimize individual workflow activities is provided.
Abstract: A system and method for managing workflow including the ability to manage system-level workflow strategy, manage individual workflow activity, and provide suggestions to optimize individual workflow activity, is provided. The system may provide suggestions to workers based on their schedules and activities that the workers have to perform. Users of the system may configure the criteria used in determining which suggestions to provide to workers. The system may refine the suggestions provided to workers based on feedback received regarding the suggestions previously provided.

Journal ArticleDOI
TL;DR: An efficient method for calculating the distance between process fragments and select candidate node sets for recommendation purpose with the help of the minimum depth-first search (DFS) codes of business process graphs is proposed.
Abstract: In modern commerce, both frequent changes of custom demands and the specialization of the business process require the capacity of modeling business processes for enterprises effectively and efficiently. Traditional methods for improving business process modeling, such as workflow mining and process retrieval, still requires much manual work. To address this, based on the structure of a business process, a method called workflow recommendation technique is proposed in this paper to provide process designers with support for automatically constructing the new business process that is under consideration. In this paper, with the help of the minimum depth-first search (DFS) codes of business process graphs, we propose an efficient method for calculating the distance between process fragments and select candidate node sets for recommendation purpose. In addition, a recommendation system for improving the modeling efficiency and accuracy was implemented and its implementation details are discussed. At last, based on both synthetic and real-world datasets, we have conducted experiments to compare the proposed method with other methods and the experiment results proved its effectiveness for practical applications.

Journal ArticleDOI
TL;DR: This paper is an innovative collaborative workflow focused on image-based plants identification as a mean to enlist new contributors and facilitate access to botanical data.

Journal ArticleDOI
TL;DR: In this paper, the authors argue that human resource slack generally decreases a firm's performance but that holding excess numbers of employees who possess important tacit knowledge that is specific to firms may benefit the firm, and obtain initial empirical evidence for their predictions by testing them on a novel dataset comprising six years of data for 4,070 manufacturing plants in Mexico.
Abstract: Whether holding resources in excess of what is needed to sustain routine operations (i.e., having slack) increases or decreases firm performance is a question of ongoing interest to management scholars. We contribute to existing theory by arguing that human resource slack generally decreases a firm's performance but that holding excess numbers of employees who possess important tacit knowledge that is specific to firms may benefit the firm. We find that the value of these excess resources increases as firms face competitive pressures and decreases when firms' operational choices facilitate the standardization of workflows. We obtain initial empirical evidence for our predictions by testing them on a novel dataset comprising six years of data for 4,070 manufacturing plants in Mexico.

Book ChapterDOI
09 Jun 2014
TL;DR: The proposed noWorkflow is a tool that transparently captures provenance of scripts and enables reproducibility, and leverages Software Engineering techniques, such as abstract syntax tree analysis, reflection, and profiling, to collect different types of provenance.
Abstract: We propose noWorkflow, a tool that transparently captures provenance of scripts and enables reproducibility. Unlike existing approaches, noWorkflow is non-intrusive and does not require users to change the way they work --- users need not wrap their experiments in scientific workflow systems, install version control systems, or instrument their scripts. The tool leverages Software Engineering techniques, such as abstract syntax tree analysis, reflection, and profiling, to collect different types of provenance, including detailed information about the underlying libraries. We describe how noWorkflow captures multiple kinds of provenance and the different classes of analyses it supports: graph-based visualization; differencing over provenance trails; and inference queries.

Journal ArticleDOI
TL;DR: It is found that dedicated promoter roles strongly contribute to a successful implementation of crowdsourcing, turning pilot projects into an organizational routine, and suggestions for organizational interventions to overcome barriers and sources of resistance.
Abstract: Crowdsourcing has been demonstrated to be an effective strategy to enhance the efficiency of a firm’s innovation process. In this paper, we focus on tournament-based crowdsourcing (also referred to as “broadcast search”), a method to solve technical problems in form of an open call for solutions to a large network of experts. Based on a longitudinal study of six companies piloting this application of crowdsourcing, we identify barriers and sources of resistance that hinder its implementation in firms. Our paper contributes to the state of research by analyzing crowdsourcing on the level of pilot projects, hence providing a workflow perspective that considers the creation of dedicated processes and operations of crowdsourcing. This project level analysis enables the identification of specific challenges managers face when implementing crowdsourcing within an established R&D organization. Following a design science approach, we derive suggestions for organizational interventions to overcome these barriers. We find that dedicated promoter roles strongly contribute to a successful implementation of crowdsourcing, turning pilot projects into an organizational routine.

Journal ArticleDOI
01 Jan 2014
TL;DR: A scheduling algorithm that schedules tasks on Cloud resources using two different pricing models (spot and on-demand instances) to reduce the cost of execution whilst meeting the workflow deadline is proposed.
Abstract: Scientific workflows are used to model applications of high throughput computation and complex large scale data analysis. In recent years, Cloud computing is fast evolving as the target platform for such applications among researchers. Furthermore, new pricing models have been pioneered by Cloud providers that allow users to provision resources and to use them in an efficient manner with significant cost reductions. In this paper, we propose a scheduling algorithm that schedules tasks on Cloud resources using two different pricing models (spot and on-demand instances) to reduce the cost of execution whilst meeting the workflow deadline. The proposed algorithm is fault tolerant against the premature termination of spot instances and also robust against performance variations of Cloud resources. Experimental results demonstrate that our heuristic reduces up to 70% execution cost as against using only on-demand instances.

Journal ArticleDOI
TL;DR: This work describes how UCSF Chimera is enhanced, a program for the interactive visualization and analysis of molecular structures and related data, through the addition of several web services, and illustrates its use of web services with an example workflow that interleaves use of these services with interactive manipulation of molecular sequences and structures.
Abstract: Integrating access to web services with desktop applications allows for an expanded set of application features, including performing computationally intensive tasks and convenient searches of databases. We describe how we have enhanced UCSF Chimera (http://www.rbvi.ucsf.edu/chimera/), a program for the interactive visualization and analysis of molecular structures and related data, through the addition of several web services (http://www.rbvi.ucsf.edu/chimera/docs/webservices.html). By streamlining access to web services, including the entire job submission, monitoring and retrieval process, Chimera makes it simpler for users to focus on their science projects rather than data manipulation. Chimera uses Opal, a toolkit for wrapping scientific applications as web services, to provide scalable and transparent access to several popular software packages. We illustrate Chimera's use of web services with an example workflow that interleaves use of these services with interactive manipulation of molecular sequences and structures, and we provide an example Python program to demonstrate how easily Opal-based web services can be accessed from within an application. Web server availability: http://webservices.rbvi.ucsf.edu/opal2/dashboard?command=serviceList.

Journal ArticleDOI
TL;DR: A new statistical analysis workflow is created, MetaboLyzer, which aims to both simplify analysis for investigators new to metabolomics, as well as provide experienced investigators the flexibility to conduct sophisticated analysis.
Abstract: Metabolomics, the global study of small molecules in a particular system, has in the past few years risen to become a primary -omics platform for the study of metabolic processes. With the ever-increasing pool of quantitative data yielded from metabolomic research, specialized methods and tools with which to analyze and extract meaningful conclusions from these data are becoming more and more crucial. Furthermore, the depth of knowledge and expertise required to undertake a metabolomics oriented study is a daunting obstacle to investigators new to the field. As such, we have created a new statistical analysis workflow, MetaboLyzer, which aims to both simplify analysis for investigators new to metabolomics, as well as provide experienced investigators the flexibility to conduct sophisticated analysis. MetaboLyzer's workflow is specifically tailored to the unique characteristics and idiosyncrasies of postprocessed liquid chromatography-mass spectrometry (LC-MS)-based metabolomic data sets. It utilizes a wide gamut of statistical tests, procedures, and methodologies that belong to classical biostatistics, as well as several novel statistical techniques that we have developed specifically for metabolomics data. Furthermore, MetaboLyzer conducts rapid putative ion identification and putative biologically relevant analysis via incorporation of four major small molecule databases: KEGG, HMDB, Lipid Maps, and BioCyc. MetaboLyzer incorporates these aspects into a comprehensive workflow that outputs easy to understand statistically significant and potentially biologically relevant information in the form of heatmaps, volcano plots, 3D visualization plots, correlation maps, and metabolic pathway hit histograms. For demonstration purposes, a urine metabolomics data set from a previously reported radiobiology study in which samples were collected from mice exposed to γ radiation was analyzed. MetaboLyzer was able to identify 243 statistically significant ions out of a total of 1942. Numerous putative metabolites and pathways were found to be biologically significant from the putative ion identification workflow.

Patent
23 Jan 2014
TL;DR: In this article, a novel approach for shifting or distributing various information (e.g., protocols, analysis methods, sample preparation data, sequencing data, etc.) to a cloud-based network is presented.
Abstract: The present disclosure provides a novel approach for shifting or distributing various information (e.g., protocols, analysis methods, sample preparation data, sequencing data, etc.) to a cloud-based network. For example, the techniques relate to a cloud computing environment (12) configured to receive this information from one or more individual sample preparation devices (38), sequencing devices (18), and/or computing systems. In turn, the cloud computing environment (12) may generate information for use in the cloud computing environment (12) and/or to provide the generated information to the devices to guide a genomic analysis workflow. Further, the cloud computing environment (12) may be used to facilitate the sharing of sample preparation protocols for use with generic sample preparation cartridges and/or monitoring the popularity of the sample preparation protocols.

Journal ArticleDOI
TL;DR: The NHI consists of various physical models at appropriate temporal and spatial scales for all parts of the water system and a workflow and version management system guarantees consistency in the data, software, computations and results.
Abstract: Water management in the Netherlands applies to a dense network of surface waters for discharge, storage and distribution, serving highly valuable land-use. National and regional water authorities develop long-term plans for sustainable water use and safety under changing climate conditions. The decisions about investments on adaptive measures are based on analysis supported by the Netherlands Hydrological Instrument NHI based on the best available data and state-of-the-art technology and developed through collaboration between national research institutes. The NHI consists of various physical models at appropriate temporal and spatial scales for all parts of the water system. Intelligent connectors provide transfer between different scales and fast computation, by coupling model codes at a deep level in software. A workflow and version management system guarantees consistency in the data, software, computations and results. The NHI is freely available to hydrologists via an open web interface that enables exchange of all data and tools. © 2014 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license