Showing papers on "Workflow published in 2015"

PDF

Open Access

Book•

Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs

[...]

Daniel Lakens¹•Institutions (1)

01 Jun 2015

TL;DR: A practical primer on how to calculate and report effect sizes for t-tests and ANOVA's such that effect sizes can be used in a-priori power analyses and meta-analyses and a detailed overview of the similarities and differences between within- and between-subjects designs is provided.

...read moreread less

Abstract: Effect sizes are the most important outcome of empirical studies. Most articles on effect sizes highlight their importance to communicate the practical significance of results. For scientists themselves, effect sizes are most useful because they facilitate cumulative science. Effect sizes can be used to determine the sample size for follow-up studies, or examining effects across studies. This article aims to provide a practical primer on how to calculate and report effect sizes for t-tests and ANOVA’s such that effect sizes can be used in a-priori power analyses and meta-analyses. Whereas many articles about effect sizes focus on between-subjects designs and address within-subjects designs only briefly, I provide a detailed overview of the similarities and differences between within- and between-subjects designs. I suggest that some research questions in experimental psychology examine inherently intra-individual effects, which makes effect sizes that incorporate the correlation between measures the best summary of the results. Finally, a supplementary spreadsheet is provided to make it as easy as possible for researchers to incorporate effect size calculations into their workflow.

...read moreread less

5,374 citations

Journal Article•DOI•

An introduction to Docker for reproducible research

[...]

Carl Boettiger

20 Jan 2015-Operating Systems Review

TL;DR: How the popular emerging technology Docker combines several areas from systems research - such as operating system virtualization, cross-platform portability, modular re-usable elements, versioning, and a 'DevOps' philosophy, to address these challenges is examined.

...read moreread less

Abstract: As computational work becomes more and more integral to many aspects of scientific research, computational reproducibility has become an issue of increasing importance to computer systems researchers and domain scientists alike Though computational reproducibility seems more straight forward than replicating physical experiments, the complex and rapidly changing nature of computer environments makes being able to reproduce and extend such work a serious challenge In this paper, I explore common reasons that code developed for one research project cannot be successfully executed or extended by subsequent researchers I review current approaches to these issues, including virtual machines and workflow systems, and their limitations I then examine how the popular emerging technology Docker combines several areas from systems research - such as operating system virtualization, cross-platform portability, modular re-usable elements, versioning, and a 'DevOps' philosophy, to address these challenges I illustrate this with several examples of Docker use with a focus on the R statistical environment

...read moreread less

729 citations

Journal Article•DOI•

Pegasus, a workflow management system for science automation

[...]

Ewa Deelman¹, Karan Vahi¹, Gideon Juve¹, Mats Rynge¹, S. Callaghan¹, Philip Maechling¹, Rajiv Mayani¹, Weiwei Chen¹, Rafael Ferreira da Silva¹, Miron Livny², Kent Wenger² - Show less +7 more•Institutions (2)

University of Southern California¹, University of Wisconsin-Madison²

01 May 2015-Future Generation Computer Systems

TL;DR: An integrated view of the Pegasus system is provided, showing its capabilities that have been developed over time in response to application needs and to the evolution of the scientific computing platforms.

...read moreread less

701 citations

Journal Article•DOI•

FireWorks: a dynamic workflow system designed for high-throughput applications

[...]

Anubhav Jain¹, Shyue Ping Ong², Wei Chen¹, Bharat Medasani¹, Xiaohui Qu¹, Michael Kocher¹, Miriam Brafman¹, Guido Petretto³, Gian-Marco Rignanese³, Geoffroy Hautier³, Dan Gunter¹, Kristin A. Persson¹ - Show less +8 more•Institutions (3)

Lawrence Berkeley National Laboratory¹, University of California, San Diego², Université catholique de Louvain³

10 Dec 2015-Concurrency and Computation: Practice and Experience

TL;DR: FireWorks has been used to complete over 50 million CPU‐hours worth of computational chemistry and materials science calculations at the National Energy Research Supercomputing Center, and its implementation strategy that rests on Python and NoSQL databases (MongoDB) is discussed.

...read moreread less

Abstract: This paper introduces FireWorks, a workflow software for running high-throughput calculation workflows at supercomputing centers. FireWorks has been used to complete over 50 million CPU-hours worth of computational chemistry and materials science calculations at the National Energy Research Supercomputing Center. It has been designed to serve the demanding high-throughput computing needs of these applications, with extensive support for i concurrent execution through job packing, ii failure detection and correction, iii provenance and reporting for long-running projects, iv automated duplicate detection, and v dynamic workflows i.e., modifying the workflow graph during runtime. We have found that these features are highly relevant to enabling modern data-driven and high-throughput science applications, and we discuss our implementation strategy that rests on Python and NoSQL databases MongoDB. Finally, we present performance data and limitations of our approach along with planned future work. Copyright © 2015 John Wiley & Sons, Ltd.

...read moreread less

405 citations

Journal Article•DOI•

IBEX: an open infrastructure software platform to facilitate collaborative work in radiomics.

[...]

Lifei Zhang¹, David V. Fried¹, X Fave¹, L. Hunter¹, Jinzhong Yang¹, Laurence E. Court¹ - Show less +2 more•Institutions (1)

University of Texas MD Anderson Cancer Center¹

01 Mar 2015-Medical Physics

TL;DR: Two key elements in collaborative workflows, the consistency of data sharing and the reproducibility of calculation result, are embedded in the IBEX workflow: image data, feature algorithms, and model validation including newly developed ones from different users can be easily and consistently shared so that results can be more easily reproduced between institutions.

...read moreread less

Abstract: Purpose: Radiomics, which is the high-throughput extraction and analysis of quantitative image features, has been shown to have considerable potential to quantify the tumor phenotype. However, at present, a lack of software infrastructure has impeded the development of radiomics and its applications. Therefore, the authors developed the imaging biomarker explorer (ibex), an open infrastructure software platform that flexibly supports common radiomics workflow tasks such as multimodality image data import and review, development of feature extraction algorithms, model validation, and consistent data sharing among multiple institutions. Methods: The ibex software package was developed using the matlab and c/c++ programming languages. The software architecture deploys the modern model-view-controller, unit testing, and function handle programming concepts to isolate each quantitative imaging analysis task, to validate if their relevant data and algorithms are fit for use, and to plug in new modules. On one hand, ibex is self-contained and ready to use: it has implemented common data importers, common image filters, and common feature extraction algorithms. On the other hand, ibex provides an integrated development environment on top of matlab and c/c++, so users are not limited to its built-in functions. In the ibex developer studio, users can plug in, debug, and test new algorithms, extending ibex’s functionality. ibex also supports quality assurance for data and feature algorithms: image data, regions of interest, and feature algorithm-related data can be reviewed, validated, and/or modified. More importantly, two key elements in collaborative workflows, the consistency of data sharing and the reproducibility of calculation result, are embedded in the ibex workflow: image data, feature algorithms, and model validation including newly developed ones from different users can be easily and consistently shared so that results can be more easily reproduced between institutions. Results: Researchers with a variety of technical skill levels, including radiation oncologists, physicists, and computer scientists, have found the ibex software to be intuitive, powerful, and easy to use. ibex can be run at any computer with the windows operating system and 1GB RAM. The authors fully validated the implementation of all importers, preprocessing algorithms, and feature extraction algorithms. Windows version 1.0 beta of stand-alone ibex and ibex’s source code can be downloaded. Conclusions: The authors successfully implemented ibex, an open infrastructure software platform that streamlines common radiomics workflow tasks. Its transparency, flexibility, and portability can greatly accelerate the pace of radiomics research and pave the way toward successful clinical translation.

...read moreread less

264 citations

Journal Article•DOI•

A Survey of Data-Intensive Scientific Workflow Management

[...]

Ji Liu¹, Esther Pacitti¹, Patrick Valduriez¹, Marta Mattoso²•Institutions (2)

University of Montpellier¹, Federal University of Rio de Janeiro²

01 Dec 2015

TL;DR: A survey of data-intensive scientific workflow management in SWfMSs and their parallelization techniques is provided based on aSWfMS functional architecture and a comparative analysis of the existing solutions is given.

...read moreread less

Abstract: Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for modeling such process. Since the sequential execution of data-intensive scientific workflows may take much time, Scientific Workflow Management Systems (SWfMSs) should enable the parallel execution of data-intensive scientific workflows and exploit the resources distributed in different infrastructures such as grid and cloud. This paper provides a survey of data-intensive scientific workflow management in SWfMSs and their parallelization techniques. Based on a SWfMS functional architecture, we give a comparative analysis of the existing solutions. Finally, we identify research issues for improving the execution of data-intensive scientific workflows in a multisite cloud.

...read moreread less

244 citations

Journal Article•DOI•

Workflow scheduling in cloud: a survey

[...]

Fuhui Wu¹, Qingbo Wu¹, Yusong Tan¹•Institutions (1)

National University of Defense Technology¹

01 Sep 2015-The Journal of Supercomputing

TL;DR: This paper makes a comprehensive survey of workflow scheduling in cloud environment in a problem–solution manner and conducts taxonomy and comparative review on workflow scheduling algorithms.

...read moreread less

Abstract: To program in distributed computing environments such as grids and clouds, workflow is adopted as an attractive paradigm for its powerful ability in expressing a wide range of applications, including scientific computing, multi-tier Web, and big data processing applications. With the development of cloud technology and extensive deployment of cloud platform, the problem of workflow scheduling in cloud becomes an important research topic. The challenges of the problem lie in: NP-hard nature of task-resource mapping; diverse QoS requirements; on-demand resource provisioning; performance fluctuation and failure handling; hybrid resource scheduling; data storage and transmission optimization. Consequently, a number of studies, focusing on different aspects, emerged in the literature. In this paper, we firstly conduct taxonomy and comparative review on workflow scheduling algorithms. Then, we make a comprehensive survey of workflow scheduling in cloud environment in a problem---solution manner. Based on the analysis, we also highlight some research directions for future investigation.

...read moreread less

206 citations

Journal Article•DOI•

Algorithms for cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds

[...]

Maciej Malawski¹, Gideon Juve², Ewa Deelman², Jarek Nabrzyski³•Institutions (3)

AGH University of Science and Technology¹, Information Sciences Institute², University of Notre Dame³

01 Jul 2015-Future Generation Computer Systems

TL;DR: It is found that the key factor determining the performance of an algorithm is its ability to decide which workflows in an ensemble to admit or reject for execution, and an admission procedure based on workflow structure and estimates of task runtimes can significantly improve the quality of solutions.

...read moreread less

204 citations

Proceedings Article•DOI•

Tricorder: building a program analysis ecosystem

[...]

Caitlin Sadowski¹, Jeffrey van Gogh¹, Ciera Jaspan¹, Emma Söderberg¹, Collin Winter¹ - Show less +1 more•Institutions (1)

Google¹

16 May 2015

TL;DR: This work presents Tricorder, a program analysis platform aimed at building a data-driven ecosystem around program analysis, and presents a set of guiding principles for program analysis tools and a scalable architecture for an analysis platform implementing these principles.

...read moreread less

Abstract: Static analysis tools help developers find bugs, improve code readability, and ensure consistent style across a project. However, these tools can be difficult to smoothly integrate with each other and into the developer workflow, particularly when scaling to large codebases. We present Tricorder, a program analysis platform aimed at building a data-driven ecosystem around program analysis. We present a set of guiding principles for our program analysis tools and a scalable architecture for an analysis platform implementing these principles. We include an empirical, in-situ evaluation of the tool as it is used by developers across Google that shows the usefulness and impact of the platform.

...read moreread less

192 citations

Journal Article•DOI•

Automatic Geometry Generation from Point Clouds for BIM

[...]

C Thomson, Jan Boehm

14 Sep 2015-Remote Sensing

TL;DR: An automated workflow for the generation of BIM data from 3D point clouds is presented and quality indicators for reconstructed geometry elements and a framework in which to assess the quality of the reconstructed geometry against a reference are presented.

...read moreread less

Abstract: The need for better 3D documentation of the built environment has come to the fore in recent years, led primarily by city modelling at the large scale and Building Information Modelling (BIM) at the smaller scale. Automation is seen as desirable as it removes the time-consuming and therefore costly amount of human intervention in the process of model generation. BIM is the focus of this paper as not only is there a commercial need, as will be shown by the number of commercial solutions, but also wide research interest due to the aspiration of automated 3D models from both Geomatics and Computer Science communities. The aim is to go beyond the current labour-intensive tracing of the point cloud to an automated process that produces geometry that is both open and more verifiable. This work investigates what can be achieved today with automation through both literature review and by proposing a novel point cloud processing process. We present an automated workflow for the generation of BIM data from 3D point clouds. We also present quality indicators for reconstructed geometry elements and a framework in which to assess the quality of the reconstructed geometry against a reference.

...read moreread less

175 citations

Journal Article•DOI•

Multiplexed, Quantitative Workflow for Sensitive Biomarker Discovery in Plasma Yields Novel Candidates for Early Myocardial Injury

[...]

Hasmik Keshishian¹, Michael Burgess¹, Michael A. Gillette², Philipp Mertins¹, Karl R. Clauser¹, D. R. Mani¹, Eric Kuhn¹, Laurie A. Farrell², Robert E. Gerszten¹, Robert E. Gerszten², Steven A. Carr¹ - Show less +7 more•Institutions (2)

Broad Institute¹, Harvard University²

01 Sep 2015-Molecular & Cellular Proteomics

TL;DR: This study provides the largest high-confidence plasma proteome dataset available to date and provides clear demonstration of the value of using isobaric mass tag reagents in plasma-based biomarker discovery experiments.

...read moreread less

Journal Article•DOI•

CERMINE: automatic extraction of structured metadata from scientific literature

[...]

Dominika Tkaczyk¹, Pawel Szostek¹, Mateusz Fedoryszak¹, Piotr Jan Dendek¹, Lukasz Bolikowski¹ - Show less +1 more•Institutions (1)

University of Warsaw¹

01 Dec 2015-International Journal on Document Analysis and Recognition

TL;DR: The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types.

...read moreread less

Abstract: CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.

...read moreread less

Journal Article•DOI•

A Practical Evaluation of Information Processing and Abstraction Techniques for the Internet of Things

[...]

Frieder Ganz¹, Daniel Puschmann¹, Payam Barnaghi¹, Francois Carrez¹•Institutions (1)

University of Surrey¹

06 Mar 2015-IEEE Internet of Things Journal

TL;DR: A survey of the requirements and solutions and challenges in the area of information abstraction and an efficient workflow to extract meaningful information from raw sensor data based on the current state-of-the-art in this area are provided.

...read moreread less

Abstract: The term Internet of Things (IoT) refers to the interaction and communication between billions of devices that produce and exchange data related to real-world objects (i.e. things). Extracting higher level information from the raw sensory data captured by the devices and representing this data as machine-interpretable or human-understandable information has several interesting applications. Deriving raw data into higher level information representations demands mechanisms to find, extract, and characterize meaningful abstractions from the raw data. This meaningful abstractions then have to be presented in a human and/or machine-understandable representation. However, the heterogeneity of the data originated from different sensor devices and application scenarios such as e-health, environmental monitoring, and smart home applications, and the dynamic nature of sensor data make it difficult to apply only one particular information processing technique to the underlying data. A considerable amount of methods from machine-learning, the semantic web, as well as pattern and data mining have been used to abstract from sensor observations to information representations. This paper provides a survey of the requirements and solutions and describes challenges in the area of information abstraction and presents an efficient workflow to extract meaningful information from raw sensor data based on the current state-of-the-art in this area. This paper also identifies research directions at the edge of information abstraction for sensor data. To ease the understanding of the abstraction workflow process, we introduce a software toolkit that implements the introduced techniques and motivates to apply them on various data sets.

...read moreread less

Proceedings Article•DOI•

Deep Questions without Deep Understanding

[...]

Igor Labutov¹, Sumit Basu², Lucy Vanderwende²•Institutions (2)

Cornell University¹, Microsoft²

01 Jul 2015

TL;DR: An approach for generating deep comprehension questions from novel text that bypasses the myriad challenges of creating a full semantic representation by decomposing the task into an ontologycrowd-relevance workflow, consisting of first representing the original text in a low-dimensional ontology, then crowdsourcing candidate question templates aligned with that space, and finally ranking potentially relevant templates for a novel region of text.

...read moreread less

Abstract: We develop an approach for generating deep (i.e, high-level) comprehension questions from novel text that bypasses the myriad challenges of creating a full semantic representation. We do this by decomposing the task into an ontologycrowd-relevance workflow, consisting of first representing the original text in a low-dimensional ontology, then crowdsourcing candidate question templates aligned with that space, and finally ranking potentially relevant templates for a novel region of text. If ontological labels are not available, we infer them from the text. We demonstrate the effectiveness of this method on a corpus of articles from Wikipedia alongside human judgments, and find that we can generate relevant deep questions with a precision of over 85% while maintaining a recall of 70%.

...read moreread less

Journal Article•DOI•

Using a suite of ontologies for preserving workflow-centric research objects

[...]

Khalid Belhajjame¹, Jun Zhao², Daniel Garijo³, Matthew Gamble⁴, Kristina Hettne⁵, Raul Palma, Eleni Mina⁵, Oscar Corcho³, Jose Manuel Gomez-Perez, Sean Bechhofer⁴, Graham Klyne⁶, Carole Goble⁴ - Show less +8 more•Institutions (6)

Paris Dauphine University¹, Lancaster University², Technical University of Madrid³, University of Manchester⁴, Leiden University Medical Center⁵, University of Oxford⁶

01 May 2015-Journal of Web Semantics

TL;DR: A novel approach to the preservation of scientific workflows through the application of research objects-aggregations of data and metadata that enrich the workflow specifications that support the creation of workflow-centric research objects.

...read moreread less

Journal Article•DOI•

Real time progress management: Re-engineering processes for cloud-based BIM in construction

[...]

Jane Matthews¹, Peter E.D. Love¹, Sam Heinemann¹, Robert Chandler, Chris Rumsey, Oluwole Olatunj¹ - Show less +2 more•Institutions (1)

Curtin University¹

01 Oct 2015-Automation in Construction

TL;DR: The effectiveness of cloud-based BIM for real-time delivery of information to support progress monitoring and management of the construction of a reinforced concrete (RC) structure is examined using action based research.

...read moreread less

Journal Article•DOI•

Impact of electronic health record technology on the work and workflow of physicians in the intensive care unit.

[...]

Pascale Carayon¹, Tosha B. Wetterneck¹, Bashar Alyousef¹, Roger L. Brown¹, Randi Cartmill¹, Kerry McGuire¹, Peter Hoonakker¹, Jason Slagle², Kara S. Van Roy¹, James M. Walker³, Matthew B. Weinger², Anping Xie⁴, Kenneth E. Wood⁵ - Show less +9 more•Institutions (5)

University of Wisconsin-Madison¹, Vanderbilt University², Siemens³, Johns Hopkins University⁴, Geisinger Health System⁵

01 Aug 2015-International Journal of Medical Informatics

TL;DR: The use of EHR technology has a major impact on ICU physician work and workflow and workflow (e.g., increased time spent on clinical review and documentation), and the impact of changes in physician work on the quality of care provided.

...read moreread less

Proceedings Article•DOI•

Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker

[...]

Charles Zheng¹, Douglas Thain¹•Institutions (1)

University of Notre Dame¹

15 Jun 2015

TL;DR: This paper considers how to best integrate container technology into an existing workflow system, using Makeflow, Work Queue, and Docker as examples of current technology.

...read moreread less

Abstract: Workflows are a widely used abstraction for representing large scientific applications and executing them on distributed systems such as clusters, clouds, and grids. However, workflow systems have been largely silent on the question of precisely what environment each task in the workflow is expected to run in. As a result, a workflow may run correctly in the environment in which it was designed, but when moved to another machine, is highly likely to fail due to differences in the operating system, installed applications, available data, and so forth. Lightweight container technology has recently arisen as a potential solution to this problem, by providing a well-defined execution environments at the operating system level. In this paper, we consider how to best integrate container technology into an existing workflow system, using Makeflow, Work Queue, and Docker as examples of current technology. A brief performance study of Docker shows very little overhead in CPU and I/O performance, but significant costs in creating and deleting containers. Taking this into account, we describe four different methods of connecting containers to different points of the infrastructure, and explain several methods of managing the container images that must be distributed to executing tasks. We explore the performance of a large bioinformatics workload on a Docker-enabled cluster, and observe the best configuration to be locally-managed containers that are shared between multiple tasks.

...read moreread less

Posted Content•DOI•

What's in my pot? Real-time species identification on the MinION

[...]

Sissel Juul, Fernando Izquierdo, Adam Hurst, Xiaoguang Dai, Amber Wright, Eugene Kulesha, Roger Pettett, Daniel J. Turner - Show less +4 more

06 Nov 2015-bioRxiv

TL;DR: What’s in my Pot?

...read moreread less

Abstract: Whole genome sequencing on next-generation instruments provides an unbiased way to identify the organisms present in complex metagenomic samples. However, the time-to-result can be protracted because of fixed-time sequencing runs and cumbersome bioinformatics workflows. This limits the utility of the approach in settings where rapid species identification is crucial, such as in the quality control of food-chain components, or in during an outbreak of an infectious disease. Here we present What′s in my Pot? (WIMP), a laboratory and analysis workflow in which, starting with an unprocessed sample, sequence data is generated and bacteria, viruses and fungi present in the sample are classified to subspecies and strain level in a quantitative manner, without prior knowledge of the sample composition, in approximately 3.5 hours. This workflow relies on the combination of Oxford Nanopore Technologies′ MinION ™ sensing device with a real-time species identification bioinformatics application.

...read moreread less

Journal Article•DOI•

Implications of an emerging EHR monoculture for hospitals and healthcare systems

[...]

Ross Koppel¹, Christoph U. Lehmann²•Institutions (2)

University of Pennsylvania¹, Vanderbilt University²

01 Mar 2015-Journal of the American Medical Informatics Association

TL;DR: The implications of this emerging monoculture: its advantages and disadvantages for physicians and hospitals and its role in innovation, professional autonomy, implementation difficulties, workflow, flexibility, cost, data standards, interoperability, and interactions with other information technology systems are examined.

...read moreread less

Journal Article•DOI•

BIM Execution Planning in Green Building Projects: LEED as a Use Case

[...]

Wei Wu¹, Raja R. A. Issa²•Institutions (2)

California State University¹, University of Florida²

01 Jan 2015-Journal of Management in Engineering

TL;DR: In this paper, a holistic approach that strategically plans for BIM execution in green building projects is proposed, where the authors find that green BIM can improve project outcomes, and facilitate the accomplishment of established sustainability goals.

...read moreread less

Abstract: Companies that embrace both building information modeling (BIM) and green building are making conscientious efforts to pursue the synergies between the two, namely the green BIM practice. When done right, project teams find that green BIM can improve project outcomes, and facilitate the accomplishment of established sustainability goals. Nevertheless, green BIM remains an emerging trend to the majority of the industry, and its full potential is yet to be explored according to recent market research reports. The hypothesis of this research is that for companies to succeed in practicing green BIM, a holistic approach that strategically plans for BIM execution in green building projects is needed. The BIM project execution planning guidelines (PEPG) is widely adopted today to offer general guidance and standardized workflow of strategic BIM implementation. PEPG is not meant to and lacks the specificity to address green building projects in particular. However, the popularity of PEPG is conceived as a...

...read moreread less

Common Workflow Language

[...]

Michael R. Crusoe

07 Dec 2015

Journal Article•DOI•

Genomics Virtual Laboratory: A Practical Bioinformatics Workbench for the Cloud.

[...]

Enis Afgan¹, Clare Sloggett¹, Nuwan Goonasekera¹, Igor V. Makunin², Derek Benson², Mark L. Crowe², Simon Gladman¹, Yousef Kowsar¹, Michael Pheasant², Ron Horst², Andrew Lonie¹ - Show less +7 more•Institutions (2)

Victorian Life Sciences Computation Initiative¹, University of Queensland²

26 Oct 2015-PLOS ONE

TL;DR: The Genomics Virtual Laboratory is designed and implemented as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options.

...read moreread less

Abstract: Background: Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visualisation tools, closely linked to large user and reference data sets; workflow platform(s) enabling accessible, reproducible, portable analyses, through a flexible set of interfaces; highly available, scalable computational resources; and flexibility and versatility in the use of these resources to meet demands and expertise of a variety of users. Access to an appropriate computational platform can be a significant barrier to researchers, as establishing such a platform requires a large upfront investment in hardware, experience, and expertise. Results: We designed and implemented the Genomics Virtual Laboratory (GVL) as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options. The platform is flexible in that users can conduct analyses through web-based (Galaxy, RStudio, IPython Notebook) or command-line interfaces, and add/remove compute nodes and data resources as required. Best-practice tutorials and protocols provide a path from introductory training to practice. The GVL is available on the OpenStack-based Australian Research Cloud (http://nectar.org.au) and the Amazon Web Services cloud. The principles, implementation and build process are designed to be cloud-agnostic. Conclusions: This paper provides a blueprint for the design and implementation of a cloud-based Genomics Virtual Laboratory. We discuss scope, design considerations and technical and logistical constraints, and explore the value added to the research community through the suite of services and resources provided by our implementation.

...read moreread less

Journal Article•DOI•

MeshAndCollect: an automated multi-crystal data-collection workflow for synchrotron macromolecular crystallography beamlines

[...]

Ulrich Zander¹, Gleb Bourenkov, Alexander Popov¹, Daniele de Sanctis¹, Olof Svensson¹, Andrew A. McCarthy², Ekaterina Round, Valentin Gordeliy, Christoph Mueller-Dieckmann¹, Gordon A. Leonard¹ - Show less +6 more•Institutions (2)

European Synchrotron Radiation Facility¹, Unit of Virus Host Cell Interactions²

01 Nov 2015-Acta Crystallographica Section D-biological Crystallography

TL;DR: The fully automated collection and merging of partial data sets from a series of cryocooled crystals of biological macromolecules contained on the same support is presented, as are the results of test experiments carried out on various systems.

...read moreread less

Abstract: Here, an automated procedure is described to identify the positions of many cryocooled crystals mounted on the same sample holder, to rapidly predict and rank their relative diffraction strengths and to collect partial X-ray diffraction data sets from as many of the crystals as desired. Subsequent hierarchical cluster analysis then allows the best combination of partial data sets, optimizing the quality of the final data set obtained. The results of applying the method developed to various systems and scenarios including the compilation of a complete data set from tiny crystals of the membrane protein bacteriorhodopsin and the collection of data sets for successful structure determination using the single-wavelength anomalous dispersion technique are also presented.

...read moreread less

Patent•

System and method for workflow management

[...]

James Geisler, Steven Thomas, John Pecorari, Adam Jack Hertzman¹, Duff Halart Gold¹, Jared John Schmidt¹, Brian Matthew Connelly¹, Christopher L. Lofty¹, Surya K. Tanneru¹, Naveneetha Myaka¹, Kurt Charles Miller¹, Duane Littleton¹, Brent Alan Nichols¹, Michael Naughton¹, Keith Braho¹, Tonya L. Custis¹, Kwong Wing Au¹, Ronald Shane Fazzio¹, Arup Suprakash Mukherjee¹, Justin Volz¹, Kartik Suchindra Babu¹ - Show less +17 more•Institutions (1)

Honeywell¹

22 Sep 2015

TL;DR: In this paper, a workflow management system for managing the storage, retrieval, and transport of items in a warehouse includes a voice-directed mobile terminal and a server computer in communication with the mobile terminal.

...read moreread less

Abstract: A workflow management system for managing the storage, retrieval, and transport of items in a warehouse includes a voice-directed mobile terminal. The system also includes a server computer in communication with the voice-directed mobile terminal. The server computer includes a tasking module for transmitting task data to the voice-directed mobile terminal. The server computer also includes a workflow-analysis module for generating, based at least in part upon an analysis of workflow dialog between the voice-directed mobile terminal and the user, performance data relating to the performance of tasks associated with the storage, retrieval, and/or transport of the items.

...read moreread less

Journal Article•DOI•

Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool

[...]

Jennifer Shelton¹, Michelle C. Coleman¹, Nic Herndon¹, Nanyan Lu¹, Ernest T. Lam, Thomas Anantharaman, Palak Sheth, Susan J. Brown¹ - Show less +4 more•Institutions (1)

Kansas State University¹

29 Sep 2015-BMC Genomics

TL;DR: This article uses a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb.

...read moreread less

Abstract: Genome assembly remains an unsolved problem. Assembly projects face a range of hurdles that confound assembly. Thus a variety of tools and approaches are needed to improve draft genomes. We used a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb. We used this map for super scaffolding the T. castaneum sequence assembly, more than tripling its N50 with the program Stitch. In this article we present software that leverages consensus genome maps assembled from extremely long single molecule maps to increase the contiguity of sequence assemblies. We report the results of applying these tools to validate and improve a 7x Sanger draft of the T. castaneum genome.

...read moreread less

Proceedings Article•DOI•

Musketeer: all for one, one for all in data processing systems

[...]

Ionel Gog¹, Malte Schwarzkopf¹, Natacha Crooks², Matthew P. Grosvenor¹, Allen Clement², Steven Hand¹ - Show less +2 more•Institutions (2)

University of Cambridge¹, Max Planck Society²

17 Apr 2015

TL;DR: Musketeer is built, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines and speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort.

...read moreread less

Abstract: Many systems for the parallel processing of big data are available today. Yet, few users can tell by intuition which system, or combination of systems, is "best" for a given workflow. Porting workflows between systems is tedious. Hence, users become "locked in", despite faster or more efficient systems being available. This is a direct consequence of the tight coupling between user-facing front-ends that express workflows (e.g., Hive, SparkSQL, Lindi, GraphLINQ) and the back-end execution engines that run them (e.g., MapReduce, Spark, PowerGraph, Naiad). We argue that the ways that workflows are defined should be decoupled from the manner in which they are executed. To explore this idea, we have built Musketeer, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines. Our prototype maps workflows expressed in four high-level query languages to seven different popular data processing systems. Musketeer speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort. Its automatically generated back-end code comes within 5%--30% of the performance of hand-optimized implementations.

...read moreread less

Journal Article•DOI•

SABA: A security-aware and budget-aware workflow scheduling strategy in clouds

[...]

Lingfang Zeng¹, Bharadwaj Veeravalli¹, Xiaorong Li²•Institutions (2)

National University of Singapore¹, Institute of High Performance Computing Singapore²

01 Jan 2015-Journal of Parallel and Distributed Computing

TL;DR: This paper introduces immoveable dataset concept which constrains the movement of certain datasets due to security and cost considerations and proposes a new scheduling model in the context of Cloud systems, which holds an economical distribution of tasks among the available CSPs (Cloud Service Providers) in the market.

...read moreread less

Patent•

Interleaving surprise activities in workflow

[...]

Mark Mellott, Heather Viszlay, John Pecorari, Mark David Murawski, Justin Volz, George Joshue Karabin - Show less +2 more

22 Dec 2015

TL;DR: In this paper, a method for obtaining primary workflow having a list of activities being performed by a worker, obtaining a surprise activity, comparing a context of the worker to contextual needs of the surprise activity; and interleaving the surprise action in the list of activity based on a best fit context.

...read moreread less

Abstract: A method includes obtaining primary workflow having a list of activities being performed by a worker; obtaining a surprise activity; comparing a context of the worker to contextual needs of the surprise activity; and interleaving the surprise activity in the list of activities based on a best fit context of the worker.

...read moreread less

Journal Article•DOI•

Taxonomies of workflow scheduling problem and techniques in the cloud

[...]

Sucha Smanchat¹, Kanchana Viriyapant¹•Institutions (1)

King Mongkut's University of Technology North Bangkok¹

01 Nov 2015-Future Generation Computer Systems

TL;DR: The taxonomies of cloud workflow scheduling problem and techniques are proposed based on analytical review and identified the aspects and classifications unique to workflow scheduling in the cloud environment in three categories, namely, scheduling process, task and resource.

...read moreread less

Collapse