scispace - formally typeset
Search or ask a question
Author

Paul Dave

Other affiliations: Argonne National Laboratory
Bio: Paul Dave is an academic researcher from University of Chicago. The author has contributed to research in topics: Workflow & Cloud computing. The author has an hindex of 3, co-authored 4 publications receiving 31 citations. Previous affiliations of Paul Dave include Argonne National Laboratory.

Papers
More filters
Proceedings ArticleDOI
22 Jul 2013
TL;DR: The Globus Genomics system allows biomedical researchers to perform rapid analysis of large NGS datasets using just a web browser in a fully automated manner, without software installation.
Abstract: We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system is notable for its high degree of end-to-end automation, which encompasses every stage of the data analysis pipeline from initial data access (from remote sequencing center or database, by the Globus Online file transfer system) to on-demand resource acquisition (on Amazon EC2, via the Globus Provision cloud manager); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); and efficient scheduling of these pipelines over many processors (via the Condor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets using just a web browser in a fully automated manner, without software installation.

20 citations

Journal ArticleDOI
15 Dec 2014-PLOS ONE
TL;DR: The study demonstrates that the seamless integration of bioinformatics resources enables fast and efficient prioritization and characterization of genomic factors and molecular networks contributing to the phenotypes of interest.
Abstract: An essential step in the discovery of molecular mechanisms contributing to disease phenotypes and efficient experimental planning is the development of weighted hypotheses that estimate the functional effects of sequence variants discovered by high-throughput genomics. With the increasing specialization of the bioinformatics resources, creating analytical workflows that seamlessly integrate data and bioinformatics tools developed by multiple groups becomes inevitable. Here we present a case study of a use of the distributed analytical environment integrating four complementary specialized resources, namely the Lynx platform, VISTA RViewer, the Developmental Brain Disorders Database (DBDB), and the RaptorX server, for the identification of high-confidence candidate genes contributing to pathogenesis of spina bifida. The analysis resulted in prediction and validation of deleterious mutations in the SLC19A placental transporter in mothers of the affected children that causes narrowing of the outlet channel and therefore leads to the reduced folate permeation rate. The described approach also enabled correct identification of several genes, previously shown to contribute to pathogenesis of spina bifida, and suggestion of additional genes for experimental validations. The study demonstrates that the seamless integration of bioinformatics resources enables fast and efficient prioritization and characterization of genomic factors and molecular networks contributing to the phenotypes of interest.

7 citations

Journal ArticleDOI
01 Jan 2012-BMJ Open
TL;DR: A retrospective investigation into tumour response for malignant pleural mesothelioma patients treated at the University of Chicago Medical Center with either of two analogous chemotherapy regimens and consented to at least one of two UCMC IRB protocols is performed.
Abstract: Objective An area of need in cancer informatics is the ability to store images in a comprehensive database as part of translational cancer research. To meet this need, we have implemented a novel tandem database infrastructure that facilitates image storage and utilisation. Background We had previously implemented the Thoracic Oncology Program Database Project (TOPDP) database for our translational cancer research needs. While useful for many research endeavours, it is unable to store images, hence our need to implement an imaging database which could communicate easily with the TOPDP database. Methods The Thoracic Oncology Research Program (TORP) imaging database was designed using the Research Electronic Data Capture (REDCap) platform, which was developed by Vanderbilt University. To demonstrate proof of principle and evaluate utility, we performed a retrospective investigation into tumour response for malignant pleural mesothelioma (MPM) patients treated at the University of Chicago Medical Center with either of two analogous chemotherapy regimens and consented to at least one of two UCMC IRB protocols, 9571 and 13473A. Results A cohort of 22 MPM patients was identified using clinical data in the TOPDP database. After measurements were acquired, two representative CT images and 0–35 histological images per patient were successfully stored in the TORP database, along with clinical and demographic data. Discussion We implemented the TORP imaging database to be used in conjunction with our comprehensive TOPDP database. While it requires an additional effort to use two databases, our database infrastructure facilitates more comprehensive translational research. Conclusions The investigation described herein demonstrates the successful implementation of this novel tandem imaging database infrastructure, as well as the potential utility of investigations enabled by it. The data model presented here can be utilised as the basis for further development of other larger, more streamlined databases in the future.

4 citations

Proceedings ArticleDOI
17 Nov 2013
TL;DR: This work presents the challenges associated with managing multiple Galaxy instances on the cloud for various research groups using Globus Genomics, a cloud based platform-as-a-service (PaaS) that provides the Galaxy workflow system as a hosted service along with data management capabilities using Globu Online.
Abstract: Workflow systems play an important role in the analysis of the fast-growing genomics data produced by low-cost next generation sequencing (NGS) technologies. Many biomedical research groups lack the expertise to assemble and run the sophisticated computational pipelines required for high-throughput analysis of such data. There is an urgent need for services that can allow researchers to run their analytical workflows where they can define their own research methodologies by selecting the tools of their interest. We present the challenges associated with managing multiple Galaxy instances on the cloud for various research groups using Globus Genomics, a cloud based platform-as-a-service (PaaS) that provides the Galaxy workflow system as a hosted service along with data management capabilities using Globus Online. We address the unique challenges, our strategy, and a tool for automatically deploying and managing hundreds of analytical tools coming from the public Galaxy Tool Shed, new tools wrapped by our group, and tools wrapped by end users across multiple Galaxy instances hosted with Globus Genomics.

1 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This research work will help researchers find the important characteristics of autonomic resource management and will also help to select the most suitable technique for autonomicresource management in a specific application along with significant future research directions.
Abstract: As computing infrastructure expands, resource management in a large, heterogeneous, and distributed environment becomes a challenging task. In a cloud environment, with uncertainty and dispersion of resources, one encounters problems of allocation of resources, which is caused by things such as heterogeneity, dynamism, and failures. Unfortunately, existing resource management techniques, frameworks, and mechanisms are insufficient to handle these environments, applications, and resource behaviors. To provide efficient performance of workloads and applications, the aforementioned characteristics should be addressed effectively. This research depicts a broad methodical literature analysis of autonomic resource management in the area of the cloud in general and QoS (Quality of Service)-aware autonomic resource management specifically. The current status of autonomic resource management in cloud computing is distributed into various categories. Methodical analysis of autonomic resource management in cloud computing and its techniques are described as developed by various industry and academic groups. Further, taxonomy of autonomic resource management in the cloud has been presented. This research work will help researchers find the important characteristics of autonomic resource management and will also help to select the most suitable technique for autonomic resource management in a specific application along with significant future research directions.

177 citations

Proceedings ArticleDOI
16 Nov 2014
TL;DR: The Skyport data analysis platform that provides scalable workflow execution environments for scientific data in the cloud, Skyport greatly reduces the complexity associated with providing the environment necessary to execute complex workflows.
Abstract: Recently, Linux container technology has been gaining attention as it promises to transform the way software is developed and deployed. The portability and ease of deployment makes Linux containers an ideal technology to be used in scientific workflow platforms. Skyport utilizes Docker Linux containers to solve software deployment problems and resource utilization inefficiencies inherent to all existing scientific workflow platforms. As an extension to AWE/Shock, our data analysis platform that provides scalable workflow execution environments for scientific data in the cloud, Skyport greatly reduces the complexity associated with providing the environment necessary to execute complex workflows.

97 citations

Journal ArticleDOI
TL;DR: A framework that can prioritize disease genes by quantitatively unifying a new deleteriousness measure called BayesDel, an improved assessment of the biological relevance of genes to the disease, a modified linkage analysis, a novel rare‐variant association test, and a converted variant call quality score is described.
Abstract: To interpret genetic variants discovered from next-generation sequencing, integration of heterogeneous information is vital for success. This article describes a framework named PERCH (Polymorphism Evaluation, Ranking, and Classification for a Heritable trait), available at http://BJFengLab.org/. It can prioritize disease genes by quantitatively unifying a new deleteriousness measure called BayesDel, an improved assessment of the biological relevance of genes to the disease, a modified linkage analysis, a novel rare-variant association test, and a converted variant call quality score. It supports data that contain various combinations of extended pedigrees, trios, and case-controls, and allows for a reduced penetrance, an elevated phenocopy rate, liability classes, and covariates. BayesDel is more accurate than PolyPhen2, SIFT, FATHMM, LRT, Mutation Taster, Mutation Assessor, PhyloP, GERP++, SiPhy, CADD, MetaLR, and MetaSVM. The overall approach is faster and more powerful than the existing quantitative method pVAAST, as shown by the simulations of challenging situations in finding the missing heritability of a complex disease. This framework can also classify variants of unknown significance (variants of uncertain significance) by quantitatively integrating allele frequencies, deleteriousness, association, and co-segregation. PERCH is a versatile tool for gene prioritization in gene discovery research and variant classification in clinical genetic testing.

93 citations

Journal ArticleDOI
TL;DR: The Globus Genomics system allows biomedical researchers to perform rapid analysis of large next‐generation sequencing datasets in a fully automated manner, without software installation or a need for any local computing infrastructure.
Abstract: We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage via the Globus file transfer system; specification, configuration, and reuse of multistep processing pipelines via the Galaxy workflow system; creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner on Amazon EC2; and efficient scheduling of these pipelines over many processors via the HTCondor scheduler. The system allows biomedical researchers to perform rapid analysis of large next-generation sequencing datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads. Copyright © 2014 John Wiley & Sons, Ltd.

65 citations

Journal ArticleDOI
TL;DR: A domain‐independent, cloud‐based science gateway platform, the Globus Galaxies platform, which overcomes this gap by providing a set of hosted services that directly address the needs of science gateway developers.
Abstract: Summary The use of public cloud computers to host sophisticated scientific data and software is transforming scientific practice by enabling broad access to capabilities previously available only to the few. The primary obstacle to more widespread use of public clouds to host scientific software (‘cloud-based science gateways’) has thus far been the considerable gap between the specialized needs of science applications and the capabilities provided by cloud infrastructures. We describe here a domain-independent, cloud-based science gateway platform, the Globus Galaxies platform, which overcomes this gap by providing a set of hosted services that directly address the needs of science gateway developers. The design and implementation of this platform leverages our several years of experience with Globus Genomics, a cloud-based science gateway that has served more than 200 genomics researchers across 30 institutions. Building on that foundation, we have implemented a platform that leverages the popular Galaxy system for application hosting and workflow execution; Globus services for data transfer, user and group management, and authentication; and a cost-aware elastic provisioning model specialized for public cloud resources. We describe here the capabilities and architecture of this platform, present six scientific domains in which we have successfully applied it, report on user experiences, and analyze the economics of our deployments. Published 2015. This article is a U.S. Government work and is in the public domain in the USA. Concurrency and Computation: Practice and Experience published by John Wiley & Sons Ltd.

45 citations