scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Distributed tools deployment and management for multiple galaxy instances in globus genomics

TL;DR: This work presents the challenges associated with managing multiple Galaxy instances on the cloud for various research groups using Globus Genomics, a cloud based platform-as-a-service (PaaS) that provides the Galaxy workflow system as a hosted service along with data management capabilities using Globu Online.
Abstract: Workflow systems play an important role in the analysis of the fast-growing genomics data produced by low-cost next generation sequencing (NGS) technologies. Many biomedical research groups lack the expertise to assemble and run the sophisticated computational pipelines required for high-throughput analysis of such data. There is an urgent need for services that can allow researchers to run their analytical workflows where they can define their own research methodologies by selecting the tools of their interest. We present the challenges associated with managing multiple Galaxy instances on the cloud for various research groups using Globus Genomics, a cloud based platform-as-a-service (PaaS) that provides the Galaxy workflow system as a hosted service along with data management capabilities using Globus Online. We address the unique challenges, our strategy, and a tool for automatically deploying and managing hundreds of analytical tools coming from the public Galaxy Tool Shed, new tools wrapped by our group, and tools wrapped by end users across multiple Galaxy instances hosted with Globus Genomics.
Citations
More filters
Journal ArticleDOI
TL;DR: A case study of a practical solution that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis using the Globus Genomics system, which is an enhanced Galaxy workflow system made available as a service.
Abstract: Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the “Globus Genomics” system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-endNGS analysis requirements. The Globus Genomics system is built on Amazon 's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research.

31 citations


Cites background from "Distributed tools deployment and ma..."

  • ...NGS analyses are well-suited for the cloud since data upload (of input files) to an Amazon cloud instance does not incur any extra charge and data download (of output files) becomes relatively inexpensive as only a small percentage of output is needed for downstream analysis [17,18]....

    [...]

References
More filters
Book ChapterDOI
TL;DR: This chapter will first outline the principle of this single-molecule, real-time (SMRT) DNA sequencing method, followed by descriptions of its underlying components and typical sequencing run conditions.
Abstract: Pacific Biosciences has developed a method for real-time sequencing of single DNA molecules (Eid et al., 2009), with intrinsic sequencing rates of several bases per second and read lengths into the kilobase range. Conceptually, this sequencing approach is based on eavesdropping on the activity of DNA polymerase carrying out template-directed DNA polymerization. Performed in a highly parallel operational mode, sequential base additions catalyzed by each polymerase are detected with terminal phosphate-linked, fluorescence-labeled nucleotides. This chapter will first outline the principle of this single-molecule, real-time (SMRT) DNA sequencing method, followed by descriptions of its underlying components and typical sequencing run conditions. Two examples are provided which illustrate that, in addition to the DNA sequence, the dynamics of DNA polymerization from each enzyme molecules is directly accessible: the determination of base-specific kinetic parameters from single-molecule sequencing reads, and the characterization of DNA synthesis rate heterogeneities.

1,199 citations


"Distributed tools deployment and ma..." refers background in this paper

  • ...The availability of next-generation sequencing (NGS) and thirdgeneration sequencing methodologies [1] has drastically reduced...

    [...]

Journal ArticleDOI
TL;DR: How to master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle the authors' big data problems is discussed.
Abstract: Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle our big data problems.

612 citations


"Distributed tools deployment and ma..." refers background in this paper

  • ...This has lead to the research groups generating and handling enormous amount of data [3]....

    [...]

Journal ArticleDOI
22 Jan 2010-Science
TL;DR: Papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension and mathematics papers are expected to contain a proof complete enough to allow knowledgeable readers to fill in any details.
Abstract: As use of computation in research grows, new tools are needed to expand recording, reporting, and reproduction of methods and data.

276 citations

Journal ArticleDOI
TL;DR: The author argues that similar methods can be used to overcome the complexities inherent in increas ingly data-intensive, computational, and collaborative scientific research.
Abstract: Many businesses today save time and money, and increase their agility, by outsourcing mundane IT tasks to cloud providers. The author argues that similar methods can be used to overcome the complexities inherent in increas ingly data-intensive, computational, and collaborative scientific research. He describes Globus Online, a system that he and his colleagues are developing to realize this vision.

269 citations


"Distributed tools deployment and ma..." refers background in this paper

  • ...Globus Genomics brings together several powerful components such as Galaxy, Globus Online and elastic computational infrastructure provided by Amazon Web Services (AWS) to build a cloud-based platform as a service (PaaS) for researchers to easily execute their computation research tasks and manage their research data....

    [...]

  • ...Keywords Globus Genomics, Globus Online, Galaxy, Galaxy Tool Shed, data transfer, data management, grid, cloud, next-generation sequencing, translational medicine....

    [...]

  • ...Globus Genomics uses commodity cloud compute resources such as Amazon Web services (AWS) [11] and provides a hosted service integrating Galaxy with Globus Online [12]....

    [...]

  • ...[11] http://aws.amazon.com [12] Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services....

    [...]

  • ...Experiences in building a nextgeneration sequencing analysis service using Galaxy, Globus Online and Amazon web service....

    [...]

Proceedings ArticleDOI
22 Jul 2013
TL;DR: The Globus Genomics system allows biomedical researchers to perform rapid analysis of large NGS datasets using just a web browser in a fully automated manner, without software installation.
Abstract: We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system is notable for its high degree of end-to-end automation, which encompasses every stage of the data analysis pipeline from initial data access (from remote sequencing center or database, by the Globus Online file transfer system) to on-demand resource acquisition (on Amazon EC2, via the Globus Provision cloud manager); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); and efficient scheduling of these pipelines over many processors (via the Condor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets using just a web browser in a fully automated manner, without software installation.

20 citations