scispace - formally typeset
Search or ask a question
Author

Alex Rodriguez

Other affiliations: Argonne National Laboratory
Bio: Alex Rodriguez is an academic researcher from University of Chicago. The author has contributed to research in topics: Workflow & Cloud computing. The author has an hindex of 12, co-authored 22 publications receiving 597 citations. Previous affiliations of Alex Rodriguez include Argonne National Laboratory.

Papers
More filters
Journal ArticleDOI
TL;DR: This work presents FIGfams, a new collection of over 100 000 protein families that are the product of manual curation and close strain comparison, and Associated with each FIGfam is a two-tiered, rapid, accurate decision procedure to determine family membership for new proteins.
Abstract: We present FIGfams, a new collection of over 100 000 protein families that are the product of manual curation and close strain comparison. Using the Subsystem approach the manual curation is carried out, ensuring a previously unattained degree of throughput and consistency. FIGfams are based on over 950 000 manually annotated proteins and across many hundred Bacteria and Archaea. Associated with each FIGfam is a two-tiered, rapid, accurate decision procedure to determine family membership for new proteins. FIGfams are freely available under an open source license. These can be downloaded at ftp://ftp.theseed.org/FIGfams/. The web site for FIGfams is http://www.theseed.org/wiki/FIGfams/

148 citations

Proceedings ArticleDOI
04 Jun 2004
TL;DR: The Grid2003 Project has deployed a multivirtual organization, application-driven grid laboratory that has sustained for several months the production-level services required by physics experiments of the Large Hadron Collider at CERN, the Sloan Digital Sky Survey project, the gravitational wave search experiment LIGO, the BTeV experiment at Fermilab, as well as applications in molecular structure analysis and genome analysis, and computer science research projects in such areas as job and data scheduling.
Abstract: The Grid2003 Project has deployed a multivirtual organization, application-driven grid laboratory ("Grid3") that has sustained for several months the production-level services required by physics experiments of the Large Hadron Collider at CERN (ATLAS and CMS), the Sloan Digital Sky Survey project, the gravitational wave search experiment LIGO, the BTeV experiment at Fermilab, as well as applications in molecular structure analysis and genome analysis, and computer science research projects in such areas as job and data scheduling. The deployed infrastructure has been operating since November 2003 with 27 sites, a peak of 2800 processors, work loads from 10 different applications exceeding 1300 simultaneous jobs, and data transfers among sites of greater than 2 TB/day. We describe the principles that have guided the development of this unique infrastructure and the practical experiences that have resulted from its creation and use. We discuss application requirements for grid services deployment and configuration, monitoring infrastructure, application performance, metrics, and operational experiences. We also summarize lessons learned.

138 citations

Journal ArticleDOI
TL;DR: The Globus Genomics system allows biomedical researchers to perform rapid analysis of large next‐generation sequencing datasets in a fully automated manner, without software installation or a need for any local computing infrastructure.
Abstract: We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage via the Globus file transfer system; specification, configuration, and reuse of multistep processing pipelines via the Galaxy workflow system; creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner on Amazon EC2; and efficient scheduling of these pipelines over many processors via the HTCondor scheduler. The system allows biomedical researchers to perform rapid analysis of large next-generation sequencing datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads. Copyright © 2014 John Wiley & Sons, Ltd.

65 citations

Journal ArticleDOI
TL;DR: A domain‐independent, cloud‐based science gateway platform, the Globus Galaxies platform, which overcomes this gap by providing a set of hosted services that directly address the needs of science gateway developers.
Abstract: Summary The use of public cloud computers to host sophisticated scientific data and software is transforming scientific practice by enabling broad access to capabilities previously available only to the few. The primary obstacle to more widespread use of public clouds to host scientific software (‘cloud-based science gateways’) has thus far been the considerable gap between the specialized needs of science applications and the capabilities provided by cloud infrastructures. We describe here a domain-independent, cloud-based science gateway platform, the Globus Galaxies platform, which overcomes this gap by providing a set of hosted services that directly address the needs of science gateway developers. The design and implementation of this platform leverages our several years of experience with Globus Genomics, a cloud-based science gateway that has served more than 200 genomics researchers across 30 institutions. Building on that foundation, we have implemented a platform that leverages the popular Galaxy system for application hosting and workflow execution; Globus services for data transfer, user and group management, and authentication; and a cost-aware elastic provisioning model specialized for public cloud resources. We describe here the capabilities and architecture of this platform, present six scientific domains in which we have successfully applied it, report on user experiences, and analyze the economics of our deployments. Published 2015. This article is a U.S. Government work and is in the public domain in the USA. Concurrency and Computation: Practice and Experience published by John Wiley & Sons Ltd.

45 citations

Journal ArticleDOI
TL;DR: An analytical workflow is developed to identify L1 polymorphic insertions with next‐generation sequencing (NGS) using data from a family in which SZ segregates, showing the utility of NGS to uncover a neglected type of genetic variants with the potential to influence the risk of schizophrenia like SNVs and CNVs.
Abstract: Recent studies show that human-specific LINE1s (L1HS) play a key role in the development of the central nervous system (CNS) and its disorders, and that their transpositions within the human genome are more common than previously thought. Many polymorphic L1HS, that is, present or absent across individuals, are not annotated in the current release of the genome and are customarily termed "non-reference L1s." We developed an analytical workflow to identify L1 polymorphic insertions with next-generation sequencing (NGS) using data from a family in which SZ segregates. Our workflow exploits two independent algorithms to detect non-reference L1 insertions, performs local de novo alignment of the regions harboring predicted L1 insertions and resolves the L1 subfamily designation from the de novo assembled sequence. We found 110 non-reference L1 polymorphic loci exhibiting Mendelian inheritance, the vast majority of which are already reported in dbRIP and/or euL1db, thus, confirming their status as non-reference L1 polymorphic insertions. Four previously undetected L1 polymorphic loci were confirmed by PCR amplification and direct sequencing of the insert. A large fraction of our non-reference L1s is located within the open reading frame of protein-coding genes that belong to pathways already implicated in the pathogenesis of schizophrenia. The finding of these polymorphic variants among SZ offsprings is intriguing and suggestive of putative pathogenic role. Our data show the utility of NGS to uncover L1 polymorphic insertions, a neglected type of genetic variants with the potential to influence the risk to develop schizophrenia like SNVs and CNVs. © 2016 Wiley Periodicals, Inc.

33 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources are described.
Abstract: In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources.

3,415 citations

Journal ArticleDOI
TL;DR: The open-source metagenomics RAST service provides a new paradigm for the annotation and analysis of metagenomes that is stable, extensible, and freely available to all researchers.
Abstract: Random community genomes (metagenomes) are now commonly used to study microbes in different environments. Over the past few years, the major challenge associated with metagenomics shifted from generating to analyzing sequences. High-throughput, low-cost next-generation sequencing has provided access to metagenomics to a wide range of researchers. A high-throughput pipeline has been constructed to provide high-performance computing to all researchers interested in using metagenomics. The pipeline produces automated functional assignments of sequences in the metagenome by comparing both protein and nucleotide databases. Phylogenetic and functional summaries of the metagenomes are generated, and tools for comparative metagenomics are incorporated into the standard views. User access is controlled to ensure data privacy, but the collaborative environment underpinning the service provides a framework for sharing datasets between multiple users. In the metagenomics RAST, all users retain full control of their data, and everything is available for download in a variety of formats. The open-source metagenomics RAST service provides a new paradigm for the annotation and analysis of metagenomes. With built-in support for multiple data sources and a back end that houses abstract data types, the metagenomics RAST is stable, extensible, and freely available to all researchers. This service has removed one of the primary bottlenecks in metagenome sequence analysis – the availability of high-performance computing for annotating the data. http://metagenomics.nmpdr.org

3,322 citations

Journal ArticleDOI
TL;DR: The RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines and offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job.
Abstract: The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

1,666 citations

Book ChapterDOI
30 Nov 2005
TL;DR: The principal characteristics of the latest release, the Web services-based GT4, which provides significant improvements over previous releases in terms of robustness, performance, usability, documentation, standards compliance, and functionality are summarized.
Abstract: The Globus Toolkit (GT) has been developed since the late 1990s to support the development of service-oriented distributed computing applications and infrastructures. Core GT components address, within a common framework, basic issues relating to security, resource access, resource management, data movement, resource discovery, and so forth. These components enable a broader “Globus ecosystem” of tools and components that build on, or interoperate with, core GT functionality to provide a wide range of useful application-level functions. These tools have in turn been used to develop a wide range of both “Grid” infrastructures and distributed applications. I summarize here the principal characteristics of the latest release, the Web services-based GT4, which provides significant improvements over previous releases in terms of robustness, performance, usability, documentation, standards compliance, and functionality.

1,509 citations