The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.
Shibu Yooseph,Granger G. Sutton,Douglas B. Rusch,Aaron L. Halpern,Shannon J. Williamson,Karin A. Remington,Jonathan A. Eisen,Jonathan A. Eisen,Karla B. Heidelberg,Gerard Manning,Weizhong Li,Lukasz Jaroszewski,Piotr Cieplak,Christopher S. Miller,Huiying Li,Susan T. Mashiyama,Marcin P. Joachimiak,Christopher van Belle,John-Marc Chandonia,John-Marc Chandonia,David A W Soergel,Yufeng Zhai,Kannan Natarajan,Shaun W. Lee,Benjamin J. Raphael,Vineet Bafna,Robert Friedman,Steven E. Brenner,Adam Godzik,David Eisenberg,Jack E. Dixon,Susan S. Taylor,Robert L. Strausberg,Marvin Frazier,J. Craig Venter +34 more
TLDR
This work used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling sequences to add a great deal of diversity to known protein families and shed light on their evolution.Abstract:
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.read more
Citations
More filters
Journal ArticleDOI
Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies
Anna Klindworth,Elmar Pruesse,Timmy Schweer,Jörg Peplies,Christian Quast,Matthias Horn,Frank Oliver Glöckner +6 more
TL;DR: The results of this study may be used as a guideline for selecting primer pairs with the best overall coverage and phylum spectrum for specific applications, therefore reducing the bias in PCR-based microbial diversity studies.
Journal ArticleDOI
The Microbial Engines That Drive Earth's Biogeochemical Cycles
TL;DR: Virtually all nonequilibrium electron transfers on Earth are driven by a set of nanobiological machines composed largely of multimeric protein complexes associated with a small number of prosthetic groups.
Journal ArticleDOI
MetaSPAdes: A new versatile metagenomic assembler
TL;DR: MetaSPAdes as mentioned in this paper addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.
Journal ArticleDOI
Natural products: a continuing source of novel drug leads.
Gordon M. Cragg,David J. Newman +1 more
TL;DR: This review traces natural products drug discovery, outlining important drugs from natural sources that revolutionized treatment of serious diseases and effective drug development depends on multidisciplinary collaborations.
Journal ArticleDOI
CD-HIT Suite
TL;DR: A new web server, CD-HIT Suite, is developed for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels and users can now interactively explore the clusters within web browsers.
References
More filters
Journal ArticleDOI
Basic Local Alignment Search Tool
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Journal ArticleDOI
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Stephen F. Altschul,Thomas L. Madden,Alejandro A. Schäffer,Jinghui Zhang,Zheng Zhang,Webb Miller,David J. Lipman +6 more
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Journal ArticleDOI
Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
TL;DR: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved and modifications are incorporated into a new program, CLUSTAL W, which is freely available.
Journal ArticleDOI
MUSCLE: multiple sequence alignment with high accuracy and high throughput
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Journal ArticleDOI
Gene Ontology: tool for the unification of biology
M Ashburner,Catherine A. Ball,Judith A. Blake,David Botstein,Heather Butler,J. M. Cherry,Allan Peter Davis,Kara Dolinski,Selina S. Dwight,J.T. Eppig,Midori A. Harris,David P. Hill,Laurie Issel-Tarver,Andrew Kasarskis,Suzanna E. Lewis,John C. Matese,Joel E. Richardson,M. Ringwald,Gerald M. Rubin,Gavin Sherlock +19 more
TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Related Papers (5)
Environmental Genome Shotgun Sequencing of the Sargasso Sea
J. Craig Venter,Karin A. Remington,John F. Heidelberg,Aaron L. Halpern,Doug Rusch,Jonathan A. Eisen,Dongying Wu,Ian T. Paulsen,Karen E. Nelson,William C. Nelson,Derrick E. Fouts,Samuel Levy,Anthony H. Knap,Michael W. Lomas,Kenneth H. Nealson,Owen White,Jeremy Peterson,Jeff Hoffman,Rachel Parsons,Holly Baden-Tillson,Cynthia Pfannkoch,Yu-Hui Rogers,Hamilton O. Smith +22 more