scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Ecological Patterns of nifH Genes in Four Terrestrial Climatic Zones Explored with Targeted Metagenomics Using FrameBot, a New Informatics Tool

01 Nov 2013-Mbio (American Society for Microbiology)-Vol. 4, Iss: 5
TL;DR: To accurately detect and correct frameshifts caused by indel sequencing errors, FrameBot was developed, a tool for frameshift correction and nearest-neighbor classification, and its accuracy was compared to that of two other rapid frameshIFT correction tools.
Abstract: Biological nitrogen fixation is an important component of sustainable soil fertility and a key component of the nitrogen cycle. We used targeted metagenomics to study the nitrogen fixation-capable terrestrial bacterial community by targeting the gene for nitrogenase reductase ( nifH ). We obtained 1.1 million nifH 454 amplicon sequences from 222 soil samples collected from 4 National Ecological Observatory Network (NEON) sites in Alaska, Hawaii, Utah, and Florida. To accurately detect and correct frameshifts caused by indel sequencing errors, we developed FrameBot, a tool for frameshift correction and nearest-neighbor classification, and compared its accuracy to that of two other rapid frameshift correction tools. We found FrameBot was, in general, more accurate as long as a reference protein sequence with 80% or greater identity to a query was available, as was the case for virtually all nifH reads for the 4 NEON sites. Frameshifts were present in 12.7% of the reads. Those nifH sequences related to the Proteobacteria phylum were most abundant, followed by those for Cyanobacteria in the Alaska and Utah sites. Predominant genera with nifH sequences similar to reads included Azospirillum , Bradyrhizobium , and Rhizobium , the latter two without obvious plant hosts at the sites. Surprisingly, 80% of the sequences had greater than 95% amino acid identity to known nifH gene sequences. These samples were grouped by site and correlated with soil environmental factors, especially drainage, light intensity, mean annual temperature, and mean annual precipitation. FrameBot was tested successfully on three ecofunctional genes but should be applicable to any. IMPORTANCE High-throughput phylogenetic analysis of microbial communities using rRNA-targeted sequencing is now commonplace; however, such data often allow little inference with respect to either the presence or the diversity of genes involved in most important ecological processes. To study the gene pool for these processes, it is more straightforward to assess the genes directly responsible for the ecological function (ecofunctional genes). However, analyzing these genes involves technical challenges beyond those seen for rRNA. In particular, frameshift errors cause garbled downstream protein translations. Our FrameBot tool described here both corrects frameshift errors in query reads and determines their closest matching protein sequences in a set of reference sequences. We validated this new tool with sequences from defined communities and demonstrated the tool’s utility on nifH gene fragments sequenced from soils in well-characterized and major terrestrial ecosystem types.
Citations
More filters
Journal ArticleDOI
TL;DR: The results illustrate the importance of parameter tuning for optimizing classifier performance, and the recommendations regarding parameter choices for these classifiers under a range of standard operating conditions are made.
Abstract: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated “novel” marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ). Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.

2,475 citations

Journal ArticleDOI
TL;DR: The Functional Gene Pipeline and Repository offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences.
Abstract: Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes.

510 citations


Cites methods from "Ecological Patterns of nifH Genes i..."

  • ...FunFrame (Weisman et al., 2013) is an R-based analysis pipeline for functional gene data, built on analysis tools including HMMFrame (Zhang and Sun, 2011) for frameshift correction and gene translation....

    [...]

Journal ArticleDOI
TL;DR: In this article, the abundance and community structure of functional genes involved in the biogeochemical cycling of N in forest soils offers an approach to directly link microbial groups to soil characteristics and ecosystem processes.
Abstract: The understanding of nitrogen (N) cycling in forest ecosystems has undergone a major shift in the past decade as molecular methods are being used to link microorganisms to key processes in soil. The analysis of the abundance and community structure of functional genes involved in the biogeochemical cycling of N in forest soils offers an approach to directly link microbial groups to soil characteristics and ecosystem processes. The majority of N entering ecosystems is biologically-derived from fixation of atmospheric N2. Molecular studies of N-fixation use the nitrogenase reductase (nifH) marker gene, and can be used to link N-fixation to other N- and C-cycling processes. Inorganic N entering soil via N-fixation, fertilization and deposition can have several fates, depending on the soil environment and the microbial community. The loss of N from forests stands subject to fertilization and atmospheric deposition is of increasing interest as the outputs of nitrate (NO3−) and nitrous oxide (N2O) are implicated in ground water pollution and climate change, respectively. Ammonia-oxidizing bacteria (AOB) and archaea (AOA) oxidize ammonia (NH3) to NO3− as the first step of nitrification and are studied using the ammonium monooxygenase (amoA) marker. The abundance and community structure of ammonia-oxidizers is largely dependent on pH and availability of reactive N forms, and can change rapidly following N addition or after fire. These organisms can also release N2O during nitrifier denitrification or through linked nitrification–denitrification. In some forest soils, N2O emissions are correlated with genes in the denitrification pathway (napA, narG, nirK, nirS, nosZ) making these genes useful indicators of greenhouse gas (GHG) flux potential. A review of this topic is timely as there is currently much concern regarding the effect of N fertilization and deposition on North American and European forests due to the potential alteration of dissimilative N-cycling processes and the potential for increased N2O emissions in forest stands.

441 citations

Journal ArticleDOI
26 Dec 2017
TL;DR: It is demonstrated that butyrate producers establish themselves within the first year of life and display high abundances in adults regardless of origin and results from longitudinal analyses propose that diversity supports functional stability during ordinary life disturbances and during interventions such as antibiotic treatment.
Abstract: Given the key role of butyrate for host health, understanding the ecology of intestinal butyrate-producing communities is a top priority for gut microbiota research. To this end, we performed a pooled analysis on 2,387 metagenomic/transcriptomic samples from 15 publicly available data sets that originated from three continents and encompassed eight diseases as well as specific interventions. For analyses, a gene catalogue was constructed from gene-targeted assemblies of all genes from butyrate synthesis pathways of all samples and from an updated reference database derived from genome screenings. We demonstrate that butyrate producers establish themselves within the first year of life and display high abundances (>20% of total bacterial community) in adults regardless of origin. Various bacteria form this functional group, exhibiting a biochemical diversity including different pathways and terminal enzymes, where one carbohydrate-fueled pathway was dominant with butyryl coenzyme A (CoA):acetate CoA transferase as the main terminal enzyme. Subjects displayed a high richness of butyrate producers, and 17 taxa, primarily members of the Lachnospiraceae and Ruminococcaceae along with some Bacteroidetes, were detected in >70% of individuals, encompassing ~85% of the total butyrate-producing potential. Most of these key taxa were also found to express genes for butyrate formation, indicating that butyrate producers occupy various niches in the gut ecosystem, concurrently synthesizing that compound. Furthermore, results from longitudinal analyses propose that diversity supports functional stability during ordinary life disturbances and during interventions such as antibiotic treatment. A reduction of the butyrate-producing potential along with community alterations was detected in various diseases, where patients suffering from cardiometabolic disorders were particularly affected. IMPORTANCE Studies focusing on taxonomic compositions of the gut microbiota are plentiful, whereas its functional capabilities are still poorly understood. Specific key functions deserve detailed investigations, as they regulate microbiota-host interactions and promote host health and disease. The production of butyrate is among the top targets since depletion of this microbe-derived metabolite is linked to several emerging noncommunicable diseases and was shown to facilitate establishment of enteric pathogens by disrupting colonization resistance. In this study, we established a workflow to investigate in detail the composition of the polyphyletic butyrate-producing community from omics data extracting its biochemical and taxonomic diversity. By combining information from various publicly available data sets, we identified universal ecological key features of this functional group and shed light on its role in health and disease. Our results will assist the development of precision medicine to combat functional dysbiosis.

290 citations

Journal ArticleDOI
TL;DR: A diagnostic framework was developed that enabled the quantification and comprehensive characterization of the TMA-producing potential in human fecal samples and provides crucial information for the development of specific treatment strategies to restrain TMA producers and limit their proliferation.
Abstract: Background Trimethylamine (TMA), produced by the gut microbiota from dietary quaternary amines (mainly choline and carnitine), is associated with atherosclerosis and severe cardiovascular disease. Currently, little information on the composition of TMA producers in the gut is available due to their low abundance and the requirement of specific functional-based detection methods as many taxa show disparate abilities to produce that compound.

288 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations


"Ecological Patterns of nifH Genes i..." refers methods in this paper

  • ...General homology search tools, such as BLASTx (17), cannot correct frameshift artifacts....

    [...]

Journal ArticleDOI
TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.
Abstract: Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

14,075 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
TL;DR: Pfam as discussed by the authors is a widely used database of protein families, containing 14 831 manually curated entries in the current version, version 27.0, and has been updated several times since 2012.
Abstract: Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.

9,415 citations


"Ecological Patterns of nifH Genes i..." refers methods in this paper

  • ...As with HMMER (6) and other protein profile HMM annotation tools, HMMFrame uses a set of protein family models from Pfam (7) or other sources to scan metagenomic data....

    [...]

Journal ArticleDOI
TL;DR: A new and simple method to find indicator species and species assemblages characterizing groups of sites, and a new way to present species-site tables, accounting for the hierarchical relationships among species, is proposed.
Abstract: This paper presents a new and simple method to find indicator species and species assemblages characterizing groups of sites The novelty of our approach lies in the way we combine a species relative abundance with its relative frequency of occurrence in the various groups of sites This index is maximum when all individuals of a species are found in a single group of sites and when the species occurs in all sites of that group; it is a symmetric indicator The statistical significance of the species indicator values is evaluated using a randomization procedure Contrary to TWINSPAN, our indicator index for a given species is independent of the other species relative abundances, and there is no need to use pseudospecies The new method identifies indicator species for typologies of species releves obtained by any hierarchical or nonhierarchical classification procedure; its use is independent of the classification method Because indicator species give ecological meaning to groups of sites, this method provides criteria to compare typologies, to identify where to stop dividing clusters into subsets, and to point out the main levels in a hierarchical classification of sites Species can be grouped on the basis of their indicator values for each clustering level, the heterogeneous nature of species assemblages observed in any one site being well preserved Such assemblages are usually a mixture of eurytopic (higher level) and stenotopic species (characteristic of lower level clusters) The species assemblage approach demonstrates the importance of the ''sampled patch size,'' ie, the diversity of sampled ecological combinations, when we compare the frequencies of core and satellite species A new way to present species-site tables, accounting for the hierarchical relationships among species, is proposed A large data set of carabid beetle distributions in open habitats of Belgium is used as a case study to illustrate the new method

7,449 citations


"Ecological Patterns of nifH Genes i..." refers methods in this paper

  • ...Using the nearest-neighbor sequence assignments as “species” assignments, we calculated the four-way Dufrene-Legendre “indicator value” (16), treating each site’s samples as a class (see Dataset S3 in the supplemental material)....

    [...]

Related Papers (5)