scispace - formally typeset
Search or ask a question
Journal ArticleDOI

SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data.

01 Dec 2012-Bioinformatics (Oxford University Press)-Vol. 28, Iss: 24, pp 3211-3217
TL;DR: SortMeRNA, a new software designed to rapidly filter rRNA fragments from metatranscriptomic data, is presented, capable of handling large sets of reads and sorting out all fragments matching to the rRNA database with high sensitivity and low running time.
Abstract: MOTIVATION: The application of Next-Generation Sequencing (NGS) technologies to RNAs directly extracted from a community of organisms yields a mixture of fragments characterizing both coding and non-coding types of RNAs. The tasks to distinguish among these and to further categorize the families of messenger RNAs and ribosomal RNAs is an important step for examining gene expression patterns of an interactive environment and the phylogenetic classification of the constituting species. RESULTS: We present SortMeRNA, a new software designed to rapidly filter ribosomal RNA fragments from metatranscriptomic data. It is capable of handling large sets of reads and sorting out all fragments matching to the rRNA database with high sensitivity and low running time. AVAILABILITY: http://bioinfo.lifl.fr/RNA/sortmerna CONTACT: evguenia.kopylova@lifl.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Citations
More filters
Journal ArticleDOI
TL;DR: The results illustrate the importance of parameter tuning for optimizing classifier performance, and the recommendations regarding parameter choices for these classifiers under a range of standard operating conditions are made.
Abstract: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated “novel” marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ). Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.

2,475 citations


Cites background or methods from "SortMeRNA: Fast and accurate filter..."

  • ...0 29/11/2014) [13]), two alignmentbased consensus taxonomy classifiers newly released in q2-...

    [...]

  • ...First, we compare the q2-feature-classifier methods to the classifiers that have been most commonly used for classification of 16S rRNA and ITS marker-gene amplicon sequences accessed through QIIME 1 (RDP, BLAST, uclust, SortMeRNA)....

    [...]

  • ...Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data....

    [...]

  • ...Species-level classifications of 16S rRNA gene simulated sequences were best with optimized UCLUST and SortMeRNA configurations for V4 domain, and naive Bayes and RDP for V1–3 domain and full-length 16S rRNA gene sequences (Fig....

    [...]

  • ...We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data....

    [...]

Journal ArticleDOI
01 Dec 2016-Cell
TL;DR: It is reported herein that gut microbiota are required for motor deficits, microglia activation, and αSyn pathology, and suggested that alterations in the human microbiome represent a risk factor for PD.

2,142 citations


Cites background or methods from "SortMeRNA: Fast and accurate filter..."

  • ...…and Algorithms ImageJ National Institutes of Health https://imagej.nih.gov/ij/ Imaris Bitplane http://www.bitplane.com/imaris/imaris SortMeRNA 2.0 Kopylova et al. (2012) http://bioinfo.lifl.fr/RNA/sortmerna/ Greengenes Lawrence Berkeley National Labs…...

    [...]

  • ...Operational Taxonomic Units (OTUs) were picked closed reference using SortMeRNA 2.0 (Kopylova et al., 2012) against the August 2013 release of Greengenes (McDonald et al., 2012) in QIIME 1.9 (Caporaso et al., 2010)....

    [...]

Journal ArticleDOI
01 Nov 2017-Nature
TL;DR: A meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project is presented, creating both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity.
Abstract: Our growing awareness of the microbial world’s importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity.

1,676 citations

Journal ArticleDOI
09 Aug 2018-Nature
TL;DR: It is shown that bacterial, but not fungal, genetic diversity is highest in temperate habitats and that microbial gene composition varies more strongly with environmental variables than with geographic distance, and that the relative contributions of these microorganisms to global nutrient cycling varies spatially.
Abstract: Soils harbour some of the most diverse microbiomes on Earth and are essential for both nutrient cycling and carbon storage. To understand soil functioning, it is necessary to model the global distribution patterns and functional gene repertoires of soil microorganisms, as well as the biotic and environmental associations between the diversity and structure of both bacterial and fungal soil communities1–4. Here we show, by leveraging metagenomics and metabarcoding of global topsoil samples (189 sites, 7,560 subsamples), that bacterial, but not fungal, genetic diversity is highest in temperate habitats and that microbial gene composition varies more strongly with environmental variables than with geographic distance. We demonstrate that fungi and bacteria show global niche differentiation that is associated with contrasting diversity responses to precipitation and soil pH. Furthermore, we provide evidence for strong bacterial–fungal antagonism, inferred from antibiotic-resistance genes, in topsoil and ocean habitats, indicating the substantial role of biotic interactions in shaping microbial communities. Our results suggest that both competition and environmental filtering affect the abundance, composition and encoded gene functions of bacterial and fungal communities, indicating that the relative contributions of these microorganisms to global nutrient cycling varies spatially.

1,108 citations

Journal ArticleDOI
TL;DR: A novel bat-borne CoV was identified that is associated with severe and fatal respiratory disease in humans and the amino acid sequence of the tentative receptor-binding domain resembles that of SARS-CoV, indicating that these viruses might use the same receptor.
Abstract: Background: Human infections with zoonotic coronaviruses (CoVs), including severe acute respiratory syndrome (SARS)-CoV and Middle East respiratory syndrome (MERS)-CoV, have raised great public health concern globally. Here, we report a novel bat-origin CoV causing severe and fatal pneumonia in humans. Methods: We collected clinical data and bronchoalveolar lavage (BAL) specimens from five patients with severe pneumonia from Jin Yin-tan Hospital, Wuhan, Hubei province, China. Nucleic acids of the BAL were extracted and subjected to next-generation sequencing. Virus isolation was carried out, and maximum-likelihood phylogenetic trees were constructed. Results: Five patients hospitalized from December 18 to December 29, 2019 presented with fever, cough, and dyspnea accompanied by complications of acute respiratory distress syndrome. Chest radiography revealed diffuse opacities and consolidation. One of these patients died. Sequence results revealed the presence of a previously unknown β-CoV strain in all five patients, with 99.8–99.9% nucleotide identities among the isolates. These isolates showed 79.0% nucleotide identity with the sequence of SARS-CoV (GenBank NC_004718) and 51.8% identity with the sequence of MERS-CoV (GenBank NC_019843). The virus is phylogenetically closest to a bat SARS-like CoV (SL-ZC45, GenBank MG772933) with 87.6–87.7% nucleotide identity, but is in a separate clade. Moreover, these viruses have a single intact open reading frame gene 8, as a further indicator of bat-origin CoVs. However, the amino acid sequence of the tentative receptor-binding domain resembles that of SARS-CoV, indicating that these viruses might use the same receptor. Conclusion: A novel bat-borne CoV was identified that is associated with severe and fatal respiratory disease in humans. Key words: Bat-origin; Coronavirus; Zoonotic transmission; Pneumonia; Etiology; Next-generation sequencing

999 citations


Cites methods from "SortMeRNA: Fast and accurate filter..."

  • ...1b).([14]) Taxonomic assignment of the clean reads was performed with Kraken 2 against the reference databases, including archaea, bacteria, fungi, human, plasmid, protozoa, univec, and virus sequences (software 2....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations


"SortMeRNA: Fast and accurate filter..." refers background or methods in this paper

  • ...Received on May 16, 2012; revised on September 17, 2012; accepted on October 9, 2012...

    [...]

  • ...The 16S rRNA database was used by SortMeRNA, riboPicker, BLASTN and SSU-ALIGN, and the 23S rRNA database was used by SortMeRNA, riboPicker and BLASTN....

    [...]

Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations


"SortMeRNA: Fast and accurate filter..." refers background or methods in this paper

  • ...Received on May 16, 2012; revised on September 17, 2012; accepted on October 9, 2012...

    [...]

  • ...An alternative algorithm outside the domain of probabilistic models is riboPicker (Schmieder et al., 2012), which uses a modified version of the Burrows-Wheeler Aligner (Li and Durbin, 2009)....

    [...]

Journal ArticleDOI
TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.
Abstract: Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

17,301 citations

Journal ArticleDOI
TL;DR: The ARB program package comprises a variety of directly interacting software tools for sequence database maintenance and analysis which are controlled by a common graphical user interface.
Abstract: The ARB (from Latin arbor, tree) project was initiated almost 10 years ago. The ARB program package comprises a variety of directly interacting software tools for sequence database maintenance and analysis which are controlled by a common graphical user interface. Although it was initially designed for ribosomal RNA data, it can be used for any nucleic and amino acid sequence data as well. A central database contains processed (aligned) primary structure data. Any additional descriptive data can be stored in database fields assigned to the individual sequences or linked via local or worldwide networks. A phylogenetic tree visualized in the main window can be used for data access and visualization. The package comprises additional tools for data import and export, sequence alignment, primary and secondary structure editing, profile and filter calculation, phylogenetic analyses, specific hybridization probe design and evaluation and other components for data analysis. Currently, the package is used by numerous working groups worldwide.

6,757 citations


"SortMeRNA: Fast and accurate filter..." refers background in this paper

  • ...Additionally, the user can work with his or her own RNA databases....

    [...]

Journal ArticleDOI
TL;DR: SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains.
Abstract: Sequencing ribosomal RNA (rRNA) genes is currently the method of choice for phylogenetic reconstruction, nucleic acid based detection and quantification of microbial diversity. The ARB software suite with its corresponding rRNA datasets has been accepted by researchers worldwide as a standard tool for large scale rRNA analysis. However, the rapid increase of publicly available rRNA sequence data has recently hampered the maintenance of comprehensive and curated rRNA knowledge databases. A new system, SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains. All sequences are checked for anomalies, carry a rich set of sequence associated contextual information, have multiple taxonomic classifications, and the latest validly described nomenclature. Furthermore, two precompiled sequence datasets compatible with ARB are offered for download on the SILVA website: (i) the reference (Ref) datasets, comprising only high quality, nearly full length sequences suitable for in-depth phylogenetic analysis and probe design and (ii) the comprehensive Parc datasets with all publicly available rRNA sequences longer than 300 nucleotides suitable for biodiversity analyses. The latest publicly available database release 91 (August 2007) hosts 547 521 sequences split into 461 823 small subunit and 85 689 large subunit rRNAs.

5,733 citations