scispace - formally typeset
Search or ask a question

Showing papers on "Sequence analysis published in 2018"


Journal ArticleDOI
TL;DR: SAVER (single-cell analysis via expression recovery), an expression recovery method for unique molecule index (UMI)-based scRNA-seq data that borrows information across genes and cells to provide accurate expression estimates for all genes.
Abstract: In single-cell RNA sequencing (scRNA-seq) studies, only a small fraction of the transcripts present in each cell are sequenced. This leads to unreliable quantification of genes with low or moderate expression, which hinders downstream analysis. To address this challenge, we developed SAVER (single-cell analysis via expression recovery), an expression recovery method for unique molecule index (UMI)-based scRNA-seq data that borrows information across genes and cells to provide accurate expression estimates for all genes.

547 citations


Journal ArticleDOI
26 Dec 2018-PLOS ONE
TL;DR: It is demonstrated that closely related neuronal cell types can be similarly discriminated with both methods if intronic sequences are included in snRNA-seq analysis, and the high information content of nuclear RNA for characterization of cellular diversity in brain tissues is illustrated.
Abstract: Transcriptomic profiling of complex tissues by single-nucleus RNA-sequencing (snRNA-seq) affords some advantages over single-cell RNA-sequencing (scRNA-seq). snRNA-seq provides less biased cellular coverage, does not appear to suffer cell isolation-based transcriptional artifacts, and can be applied to archived frozen specimens. We used well-matched snRNA-seq and scRNA-seq datasets from mouse visual cortex to compare cell type detection. Although more transcripts are detected in individual whole cells (~11,000 genes) than nuclei (~7,000 genes), we demonstrate that closely related neuronal cell types can be similarly discriminated with both methods if intronic sequences are included in snRNA-seq analysis. We estimate that the nuclear proportion of total cellular mRNA varies from 20% to over 50% for large and small pyramidal neurons, respectively. Together, these results illustrate the high information content of nuclear RNA for characterization of cellular diversity in brain tissues.

368 citations


Journal ArticleDOI
TL;DR: It is confirmed that transcript structure generally limits translation initiation and demonstrated its physiological cost using an epigenetic assay, and a set of design principles is proposed to improve translation efficiency that would benefit from more accurate prediction of secondary structures in vivo.
Abstract: Comparative analyses of natural and mutated sequences have been used to probe mechanisms of gene expression, but small sample sizes may produce biased outcomes. We applied an unbiased design-of-experiments approach to disentangle factors suspected to affect translation efficiency in E. coli. We precisely designed 244,000 DNA sequences implementing 56 replicates of a full factorial design to evaluate nucleotide, secondary structure, codon and amino acid properties in combination. For each sequence, we measured reporter transcript abundance and decay, polysome profiles, protein production and growth rates. Associations between designed sequences properties and these consequent phenotypes were dominated by secondary structures and their interactions within transcripts. We confirmed that transcript structure generally limits translation initiation and demonstrated its physiological cost using an epigenetic assay. Codon composition has a sizable impact on translatability, but only in comparatively rare elongation-limited transcripts. We propose a set of design principles to improve translation efficiency that would benefit from more accurate prediction of secondary structures in vivo.

170 citations


Journal ArticleDOI
27 Aug 2018-Nature
TL;DR: Findings show the need to go beyond genomic analyses in cancer diagnostics, as mRNA events that are silent at the DNA level are widespread contributors to cancer pathogenesis through the inactivation of tumour-suppressor genes.
Abstract: DNA mutations are known cancer drivers. Here we investigated whether mRNA events that are upregulated in cancer can functionally mimic the outcome of genetic alterations. RNA sequencing or 3′-end sequencing techniques were applied to normal and malignant B cells from 59 patients with chronic lymphocytic leukaemia (CLL)1–3. We discovered widespread upregulation of truncated mRNAs and proteins in primary CLL cells that were not generated by genetic alterations but instead occurred by intronic polyadenylation. Truncated mRNAs caused by intronic polyadenylation were recurrent (n = 330) and predominantly affected genes with tumour-suppressive functions. The truncated proteins generated by intronic polyadenylation often lack the tumour-suppressive functions of the corresponding full-length proteins (such as DICER and FOXN3), and several even acted in an oncogenic manner (such as CARD11, MGA and CHST11). In CLL, the inactivation of tumour-suppressor genes by aberrant mRNA processing is substantially more prevalent than the functional loss of such genes through genetic events. We further identified new candidate tumour-suppressor genes that are inactivated by intronic polyadenylation in leukaemia and by truncating DNA mutations in solid tumours4,5. These genes are understudied in cancer, as their overall mutation rates are lower than those of well-known tumour-suppressor genes. Our findings show the need to go beyond genomic analyses in cancer diagnostics, as mRNA events that are silent at the DNA level are widespread contributors to cancer pathogenesis through the inactivation of tumour-suppressor genes. The inactivation of tumour suppressor genes at the level of mRNA occurs by the generation of truncated proteins in leukaemia.

162 citations


Journal ArticleDOI
TL;DR: A sequence comparison method to deconstruct linear sequence relationships in lncRNAs and evaluate similarity based on the abundance of short motifs called k-mers found that lnc RNAs of related function often had similar k-mer profiles despite lacking linear homology.
Abstract: The functions of most long non-coding RNAs (lncRNAs) are unknown In contrast to proteins, lncRNAs with similar functions often lack linear sequence homology; thus, the identification of function in one lncRNA rarely informs the identification of function in others We developed a sequence comparison method to deconstruct linear sequence relationships in lncRNAs and evaluate similarity based on the abundance of short motifs called k-mers We found that lncRNAs of related function often had similar k-mer profiles despite lacking linear homology, and that k-mer profiles correlated with protein binding to lncRNAs and with their subcellular localization Using a novel assay to quantify Xist-like regulatory potential, we directly demonstrated that evolutionarily unrelated lncRNAs can encode similar function through different spatial arrangements of related sequence motifs K-mer-based classification is a powerful approach to detect recurrent relationships between sequence and function in lncRNAs

150 citations


Journal ArticleDOI
TL;DR: Improvements in the recovery of bound RNAs and the efficiency of converting isolated RNAs into a library for sequencing have enhanced the ability to perform the experiment at scale, from less starting material than has previously been possible, and resulting in high quality datasets for the confident identification of protein binding sites.
Abstract: RNA binding proteins (RBPs) play key roles in determining cellular behavior by manipulating the processing of target RNAs. Robust methods are required to detect the numerous binding sites of RBPs across the transcriptome. RNA-immunoprecipitation followed by sequencing (RIP-seq) and crosslinking followed by immunoprecipitation and sequencing (CLIP-seq) are state-of-the-art methods used to identify the RNA targets and specific binding sites of RBPs. Historically, CLIP methods have been confounded with challenges such as the requirement for tens of millions of cells per experiment, low RNA yields resulting in libraries that contain a high number of polymerase chain reaction duplicated reads, and technical inconveniences such as radioactive labeling of RNAs. However, recent improvements in the recovery of bound RNAs and the efficiency of converting isolated RNAs into a library for sequencing have enhanced our ability to perform the experiment at scale, from less starting material than has previously been possible, and resulting in high quality datasets for the confident identification of protein binding sites. These, along with additional improvements to protein capture, removal of nonspecific signals, and methods to isolate noncanonical RBP targets have revolutionized the study of RNA processing regulation, and reveal a promising future for mapping the human protein-RNA regulatory network. WIREs RNA 2018, 9:e1436. doi: 10.1002/wrna.1436 This article is categorized under: RNA Interactions with Proteins and Other Molecules > Protein-RNA Recognition RNA Interactions with Proteins and Other Molecules > Protein-RNA Interactions: Functional Implications RNA Methods > RNA Analyses in Cells.

146 citations


Journal ArticleDOI
TL;DR: Analysis of the potato Hsp20 gene family demonstrated that the genes responded to multiple abiotic stresses, such as heat, salt or drought stress, and provided valuable information for clarifying the evolutionary relationship of the StHsp20 family and in aiding functional characterization of StHSP20 genes in further research.
Abstract: Heat shock proteins (Hsps) are essential components in plant tolerance mechanism under various abiotic stresses. Hsp20 is the major family of heat shock proteins, but little of Hsp20 family is known in potato (Solanum tuberosum), which is an important vegetable crop that is thermosensitive. To reveal the mechanisms of potato Hsp20s coping with abiotic stresses, analyses of the potato Hsp20 gene family were conducted using bioinformatics-based methods. In total, 48 putative potato Hsp20 genes (StHsp20s) were identified and named according to their chromosomal locations. A sequence analysis revealed that most StHsp20 genes (89.6%) possessed no, or only one, intron. A phylogenetic analysis indicated that all of the StHsp20 genes, except 10, were grouped into 12 subfamilies. The 48 StHsp20 genes were randomly distributed on 12 chromosomes. Nineteen tandem duplicated StHsp20s and one pair of segmental duplicated genes (StHsp20-15 and StHsp20-48) were identified. A cis-element analysis inferred that StHsp20s, except for StHsp20-41, possessed at least one stress response cis-element. A heatmap of the StHsp20 gene family showed that the genes, except for StHsp20-2 and StHsp20-45, were expressed in various tissues and organs. Real-time quantitative PCR was used to detect the expression level of StHsp20 genes and demonstrated that the genes responded to multiple abiotic stresses, such as heat, salt or drought stress. The relative expression levels of 14 StHsp20 genes (StHsp20-4, 6, 7, 9, 20, 21, 33, 34, 35, 37, 41, 43, 44 and 46) were significantly up-regulated (more than 100-fold) under heat stress. These results provide valuable information for clarifying the evolutionary relationship of the StHsp20 family and in aiding functional characterization of StHsp20 genes in further research.

140 citations


Journal ArticleDOI
Shanrong Zhao1, Ying Zhang1, Ramya Gamini1, Baohong Zhang1, David von Schack1 
TL;DR: It is found that a small number of lncRNAs and small RNAs made up a large fraction of the reads in the rRNA depletion RNA sequencing data, and it is recommended that these RNAs are specifically depleted to improve the sequencing depth of the remaining RNAs.
Abstract: To allow efficient transcript/gene detection, highly abundant ribosomal RNAs (rRNA) are generally removed from total RNA either by positive polyA+ selection or by rRNA depletion (negative selection) before sequencing. Comparisons between the two methods have been carried out by various groups, but the assessments have relied largely on non-clinical samples. In this study, we evaluated these two RNA sequencing approaches using human blood and colon tissue samples. Our analyses showed that rRNA depletion captured more unique transcriptome features, whereas polyA+ selection outperformed rRNA depletion with higher exonic coverage and better accuracy of gene quantification. For blood- and colon-derived RNAs, we found that 220% and 50% more reads, respectively, would have to be sequenced to achieve the same level of exonic coverage in the rRNA depletion method compared with the polyA+ selection method. Therefore, in most cases we strongly recommend polyA+ selection over rRNA depletion for gene quantification in clinical RNA sequencing. Our evaluation revealed that a small number of lncRNAs and small RNAs made up a large fraction of the reads in the rRNA depletion RNA sequencing data. Thus, we recommend that these RNAs are specifically depleted to improve the sequencing depth of the remaining RNAs.

131 citations


Journal ArticleDOI
TL;DR: A deep 1D‐convolution neural network (DeepSF) is developed to directly classify any protein sequence into one of 1195 known folds, which is useful for both fold recognition and the study of sequence‐structure relationship.
Abstract: Motivation Protein fold recognition is an important problem in structural bioinformatics Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a target protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice Results We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein sequence into one of 1195 known folds, which is useful for both fold recognition and the study of sequence-structure relationship Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and maps it to the fold space We train and test our method on the datasets curated from SCOP175, yielding an average classification accuracy of 753% On the independent testing dataset curated from SCOP206, the classification accuracy is 730% We compare our method with a top profile-profile alignment method-HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy The accuracy of our method is 1263-2632% higher than HHSearch on template-free modeling targets and 339-1709% higher on hard template-based modeling targets for top 1, 5 and 10 predicted folds The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking Availability and implementation The DeepSF server is publicly available at: http://irisrnetmissouriedu/DeepSF/ Contact chengji@missouriedu Supplementary information Supplementary data are available at Bioinformatics online

125 citations


Journal ArticleDOI
TL;DR: This work demonstrates a strategy to obtain complete genome sequences and transcriptional landscapes that can be applied to other eukaryal organisms.
Abstract: Completion of eukaryal genomes can be difficult task with the highly repetitive sequences along the chromosomes and short read lengths of second-generation sequencing. Saccharomyces cerevisiae strain CEN.PK113-7D, widely used as a model organism and a cell factory, was selected for this study to demonstrate the superior capability of very long sequence reads for de novo genome assembly. We generated long reads using two common third-generation sequencing technologies (Oxford Nanopore Technology (ONT) and Pacific Biosciences (PacBio)) and used short reads obtained using Illumina sequencing for error correction. Assembly of the reads derived from all three technologies resulted in complete sequences for all 16 yeast chromosomes, as well as the mitochondrial chromosome, in one step. Further, we identified three types of DNA methylation (5mC, 4mC and 6mA). Comparison between the reference strain S288C and strain CEN.PK113-7D identified chromosomal rearrangements against a background of similar gene content between the two strains. We identified full-length transcripts through ONT direct RNA sequencing technology. This allows for the identification of transcriptional landscapes, including untranslated regions (UTRs) (5' UTR and 3' UTR) as well as differential gene expression quantification. About 91% of the predicted transcripts could be consistently detected across biological replicates grown either on glucose or ethanol. Direct RNA sequencing identified many polyadenylated non-coding RNAs, rRNAs, telomere-RNA, long non-coding RNA and antisense RNA. This work demonstrates a strategy to obtain complete genome sequences and transcriptional landscapes that can be applied to other eukaryal organisms.

111 citations


Journal ArticleDOI
TL;DR: BPRNA-1m as mentioned in this paper is a large-scale database of RNA secondary structures annotated with pseudoknots, along with the positions, sequence, and flanking base pairs of each such structural feature.
Abstract: While RNA secondary structure prediction from sequence data has made remarkable progress, there is a need for improved strategies for annotating the features of RNA secondary structures. Here, we present bpRNA, a novel annotation tool capable of parsing RNA structures, including complex pseudoknot-containing RNAs, to yield an objective, precise, compact, unambiguous, easily-interpretable description of all loops, stems, and pseudoknots, along with the positions, sequence, and flanking base pairs of each such structural feature. We also introduce several new informative representations of RNA structure types to improve structure visualization and interpretation. We have further used bpRNA to generate a web-accessible meta-database, 'bpRNA-1m', of over 100 000 single-molecule, known secondary structures; this is both more fully and accurately annotated and over 20-times larger than existing databases. We use a subset of the database with highly similar (≥90% identical) sequences filtered out to report on statistical trends in sequence, flanking base pairs, and length. Both the bpRNA method and the bpRNA-1m database will be valuable resources both for specific analysis of individual RNA molecules and large-scale analyses such as are useful for updating RNA energy parameters for computational thermodynamic predictions, improving machine learning models for structure prediction, and for benchmarking structure-prediction algorithms.

Journal ArticleDOI
TL;DR: A detailed protocol for performing ‘phage immunoprecipitation sequencing’ (PhIP-Seq), which is a powerful method for analyzing antibody-repertoire binding specificities with high throughput and at low cost.
Abstract: The binding specificities of an individual’s antibody repertoire contain a wealth of biological information. They harbor evidence of environmental exposures, allergies, ongoing or emerging autoimmune disease processes, and responses to immunomodulatory therapies, for example. Highly multiplexed methods to comprehensively interrogate antibody-binding specificities have therefore emerged in recent years as important molecular tools. Here, we provide a detailed protocol for performing ‘phage immunoprecipitation sequencing’ (PhIP-Seq), which is a powerful method for analyzing antibody-repertoire binding specificities with high throughput and at low cost. The methodology uses oligonucleotide library synthesis (OLS) to encode proteomic-scale peptide libraries for display on bacteriophage. These libraries are then immunoprecipitated, using an individual’s antibodies, for subsequent analysis by high-throughput DNA sequencing. We have used PhIP-Seq to identify novel self-antigens associated with autoimmune disease, to characterize the self-reactivity of broadly neutralizing HIV antibodies, and in a large international cross-sectional study of exposure to hundreds of human viruses. Compared with alternative array-based techniques, PhIP-Seq is far more scalable in terms of sample throughput and cost per analysis. Cloning and expression of recombinant proteins are not required (versus protein microarrays), and peptide lengths are limited only by DNA synthesis chemistry (up to 90-aa (amino acid) peptides versus the typical 8- to 12-aa length limit of synthetic peptide arrays). Compared with protein microarrays, however, PhIP-Seq libraries lack discontinuous epitopes and post-translational modifications. To increase the accessibility of PhIP-Seq, we provide detailed instructions for the design of phage-displayed peptidome libraries, their immunoprecipitation using serum antibodies, deep sequencing–based measurement of peptide abundances, and statistical determination of peptide enrichments that reflect antibody–peptide interactions. Once a library has been constructed, PhIP-Seq data can be obtained for analysis within a week. Phage immunoprecipitation sequencing (PhIP-Seq) is a method for analyzing antibody-repertoire binding specificities. Phage-displayed oligonucleotide libraries encoding peptidomes are immunoprecipitated and analyzed by high-throughput DNA-Seq.

Journal ArticleDOI
TL;DR: Long sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution, suggesting that cross-species enhancer prediction is often possible.
Abstract: Genomic regions with gene regulatory enhancer activity turnover rapidly across mammals. In contrast, gene expression patterns and transcription factor binding preferences are largely conserved between mammalian species. Based on this conservation, we hypothesized that enhancers active in different mammals would exhibit conserved sequence patterns in spite of their different genomic locations. To investigate this hypothesis, we evaluated the extent to which sequence patterns that are predictive of enhancers in one species are predictive of enhancers in other mammalian species by training and testing two types of machine learning models. We trained support vector machine (SVM) and convolutional neural network (CNN) classifiers to distinguish enhancers defined by histone marks from the genomic background based on DNA sequence patterns in human, macaque, mouse, dog, cow, and opossum. The classifiers accurately identified many adult liver, developing limb, and developing brain enhancers, and the CNNs outperformed the SVMs. Furthermore, classifiers trained in one species and tested in another performed nearly as well as classifiers trained and tested on the same species. We observed similar cross-species conservation when applying the models to human and mouse enhancers validated in transgenic assays. This indicates that many short sequence patterns predictive of enhancers are largely conserved. The sequence patterns most predictive of enhancers in each species matched the binding motifs for a common set of TFs enriched for expression in relevant tissues, supporting the biological relevance of the learned features. Thus, despite the rapid change of active enhancer locations between mammals, cross-species enhancer prediction is often possible. Our results suggest that short sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution.

Journal ArticleDOI
TL;DR: A compendium of conserved cleavage and polyadenylation sites (PASs) in mammalian genes, based on approximately 1.2 billion 3' end sequencing reads from more than 360 human, mouse, and rat samples, shows that ∼80% of mammalian mRNA genes contain at least one conserved PAS, and ∼50% have conserved APA events.
Abstract: Cleavage and polyadenylation is essential for 3' end processing of almost all eukaryotic mRNAs. Recent studies have shown widespread alternative cleavage and polyadenylation (APA) events leading to mRNA isoforms with different 3' UTRs and/or coding sequences. Here, we present a compendium of conserved cleavage and polyadenylation sites (PASs) in mammalian genes, based on approximately 1.2 billion 3' end sequencing reads from more than 360 human, mouse, and rat samples. We show that ∼80% of mammalian mRNA genes contain at least one conserved PAS, and ∼50% have conserved APA events. PAS conservation generally reduces promiscuous 3' end processing, stabilizing gene expression levels across species. Conservation of APA correlates with gene age, gene expression features, and gene functions. Genes with certain functions, such as cell morphology, cell proliferation, and mRNA metabolism, are particularly enriched with conserved APA events. Whereas tissue-specific genes typically have a low APA rate, brain-specific genes tend to evolve APA. In addition, we show enrichment of mRNA destabilizing motifs in alternative 3' UTR sequences, leading to substantial differences in mRNA stability between 3' UTR isoforms. Using conserved PASs, we reveal sequence motifs surrounding APA sites and a preference of adenosine at the cleavage site. Furthermore, we show that mutations of U-rich motifs around the PAS often accompany APA profile differences between species. Analysis of lncRNA PASs indicates a mechanism of PAS fixation through evolution of A-rich motifs. Taken together, our results present a comprehensive view of PAS evolution in mammals, and a phylogenic perspective on APA functions.

Journal ArticleDOI
TL;DR: The draft genome sequence of a wild rose was determined using Illumina MiSeq and HiSeq platforms and revealed a large number of genes for a diploid plant, which may reflect heterogeneity of the genome originating from self-incompatibility in R. multiflora.
Abstract: The draft genome sequence of a wild rose (Rosa multiflora Thunb.) was determined using Illumina MiSeq and HiSeq platforms. The total length of the scaffolds was 739,637,845 bp, consisting of 83,189 scaffolds, which was close to the 711 Mbp length estimated by k-mer analysis. N50 length of the scaffolds was 90,830 bp, and extent of the longest was 1,133,259 bp. The average GC content of the scaffolds was 38.9%. After gene prediction, 67,380 candidates exhibiting sequence homology to known genes and domains were extracted, which included complete and partial gene structures. This large number of genes for a diploid plant may reflect heterogeneity of the genome originating from self-incompatibility in R. multiflora. According to CEGMA analysis, 91.9% and 98.0% of the core eukaryotic genes were completely and partially conserved in the scaffolds, respectively. Genes presumably involved in flower color, scent and flowering are assigned. The results of this study will serve as a valuable resource for fundamental and applied research in the rose, including breeding and phylogenetic study of cultivated roses.

Journal ArticleDOI
TL;DR: The findings reported here greatly expand the known host range of (putative) viruses of bacteria and archaea that encode a DJR MCP and demonstrate the extreme diversity of genome architectures in these viruses that encode no universal proteins other than the capsid protein that was used as the marker for their identification.
Abstract: Analysis of metagenomic sequences has become the principal approach for the study of the diversity of viruses. Many recent, extensive metagenomic studies on several classes of viruses have dramatically expanded the visible part of the virosphere, showing that previously undetected viruses, or those that have been considered rare, actually are important components of the global virome. We investigated the provenance of viruses related to tail-less bacteriophages of the family Tectiviridae by searching genomic and metagenomics sequence databases for distant homologs of the tectivirus-like Double Jelly-Roll major capsid proteins (DJR MCP). These searches resulted in the identification of numerous genomes of virus-like elements that are similar in size to tectiviruses (10–15 kilobases) and have diverse gene compositions. By comparison of the gene repertoires, the DJR MCP-encoding genomes were classified into 6 distinct groups that can be predicted to differ in reproduction strategies and host ranges. Only the DJR MCP gene that is present by design is shared by all these genomes, and most also encode a predicted DNA-packaging ATPase; the rest of the genes are present only in subgroups of this unexpectedly diverse collection of DJR MCP-encoding genomes. Only a minority encode a DNA polymerase which is a hallmark of the family Tectiviridae and the putative family "Autolykiviridae". Notably, one of the identified putative DJR MCP viruses encodes a homolog of Cas1 endonuclease, the integrase involved in CRISPR-Cas adaptation and integration of transposon-like elements called casposons. This is the first detected occurrence of Cas1 in a virus. Many of the identified elements are individual contigs flanked by inverted or direct repeats and appear to represent complete, extrachromosomal viral genomes, whereas others are flanked by bacterial genes and thus can be considered as proviruses. These contigs come from metagenomes of widely different environments, some dominated by archaea and others by bacteria, suggesting that collectively, the DJR MCP-encoding elements have a broad host range among prokaryotes. The findings reported here greatly expand the known host range of (putative) viruses of bacteria and archaea that encode a DJR MCP. They also demonstrate the extreme diversity of genome architectures in these viruses that encode no universal proteins other than the capsid protein that was used as the marker for their identification. From a supposedly minor group of bacterial and archaeal viruses, these viruses are emerging as a substantial component of the prokaryotic virome.

Journal ArticleDOI
TL;DR: The genome of the field bean (Vicia faba, 2n = 12), a long-established model for cytogenetic studies in plants, contains a diverse set of satellite repeats, most of which remained concealed until their present investigation.
Abstract: Satellite DNA, a class of repetitive sequences forming long arrays of tandemly repeated units, represents substantial portions of many plant genomes yet remains poorly characterized due to various methodological obstacles. Here we show that the genome of the field bean (Vicia faba, 2n = 12), a long-established model for cytogenetic studies in plants, contains a diverse set of satellite repeats, most of which remained concealed until their present investigation. Using next-generation sequencing combined with novel bioinformatics tools, we reconstructed consensus sequences of 23 novel satellite repeats representing 0.008–2.700% of the genome and mapped their distribution on chromosomes. We found that in addition to typical satellites with monomers hundreds of nucleotides long, V. faba contains a large number of satellite repeats with unusually long monomers (687–2033 bp), which are predominantly localized in pericentromeric regions. Using chromatin immunoprecipitation with CenH3 antibody, we revealed an extraordinary diversity of centromeric satellites, consisting of seven repeats with chromosome-specific distribution. We also found that in spite of their different nucleotide sequences, all centromeric repeats are replicated during mid-S phase, while most other satellites are replicated in the first part of late S phase, followed by a single family of FokI repeats representing the latest replicating chromatin.

Journal ArticleDOI
TL;DR: IsoCon is developed, a tool for detecting and reconstructing isoforms from multigene families by analyzing long PacBio Iso-Seq reads that has allowed us to detect an unprecedented number of novel isoforms and has opened the door for unraveling the structure of many multigenes families and gaining a deeper understanding of genome evolution and human diseases.
Abstract: A significant portion of genes in vertebrate genomes belongs to multigene families, with each family containing several gene copies whose presence/absence, as well as isoform structure, can be highly variable across individuals. Existing de novo techniques for assaying the sequences of such highly-similar gene families fall short of reconstructing end-to-end transcripts with nucleotide-level precision or assigning alternatively spliced transcripts to their respective gene copies. We present IsoCon, a high-precision method using long PacBio Iso-Seq reads to tackle this challenge. We apply IsoCon to nine Y chromosome ampliconic gene families and show that it outperforms existing methods on both experimental and simulated data. IsoCon has allowed us to detect an unprecedented number of novel isoforms and has opened the door for unraveling the structure of many multigene families and gaining a deeper understanding of genome evolution and human diseases.

Journal ArticleDOI
TL;DR: The large number of variants observed reveal heterogeneity in human rDNA, opening up the possibility of corresponding variations in ribosome dynamics.
Abstract: Despite the key role of the human ribosome in protein biosynthesis, little is known about the extent of sequence variation in ribosomal DNA (rDNA) or its pre-rRNA and rRNA products. We recovered ribosomal DNA segments from a single human chromosome 21 using transformation-associated recombination (TAR) cloning in yeast. Accurate long-read sequencing of 13 isolates covering ∼0.82 Mb of the chromosome 21 rDNA complement revealed substantial variation among tandem repeat rDNA copies, several palindromic structures and potential errors in the previous reference sequence. These clones revealed 101 variant positions in the 45S transcription unit and 235 in the intergenic spacer sequence. Approximately 60% of the 45S variants were confirmed in independent whole-genome or RNA-seq data, with 47 of these further observed in mature 18S/28S rRNA sequences. TAR cloning and long-read sequencing enabled the accurate reconstruction of multiple rDNA units and a new, high-quality 44 838 bp rDNA reference sequence, which we have annotated with variants detected from chromosome 21 of a single individual. The large number of variants observed reveal heterogeneity in human rDNA, opening up the possibility of corresponding variations in ribosome dynamics.

Journal ArticleDOI
TL;DR: It is demonstrated that favorable alleles of the Gnla and DEP1 genes, which are considered key factors in rice yield increases, could be developed by artificial mutagenesis using genome editing technology.
Abstract: Rice yield is an important and complex agronomic trait controlled by multiple genes. In recent decades, dozens of yield-associated genes in rice have been cloned, many of which can increase production in the form of loss or degeneration of function. However, mutations occurring randomly under natural conditions have provided very limited genetic resources for yield increases. In this study, potentially yield-increasing alleles of two genes closely associated with yield were edited artificially. The recently developed CRISPR/Cas9 system was used to edit two yield genes: Grain number 1a (Gn1a) and DENSE AND ERECT PANICLE1 (DEP1). Several mutants were identified by a target sequence analysis. Phenotypic analysis confirmed one mutant allele of Gn1a and three of DEP1 conferring yield superior to that conferred by other natural high-yield alleles. Our results demonstrate that favorable alleles of the Gnla and DEP1 genes, which are considered key factors in rice yield increases, could be developed by artificial mutagenesis using genome editing technology.

Journal ArticleDOI
TL;DR: Establishing sequence-function and sequence-structure relationships in polyspecific CAZyme families are promising approaches for streamlining enzyme discovery and providing an in silico tool that can be tailored for enzyme bioprospecting in datasets of increasing complexity and for diverse applications in glycobiotechnology.
Abstract: Deposition of new genetic sequences in online databases is expanding at an unprecedented rate. As a result, sequence identification continues to outpace functional characterization of carbohydrate active enzymes (CAZymes). In this paradigm, the discovery of enzymes with novel functions is often hindered by high volumes of uncharacterized sequences particularly when the enzyme sequence belongs to a family that exhibits diverse functional specificities (i.e., polyspecificity). Therefore, to direct sequence-based discovery and characterization of new enzyme activities we have developed an automated in silico pipeline entitled: Sequence Analysis and Clustering of CarboHydrate Active enzymes for Rapid Informed prediction of Specificity (SACCHARIS). This pipeline streamlines the selection of uncharacterized sequences for discovery of new CAZyme or CBM specificity from families currently maintained on the CAZy website or within user-defined datasets. SACCHARIS was used to generate a phylogenetic tree of a GH43, a CAZyme family with defined subfamily designations. This analysis confirmed that large datasets can be organized into sequence clusters of manageable sizes that possess related functions. Seeding this tree with a GH43 sequence from Bacteroides dorei DSM 17855 (BdGH43b, revealed it partitioned as a single sequence within the tree. This pattern was consistent with it possessing a unique enzyme activity for GH43 as BdGH43b is the first described α-glucanase described for this family. The capacity of SACCHARIS to extract and cluster characterized carbohydrate binding module sequences was demonstrated using family 6 CBMs (i.e., CBM6s). This CBM family displays a polyspecific ligand binding profile and contains many structurally determined members. Using SACCHARIS to identify a cluster of divergent sequences, a CBM6 sequence from a unique clade was demonstrated to bind yeast mannan, which represents the first description of an α-mannan binding CBM. Additionally, we have performed a CAZome analysis of an in-house sequenced bacterial genome and a comparative analysis of B. thetaiotaomicron VPI-5482 and B. thetaiotaomicron 7330, to demonstrate that SACCHARIS can generate “CAZome fingerprints”, which differentiate between the saccharolytic potential of two related strains in silico. Establishing sequence-function and sequence-structure relationships in polyspecific CAZyme families are promising approaches for streamlining enzyme discovery. SACCHARIS facilitates this process by embedding CAZyme and CBM family trees generated from biochemically to structurally characterized sequences, with protein sequences that have unknown functions. In addition, these trees can be integrated with user-defined datasets (e.g., genomics, metagenomics, and transcriptomics) to inform experimental characterization of new CAZymes or CBMs not currently curated, and for researchers to compare differential sequence patterns between entire CAZomes. In this light, SACCHARIS provides an in silico tool that can be tailored for enzyme bioprospecting in datasets of increasing complexity and for diverse applications in glycobiotechnology.

Journal ArticleDOI
TL;DR: The complete chloroplast genome sequence of F. dibotrys reported in this study will provide useful plastid genomic resources for population genetics and pave the way for resolving phylogenetic relationships of order Caryophyllales.
Abstract: Fagopyrum dibotrys, belongs to Polygonaceae family, is one of national key conserved wild plants of China with important medicinal and economic values. Here, the complete chloroplast (cp) genome sequence of F. dibotrys is reported. The cp genome size is 159,919 bp with a typical quadripartite structure and consisting of a pair of inverted repeat regions (30,738 bp) separated by large single copy region (85,134 bp) and small single copy region (13,309 bp). Sequencing analyses indicated that the cp genome encodes 131 genes, including 80 protein-coding genes, 28 tRNA genes and 4 rRNA genes. The genome structure, gene order and codon usage are typical of angiosperm cp genomes. We also identified 48 simple sequence repeats (SSR) loci, fewer of them are distributed in the protein-coding sequences compared to the noncoding regions. Comparison of F. dibotrys cp genome to other Polygonaceae cp genomes indicated the inverted repeats (IRs) and coding regions were more conserved than single copy and noncoding regions, and several variation hotspots were detected. Coding gene sequence divergence analyses indicated that five genes (ndhK, petL rpoC2, ycf1, ycf2) were subject to positive selection. Phylogenetic analysis among 42 species based on cp genomes and 50 protein-coding genes indicated a close relationship between F. dibotrys and F. tataricum. In summary, the complete cp genome sequence of F. dibotrys reported in this study will provide useful plastid genomic resources for population genetics and pave the way for resolving phylogenetic relationships of order Caryophyllales.

Journal ArticleDOI
22 Jun 2018-PLOS ONE
TL;DR: It is estimated with 95% probability that the chances of finding incorrectly described metazoan sequences in the GenBank depend on the systematic group, and variety from less than 1% (Mollusca and Arthropoda) up to 6.9% (Gastrotricha).
Abstract: The cytochrome c oxidase subunit I (cox1) gene is the main mitochondrial molecular marker playing a pivotal role in phylogenetic research and is a crucial barcode sequence. Folmer's "universal" primers designed to amplify this gene in metazoan invertebrates allowed quick and easy barcode and phylogenetic analysis. On the other hand, the increase in the number of studies on barcoding leads to more frequent publishing of incorrect sequences, due to amplification of non-target taxa, and insufficient analysis of the obtained sequences. Consequently, some sequences deposited in genetic databases are incorrectly described as obtained from invertebrates, while being in fact bacterial sequences. In our study, in which we used Folmer's primers to amplify COI sequences of the crustacean fairy shrimp Branchipus schaefferi (Fischer 1834), we also obtained COI sequences of microbial contaminants from Aeromonas sp. However, when we searched the GenBank database for sequences closely matching these contaminations we found entries described as representatives of Gastrotricha and Mollusca. When these entries were compared with other sequences bearing the same names in the database, the genetic distance between the incorrect and correct sequences amplified from the same species was c.a. 65%. Although the responsibility for the correct molecular identification of species rests on researchers, the errors found in already published sequences data have not been re-evaluated so far. On the basis of the standard sampling technique we have estimated with 95% probability that the chances of finding incorrectly described metazoan sequences in the GenBank depend on the systematic group, and variety from less than 1% (Mollusca and Arthropoda) up to 6.9% (Gastrotricha). Consequently, the increasing popularity of DNA barcoding and metabarcoding analysis may lead to overestimation of species diversity. Finally, the study also discusses the sources of the problems with amplification of non-target sequences.

Journal ArticleDOI
TL;DR: Application of RIL-seq to bacteria grown under different conditions provides distinctive snapshots of the sRNA interactome and sheds light on the dynamics and rewiring of the post-transcriptional regulatory network.
Abstract: Small RNAs (sRNAs) are major post-transcriptional regulators of gene expression in bacteria. To enable transcriptome-wide mapping of bacterial sRNA-target pairs, we developed RIL-seq (RNA interaction by ligation and sequencing). RIL-seq is an experimental-computational methodology for capturing sRNA-target interactions in vivo that takes advantage of the mutual binding of the sRNA and target RNA molecules to the RNA chaperone protein Hfq. The experimental part of the protocol involves co-immunoprecipitation of Hfq and bound RNAs, ligation of RNAs, library preparation and sequencing. The computational pipeline maps the sequenced fragments to the genome, reveals chimeric fragments (fragments comprising two ligated independent fragments) and determines statistically significant overrepresented chimeric fragments as interacting RNAs. The statistical filter is aimed at reducing the number of spurious interactions resulting from ligation of random neighboring RNA fragments, thus increasing the reliability of the determined sRNA-target pairs. A major advantage of RIL-seq is that it does not require overexpression of sRNAs; instead, it simultaneously captures the in vivo targets of all sRNAs in the native state of the cell. Application of RIL-seq to bacteria grown under different conditions provides distinctive snapshots of the sRNA interactome and sheds light on the dynamics and rewiring of the post-transcriptional regulatory network. As RIL-seq needs no prior information about the sRNA and target sequences, it can identify novel sRNAs, along with their targets. It can be adapted to detect protein-mediated RNA-RNA interactions in any bacterium with a sequenced genome. The experimental part of the RIL-seq protocol takes 7-9 d and the computational analysis takes ∼2 d.

Journal ArticleDOI
TL;DR: The optimization and application of scRNA-seq is highlighted to understand the development of intercellular heterogeneity, the genealogy and evolution of cells, and key driven transcriptome networks in response to drug efficacy and toxicity.
Abstract: There is a rapid increase of evidence to address the importance of the interaction between single cells, drugs, and the response of single cells to therapies. Single-cell measurements were used to evaluate the DNA-damaging ability of the herbicide in freshly isolated human leukocytes (Villarini et al. 2000) or the ethoxyresorufin-Odeethylase activity of cytochrome P450 1A1 in singleliving cells with the microspectrofluorometric technique (Taira et al. 2007). The measurements of single-cell biology and sequencing are recently considered as an important approach to investigate molecular mechanisms of drug efficacy and resistances, discovery and development of therapeutic targets, and genealogic phenotypes of cells during disease progression (Chu et al. 2017; Wang 2016; Wang et al. 2017). Single-cell sequencing is an important measure to define intercellular heterogeneity, rare cell types, cell genealogies, somatic mosaicism, microbes, and disease evolution, including single-cell DNA genome sequencing, DNA methylome sequencing, and RNA sequencing. Of those, single-cell RNA sequencing (scRNA-seq) demonstrates transcriptomic cell-to-cell variation, new cell types, developmental processes, transcriptional stochasticity, transcriptome plasticity, and genome evolution (Wang 2015) (Fig. 1). The present article aims to highlight the optimization and application of scRNA-seq to understand the development of intercellular heterogeneity, the genealogy and evolution of cells, and key driven transcriptome networks in response to drug efficacy and toxicity. The experimental design and technical challenges are critical in the application of scRNA-seq (Kukurba and Montgomery 2015). A number of practical protocols have been developed and validated with a great variation of RNA sequencing sensitivity and accuracy. Ziegenhain et al. (2017) made a comprehensive comparison of scRNA-seq protocols and suggested that an informed choice among six prominent scRNA-seq methods, including CEL-seq2, Drop-seq, MARS-seq, SCRB-seq, Smart-seq, and Smart-seq2, based on scRNA-seq data from mouse embryonic stem cells. Svensson et al. (2017) evaluated the protocol sensitivity and accuracy of the published data sets as well as the study designs by comparing it with 15 other protocols computationally and 4 protocols experimentally for batch-matched cell populations. Using the spike-in standards and uniform data processing, they developed a flexible tool for counting the number of unique molecular identifiers (https://github.com/vals/umis/). Such a protocol makes it possible to perform scRNA-seq and to compare gene expression, novel transcripts, alternatively spliced genes, and allele-specific expression among numerous studies and performers. Of scRNA-seq preparation procedures, cryopreserved cells using 3′-end and full-length RNA preparation methods was found to generate the same transcriptional profiles as fresh cells do (Guillaumet-Adkins et al. 2017). Intercellular heterogeneity is a dominant element of intratumor heterogeneity, responsible for the development Cell Biol Toxicol DOI 10.1007/s10565-017-9404-y

Journal ArticleDOI
TL;DR: SeqOutBias efficiently corrects enzymatic sequence bias and facilitates identification of true molecular signatures resulting from transcription factors and RNA polymerase interacting with DNA.
Abstract: Coupling molecular biology to high-throughput sequencing has revolutionized the study of biology. Molecular genomics techniques are continually refined to provide higher resolution mapping of nucleic acid interactions and structure. Sequence preferences of enzymes can interfere with the accurate interpretation of these data. We developed seqOutBias to characterize enzymatic sequence bias from experimental data and scale individual sequence reads to correct intrinsic enzymatic sequence biases. SeqOutBias efficiently corrects DNase-seq, TACh-seq, ATAC-seq, MNase-seq and PRO-seq data. We show that seqOutBias correction facilitates identification of true molecular signatures resulting from transcription factors and RNA polymerase interacting with DNA.

Journal ArticleDOI
TL;DR: The results of this study suggest that SHIV should be considered a member of the proposed new genus “Xiairidovirus”.
Abstract: Infection with shrimp hemocyte iridescent virus (SHIV), a new virus of the family Iridoviridae isolated in China, results in a high mortality rate in white leg shrimp (Litopenaeus vannamei). The complete genome sequence of SHIV was determined and analyzed in this study. The genomic DNA was 165,809 bp long with 34.6% G+C content and 170 open reading frames (ORFs). Dotplot analysis showed that the longest repetitive region was 320 bp in length, including 11 repetitions of an 18-bp sequence and 3.1 repetitions of a 39-bp sequence. Two phylogenetic trees were constructed based on 27 or 16 concatenated sequences of proteins encoded by genes that are conserved between SHIV homologous and other iridescent viruses. The results of this study, suggest that SHIV should be considered a member of the proposed new genus “Xiairidovirus”.

Journal ArticleDOI
TL;DR: This comparative analysis of two high-quality chromosome assemblies enabled a comprehensive assessment of large structural variations and gene content of the 700-megabase chromosome 2D between two bread wheat genotypes.
Abstract: Recent improvements in DNA sequencing and genome scaffolding have paved the way to generate high-quality de novo assemblies of pseudomolecules representing complete chromosomes of wheat and its wild relatives. These assemblies form the basis to compare the dynamics of wheat genomes on a megabase scale. Here, we provide a comparative sequence analysis of the 700-megabase chromosome 2D between two bread wheat genotypes—the old landrace Chinese Spring and the elite Swiss spring wheat line ‘CH Campala Lr22a’. Both chromosomes were assembled into megabase-sized scaffolds. There is a high degree of sequence conservation between the two chromosomes. Analysis of large structural variations reveals four large indels of more than 100 kb. Based on the molecular signatures at the breakpoints, unequal crossing over and double-strand break repair were identified as the molecular mechanisms that caused these indels. Three of the large indels affect copy number of NLRs, a gene family involved in plant immunity. Analysis of SNP density reveals four haploblocks of 4, 8, 9 and 48 Mb with a 35-fold increased SNP density compared to the rest of the chromosome. Gene content across the two chromosomes was highly conserved. Ninety-nine percent of the genic sequences were present in both genotypes and the fraction of unique genes ranged from 0.4 to 0.7%. This comparative analysis of two high-quality chromosome assemblies enabled a comprehensive assessment of large structural variations and gene content. The insight obtained from this analysis will form the basis of future wheat pan-genome studies.

Journal ArticleDOI
TL;DR: During routine laboratory work, all colistin-resistant isolates were subjected to mcr-1 qPCR screening as previously described, and it was revealed that pEC1066 and pEC2380 are ColE plasmids with a close relationship to pSE13-SA01718 of Salmonella Paratyphi B (dTa!)
Abstract: Sir, Colistin is considered a last-resort antibiotic used to treat severe human infections caused by MDR Gram-negative bacteria. Thus, spread of colistin resistance among humans would be associated with major public health concerns. In 2015, the first mobile colistin resistance gene, mcr-1, encoding a phosphoethanolamine transferase enzyme, was identified on a transmissible plasmid. Soon after, mcr-2 and mcr-3 were detected on other conjugative plasmids in Enterobacteriaceae. Recently, Carattoli et al. and Borowiak et al. reported two novel genes, mcr-4 and mcr-5, in Salmonella enterica serovar Typhimurium and d-tartrate-fermenting S. enterica serovar Paratyphi B [Salmonella Paratyphi B (dTa!)], respectively. Both genes were located on non-conjugative ColE plasmids, either transmissible by a helper plasmid (mcr-4) or mobilizable by a Tn3-type transposon (mcr-5, registered as Tn6452 on the LSTM website, http://transposon.lstmed.ac.uk/) found on the Salmonella Paratyphi B (dTa!) chromosome and plasmids. Identification of different mcr genes in Enterobacteriaceae raised concerns about their distribution and genetic diversity. Thus, a molecular survey on the detection of mcr-5-harbouring Escherichia coli was initiated. In the German national monitoring programme for antimicrobial resistance in zoonotic agents the National Reference Laboratory for Antimicrobial Resistance has received 19216 E. coli isolates from food and food-producing animals between 2010 and 2017 for antimicrobial resistance testing. Of these, 737 isolates exhibited an MIC of colistin 4 mg/L. During routine laboratory work, all colistin-resistant isolates were subjected to mcr-1 qPCR screening as previously described. Using the PCR assay of Borowiak et al. on mcr-1-negative E. coli (n"135), three mcr-5positive isolates were detected in samples recovered from the caecal contents of pigs at slaughter (10E01066) and faecal samples from pigs at farms (11E02380, 15-AB00674). These isolates exhibit a non-WT phenotype (resistant) for colistin and other antimicrobials using the microdilution method according to CLSI guidelines (M07-A9) following EUCAST epidemiological cut-off values (Table S1, available as Supplementary data at JAC Online). S1-PFGE profiling and DNA hybridization showed a plasmid location of mcr-5 in the isolates. Short-read paired-end MiSeq sequencing of genomic DNA and de novo assembling were performed as previously described. Relevant genome features of the isolates are summarized in Table S1. The complete genomes of the mcr-5-harbouring plasmids were derived from WGS data by raw-read mapping against Salmonella Paratyphi B (dTa!) plasmid pSE13-SA01718 (KY807921). In 10E01066 and 11E02380 mcr-5 was detected on 12201 bp (pEC1066, MG587003) and 11708 bp (pEC2380, MG587004) plasmids, respectively. Comparative analyses using BRIG revealed that pEC1066 and pEC2380 are ColE plasmids with a close relationship to pSE13-SA01718 of Salmonella Paratyphi B (dTa!) (Figure 1a). Isolate 15-AB00674 carries mcr-5 on a 6268 bp plasmid (pEC0674, MF684783) (Figure 1b) exhibiting a stronger similarity to Klebsiella pneumoniae plasmid Kp13 (CP003996.1), potentially belonging to an as yet unknown incompatibility type. Interestingly, pEC0674 lacks tnpA and tnpR of the mcr-5 transposon. Further DNA alignments showed that pEC1066 and pEC2380 carry identical mcr-5 transposon sequences as described for pSE13-SA01718. Interestingly, mcr-5 on pEC2380 exhibits a deletion of three nucleotides encoding an amino acid in the central part of the protein (Figure 1c). As this isolate exhibits the highest MIC (8 mg/L) of colistin for the isolates of this study (Table S1), this deletion may not affect the domain structure of the enzyme or functional amino acids involved in resistance development (Figure 1c). Since this mcr-5 allele is the first variant, we suggest designating it mcr-5.2 (MG384740). The increased colistin resistance (MIC"8 mg/L) of isolate 11E02380 may be caused by a mutation in pmrB (V161G), which is known to contribute to colistin resistance in E. coli. In contrast to isolate 11E02380, no colistin resistance-associated pmrA or pmrB gene mutations were detected in E. coli 10E01066 and 15-AB00674. The mcr-5-carrying plasmids of this study do not carry transfer genes involved in plasmid conjugation. Further investigations were performed with pEC1066, which was therefore introduced by transformation into chemically competent E. coli cells (DH5a, Thermo Fisher) under colistin (2 mg/L) selection. The presence of

Journal ArticleDOI
TL;DR: A database and interpretive criteria for identifying individual species of nontuberculous mycobacteria are developed and clinical usefulness of 16S rRNA, hsp65, and rpoB as target genes for this method is evaluated.
Abstract: Background The isolation of nontuberculous mycobacteria (NTM) from clinical specimens has increased, and they now are considered significant opportunistic pathogens. The aims of this study were to develop a database and interpretive criteria for identifying individual species. In addition, using clinical isolates, we evaluated the clinical usefulness of 16S rRNA, hsp65, and rpoB as target genes for this method. Methods The sequences of NTM for 16S rRNA, hsp65, and rpoB were collected from GenBank and checked by manual inspection. Clinical isolates collected between 2005 and 2010 were used for DNA extraction, polymerase chain reaction, and sequencing of these three genes. We constructed a database for the genes and evaluated the clinical utility of multilocus sequence analysis (MLSA) using 109 clinical isolates. Results A total 131, 130, and 122 sequences were collected from GenBank for 16S rRNA, hsp65, and rpoB, respectively. The percent similarities of the three genes ranged from 96.57% to 100% for the 16S rRNA gene, 89.27% to 100% for hsp65, and 92.71% to 100% for rpoB. When we compared the sequences of 109 clinical strains with those of the database, the rates of species-level identification were 71.3%, 86.79%, and 81.55% with 16S rRNA, hsp65, and rpoB, respectively. We could identify 97.25% of the isolates to the species level when we used MLSA. Conclusion There were significant differences among the utilities of the three genes for species identification. The MLSA technique would be helpful for identification of NTM.