scispace - formally typeset
Search or ask a question

Showing papers on "Sequence analysis published in 2019"


Journal ArticleDOI
TL;DR: A high-throughput amplicon sequencing methodology based on PacBio circular consensus sequencing and the DADA2 sample inference method that measures the full-length 16S rRNA gene with single-nucleotide resolution and a near-zero error rate is presented.
Abstract: Targeted PCR amplification and high-throughput sequencing (amplicon sequencing) of 16S rRNA gene fragments is widely used to profile microbial communities. New long-read sequencing technologies can sequence the entire 16S rRNA gene, but higher error rates have limited their attractiveness when accuracy is important. Here we present a high-throughput amplicon sequencing methodology based on PacBio circular consensus sequencing and the DADA2 sample inference method that measures the full-length 16S rRNA gene with single-nucleotide resolution and a near-zero error rate. In two artificial communities of known composition, our method recovered the full complement of full-length 16S sequence variants from expected community members without residual errors. The measured abundances of intra-genomic sequence variants were in the integral ratios expected from the genuine allelic variants within a genome. The full-length 16S gene sequences recovered by our approach allowed Escherichia coli strains to be correctly classified to the O157:H7 and K12 sub-species clades. In human fecal samples, our method showed strong technical replication and was able to recover the full complement of 16S rRNA alleles in several E. coli strains. There are likely many applications beyond microbial profiling for which high-throughput amplicon sequencing of complete genes with single-nucleotide resolution will be of use.

263 citations


Journal ArticleDOI
TL;DR: Generation of a library of 62,389 mapped insertion mutants for the unicellular alga Chlamydomonas reinhardtii enables screening for genes required for photosynthesis and the identification of 303 candidate genes.
Abstract: Photosynthetic organisms provide food and energy for nearly all life on Earth, yet half of their protein-coding genes remain uncharacterized1,2. Characterization of these genes could be greatly accelerated by new genetic resources for unicellular organisms. Here we generated a genome-wide, indexed library of mapped insertion mutants for the unicellular alga Chlamydomonas reinhardtii. The 62,389 mutants in the library, covering 83% of nuclear protein-coding genes, are available to the community. Each mutant contains unique DNA barcodes, allowing the collection to be screened as a pool. We performed a genome-wide survey of genes required for photosynthesis, which identified 303 candidate genes. Characterization of one of these genes, the conserved predicted phosphatase-encoding gene CPL3, showed that it is important for accumulation of multiple photosynthetic protein complexes. Notably, 21 of the 43 higher-confidence genes are novel, opening new opportunities for advances in understanding of this biogeochemically fundamental process. This library will accelerate the characterization of thousands of genes in algae, plants, and animals.

187 citations


Journal ArticleDOI
TL;DR: This work used a full-length, direct RNA sequencing (DRS) approach based on nanopores to characterize viral RNAs produced in cells infected with a human coronavirus, and paves the way for haplotype-based analyses of viral quasispecies by showing the feasibility of intra-sample haplotype separation.
Abstract: Sequence analyses of RNA virus genomes remain challenging owing to the exceptional genetic plasticity of these viruses. Because of high mutation and recombination rates, genome replication by viral RNA-dependent RNA polymerases leads to populations of closely related viruses, so-called "quasispecies." Standard (short-read) sequencing technologies are ill-suited to reconstruct large numbers of full-length haplotypes of (1) RNA virus genomes and (2) subgenome-length (sg) RNAs composed of noncontiguous genome regions. Here, we used a full-length, direct RNA sequencing (DRS) approach based on nanopores to characterize viral RNAs produced in cells infected with a human coronavirus. By using DRS, we were able to map the longest (∼26-kb) contiguous read to the viral reference genome. By combining Illumina and Oxford Nanopore sequencing, we reconstructed a highly accurate consensus sequence of the human coronavirus (HCoV)-229E genome (27.3 kb). Furthermore, by using long reads that did not require an assembly step, we were able to identify, in infected cells, diverse and novel HCoV-229E sg RNAs that remain to be characterized. Also, the DRS approach, which circumvents reverse transcription and amplification of RNA, allowed us to detect methylation sites in viral RNAs. Our work paves the way for haplotype-based analyses of viral quasispecies by showing the feasibility of intra-sample haplotype separation. Even though several technical challenges remain to be addressed to exploit the potential of the nanopore technology fully, our work illustrates that DRS may significantly advance genomic studies of complex virus populations, including predictions on long-range interactions in individual full-length viral RNA haplotypes.

177 citations


Journal ArticleDOI
TL;DR: A computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies and shows that the corresponding sequence is highly accurate and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome.
Abstract: We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged ( 99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.

128 citations


Journal ArticleDOI
21 Jan 2019-Nature
TL;DR: Single-cell, thiol-(SH)-linked alkylation of RNA for metabolic labelling sequencing (scSLAM-seq) is introduced, which integrates metabolic RNA labelling2, biochemical nucleoside conversion3 and scRNA-seq to record transcriptional activity directly by differentiating between new and old RNA for thousands of genes per single cell.
Abstract: Single-cell RNA sequencing (scRNA-seq) has highlighted the important role of intercellular heterogeneity in phenotype variability in both health and disease1. However, current scRNA-seq approaches provide only a snapshot of gene expression and convey little information on the true temporal dynamics and stochastic nature of transcription. A further key limitation of scRNA-seq analysis is that the RNA profile of each individual cell can be analysed only once. Here we introduce single-cell, thiol-(SH)-linked alkylation of RNA for metabolic labelling sequencing (scSLAM-seq), which integrates metabolic RNA labelling2, biochemical nucleoside conversion3 and scRNA-seq to record transcriptional activity directly by differentiating between new and old RNA for thousands of genes per single cell. We use scSLAM-seq to study the onset of infection with lytic cytomegalovirus in single mouse fibroblasts. The cell-cycle state and dose of infection deduced from old RNA enable dose-response analysis based on new RNA. scSLAM-seq thereby both visualizes and explains differences in transcriptional activity at the single-cell level. Furthermore, it depicts 'on-off' switches and transcriptional burst kinetics in host gene expression with extensive gene-specific differences that correlate with promoter-intrinsic features (TBP-TATA-box interactions and DNA methylation). Thus, gene-specific, and not cell-specific, features explain the heterogeneity in transcriptomes between individual cells and the transcriptional response to perturbations.

122 citations


Journal ArticleDOI
TL;DR: It is shown that m7G-MaP-seq efficiently detects known m7g modifications in rRNA with mutational rates up to 25% and a previously uncharacterised evolutionarily conserved rRNA modification at position 1581 in Arabidopsis thaliana SSU rRNA is mapped.
Abstract: Methylation of guanosine on position N7 (m7G) on internal RNA positions has been found in all domains of life and have been implicated in human disease. Here, we present m7G Mutational Profiling sequencing (m7G-MaP-seq), which allows high throughput detection of m7G modifications at nucleotide resolution. In our method, m7G modified positions are converted to abasic sites by reduction with sodium borohydride, directly recorded as cDNA mutations through reverse transcription and sequenced. We detect positions with increased mutation rates in the reduced and control samples taking the possibility of sequencing/alignment error into account and use replicates to calculate statistical significance based on log likelihood ratio tests. We show that m7G-MaP-seq efficiently detects known m7G modifications in rRNA with mutational rates up to 25% and we map a previously uncharacterised evolutionarily conserved rRNA modification at position 1581 in Arabidopsis thaliana SSU rRNA. Furthermore, we identify m7G modifications in budding yeast, human and arabidopsis tRNAs and demonstrate that m7G modification occurs before tRNA splicing. We do not find any evidence for internal m7G modifications being present in other small RNA, such as miRNA, snoRNA and sRNA, including human Let-7e. Likewise, high sequencing depth m7G-MaP-seq analysis of mRNA from E. coli or yeast cells did not identify any internal m7G modifications.

111 citations


Journal ArticleDOI
TL;DR: It is found that 3′ untranslated region length is correlated with poly(A) tail length, that alternative polyadenylation sites and alternative promoters for the same gene are linked to different tail lengths, and that tails contain a substantial number of cytosines.
Abstract: Although messenger RNAs are key molecules for understanding life, until now, no method has existed to determine the full-length sequence of endogenous mRNAs including their poly(A) tails. Moreover, although non-A nucleotides can be incorporated in poly(A) tails, there also exists no method to accurately sequence them. Here, we present full-length poly(A) and mRNA sequencing (FLAM-seq), a rapid and simple method for high-quality sequencing of entire mRNAs. We report a complementary DNA library preparation method coupled to single-molecule sequencing to perform FLAM-seq. Using human cell lines, brain organoids and Caenorhabditis elegans we show that FLAM-seq delivers high-quality full-length mRNA sequences for thousands of different genes per sample. We find that 3′ untranslated region length is correlated with poly(A) tail length, that alternative polyadenylation sites and alternative promoters for the same gene are linked to different tail lengths, and that tails contain a substantial number of cytosines. FLAM-seq implements a cDNA library preparation followed by single-molecule sequencing, for determining full-length mRNA molecules, including poly(A) tails.

101 citations


Journal ArticleDOI
TL;DR: The evolution of species HAdV-C was studied by generating 51 complete genome sequences from circulating strains and clustering of the E4 region indicated recombination events in 26 out of the 51 sequenced specimens.
Abstract: Currently, 88 different Human Adenovirus (HAdV) types are grouped into seven HAdV species A to G. Most types (57) belong to species HAdV-D. Recombination between capsid genes (hexon, penton and fiber) is the main factor contributing to the diversity in species HAdV-D. Noteworthy, species HAdV-C contains so far only five types, although species HAdV-C is highly prevalent and clinically significant in immunosuppressed patients. Therefore, the evolution of species HAdV-C was studied by generating 51 complete genome sequences from circulating strains. Clustering of the whole genome HAdV-C sequences confirmed classical typing results (fifteen HAdV-C1, thirty HAdV-C2, four HAdV-C5, two HAdV-C6). However, two HAdV-C2 strains had a novel penton base sequence and thus were re-labeled as the novel type HAdV-C89. Fiber and early gene region 3 (E3) sequences clustered always with the corresponding prototype sequence but clustering of the E4 region indicated recombination events in 26 out of the 51 sequenced specimens. Recombination of the E1 gene region was detected in 16 circulating strains. As early gene region sequences are not considered in the type definition of HAdVs, evolution of HAdV-C remains on the subtype level. Nonetheless, recombination of the E1 and E4 gene regions may influence the virulence of HAdV-C strains.

101 citations


Journal ArticleDOI
TL;DR: New transcriptome alkylation-dependent single-cell RNA sequencing (NASC-seq) is developed, to monitor newly synthesised and pre-existing RNA simultaneously in single cells, and enables precise temporal monitoring of RNA synthesis at single- cell resolution during homoeostasis, perturbation responses and cellular differentiation.
Abstract: Sequencing of newly synthesised RNA can monitor transcriptional dynamics with great sensitivity and high temporal resolution, but is currently restricted to populations of cells. Here, we develop new transcriptome alkylation-dependent single-cell RNA sequencing (NASC-seq), to monitor newly synthesised and pre-existing RNA simultaneously in single cells. We validate the method on pre-labelled RNA, and by demonstrating that more newly synthesised RNA was detected for genes with known high mRNA turnover. Monitoring RNA synthesis during Jurkat T-cell activation with NASC-seq reveals both rapidly up- and down-regulated genes, and that induced genes are almost exclusively detected as newly transcribed. Moreover, the newly synthesised and pre-existing transcriptomes after T-cell activation are distinct, confirming that NASC-seq simultaneously measures gene expression corresponding to two time points in single cells. Altogether, NASC-seq enables precise temporal monitoring of RNA synthesis at single-cell resolution during homoeostasis, perturbation responses and cellular differentiation.

73 citations


Journal ArticleDOI
17 Jun 2019-PLOS ONE
TL;DR: The results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.
Abstract: The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

63 citations


Journal ArticleDOI
TL;DR: NtEdit is developed, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes and performs well at low sequence depths, fixing the majority of base substitutions and indels, and its performance is largely constant with increased coverage.
Abstract: MOTIVATION In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. RESULTS We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths ( 97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. AVAILABILITY AND IMPLEMENTATION https://github.com/bcgsc/ntedit. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The utility of ViromeQC is demonstrated by applying it to 2,050 human, animal and environmental samples from 35 metagenomic virome sequencing studies that used one of the available VLP enrichment techniques, revealing these viromes to be rife with bacterial, archaeal and fungal contamination.
Abstract: To the Editor — Eukaryotic viruses and bacteriophages have important roles in microbiomes, but characterization of viruses in metagenomics data is difficult. Viral-like particle (VLP) purification enables enrichment for viruses from microbiome samples before sequencing, but contamination can result in misleading conclusions. We present a software tool named ViromeQC for analyzing virome data. Here, we demonstrate the utility of ViromeQC by applying it to 2,050 human, animal and environmental samples from 35 metagenomic virome sequencing studies that used one of the available VLP enrichment techniques. The resulting analysis reveals these viromes to be rife with bacterial, archaeal and fungal contamination. Most samples show only modest virus enrichment, and such enrichment is very variable between viromes in the same study. To address these issues, we present a validated contamination quality-control pipeline to enable more robust virome metagenomic analyses. Viruses affect the ecology and composition of microbial communities1,2. Bacteriophages (viruses of bacteria and archaea) are extremely abundant and diverse, and they affect microbiomes in several ways, including transduction, which is an important mechanism of lateral gene transfer3. Metagenomics can be used to characterize phage populations, but phage are so diverse, and evolve so rapidly, that they are poorly represented in sequence databases. Also, there are no universal viral genetic markers, and the overall biomass of viruses, compared with that of other microorganisms in a sample, is low. For these reasons, phage sequences are difficult to identify in metagenomes, although specific methods that are partly based on sequence characteristics of known phages have been reported4,5. VLP purification can be used to enrich microbiome samples for viral nucleic acids6, thereby improving virus detection. VLP protocols have various goals, ranging from untargeted analyses of highly purified phage populations to targeted identification of rare sequences of viral pathogens in diagnostic samples. These methods typically include filtration through small-poresize filters that retain bacteria, cesium chloride gradient purification, treatment with chloroform to disrupt membranes, and exposure to nucleases to reduce free DNA and RNA concentration. If the aim is to use metagenomics to detect known viral pathogens, a low-purity sample may suffice because identification will be by alignment of sequence reads to viral databases. However, if the aim is to detect unknown viruses or report all viruses in a sample, a high-purity sample is required. When coupled with untargeted shotgun sequencing7, VLP enrichment has underpinned many studies in human8,9, environmental10,11 and built-environment settings12, but there is no single VLP enrichment protocol that is optimal for all sample types. Regardless of the VLP protocol, non-viral genetic material remains after enrichment13. These unwanted nucleic acids are contaminants, and their presence particularly confounds the de novo discovery of phages in untargeted virome sequencing. If the VLP virome is pure, it is possible to assemble reads into possibly fragmented viral genomes without using computational prediction approaches, which are unavoidably affected by lowconfidence calls and false negatives4,5. The fraction of next-generation sequencing reads belonging to viruses in the VLP sample correlates with the performance of de novo recovery of new viruses, but methods for evaluating VLP purity in samples have not been systematically explored. Studies have assessed contamination of VLP preparations by PCR amplification of prokaryotic 16S rRNA gene sequences before virome sequencing11,14–19. Others have mapped next-generation virome sequencing output against the 16S rRNA gene, or a different marker9,20–24. However, these studies have not provided a validated pipeline to quantify viral enrichment in viromes or unenriched samples. Although efforts toward VLP-protocol optimization have been reported24, the largest meta-analysis of post-sequencing non-viral quantification to date considered just 67 viromes13. As the use of VLP enrichment for virome sequencing is increasing, we set out to evaluate non-viral contamination in >2,000 virome samples. To assess the enrichment rates of publicly available viromes, we applied our method (Supplementary Methods) to a collection of 2,050 VLP samples (Supplementary Table 1). As controls, we included 2,189 metagenomes that were not enriched for viruses from the curatedMetagenomicData25 and the National Center for Biotechnology Information Sequence Read Archive (NCBISRA)26 repositories, as well as 108 publicly accessible synthetic metagenomes27,28 and one mock community (Supplementary Table 2). After uniform preprocessing to remove low-quality reads (Supplementary Methods), we computed the percentage of raw reads in each sample that align to the small subunit ribosomal RNA gene (SSU rRNA), which has never been found in a viral genome. This provided a proxy for non-viral microbial sequence abundance13. We estimated the abundance of bacterial and archaeal 16S and microeukaryotic 18S ribosomal genes in all of the viromes and metagenomes. Unenriched metagenomes provided a baseline estimation of the environment-specific rRNA gene abundance, from which we calculated the relative enrichment of viromes with respect to the metagenomes. Environmental and human/animal unenriched metagenomes had a median rRNA gene abundance of 0.08% (n = 320, interquartile range = 0.07%) and 0.25% (n = 1,551, interquartile range = 0.1%), respectively (Fig. 1). Prokaryotic and micro-eukaryotic contamination of viromes estimated by the quantification of the SSU rRNA revealed a wide range of enrichment efficiencies, with a large fraction of samples (n = 567, 28.7%) having no virus enrichment at all and >50% (n = 990) having less than threefold enrichment. A substantially smaller fraction of samples (n = 339, 17.15%) showed high enrichment (>100-fold). Differences in enrichment rates were not clearly associated with any one VLP purification method, although the heterogeneity of protocols makes it difficult to provide statistical support to this observation. According to taxonomic annotations of the rRNA gene sequences retrieved in viromes, the largest source of contamination was bacterial DNA (1,466 samples), with 88 samples having higher abundances of eukaryotic-associated SSU rRNAs (Supplementary Table 3). The rRNA gene abundance variability was higher in viromes than in metagenomes (Mann–Whitney U test P = 7.5 × 10–8, Supplementary Fig. 1), revealing not only that many viromes are poorly enriched for

Journal ArticleDOI
TL;DR: Overall, ST1193 appears to be a recently emerged clone in which both stepwise and mosaic evolution have contributed to epidemiologic success.
Abstract: The fluoroquinolone-resistant sequence type 1193 (ST1193) of Escherichia coli, from the ST14 clonal complex (STc14) within phylogenetic group B2, has appeared recently as an important cause of extraintestinal disease in humans. Although this emerging lineage has been characterized to some extent using conventional methods, it has not been studied extensively at the genomic level. Here, we used whole-genome sequence analysis to compare 355 ST1193 isolates with 72 isolates from other STs within STc14. Using core genome phylogeny, the ST1193 isolates formed a tightly clustered clade with many genotypic similarities, unlike ST14 isolates. All ST1193 isolates possessed the same set of three chromosomal mutations conferring fluoroquinolone resistance, carried the fimH64 allele, and were lactose non-fermenting. Analysis revealed an evolutionary progression from K1 to K5 capsular types and acquisition of an F-type virulence plasmid, followed by changes in plasmid structure congruent with genome phylogeny. In contrast, the numerous identified antimicrobial resistance genes were distributed incongruently with the underlying phylogeny, suggesting frequent gain or loss of the corresponding resistance gene cassettes despite retention of the presumed carrier plasmids. Pangenome analysis revealed gains and losses of genetic loci occurring during the transition from ST14 to ST1193 and from the K1 to K5 capsular types. Using time-scaled phylogenetic analysis, we estimated that current ST1193 clades first emerged approximately 25 years ago. Overall, ST1193 appears to be a recently emerged clone in which both stepwise and mosaic evolution have contributed to epidemiologic success.

Journal ArticleDOI
TL;DR: TRNA reduction and cleavage sequencing (TRAC-Seq) is a chemically based approach for the unbiased global mapping of 7-methylguansine (m7G) modification of tRNAs at single-nucleotide resolution throughout the tRNA transcriptome.
Abstract: Precise identification of sites of RNA modification is key to studying the functional role of such modifications in the regulation of gene expression and for elucidating relevance to diverse physiological processes. tRNA reduction and cleavage sequencing (TRAC-Seq) is a chemically based approach for the unbiased global mapping of 7-methylguansine (m7G) modification of tRNAs at single-nucleotide resolution throughout the tRNA transcriptome. m7G TRAC-Seq involves the treatment of size-selected (<200 nt) RNAs with the demethylase AlkB to remove major tRNA modifications, followed by sodium borohydride (NaBH4) reduction of m7G sites and subsequent aniline-mediated cleavage of the RNA chain at the resulting abasic sites. The cleaved sites are subsequently ligated with adaptors for the construction of libraries for high-throughput sequencing. The m7G modification sites are identified using a bioinformatic pipeline that calculates the cleavage scores at individual sites on all tRNAs. Unlike antibody-based methods, such as methylated RNA immunoprecipitation and sequencing (meRIP-Seq) for enrichment of methylated RNA sequences, chemically based approaches, including TRAC-Seq, can provide nucleotide-level resolution of modification sites. Compared to the related method AlkAniline-Seq (alkaline hydrolysis and aniline cleavage sequencing), TRAC-Seq incorporates small RNA selection, AlkB demethylation, and sodium borohydride reduction steps to achieve specific and efficient single-nucleotide resolution profiling of m7G sites in tRNAs. The m7G TRAC-Seq protocol could be adapted to chemical cleavage-mediated detection of other RNA modifications. The protocol can be completed within ~9 d for four biological replicates of input and treated samples.

Journal ArticleDOI
TL;DR: Phylogenetic analysis illustrated that the metal clusters responsible for intermolecular electron transfer were drastically altered during evolution, and network analysis among the structural groups of Ni-CODHs, their associated proteins and taxonomies revealed previously unrecognized gene clusters for Ni- CODH, including uncharacterized structural groups with putative metal transporters, oxidoreductases, or transcription factors.
Abstract: Anaerobic Ni-containing carbon-monoxide dehydrogenases (Ni-CODHs) catalyze the reversible conversion between carbon monoxide and carbon dioxide as multi-enzyme complexes responsible for carbon fixation and energy conservation in anaerobic microbes. However, few biochemically characterized model enzymes exist, with most Ni-CODHs remaining functionally unknown. Here, we performed phylogenetic and structure-based Ni-CODH classification using an expanded dataset comprised of 1942 non-redundant Ni-CODHs from 1375 Ni-CODH-encoding genomes across 36 phyla. Ni-CODHs were divided into seven clades, including a novel clade. Further classification into 24 structural groups based on sequence analysis combined with structural prediction revealed diverse structural motifs for metal cluster formation and catalysis, including novel structural motifs potentially capable of forming metal clusters or binding metal ions, indicating Ni-CODH diversity and plasticity. Phylogenetic analysis illustrated that the metal clusters responsible for intermolecular electron transfer were drastically altered during evolution. Additionally, we identified novel putative Ni-CODH-associated proteins from genomic contexts other than the Wood–Ljungdahl pathway and energy converting hydrogenase system proteins. Network analysis among the structural groups of Ni-CODHs, their associated proteins and taxonomies revealed previously unrecognized gene clusters for Ni-CODHs, including uncharacterized structural groups with putative metal transporters, oxidoreductases, or transcription factors. These results suggested diversification of Ni-CODH structures adapting to their associated proteins across microbial genomes.

Journal ArticleDOI
24 Jun 2019-PLOS ONE
TL;DR: A simplified workflow for amplification of IgG antibody variable regions from hybridoma RNA by a specialized RT-PCR followed by Sanger sequencing is described, which successfully sequenced the variable regions of five mouse monoclonal IgG antibodies and enabled the design of chimeric mouse/human antibody expression plasmids for recombinant antibody production in mammalian cell culture expression systems.
Abstract: The diversity of antibody variable regions makes cDNA sequencing challenging, and conventional monoclonal antibody cDNA amplification requires the use of degenerate primers. Here, we describe a simplified workflow for amplification of IgG antibody variable regions from hybridoma RNA by a specialized RT-PCR followed by Sanger sequencing. We perform three separate reactions for each hybridoma: one each for kappa, lambda, and heavy chain transcripts. We prime reverse transcription with a primer specific to the respective constant region and use a template-switch oligonucleotide, which creates a custom sequence at the 5' end of the antibody cDNA. This template-switching circumvents the issue of low sequence homology and the need for degenerate primers. Instead, subsequent PCR amplification of the antibody cDNA molecules requires only two primers: one primer specific for the template-switch oligonucleotide sequence and a nested primer to the respective constant region. We successfully sequenced the variable regions of five mouse monoclonal IgG antibodies using this method, which enabled us to design chimeric mouse/human antibody expression plasmids for recombinant antibody production in mammalian cell culture expression systems. All five recombinant antibodies bind their respective antigens with high affinity, confirming that the amino acid sequences determined by our method are correct and demonstrating the high success rate of our method. Furthermore, we also designed RT-PCR primers and amplified the variable regions from RNA of cells transfected with chimeric mouse/human antibody expression plasmids, showing that our approach is also applicable to IgG antibodies of human origin. Our monoclonal antibody sequencing method is highly accurate, user-friendly, and very cost-effective.

Journal ArticleDOI
TL;DR: The use of biotinylated oligos and streptavidin-coated paramagnetic beads for the efficient and specific depletion of trypanosomal rRNA is described, providing a useful alternative for rRNA removal where enrichment of polyadenylated transcripts is not an option and commercial kits for r RNA are not available.
Abstract: In most organisms, ribosomal RNA (rRNA) contributes to >85% of total RNA. Thus, to obtain useful information from RNA-sequencing (RNA-seq) analyses at reasonable sequencing depth, typically, mature polyadenylated transcripts are enriched or rRNA molecules are depleted. Targeted depletion of rRNA is particularly useful when studying transcripts lacking a poly(A) tail, such as some non-coding RNAs (ncRNAs), most bacterial RNAs and partially degraded or immature transcripts. While several commercially available kits allow effective rRNA depletion, their efficiency relies on a high degree of sequence homology between oligonucleotide probes and the target RNA. This restricts the use of such kits to a limited number of organisms with conserved rRNA sequences. In this study we describe the use of biotinylated oligos and streptavidin-coated paramagnetic beads for the efficient and specific depletion of trypanosomal rRNA. Our approach reduces the levels of the most abundant rRNA transcripts to less than 5% with minimal off-target effects. By adjusting the sequence of the oligonucleotide probes, our approach can be used to deplete rRNAs or other abundant transcripts independent of species. Thus, our protocol provides a useful alternative for rRNA removal where enrichment of polyadenylated transcripts is not an option and commercial kits for rRNA are not available.

Journal ArticleDOI
TL;DR: The molecular basis for Rph1-mediated resistance to leaf rust in cultivated barley enabling varietal improvement through diagnostic marker design, gene editing, and gene stacking technologies is determined.
Abstract: Unraveling and exploiting mechanisms of disease resistance in cereal crops is currently limited by their large repeat-rich genomes and the lack of genetic recombination or cultivar (cv)-specific sequence information. We cloned the first leaf rust resistance gene Rph1 (Rph1.a) from cultivated barley (Hordeum vulgare) using “MutChromSeq,” a recently developed molecular genomics tool for the rapid cloning of genes in plants. Marker-trait association in the CI 9214/Stirling doubled haploid population mapped Rph1 to the short arm of chromosome 2H in a physical region of 1.3 megabases relative to the barley cv Morex reference assembly. A sodium azide mutant population in cv Sudan was generated and 10 mutants were confirmed by progeny-testing. Flow-sorted 2H chromosomes from Sudan (wild type) and six of the mutants were sequenced and compared to identify candidate genes for the Rph1 locus. MutChromSeq identified a single gene candidate encoding a coiled-coil nucleotide binding site Leucine-rich repeat (NLR) receptor protein that was altered in three different mutants. Further Sanger sequencing confirmed all three mutations and identified an additional two independent mutations within the same candidate gene. Phylogenetic analysis determined that Rph1 clustered separately from all previously cloned NLRs from the Triticeae and displayed highest sequence similarity (89%) with a homolog of the Arabidopsis (Arabidopsis thaliana) disease resistance protein 1 protein in Triticum urartu. In this study we determined the molecular basis for Rph1-mediated resistance in cultivated barley enabling varietal improvement through diagnostic marker design, gene editing, and gene stacking technologies.

Journal ArticleDOI
TL;DR: This work aimed to develop a cost‐effective sequencing method for ABCA4 exons and regions carrying known causal deep‐intronic variants that can be applied to Stargardt disease.
Abstract: Purpose Stargardt disease (STGD1) is caused by biallelic mutations in ABCA4, but many patients are genetically unsolved due to insensitive mutation-scanning methods. We aimed to develop a cost-effective sequencing method for ABCA4 exons and regions carrying known causal deep-intronic variants. Methods Fifty exons and 12 regions containing 14 deep-intronic variants of ABCA4 were sequenced using double-tiled single molecule Molecular Inversion Probe (smMIP)-based next-generation sequencing. DNAs of 16 STGD1 cases carrying 29 ABCA4 alleles and of four healthy persons were sequenced using 483 smMIPs. Thereafter, DNAs of 411 STGD1 cases with one or no ABCA4 variant were sequenced. The effect of novel noncoding variants on splicing was analyzed using in vitro splice assays. Results Thirty-four ABCA4 variants previously identified in 16 STGD1 cases were reliably identified. In 155/411 probands (38%), two causal variants were identified. We identified 11 deep-intronic variants present in 62 alleles. Two known and two new noncanonical splice site variants showed splice defects, and one novel deep-intronic variant (c.4539+2065C>G) resulted in a 170-nt mRNA pseudoexon insertion (p.[Arg1514Lysfs*35,=]). Conclusions smMIPs-based sequence analysis of coding and selected noncoding regions of ABCA4 enabled cost-effective mutation detection in STGD1 cases in previously unsolved cases.

Journal ArticleDOI
TL;DR: The function of a newly identified RBP is shown and insights into alternative splicing regulation during maize kernel development are provided and suggested that DEK42 participates in the regulation of pre-messenger RNA splicing through its interaction with other spliceosome components.
Abstract: RNA-binding proteins (RBPs) play an important role in post-transcriptional gene regulation. However, the functions of RBPs in plants remain poorly understood. Maize kernel mutant dek42 has small defective kernels and lethal seedlings. Dek42 was cloned by Mutator tag isolation and further confirmed by an independent mutant allele and clustered regularly interspaced short palindromic repeats (CRISPR)-CRISPR-associated protein 9 materials. Dek42 encodes an RRM_RBM48 type RNA-binding protein that localizes to the nucleus. Dek42 is constitutively expressed in various maize tissues. The dek42 mutation caused a significant reduction in the accumulation of DEK42 protein in mutant kernels. RNA-seq analysis showed that the dek42 mutation significantly disturbed the expression of thousands of genes during maize kernel development. Sequence analysis also showed that the dek42 mutation significantly changed alternative splicing in expressed genes, which were especially enriched for the U12-type intron-retained type. Yeast two-hybrid screening identified SF3a1 as a DEK42-interacting protein. DEK42 also interacts with the spliceosome component U1-70K. These results suggested that DEK42 participates in the regulation of pre-messenger RNA splicing through its interaction with other spliceosome components. This study showed the function of a newly identified RBP and provided insights into alternative splicing regulation during maize kernel development.

Journal ArticleDOI
TL;DR: This work developed a novel single cell strand-specific total RNA library preparation method addressing all the shortcomings of existing methods and demonstrating that the method detects an equal or higher number of genes compared to classic polyA[+] RNA-seq, including novel and non-polyadenylated genes.
Abstract: Single cell RNA sequencing methods have been increasingly used to understand cellular heterogeneity. Nevertheless, most of these methods suffer from one or more limitations, such as focusing only on polyadenylated RNA, sequencing of only the 3' end of the transcript, an exuberant fraction of reads mapping to ribosomal RNA, and the unstranded nature of the sequencing data. Here, we developed a novel single cell strand-specific total RNA library preparation method addressing all the aforementioned shortcomings. Our method was validated on a microfluidics system using three different cancer cell lines undergoing a chemical or genetic perturbation and on two other cancer cell lines sorted in microplates. We demonstrate that our total RNA-seq method detects an equal or higher number of genes compared to classic polyA[+] RNA-seq, including novel and non-polyadenylated genes. The obtained RNA expression patterns also recapitulate the expected biological signal. Inherent to total RNA-seq, our method is also able to detect circular RNAs. Taken together, SMARTer single cell total RNA sequencing is very well suited for any single cell sequencing experiment in which transcript level information is needed beyond polyadenylated genes.

Journal ArticleDOI
TL;DR: The analysis of core genes revealed the extent, source, and mechanisms of recombination events that shaped the current population and genomic structure of X. perforans in Florida.
Abstract: Prior to the identification of Xanthomonas perforans associated with bacterial spot of tomato in 1991, X. euvesicatoria was the only known species in Florida. Currently, X. perforans is the Xanthomonas sp. associated with tomato in Florida. Changes in pathogenic race and sequence alleles over time signify shifts in the dominant X. perforans genotype in Florida. We previously reported recombination of X. perforans strains with closely related Xanthomonas species as a potential driving factor for X. perforans evolution. However, the extent of recombination across the X. perforans genomes was unknown. We used a core genome multilocus sequence analysis approach to identify conserved genes and evaluated recombination-associated evolution of these genes in X. perforans. A total of 1,356 genes were determined to be "core" genes conserved among the 58 X. perforans genomes used in the study. Our approach identified three genetic groups of X. perforans in Florida based on the principal component analysis (PCA) using core genes. Nucleotide variation in 241 genes defined these groups, that are referred as Phylogenetic-group Defining (PgD) genes. Furthermore, alleles of many of these PgD genes showed 100% sequence identity with X. euvesicatoria, suggesting that variation likely has been introduced by recombination at multiple locations throughout the bacterial chromosome. Site-specific recombinase genes along with plasmid mobilization and phage associated genes were observed at different frequencies in the three phylogenetic groups and were associated with clusters of recombinant genes. Our analysis of core genes revealed the extent, source, and mechanisms of recombination events that shaped the current population and genomic structure of X. perforans in Florida.

Journal ArticleDOI
15 Feb 2019-PLOS ONE
TL;DR: Compared 16S rRNA gene sequences obtained from completely assembled whole genome and Sanger electrophoresis sequencing of cloned PCR products from Serratia fonticola GS2, it was observed that the greater the number of copies with similar sequences the higher its chance of amplification, and this did not have an effect on species identification.
Abstract: Variable region analysis of 16S rRNA gene sequences is the most common tool in bacterial taxonomic studies. Although used for distinguishing bacterial species, its use remains limited due to the presence of variable copy numbers with sequence variation in the genomes. In this study, 16S rRNA gene sequences, obtained from completely assembled whole genome and Sanger electrophoresis sequencing of cloned PCR products from Serratia fonticola GS2, were compared. Sanger sequencing produced a combination of sequences from multiple copies of 16S rRNA genes. To determine whether the variant copies of 16S rRNA genes affected Sanger sequencing, two ratios (5:5 and 8:2) with different concentrations of cloned 16S rRNA genes were used; it was observed that the greater the number of copies with similar sequences the higher its chance of amplification. Effect of multiple copies for taxonomic classification of 16S rRNA gene sequences was investigated using the strain GS2 as a model. 16S rRNA copies with the maximum variation had 99.42% minimum pairwise similarity and this did not have an effect on species identification. Thus, PCR products from genomes containing variable 16S rRNA gene copies can provide sufficient information for species identification except from species which have high similarity of sequences in their 16S rRNA gene copies like the case of Bacillus thuringiensis and Bacillus cereus. In silico analysis of 1,616 bacterial genomes from long-read sequencing was also done. The average minimum pairwise similarity for each phylum was reported with their average genome size and average “unique copies” of 16S rRNA genes and we found that the phyla Proteobacteria and Firmicutes showed the highest amount of variation in their copies of their 16S rRNA genes. Overall, our results shed light on how the variations in the multiple copies of the 16S rRNA genes of bacteria can aid in appropriate species identification.

Journal ArticleDOI
TL;DR: A novel bottom-up oligonucleotide sequence mapping workflow combining multiple endonucleases that cleave mRNA at different frequencies is developed, providing high-throughput sequence identification and sensitive impurity detection.
Abstract: Characterization of mRNA sequences is a critical aspect of mRNA drug development and regulatory filing. Herein, we developed a novel bottom-up oligonucleotide sequence mapping workflow combining multiple endonucleases that cleave mRNA at different frequencies. RNase T1, colicin E5, and mazF were applied in parallel to provide complementary sequence coverage for large mRNAs. Combined use of multiple endonucleases resulted in significantly improved sequence coverage: greater than 70% sequence coverage was achieved on mRNAs near 3000 nucleotides long. Oligonucleotide mapping simulations with large human RNA databases demonstrate that the proposed workflow can positively identify a single correct sequence from hundreds of similarly sized sequences. In addition, the workflow is sensitive and specific enough to detect minor sequence impurities such as single nucleotide polymorphisms (SNPs) with a sensitivity of less than 1%. LC-MS/MS-based oligonucleotide sequence mapping can serve as an orthogonal sequence characterization method to techniques such as Sanger sequencing or next-generation sequencing (NGS), providing high-throughput sequence identification and sensitive impurity detection.

Journal ArticleDOI
TL;DR: Insight is provided into the architecture associated with C4 photosynthesis gene expression in particular and characteristics of transcription factor binding in cereal crops in general.
Abstract: The majority of plants use C3 photosynthesis, but over 60 independent lineages of angiosperms have evolved the C4 pathway. In most C4 species, photosynthesis gene expression is compartmented between mesophyll and bundle-sheath cells. We performed DNaseI sequencing to identify genome-wide profiles of transcription factor binding in leaves of the C4 grasses Zea mays, Sorghum bicolor, and Setaria italica as well as C3Brachypodium distachyon. In C4 species, while bundle-sheath strands and whole leaves shared similarity in the broad regions of DNA accessible to transcription factors, the short sequences bound varied. Transcription factor binding was prevalent in gene bodies as well as promoters, and many of these sites could represent duons that influence gene regulation in addition to amino acid sequence. Although globally there was little correlation between any individual DNaseI footprint and cell-specific gene expression, within individual species transcription factor binding to the same motifs in multiple genes provided evidence for shared mechanisms governing C4 photosynthesis gene expression. Furthermore, interspecific comparisons identified a small number of highly conserved transcription factor binding sites associated with leaves from species that diverged around 60 million years ago. These data therefore provide insight into the architecture associated with C4 photosynthesis gene expression in particular and characteristics of transcription factor binding in cereal crops in general.

Journal ArticleDOI
TL;DR: Analysis of the evolution, structure, tissue specificity and expression of the euAP2 genes in Brassica napus provides insights for further functional characterization of the miR172 /euAP2 module in B.napus.
Abstract: APETALA2-like genes encode plant-specific transcription factors, some of which possess one microRNA172 (miR172) binding site. The miR172 and its target euAP2 genes are involved in the process of phase transformation and flower organ development in many plants. However, the roles of miR172 and its target AP2 genes remain largely unknown in Brassica napus (B. napus). In this study, 19 euAP2 and four miR172 genes were identified in the B. napus genome. A sequence analysis suggested that 17 euAP2 genes were targeted by Bna-miR172 in the 3′ coding region. EuAP2s were classified into five major groups in B.napus. This classification was consistent with the exon-intron structure and motif organization. An analysis of the nonsynonymous and synonymous substitution rates revealed that the euAP2 genes had gone through purifying selection. Whole genome duplication (WGD) or segmental duplication events played a major role in the expansion of the euAP2 gene family. A cis-regulatory element (CRE) analysis suggested that the euAP2s were involved in the response to light, hormones, stress, and developmental processes including circadian control, endosperm and meristem expression. Expression analysis of the miR172-targeted euAP2s in nine different tissues showed diverse spatiotemporal expression patterns. Most euAP2 genes were highly expressed in the floral organs, suggesting their specific functions in flower development. BnaAP2–1, BnaAP2–5 and BnaTOE1–2 had higher expression levels in late-flowering material than early-flowering material based on RNA-seq and qRT-PCR, indicating that they may act as floral suppressors. Overall, analyses of the evolution, structure, tissue specificity and expression of the euAP2 genes were peformed in B.napus. Based on the RNA-seq and experimental data, euAP2 may be involved in flower development. Three euAP2 genes (BnaAP2–1, BnaAP2–5 and BnaTOE1–2) might be regarded as floral suppressors. The results of this study provide insights for further functional characterization of the miR172 /euAP2 module in B.napus.

Journal ArticleDOI
TL;DR: It is demonstrated that sequences encoded in the viral genome determine the location of cbDVG formation and, therefore, the generation of cBDVGs is not a stochastic process.
Abstract: Defective viral genomes of the copy-back type (cbDVGs) are the primary initiators of the antiviral immune response during infection with respiratory syncytial virus (RSV) both in vitro and in vivo. However, the mechanism governing cbDVG generation remains unknown, thereby limiting our ability to manipulate cbDVG content in order to modulate the host response to infection. Here we report a specific genomic signal that mediates the generation of a subset of RSV cbDVG species. Using a customized bioinformatics tool, we identified regions in the RSV genome frequently used to generate cbDVGs during infection. We then created a minigenome system to validate the function of one of these sequences and to determine if specific nucleotides were essential for cbDVG generation at that position. Further, we created a recombinant virus unable to produce a subset of cbDVGs due to mutations introduced in this sequence. The identified sequence was also found as a site for cbDVG generation during natural RSV infections, and common cbDVGs originated at this sequence were found among samples from various infected patients. These data demonstrate that sequences encoded in the viral genome determine the location of cbDVG formation and, therefore, the generation of cbDVGs is not a stochastic process. These findings open the possibility of genetically manipulating cbDVG formation to modulate infection outcome.

Journal ArticleDOI
TL;DR: A bisulfite-free and base-resolution sequencing method based on peroxotungstate oxidation is presented for the identification of hm5C sites in the transcriptome and m5C can also be detected in a procedure termed TET-Assisted WO-Seq (TAWO- Seq).

Journal ArticleDOI
07 Jun 2019-Science
TL;DR: Yizhak et al. (1) add to the evidence that normal tissues are not so normal after all, finding a large number of acquired (somatic) DNA mutations—some of which are typically associated with cancer—and mutational clones of macroscopic size, in the absence of cancer pathology.
Abstract: Measuring and understanding the dynamics of clonal cell populations is key for cancer prevention Cancers are formed by the expansion of harmful “mutational clones,” which are cell populations carrying the same DNA mutations. How harmful they are depends on which mutated genes they contain. On page 970 of this issue, Yizhak et al. (1) add to the evidence that normal tissues are not so normal after all. Examining a substantial number of healthy tissues from almost 500 individuals, they found a large number of acquired (somatic) DNA mutations—some of which are typically associated with cancer—and mutational clones of macroscopic size, in the absence of cancer pathology. The presence in normal tissues of clonal cell populations, with mutations in cancer-associated genes, is informative about how a tumor evolves from the first mutation to a benign growth and, finally, to cancer.

Journal ArticleDOI
TL;DR: This study is a first step toward standardizing methods for 16S rRNA gene sequencing and bioinformatics analysis of vaginal microbiome data and compared the use of DADA2 for denoising and clustering of sequence reads.
Abstract: Background Identification of bacteria in human vaginal specimens is commonly performed using 16S ribosomal RNA (rRNA) gene sequences. However, studies utilize different 16S primer sets, sequence databases, and parameters for sample and database clustering. Our goal was to assess the ability of these methods to detect common species of vaginal bacteria. Methods We performed an in silico analysis of 16S rRNA gene primer sets, targeting different hypervariable regions. Using vaginal samples from women with bacterial vaginosis, we sequenced 16S genes using the V1-V3, V3-V4, and V4 primer sets. For analysis, we used an extended Greengenes database including 16S gene sequences from vaginal bacteria not already present. We compared results with those obtained using the SILVA 16S database. Using multiple database and sample clustering parameters, each primer set's ability to detect common vaginal bacteria at the species level was determined. We also compared these methods to the use of DADA2 for denoising and clustering of sequence reads. Results V4 sequence reads clustered at 99% identity and using the 99% clustered, extended Greengenes database provided optimal species-level identification of vaginal bacteria. Conclusions This study is a first step toward standardizing methods for 16S rRNA gene sequencing and bioinformatics analysis of vaginal microbiome data.