scispace - formally typeset
Search or ask a question

Showing papers on "Sequence analysis published in 2016"


Journal ArticleDOI
TL;DR: A three-step method (MutRenSeq)-that combines chemical mutagenesis with exome capture and sequencing for rapid R gene cloning is described that was applied to clone stem rust resistance genes Sr22 and Sr45 from hexaploid bread wheat.
Abstract: Wild relatives of domesticated crop species harbor multiple, diverse, disease resistance (R) genes that could be used to engineer sustainable disease control. However, breeding R genes into crop lines often requires long breeding timelines of 5-15 years to break linkage between R genes and deleterious alleles (linkage drag). Further, when R genes are bred one at a time into crop lines, the protection that they confer is often overcome within a few seasons by pathogen evolution. If several cloned R genes were available, it would be possible to pyramid R genes in a crop, which might provide more durable resistance. We describe a three-step method (MutRenSeq)-that combines chemical mutagenesis with exome capture and sequencing for rapid R gene cloning. We applied MutRenSeq to clone stem rust resistance genes Sr22 and Sr45 from hexaploid bread wheat. MutRenSeq can be applied to other commercially relevant crops and their relatives, including, for example, pea, bean, barley, oat, rye, rice and maize.

313 citations


Journal ArticleDOI
TL;DR: LIGR-seq data reveal unexpected interactions between small nucleolar (sno)RNAs and m RNAs, including those involving the orphan C/D box snoRNA, SNORD83B, that control steady-state levels of its target mRNAs.

287 citations


Journal ArticleDOI
TL;DR: It is found that the plasma DNA read depth patterns from healthy donors reflected the expression signature of hematopoietic cells, and in patients with cancer having metastatic disease, expressed cancer driver genes in regions with somatic copy number gains with high accuracy were classified.
Abstract: The analysis of cell-free DNA (cfDNA) in plasma represents a rapidly advancing field in medicine. cfDNA consists predominantly of nucleosome-protected DNA shed into the bloodstream by cells undergoing apoptosis. We performed whole-genome sequencing of plasma DNA and identified two discrete regions at transcription start sites (TSSs) where nucleosome occupancy results in different read depth coverage patterns for expressed and silent genes. By employing machine learning for gene classification, we found that the plasma DNA read depth patterns from healthy donors reflected the expression signature of hematopoietic cells. In patients with cancer having metastatic disease, we were able to classify expressed cancer driver genes in regions with somatic copy number gains with high accuracy. We were able to determine the expressed isoform of genes with several TSSs, as confirmed by RNA-seq analysis of the matching primary tumor. Our analyses provide functional information about cells releasing their DNA into the circulation.

279 citations


Journal ArticleDOI
15 Jan 2016-Science
TL;DR: Thousands of human and viral sequences with cap-independent translation activity are uncovered, which provide a 50-fold increase in the number of sequences known to date and reveal the wide existence of cap- independents in both humans and viruses.
Abstract: INTRODUCTION The recruitment of the ribosome to a specific mRNA is a critical step in the production of proteins in cells. In addition to a general recognition of the “cap” structure at the beginning of eukaryotic mRNAs, ribosomes can also initiate translation from a regulatory RNA element termed internal ribosome entry site (IRES) in a cap-independent manner. IRESs are essential for the synthesis of many human and viral proteins and take part in a variety of biological functions, such as viral infections, the response of cells to stress, and organismal development. Despite their importance, we lack systematic methods for discovering and characterizing IRESs, and thus, little is known about their position in the human and viral genomes and the mechanisms by which they recruit the ribosome. RATIONALE Our method enables accurate measurement of thousands of fully designed sequences for cap-independent translation activity. By using a synthetic oligonucleotide library, we can determine the exact composition of the sequences tested and can profile sequences from hundreds of different viruses, as well as the human genome, in a single experiment. In addition, synthetic design enables the construction of oligos in which we carefully and systematically mutate native IRESs and measure the effect of these mutations on expression. This reverse-genetics approach enables the characterization of the regulatory elements that recruit the ribosome and provide specificity in translation. RESULTS We uncover thousands of human and viral sequences with cap-independent translation activity, which provide a 50-fold increase in the number of sequences known to date. Unbiased screening of cap-independent activity across human transcripts demonstrates enrichment of regulatory elements in the untranslated region in the beginning of transcripts (5′UTR). However, we also find enrichment in the untranslated region located downstream of the coding sequence (3′UTR), which suggests a mechanism by which ribosomes are recruited to the 3′UTR to enhance the translation of an upstream sequence. A genome-wide profiling of positive-strand RNA viruses ([+]ssRNA) reveals the existence of translational elements along their coding regions. This finding suggests that [+]ssRNA viruses can translate only part of their genome, in addition to the synthesis and cleavage of a premature polyprotein. Our analysis reveals two classes of functional elements that drive cap-independent translation: (i) highly structured elements and (ii) unstructured elements that act through a short sequence motif. We show that many 5′UTRs can attract the ribosome by Watson-Crick base pairing with the 18 S ribosomal RNA, a structural RNA component of the small ribosomal subunit (40 S ). In addition, we systematically investigate the functional regions of the 18 S rRNA involved in these interactions that enhance cap-independent translation. CONCLUSIONS These results reveal the wide existence of cap-independent translation sequences in both humans and viruses. They provide insights on the landscape of translational regulation and uncover the regulatory elements underlying cap-independent translation activity.

261 citations


Journal ArticleDOI
TL;DR: This work proposes a weighted Bonferroni adjustment that controls for the family-wise error rate (FWER), using as weights the enrichment of sequence annotations among association signals, and shows that this weighted adjustment increases the power to detect association over the standard Bonferronsi correction.
Abstract: The consensus approach to genome-wide association studies (GWAS) has been to assign equal prior probability of association to all sequence variants tested. However, some sequence variants, such as loss-of-function and missense variants, are more likely than others to affect protein function and are therefore more likely to be causative. Using data from whole-genome sequencing of 2,636 Icelanders and the association results for 96 quantitative and 123 binary phenotypes, we estimated the enrichment of association signals by sequence annotation. We propose a weighted Bonferroni adjustment that controls for the family-wise error rate (FWER), using as weights the enrichment of sequence annotations among association signals. We show that this weighted adjustment increases the power to detect association over the standard Bonferroni correction. We use the enrichment of associations by sequence annotation we have estimated in Iceland to derive significance thresholds for other populations with different numbers and combinations of sequence variants.

184 citations


Journal ArticleDOI
TL;DR: Performing functional analyses on a few widely expressed fusions found that silencing them resulted in dramatic reduction in normal cell growth and/or motility, and explored the implications of these non-pathological fusions in cancer and in evolution.
Abstract: Gene fusions and their products (RNA and protein) were once thought to be unique features to cancer. However, chimeric RNAs can also be found in normal cells. Here, we performed, curated and analyzed nearly 300 RNA-Seq libraries covering 30 different non-neoplastic human tissues and cells as well as 15 mouse tissues. A large number of fusion transcripts were found. Most fusions were detected only once, while 291 were seen in more than one sample. We focused on the recurrent fusions and performed RNA and protein level validations on a subset. We characterized these fusions based on various features of the fusions, and their parental genes. They tend to be expressed at higher levels relative to their parental genes than the non-recurrent ones. Over half of the recurrent fusions involve neighboring genes transcribing in the same direction. A few sequence motifs were found enriched close to the fusion junction sites. We performed functional analyses on a few widely expressed fusions, and found that silencing them resulted in dramatic reduction in normal cell growth and/or motility. Most chimeras use canonical splicing sites, thus are likely products of 'intergenic splicing'. We also explored the implications of these non-pathological fusions in cancer and in evolution.

147 citations


Journal ArticleDOI
TL;DR: A detailed protocol for G&T-seq, a method for separation and parallel sequencing of genomic DNA and full-length polyA(+) mRNA from single cells, which allows the detection of thousands of transcripts in parallel with the genetic variants captured by the DNA-seq data from the same single cell.
Abstract: Parallel sequencing of a single cell's genome and transcriptome provides a powerful tool for dissecting genetic variation and its relationship with gene expression. Here we present a detailed protocol for GT the physical separation of polyA(+) mRNA from genomic DNA using a modified oligo-dT bead capture and the respective whole-transcriptome and whole-genome amplifications; and library preparation and sequence analyses of these amplification products. The method allows the detection of thousands of transcripts in parallel with the genetic variants captured by the DNA-seq data from the same single cell. G&T-seq differs from other currently available methods for parallel DNA and RNA sequencing from single cells, as it involves physical separation of the DNA and RNA and does not require bespoke microfluidics platforms. The process can be implemented manually or through automation. When performed manually, paired genome and transcriptome sequencing libraries from eight single cells can be produced in ∼3 d by researchers experienced in molecular laboratory work. For users with experience in the programming and operation of liquid-handling robots, paired DNA and RNA libraries from 96 single cells can be produced in the same time frame. Sequence analysis and integration of single-cell G&T-seq DNA and RNA data requires a high level of bioinformatics expertise and familiarity with a wide range of informatics tools.

141 citations


Journal ArticleDOI
TL;DR: A revision of an earlier phylogeny of Bacteroidetes has been performed using the 16S rRNA gene as a backbone in combination with the 23S r RNA gene, as well as multilocus sequence analysis (MLSA) of 29 orthologous protein sequences, and indels in the sequences of the beta subunit of the F-type ATPase and the alanyl-tRNA synthetase.

118 citations


Journal ArticleDOI
TL;DR: A set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms, that provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome are developed.
Abstract: RNA sequencing (RNA-seq) can be used to assemble spliced isoforms, quantify expressed genes and provide a global profile of the transcriptome. However, the size and diversity of the transcriptome, the wide dynamic range in gene expression and inherent technical biases confound RNA-seq analysis. We have developed a set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms. Sequins have an entirely artificial sequence with no homology to natural reference genomes, but they align to gene loci encoded on an artificial in silico chromosome. The combination of multiple sequins across a range of concentrations emulates alternative splicing and differential gene expression, and it provides scaling factors for normalization between samples. We demonstrate the use of sequins in RNA-seq experiments to measure sample-specific biases and determine the limits of reliable transcript assembly and quantification in accompanying human RNA samples. In addition, we have designed a complementary set of sequins that represent fusion genes arising from rearrangements of the in silico chromosome to aid in cancer diagnosis. RNA sequins provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome.

113 citations


01 Jan 2016
TL;DR: Analysis of the secondary structure of the wild-type, mutant, and pseudorevertant LamB signal sequences suggests that the secondary mutations restore export by allowing the formation of a stable alpha-helical conformation in the central, hydrophobic region of the signal sequence.
Abstract: Mutant Escherichia coli strains in which export of the LamB protein (coded for by the lamB gene) to the outer membrane of the cell is prevented have been described previ- ously. One of these mutant strains contains a small (12-base pair) deletion mutation within the region of the lamB gene that codes for the NH2-terminal signal sequence. In this mutant strain, ex- port but not synthesis of the LamB protein is blocked. We have isolated pseudorevertants that restore export of functional LamB protein to the outer membrane. DNA sequence analysis showed that two of the revertants contain a point mutation in addition to the original deletion. These point mutations lead to amino acid substitutions within the signal sequence. Our results indicate that these secondary mutations efficiently suppress the export defect caused by the deletion mutation. Analysis of the secondary struc- ture of the wild-type, mutant, and pseudorevertant LamB signal sequences suggests that the secondary mutations restore export by allowing the formation of a stable a-helical conformation in the central, hydrophobic region of the signal sequence. The mechanism of protein secretion in both prokaryotic and eukaryotic cells appears to require the presence of an extra se-

105 citations



Journal ArticleDOI
TL;DR: Partial genome sequences of 122 RNA bacteriophage phylotypes are identified that were present in samples collected from a range of ecological niches worldwide, including invertebrates and extreme microbial sediment, demonstrating that they are more widely distributed than previously recognized.
Abstract: Bacteriophage modulation of microbial populations impacts critical processes in ocean, soil, and animal ecosystems. However, the role of bacteriophages with RNA genomes (RNA bacteriophages) in these processes is poorly understood, in part because of the limited number of known RNA bacteriophage species. Here, we identify partial genome sequences of 122 RNA bacteriophage phylotypes that are highly divergent from each other and from previously described RNA bacteriophages. These novel RNA bacteriophage sequences were present in samples collected from a range of ecological niches worldwide, including invertebrates and extreme microbial sediment, demonstrating that they are more widely distributed than previously recognized. Genomic analyses of these novel bacteriophages yielded multiple novel genome organizations. Furthermore, one RNA bacteriophage was detected in the transcriptome of a pure culture of Streptomyces avermitilis, suggesting for the first time that the known tropism of RNA bacteriophages may include gram-positive bacteria. Finally, reverse transcription PCR (RT-PCR)-based screening for two specific RNA bacteriophages in stool samples from a longitudinal cohort of macaques suggested that they are generally acutely present rather than persistent.

Journal ArticleDOI
TL;DR: This study describes a set of conserved but functionally diverse structural RNA motifs that occur in multiple coding regions of the HCV genome, and it is demonstrated that conformational changes in these motifs influence specific stages in the virus' life cycle.

Journal ArticleDOI
TL;DR: A much faster and more affordable strategy for sequencing and assembling mammalian Y Chromosomes of sufficient quality for most comparative genomics analyses and for conservation genetics applications is presented and is used to reconstruct sex chromosomes in a heterogametic sex of any species.
Abstract: The mammalian Y Chromosome sequence, critical for studying male fertility and dispersal, is enriched in repeats and palindromes, and thus, is the most difficult component of the genome to assemble. Previously, expensive and labor-intensive BAC-based techniques were used to sequence the Y for a handful of mammalian species. Here, we present a much faster and more affordable strategy for sequencing and assembling mammalian Y Chromosomes of sufficient quality for most comparative genomics analyses and for conservation genetics applications. The strategy combines flow sorting, short- and long-read genome and transcriptome sequencing, and droplet digital PCR with novel and existing computational methods. It can be used to reconstruct sex chromosomes in a heterogametic sex of any species. We applied our strategy to produce a draft of the gorilla Y sequence. The resulting assembly allowed us to refine gene content, evaluate copy number of ampliconic gene families, locate species-specific palindromes, examine the repetitive element content, and produce sequence alignments with human and chimpanzee Y Chromosomes. Our results inform the evolution of the hominine (human, chimpanzee, and gorilla) Y Chromosomes. Surprisingly, we found the gorilla Y Chromosome to be similar to the human Y Chromosome, but not to the chimpanzee Y Chromosome. Moreover, we have utilized the assembled gorilla Y Chromosome sequence to design genetic markers for studying the male-specific dispersal of this endangered species.

01 Jan 2016
TL;DR: Chirikjian et al. as discussed by the authors found that the vimentin gene con- tained two sets of tandem polyadenylylation sites, 249 and 532 nucleotides downstream from the stop codon for protein synthesis.
Abstract: Genomic clones and cDNA plasmids were iso- lated for the intermediate filament protein vimentin from chicken. The identity of the various clones was determined both by mRNA selection (Paterson, B. M. & Roberts, B. E. (1981) in Gene Am- plification and Analysis, Structural Analysis or Nucleic Acids, eds. Chirikjian, J. G. & Papas, T. S. (Elsevier, North Holland), Vol. 2, pp. 418-435) and nucleotide sequence analysis. Restriction analysis, hybridization data, and heteroduplex studies confirmed that all of the genomic isolates contained overlapping fragments of an identical vimentin gene. No evidence for the existence of a second vimentin gene could be found by a Southern analysis either by using coding fragments from the purified vimentin gene or by using cDNA plasmids as probe. Likewise, copy-number experi- ments verified that the vimentin gene was present only once in the haploid chicken genome. However, in a RNA blot analysis, at least two equally abundant vimentin mRNA species of approximately 2,200 and 2,500 nucleotides in length were detected in all RNAs tested. Sequence analysis revealed that the vimentin gene con- tained two sets of tandem polyadenylylation sites, 249 and 532 nucleotides downstream from the stop codon for protein synthesis. It is proposed that the larger mRNA species arise because of com- plete transcription of the 3'-end of the vimentin gene (560 nu- cleotides of 3' nontranslated sequence), whereas the smaller

Journal ArticleDOI
TL;DR: An integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences by integrating de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph.
Abstract: De novo protein sequencing is one of the key problems in mass spectrometry-based proteomics, especially for novel proteins such as monoclonal antibodies for which genome information is often limited or not available. However, due to limitations in peptides fragmentation and coverage, as well as ambiguities in spectra interpretation, complete de novo assembly of unknown protein sequences still remains challenging. To address this problem, we propose an integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences. Our system integrates de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph to assemble protein sequences. We evaluated ALPS performance on two antibody data sets, each including a heavy chain and a light chain. The results show that ALPS was able to assemble three complete monoclonal antibody sequences of length 216-441 AA, at 100% coverage, and 96.64-100% accuracy.

Journal ArticleDOI
TL;DR: The transcriptome data is the first report, which offers an insight into the mechanisms and genes involved in salt tolerance, which can be used to improve salt tolerance in elite wheat cultivars and to develop tolerant germplasm for other cereal crops.
Abstract: Kharchia Local wheat variety is an Indian salt tolerant land race known for its tolerance to salinity. However, there is a lack of detailed information regarding molecular mechanism imparting tolerance to high salinity in this bread wheat. In the present study, differential root transcriptome analysis identifying salt stress responsive gene networks and functional annotation under salt stress in Kharchia Local was performed. A total of 453,882 reads were obtained after quality filtering, using Roche 454-GS FLX Titanium sequencing technology. From these reads 22,241 ESTs were generated out of which, 17,911 unigenes were obtained. A total of 14,898 unigenes were annotated against nr protein database. Seventy seven transcription factors families in 826 unigenes and 11,002 SSRs in 6,939 unigenes were identified. Kyoto Encyclopedia of Genes and Genomes database identified 310 metabolic pathways. The expression pattern of few selected genes was compared during the time course of salt stress treatment between salt-tolerant (Kharchia Local) and susceptible (HD2687). The transcriptome data is the first report, which offers an insight into the mechanisms and genes involved in salt tolerance. This information can be used to improve salt tolerance in elite wheat cultivars and to develop tolerant germplasm for other cereal crops.

Journal ArticleDOI
TL;DR: This work identified 413 and 709 multi-exon noncoding transcripts from 353 and 595 loci of the cultivar tomato Heinz1706 and its wild relative LA1589, respectively and shows that they are poorly conserved in Solanaceae, showing novel insights into the evolution of lncRNAs in plants.
Abstract: Summary Long noncoding RNAs (lncRNAs) regulate gene expression and biological processes. With the development of high-throughput RNA sequencing technology, lncRNAs have been extensively studied in recent years. Nevertheless, the expression and evolution of lncRNAs in plants remain poorly understood. Here, we identified 413 and 709 multi-exon noncoding transcripts from 353 and 595 loci of the cultivar tomato Heinz1706 and its wild relative LA1589, respectively. Systematic comparison of the sequence and expression of lncRNAs showed that they are poorly conserved in Solanaceae, with only < 0.4% lncRNAs present in all sequenced genomes of tomato and potato. Sequence analysis of Lycopersicon-specific lncRNA loci in Solanum lycopersicum and S. pennellii showed that the origins of these molecules are associated with transposable elements (TEs). LncRNA-314, a fruit-specific lncRNA expressed in S. lycopersicum and S. pimpinellifolium, but not in S. pennellii, originated through two evolutionary events: speciation of S. pennellii resulted in insertion of a long terminal repeat (LTR) retrotransposon into chromosome 10 and contributed to most of the transcribed region of lncRNA-314; and a large deletion in Lycopersicon generated the promoter region and part of the transcribed region of lncRNA-314. These results provide novel insights into the evolution of lncRNAs in plants.

Journal ArticleDOI
TL;DR: This study provides genomic and physiological insight into Intestinimonas butyriciproducens, a prevalent butyrate-producing species, differentiating strains that originate from the mouse and human gut.
Abstract: Intestinimonas is a newly described bacterial genus with representative strains present in the intestinal tract of human and other animals. Despite unique metabolic features including the production of butyrate from both sugars and amino acids, there is to date no data on their diversity, ecology, and physiology. Using a comprehensive phylogenetic approach, Intestinimomas was found to include at least three species that colonize primarily the human and mouse intestine. We focused on the most common and cultivable species of the genus, Intestinimonas butyriciproducens, and performed detailed genomic and physiological comparison of strains SRB521T and AF211, isolated from the mouse and human gut respectively. The complete 3.3-Mb genomic sequences of both strains were highly similar with 98.8% average nucleotide identity, testifying to their assignment to one single species. However, thorough analysis revealed significant genomic rearrangements, variations in phage-derived sequences, and the presence of new CRISPR sequences in both strains. Moreover, strain AF211 appeared to be more efficient than strain SRB521T in the conversion of the sugars arabinose and galactose. In conclusion, this study provides genomic and physiological insight into Intestinimonas butyriciproducens, a prevalent butyrate-producing species, differentiating strains that originate from the mouse and human gut.

Journal ArticleDOI
TL;DR: A convenient system for identifying an experimental evidence-based annotation of candidate ASD-associated genes and a substantial portion of these genes with de novo single-nucleotide variations might have roles in neuronal systems, although further detailed analysis might eliminate false positive genes from identified candidate ASD genes.
Abstract: Autism spectrum disorder (ASD) is a complex group of clinically heterogeneous neurodevelopmental disorders with unclear etiology and pathogenesis. Genetic studies have identified numerous candidate genetic variants, including de novo mutated ASD-associated genes; however, the function of these de novo mutated genes remains unclear despite extensive bioinformatics resources. Accordingly, it is not easy to assign priorities to numerous candidate ASD-associated genes for further biological analysis. Here we developed a convenient system for identifying an experimental evidence-based annotation of candidate ASD-associated genes. We performed trio-based whole-exome sequencing in 30 sporadic cases of ASD and identified 37 genes with de novo single-nucleotide variations (SNVs). Among them, 5 of those 37 genes, POGZ, PLEKHA4, PCNX, PRKD2 and HERC1, have been previously reported as genes with de novo SNVs in ASD; and consultation with in silico databases showed that only HERC1 might be involved in neural function. To examine whether the identified gene products are involved in neural functions, we performed small hairpin RNA-based assays using neuroblastoma cell lines to assess neurite development. Knockdown of 8 out of the 14 examined genes significantly decreased neurite development (P<0.05, one-way analysis of variance), which was significantly higher than the number expected from gene ontology databases (P=0.010, Fisher's exact test). Our screening system may be valuable for identifying the neural functions of candidate ASD-associated genes for further analysis and a substantial portion of these genes with de novo SNVs might have roles in neuronal systems, although further detailed analysis might eliminate false positive genes from identified candidate ASD genes.

Journal ArticleDOI
TL;DR: An RNA-Seq analysis provided comprehensive transcriptome information on E.gracilis for the first time and indicated that paramylon and wax ester metabolic pathways are regulated at post-transcriptional rather than the transcriptional level in response to anaerobic conditions.
Abstract: The phytoflagellated protozoan, Euglena gracilis, has been proposed as an attractive feedstock for the accumulation of valuable compounds such as β-1,3-glucan, also known as paramylon, and wax esters. The production of wax esters proceeds under anaerobic conditions, designated as wax ester fermentation. In spite of the importance and usefulness of Euglena, the genome and transcriptome data are currently unavailable, though another research group has recently published E.gracilis transcriptome study during our submission. We herein performed an RNA-Seq analysis to provide a comprehensive sequence resource and some insights into the regulation of genes including wax ester metabolism by comparative transcriptome analysis of E.gracilis under aerobic and anaerobic conditions. The E.gracilis transcriptome analysis was performed using the Illumina platform and yielded 90.3 million reads after the filtering steps. A total of 49,826 components were assembled and identified as a reference sequence of E.gracilis, of which 26,479 sequences were considered to be potentially expressed (having FPKM value of greater than 1). Approximately half of all components were estimated to be regulated in a trans-splicing manner, with the addition of protruding spliced leader sequences. Nearly 40 % of 26,479 sequences were annotated by similarity to Swiss-Prot database using the BLASTX program. A total of 2080 transcripts were identified as differentially expressed genes (DEGs) in response to anaerobic treatment for 24 h. A comprehensive pathway enrichment analysis using the KEGG pathway revealed that the majority of DEGs were involved in photosynthesis, nucleotide metabolism, oxidative phosphorylation, fatty acid metabolism. We successfully identified a candidate gene set of paramylon and wax esters, including novel β-1,3-glucan and wax ester synthases. A comparative expression analysis of aerobic- and anaerobic-treated E.gracilis cells indicated that gene expression changes in these components were not extensive or dynamic during the anaerobic treatment. The RNA-Seq analysis provided comprehensive transcriptome information on E.gracilis for the first time, and this information will advance our understanding of this unique organism. The comprehensive analysis indicated that paramylon and wax ester metabolic pathways are regulated at post-transcriptional rather than the transcriptional level in response to anaerobic conditions.

Journal ArticleDOI
07 Sep 2016-PLOS ONE
TL;DR: A de novo annotated assembly of the chromosomal genome of an industrially-relevant strain, W29/CLIB89, determined by hybrid next-generation sequencing underscores the utility of an additional independent genome assembly for this economically important organism.
Abstract: Yarrowia lipolytica, an oleaginous yeast, is capable of accumulating significant cellular mass in lipid making it an important source of biosustainable hydrocarbon-based chemicals. In spite of a similar number of protein-coding genes to that in other Hemiascomycetes, the Y. lipolytica genome is almost double that of model yeasts. Despite its economic importance and several distinct strains in common use, an independent genome assembly exists for only one strain. We report here a de novo annotated assembly of the chromosomal genome of an industrially-relevant strain, W29/CLIB89, determined by hybrid next-generation sequencing. For the first time, each Y. lipolytica chromosome is represented by a single contig. The telomeric rDNA repeats were localized by Irys long-range genome mapping and one complete copy of the rDNA sequence is reported. Two large structural variants and retroelement differences with reference strain CLIB122 including a full-length, novel Ty3/Gypsy long terminal repeat (LTR) retrotransposon and multiple LTR-like sequences are described. Strikingly, several of these are adjacent to RNA polymerase III-transcribed genes, which are almost double in number in Y. lipolytica compared to other Hemiascomycetes. In addition to previously-reported dimeric RNA polymerase III-transcribed genes, tRNA pseudogenes were identified. Multiple full-length and truncated LINE elements are also present. Therefore, although identified transposons do not constitute a significant fraction of the Y. lipolytica genome, they could have played an active role in its evolution. Differences between the sequence of this strain and of the existing reference strain underscore the utility of an additional independent genome assembly for this economically important organism.

Journal ArticleDOI
TL;DR: This novel dsRNA targeting metagenomic method is characterized by an extremely high recovery rate of viral RNA sequences, the retrieval of terminal sequences, and uniform read coverage, which has not previously been reported in other meetagenomic methods targeting RNA viruses.
Abstract: Knowledge of the distribution and diversity of RNA viruses is still limited in spite of their possible environmental and epidemiological impacts because RNA virus-specific metagenomic methods have not yet been developed. We herein constructed an effective metagenomic method for RNA viruses by targeting long double-stranded (ds)RNA in cellular organisms, which is a hallmark of infection, or the replication of dsRNA and single-stranded (ss)RNA viruses, except for retroviruses. This novel dsRNA targeting metagenomic method is characterized by an extremely high recovery rate of viral RNA sequences, the retrieval of terminal sequences, and uniform read coverage, which has not previously been reported in other metagenomic methods targeting RNA viruses. This method revealed a previously unidentified viral RNA diversity of more than 20 complete RNA viral genomes including dsRNA and ssRNA viruses associated with an environmental diatom colony. Our approach will be a powerful tool for cataloging RNA viruses associated with organisms of interest.

Journal ArticleDOI
TL;DR: This work provides a solid candidate list for SSCP-derived effectors that may play roles in mediating F. graminearum-wheat interactions and the in vitro secretome-based method presented here also may be applicable for identifying candidate effectors in other ascomycete pathogens of crop plants.
Abstract: Pathogen-derived, small secreted cysteine-rich proteins (SSCPs) are known to be a common source of fungal effectors that trigger resistance or susceptibility in specific host plants. This group of proteins has not been well studied in Fusarium graminearum, the primary cause of Fusarium head blight (FHB), a devastating disease of wheat. We report here a comprehensive analysis of SSCPs encoded in the genome of this fungus and selection of candidate effector proteins through proteomics and sequence/transcriptional analyses. A total of 190 SSCPs were identified in the genome of F. graminearum (isolate PH-1) based on the presence of N-terminal signal peptide sequences, size (≤200 amino acids), and cysteine content (≥2%) of the mature proteins. Twenty-five (approximately 13%) SSCPs were confirmed to be true extracellular proteins by nanoscale liquid chromatography-tandem mass spectrometry (nanoLC-MS/MS) analysis of a minimal medium-based in vitro secretome. Sequence analysis suggested that 17 SSCPs harbor conserved functional domains, including two homologous to Ecp2, a known effector produced by the tomato pathogen Cladosporium fulvum. Transcriptional analysis revealed that at least 34 SSCPs (including 23 detected in the in vitro secretome) are expressed in infected wheat heads; about half are up-regulated with expression patterns correlating with the development of FHB. This work provides a solid candidate list for SSCP-derived effectors that may play roles in mediating F. graminearum-wheat interactions. The in vitro secretome-based method presented here also may be applicable for identifying candidate effectors in other ascomycete pathogens of crop plants.

Journal ArticleDOI
TL;DR: Distinct primary structures defined by homologous domains shed light on how barnacles use low complexity in nanofibers to enable adhesion, and serves as a starting point for unraveling the molecular architecture of a robust and unique class of adhesive nanostructures.
Abstract: Barnacles adhere by producing a mixture of cement proteins (CPs) that organize into a permanently bonded layer displayed as nanoscale fibers. These cement proteins share no homology with any other marine adhesives, and a common sequence-basis that defines how nanostructures function as adhesives remains undiscovered. Here we demonstrate that a significant unidentified portion of acorn barnacle cement is comprised of low complexity proteins; they are organized into repetitive sequence blocks and found to maintain homology to silk motifs. Proteomic analysis of aggregate bands from PAGE gels reveal an abundance of Gly/Ala/Ser/Thr repeats exemplified by a prominent, previously unidentified, 43 kDa protein in the solubilized adhesive. Low complexity regions found throughout the cement proteome, as well as multiple lysyl oxidases and peroxidases, establish homology with silk-associated materials such as fibroin, silk gum sericin, and pyriform spidroins from spider silk. Distinct primary structures defined by homologous domains shed light on how barnacles use low complexity in nanofibers to enable adhesion, and serves as a starting point for unraveling the molecular architecture of a robust and unique class of adhesive nanostructures.

Journal ArticleDOI
TL;DR: The transcript levels of SlHsp20 genes could be induced profusely by abiotic and biotic stresses such as heat, drought, salt, Botrytis cinerea, and Tomato Spotted Wilt Virus, indicating their potential roles in mediating the response of tomato plants to environment stresses.
Abstract: The Hsp20 genes are involved in the response of plants to environment stresses including heat shock and also play a vital role in plant growth and development. They represent the most abundant small heat shock proteins (sHsps) in plants, but little is known about this family in tomato (Solanum lycopersicum), an important vegetable crop in the world. Here, we characterized heat shock protein 20 (SlHsp20) gene family in tomato through integration of gene structure, chromosome location, phylogenetic relationship and expression profile. Using bioinformatics-based methods, we identified at least 42 putative SlHsp20 genes in tomato. Sequence analysis revealed that most of SlHsp20 genes possessed no intron or a relatively short intron in length. Chromosome mapping indicated that inter-arm and intra-chromosome duplication events contributed remarkably to the expansion of SlHsp20 genes. Phylogentic tree of Hsp20 genes from tomato and other plant species revealed that SlHsp20 genes were grouped into 13 subfamilies, indicating that these genes may have a common ancestor that generated diverse subfamilies prior to the mono-dicot split. In addition, expression analysis using RNA-seq in various tissues and developmental stages of cultivated tomato and the wild relative Solanum pimpinellifolium revealed that most of these genes (83%) were expressed in at least one stage from at least one genotype. Out of 42 genes, 4 genes were expressed constitutively in almost all the tissues analyzed, implying that these genes might have specific housekeeping function in tomato cell under normal growth conditions. Two SlHsp20 genes displayed differential expression levels between cultivated tomato and S. pimpinellifolium in vegetative (leaf and root) and reproductive organs (floral bud and flower), suggesting inter-species diversification for functional specialization during the process of domestication. Based on genome-wide microarray analysis, we showed that the transcript levels of SlHsp20 genes could be induced profusely by abiotic and biotic stresses such as heat, drought, salt, Botrytis cinerea and Tomato Spotted Wilt Virus, indicating their potential roles in mediating the response of tomato plants to environment stresses. In conclusion, these results provide valuable information for elucidating the evolutionary relationship of Hsp20 gene family and functional characterization of the SlHsp20 gene family in the future.

Journal ArticleDOI
TL;DR: In this article, the authors performed whole-transcriptome sequencing using formalin-fixed, paraffin-embedded (FFPE) tissues in malignant or biologically indeterminate spitzoid tumors from 7 patients (age 2-14 years).

Journal ArticleDOI
01 Oct 2016-Genomics
TL;DR: This research proposes to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship and gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega.

Posted Content
TL;DR: A new method is presented, called seq2vec, to represent a complete biological sequence in an Euclidean space, which has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks.
Abstract: Biological sequence comparison is a key step in inferring the relatedness of various organisms and the functional similarity of their components. Thanks to the Next Generation Sequencing efforts, an abundance of sequence data is now available to be processed for a range of bioinformatics applications. Embedding a biological sequence over a nucleotide or amino acid alphabet in a lower dimensional vector space makes the data more amenable for use by current machine learning tools, provided the quality of embedding is high and it captures the most meaningful information of the original sequences. Motivated by recent advances in the text document embedding literature, we present a new method, called seq2vec, to represent a complete biological sequence in an Euclidean space. The new representation has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks. We test our embeddings with protein sequence classification and retrieval tasks and demonstrate encouraging outcomes.

Journal ArticleDOI
TL;DR: This paper reports the first use of whole-genome sequencing (WGS) to diagnose ADPKD, which overcomes pseudogene homology, provides uniform coverage, detects all variant types in a single test and is less labour-intensive than current techniques.
Abstract: Autosomal dominant polycystic kidney disease (ADPKD) is the most common monogenic kidney disorder and is due to disease-causing variants in PKD1 or PKD2. Strong genotype-phenotype correlation exists although diagnostic sequencing is not part of routine clinical practice. This is because PKD1 bears 97.7% sequence similarity with six pseudogenes, requiring laborious and error-prone long-range PCR and Sanger sequencing to overcome. We hypothesised that whole-genome sequencing (WGS) would be able to overcome the problem of this sequence homology, because of 150 bp, paired-end reads and avoidance of capture bias that arises from targeted sequencing. We prospectively recruited a cohort of 28 unique pedigrees with ADPKD phenotype. Standard DNA extraction, library preparation and WGS were performed using Illumina HiSeq X and variants were classified following standard guidelines. Molecular diagnosis was made in 24 patients (86%), with 100% variant confirmation by current gold standard of long-range PCR and Sanger sequencing. We demonstrated unique alignment of sequencing reads over the pseudogene-homologous region. In addition to identifying function-affecting single-nucleotide variants and indels, we identified single- and multi-exon deletions affecting PKD1 and PKD2, which would have been challenging to identify using exome sequencing. We report the first use of WGS to diagnose ADPKD. This method overcomes pseudogene homology, provides uniform coverage, detects all variant types in a single test and is less labour-intensive than current techniques. This technique is translatable to a diagnostic setting, allows clinicians to make better-informed management decisions and has implications for other disease groups that are challenged by regions of confounding sequence homology.