Showing papers on "Sequence analysis published in 2016"

PDF

Open Access

Journal Article•DOI•

Rapid cloning of disease-resistance genes in plants using mutagenesis and sequence capture

[...]

Burkhard Steuernagel¹, Burkhard Steuernagel², Sambasivam Periyannan³, Inmaculada Hernández-Pinzón¹, Kamil Witek¹, Matthew N. Rouse⁴, Guotai Yu², Asyraf Hatta², Asyraf Hatta⁵, Mick Ayliffe³, Harbans Bariana⁶, Jonathan D. G. Jones¹, Evans Lagudah³, Brande B. H. Wulff², Brande B. H. Wulff¹ - Show less +11 more•Institutions (6)

Sainsbury Laboratory¹, John Innes Centre², Commonwealth Scientific and Industrial Research Organisation³, University of Minnesota⁴, Universiti Putra Malaysia⁵, University of Sydney⁶

01 Jun 2016-Nature Biotechnology

TL;DR: A three-step method (MutRenSeq)-that combines chemical mutagenesis with exome capture and sequencing for rapid R gene cloning is described that was applied to clone stem rust resistance genes Sr22 and Sr45 from hexaploid bread wheat.

...read moreread less

Abstract: Wild relatives of domesticated crop species harbor multiple, diverse, disease resistance (R) genes that could be used to engineer sustainable disease control. However, breeding R genes into crop lines often requires long breeding timelines of 5-15 years to break linkage between R genes and deleterious alleles (linkage drag). Further, when R genes are bred one at a time into crop lines, the protection that they confer is often overcome within a few seasons by pathogen evolution. If several cloned R genes were available, it would be possible to pyramid R genes in a crop, which might provide more durable resistance. We describe a three-step method (MutRenSeq)-that combines chemical mutagenesis with exome capture and sequencing for rapid R gene cloning. We applied MutRenSeq to clone stem rust resistance genes Sr22 and Sr45 from hexaploid bread wheat. MutRenSeq can be applied to other commercially relevant crops and their relatives, including, for example, pea, bean, barley, oat, rye, rice and maize.

...read moreread less

313 citations

Journal Article•DOI•

Global Mapping of Human RNA-RNA Interactions

[...]

Eesha Sharma¹, Timothy Sterne-Weiler¹, Dave O'Hanlon¹, Benjamin J. Blencowe¹•Institutions (1)

University of Toronto¹

19 May 2016-Molecular Cell

TL;DR: LIGR-seq data reveal unexpected interactions between small nucleolar (sno)RNAs and m RNAs, including those involving the orphan C/D box snoRNA, SNORD83B, that control steady-state levels of its target mRNAs.

...read moreread less

287 citations

Journal Article•DOI•

Inferring expressed genes by whole-genome sequencing of plasma DNA

[...]

Peter Ulz¹, Gerhard G. Thallinger², Martina Auer¹, Ricarda Graf¹, Karl Kashofer¹, Stephan W Jahn¹, Luca Abete¹, Gunda Pristauz¹, Edgar Petru¹, Jochen B. Geigl¹, Ellen Heitzer¹, Michael R. Speicher¹ - Show less +8 more•Institutions (2)

Medical University of Graz¹, Graz University of Technology²

01 Oct 2016-Nature Genetics

TL;DR: It is found that the plasma DNA read depth patterns from healthy donors reflected the expression signature of hematopoietic cells, and in patients with cancer having metastatic disease, expressed cancer driver genes in regions with somatic copy number gains with high accuracy were classified.

...read moreread less

Abstract: The analysis of cell-free DNA (cfDNA) in plasma represents a rapidly advancing field in medicine. cfDNA consists predominantly of nucleosome-protected DNA shed into the bloodstream by cells undergoing apoptosis. We performed whole-genome sequencing of plasma DNA and identified two discrete regions at transcription start sites (TSSs) where nucleosome occupancy results in different read depth coverage patterns for expressed and silent genes. By employing machine learning for gene classification, we found that the plasma DNA read depth patterns from healthy donors reflected the expression signature of hematopoietic cells. In patients with cancer having metastatic disease, we were able to classify expressed cancer driver genes in regions with somatic copy number gains with high accuracy. We were able to determine the expressed isoform of genes with several TSSs, as confirmed by RNA-seq analysis of the matching primary tumor. Our analyses provide functional information about cells releasing their DNA into the circulation.

...read moreread less

279 citations

Journal Article•DOI•

Comparative genetics. Systematic discovery of cap-independent translation sequences in human and viral genomes.

[...]

Shira Weingarten-Gabbay¹, Shani Elias-Kirma¹, Ronit Nir¹, Alexey A. Gritsenko², Noam Stern-Ginossar¹, Zohar Yakhini³, Zohar Yakhini⁴, Adina Weinberger¹, Eran Segal¹ - Show less +5 more•Institutions (4)

Weizmann Institute of Science¹, Delft University of Technology², Technion – Israel Institute of Technology³, Agilent Technologies⁴

15 Jan 2016-Science

TL;DR: Thousands of human and viral sequences with cap-independent translation activity are uncovered, which provide a 50-fold increase in the number of sequences known to date and reveal the wide existence of cap- independents in both humans and viruses.

...read moreread less

Abstract: INTRODUCTION The recruitment of the ribosome to a specific mRNA is a critical step in the production of proteins in cells. In addition to a general recognition of the “cap” structure at the beginning of eukaryotic mRNAs, ribosomes can also initiate translation from a regulatory RNA element termed internal ribosome entry site (IRES) in a cap-independent manner. IRESs are essential for the synthesis of many human and viral proteins and take part in a variety of biological functions, such as viral infections, the response of cells to stress, and organismal development. Despite their importance, we lack systematic methods for discovering and characterizing IRESs, and thus, little is known about their position in the human and viral genomes and the mechanisms by which they recruit the ribosome. RATIONALE Our method enables accurate measurement of thousands of fully designed sequences for cap-independent translation activity. By using a synthetic oligonucleotide library, we can determine the exact composition of the sequences tested and can profile sequences from hundreds of different viruses, as well as the human genome, in a single experiment. In addition, synthetic design enables the construction of oligos in which we carefully and systematically mutate native IRESs and measure the effect of these mutations on expression. This reverse-genetics approach enables the characterization of the regulatory elements that recruit the ribosome and provide specificity in translation. RESULTS We uncover thousands of human and viral sequences with cap-independent translation activity, which provide a 50-fold increase in the number of sequences known to date. Unbiased screening of cap-independent activity across human transcripts demonstrates enrichment of regulatory elements in the untranslated region in the beginning of transcripts (5′UTR). However, we also find enrichment in the untranslated region located downstream of the coding sequence (3′UTR), which suggests a mechanism by which ribosomes are recruited to the 3′UTR to enhance the translation of an upstream sequence. A genome-wide profiling of positive-strand RNA viruses ([+]ssRNA) reveals the existence of translational elements along their coding regions. This finding suggests that [+]ssRNA viruses can translate only part of their genome, in addition to the synthesis and cleavage of a premature polyprotein. Our analysis reveals two classes of functional elements that drive cap-independent translation: (i) highly structured elements and (ii) unstructured elements that act through a short sequence motif. We show that many 5′UTRs can attract the ribosome by Watson-Crick base pairing with the 18 S ribosomal RNA, a structural RNA component of the small ribosomal subunit (40 S ). In addition, we systematically investigate the functional regions of the 18 S rRNA involved in these interactions that enhance cap-independent translation. CONCLUSIONS These results reveal the wide existence of cap-independent translation sequences in both humans and viruses. They provide insights on the landscape of translational regulation and uncover the regulatory elements underlying cap-independent translation activity.

...read moreread less

261 citations

Journal Article•DOI•

Weighting sequence variants based on their annotation increases power of whole-genome association studies

[...]

Gardar Sveinbjornsson¹, Gardar Sveinbjornsson², Anders Albrechtsen³, Florian Zink¹, Sigurjon A. Gudjonsson¹, Asmundur Oddson¹, Gisli Masson¹, Hilma Holm¹, Hilma Holm², Augustine Kong¹, Augustine Kong², Unnur Thorsteinsdottir¹, Unnur Thorsteinsdottir², Patrick Sulem¹, Daniel F. Gudbjartsson¹, Daniel F. Gudbjartsson², Kari Stefansson¹, Kari Stefansson² - Show less +14 more•Institutions (3)

Amgen¹, University of Iceland², University of Copenhagen³

01 Mar 2016-Nature Genetics

TL;DR: This work proposes a weighted Bonferroni adjustment that controls for the family-wise error rate (FWER), using as weights the enrichment of sequence annotations among association signals, and shows that this weighted adjustment increases the power to detect association over the standard Bonferronsi correction.

...read moreread less

Abstract: The consensus approach to genome-wide association studies (GWAS) has been to assign equal prior probability of association to all sequence variants tested. However, some sequence variants, such as loss-of-function and missense variants, are more likely than others to affect protein function and are therefore more likely to be causative. Using data from whole-genome sequencing of 2,636 Icelanders and the association results for 96 quantitative and 123 binary phenotypes, we estimated the enrichment of association signals by sequence annotation. We propose a weighted Bonferroni adjustment that controls for the family-wise error rate (FWER), using as weights the enrichment of sequence annotations among association signals. We show that this weighted adjustment increases the power to detect association over the standard Bonferroni correction. We use the enrichment of associations by sequence annotation we have estimated in Iceland to derive significance thresholds for other populations with different numbers and combinations of sequence variants.

...read moreread less

184 citations

Journal Article•DOI•

Recurrent chimeric fusion RNAs in non-cancer tissues and cells

[...]

Mihaela Babiceanu¹, Fujun Qin¹, Zhongqiu Xie¹, Yuemeng Jia¹, Kevin Lopez¹, Nick Janus¹, Loryn Facemire¹, Shailesh Kumar¹, Yuwei Pang¹, Yanjun Qi¹, Iulia M. Lazar², Hui Li¹ - Show less +8 more•Institutions (2)

University of Virginia¹, Virginia Tech²

07 Apr 2016-Nucleic Acids Research

TL;DR: Performing functional analyses on a few widely expressed fusions found that silencing them resulted in dramatic reduction in normal cell growth and/or motility, and explored the implications of these non-pathological fusions in cancer and in evolution.

...read moreread less

Abstract: Gene fusions and their products (RNA and protein) were once thought to be unique features to cancer. However, chimeric RNAs can also be found in normal cells. Here, we performed, curated and analyzed nearly 300 RNA-Seq libraries covering 30 different non-neoplastic human tissues and cells as well as 15 mouse tissues. A large number of fusion transcripts were found. Most fusions were detected only once, while 291 were seen in more than one sample. We focused on the recurrent fusions and performed RNA and protein level validations on a subset. We characterized these fusions based on various features of the fusions, and their parental genes. They tend to be expressed at higher levels relative to their parental genes than the non-recurrent ones. Over half of the recurrent fusions involve neighboring genes transcribing in the same direction. A few sequence motifs were found enriched close to the fusion junction sites. We performed functional analyses on a few widely expressed fusions, and found that silencing them resulted in dramatic reduction in normal cell growth and/or motility. Most chimeras use canonical splicing sites, thus are likely products of 'intergenic splicing'. We also explored the implications of these non-pathological fusions in cancer and in evolution.

...read moreread less

147 citations

Journal Article•DOI•

Separation and parallel sequencing of the genomes and transcriptomes of single cells using G&T-seq

[...]

Iain C. Macaulay¹, Mabel J Teng², Wilfried Haerty¹, Parveen Kumar³, Parveen Kumar², Chris P. Ponting⁴, Chris P. Ponting², Thierry Voet³, Thierry Voet² - Show less +5 more•Institutions (4)

Norwich Research Park¹, Wellcome Trust Sanger Institute², Katholieke Universiteit Leuven³, University of Edinburgh⁴

01 Nov 2016-Nature Protocols

TL;DR: A detailed protocol for G&T-seq, a method for separation and parallel sequencing of genomic DNA and full-length polyA(+) mRNA from single cells, which allows the detection of thousands of transcripts in parallel with the genetic variants captured by the DNA-seq data from the same single cell.

...read moreread less

Abstract: Parallel sequencing of a single cell's genome and transcriptome provides a powerful tool for dissecting genetic variation and its relationship with gene expression. Here we present a detailed protocol for GT the physical separation of polyA(+) mRNA from genomic DNA using a modified oligo-dT bead capture and the respective whole-transcriptome and whole-genome amplifications; and library preparation and sequence analyses of these amplification products. The method allows the detection of thousands of transcripts in parallel with the genetic variants captured by the DNA-seq data from the same single cell. G&T-seq differs from other currently available methods for parallel DNA and RNA sequencing from single cells, as it involves physical separation of the DNA and RNA and does not require bespoke microfluidics platforms. The process can be implemented manually or through automation. When performed manually, paired genome and transcriptome sequencing libraries from eight single cells can be produced in ∼3 d by researchers experienced in molecular laboratory work. For users with experience in the programming and operation of liquid-handling robots, paired DNA and RNA libraries from 96 single cells can be produced in the same time frame. Sequence analysis and integration of single-cell G&T-seq DNA and RNA data requires a high level of bioinformatics expertise and familiarity with a wide range of informatics tools.

...read moreread less

141 citations

Journal Article•DOI•

Revised phylogeny of Bacteroidetes and proposal of sixteen new taxa and two new combinations including Rhodothermaeota phyl. nov.

[...]

Raul Munoz¹, Raul Munoz², Ramon Rosselló-Móra¹, Rudolf Amann²•Institutions (2)

Spanish National Research Council¹, Max Planck Society²

01 Jul 2016-Systematic and Applied Microbiology

TL;DR: A revision of an earlier phylogeny of Bacteroidetes has been performed using the 16S rRNA gene as a backbone in combination with the 23S r RNA gene, as well as multilocus sequence analysis (MLSA) of 29 orthologous protein sequences, and indels in the sequences of the beta subunit of the F-type ATPase and the alanyl-tRNA synthetase.

...read moreread less

118 citations

Journal Article•DOI•

Spliced synthetic genes as internal controls in RNA sequencing experiments

[...]

Simon A. Hardwick¹, Simon A. Hardwick², Wendy Y. Chen¹, Wendy Y. Chen², Ted Wong¹, Ira W. Deveson², Ira W. Deveson¹, James Blackburn², James Blackburn¹, Stacey B. Andersen³, Lars K. Nielsen³, John S. Mattick², John S. Mattick¹, Tim R. Mercer¹, Tim R. Mercer² - Show less +11 more•Institutions (3)

Garvan Institute of Medical Research¹, University of New South Wales², University of Queensland³

01 Sep 2016-Nature Methods

TL;DR: A set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms, that provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome are developed.

...read moreread less

Abstract: RNA sequencing (RNA-seq) can be used to assemble spliced isoforms, quantify expressed genes and provide a global profile of the transcriptome. However, the size and diversity of the transcriptome, the wide dynamic range in gene expression and inherent technical biases confound RNA-seq analysis. We have developed a set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms. Sequins have an entirely artificial sequence with no homology to natural reference genomes, but they align to gene loci encoded on an artificial in silico chromosome. The combination of multiple sequins across a range of concentrations emulates alternative splicing and differential gene expression, and it provides scaling factors for normalization between samples. We demonstrate the use of sequins in RNA-seq experiments to measure sample-specific biases and determine the limits of reliable transcript assembly and quantification in accompanying human RNA samples. In addition, we have designed a complementary set of sequins that represent fusion genes arising from rearrangements of the in silico chromosome to aid in cancer diagnosis. RNA sequins provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome.

...read moreread less

113 citations

Importance of secondary structure in the signal sequence for protein secretion (LamB protein/export-defective mutants/pseudorevertants/DNA sequence analysis)

[...]

Scott D. Emr, Thomas J. SILHAVYt

01 Jan 2016

TL;DR: Analysis of the secondary structure of the wild-type, mutant, and pseudorevertant LamB signal sequences suggests that the secondary mutations restore export by allowing the formation of a stable alpha-helical conformation in the central, hydrophobic region of the signal sequence.

...read moreread less

Abstract: Mutant Escherichia coli strains in which export of the LamB protein (coded for by the lamB gene) to the outer membrane of the cell is prevented have been described previ- ously. One of these mutant strains contains a small (12-base pair) deletion mutation within the region of the lamB gene that codes for the NH2-terminal signal sequence. In this mutant strain, ex- port but not synthesis of the LamB protein is blocked. We have isolated pseudorevertants that restore export of functional LamB protein to the outer membrane. DNA sequence analysis showed that two of the revertants contain a point mutation in addition to the original deletion. These point mutations lead to amino acid substitutions within the signal sequence. Our results indicate that these secondary mutations efficiently suppress the export defect caused by the deletion mutation. Analysis of the secondary struc- ture of the wild-type, mutant, and pseudorevertant LamB signal sequences suggests that the secondary mutations restore export by allowing the formation of a stable a-helical conformation in the central, hydrophobic region of the signal sequence. The mechanism of protein secretion in both prokaryotic and eukaryotic cells appears to require the presence of an extra se-

...read moreread less

105 citations

Journal Article•DOI•

An analysis of the sensitivity of proteogenomic mapping of somatic mutations and novel splicing events in cancer

[...]

Kelly V. Ruggles¹, Zuojian Tang¹, Xuya Wang¹, Himanshu Grover¹, Manor Askenazi, Jennifer Teubl¹, Song Cao², Michael D. McLellan², Karl R. Clauser³, David L. Tabb⁴, Philipp Mertins³, Robbert J.C. Slebos⁴, Petra Erdmann-Gilmore², Shunqiang Li², Harsha P. Gunawardena, Ling Xie, Tao Liu⁵, Jian-Ying Zhou⁶, Shisheng Sun⁶, Katherine A. Hoadley, Charles M. Perou, Xian Chen, Sherri R. Davies², Christopher G. Maher², Christopher R. Kinsinger⁷, Karen D. Rodland⁵, Hui Zhang⁶, Zhen Zhang⁶, Li Ding², Raymond R. Townsend², Henry Rodriguez⁷, Daniel W. Chan⁶, Richard D. Smith⁵, Daniel C. Liebler⁴, Steven A. Carr³, Samuel H. Payne⁵, Matthew J. Ellis², David Fenyő¹ - Show less +34 more•Institutions (7)

New York University¹, Washington University in St. Louis², Broad Institute³, Vanderbilt University⁴, Pacific Northwest National Laboratory⁵, Johns Hopkins University⁶, National Institutes of Health⁷

01 Mar 2016-Molecular & Cellular Proteomics

TL;DR: This large-scale proteogenomic integration allowed us to determine the degree to which mutations are translated and identify gaps in sequence coverage, thereby benchmarking current technology and progress toward whole cancer proteome and transcriptome analysis.

...read moreread less

Journal Article•DOI•

Hyperexpansion of RNA Bacteriophage Diversity

[...]

Siddharth R. Krishnamurthy¹, Andrew B. Janowski¹, Guoyan Zhao¹, Dan H. Barouch², David Wang¹ - Show less +1 more•Institutions (2)

Washington University in St. Louis¹, Ragon Institute of MGH, MIT and Harvard²

24 Mar 2016-PLOS Biology

TL;DR: Partial genome sequences of 122 RNA bacteriophage phylotypes are identified that were present in samples collected from a range of ecological niches worldwide, including invertebrates and extreme microbial sediment, demonstrating that they are more widely distributed than previously recognized.

...read moreread less

Abstract: Bacteriophage modulation of microbial populations impacts critical processes in ocean, soil, and animal ecosystems. However, the role of bacteriophages with RNA genomes (RNA bacteriophages) in these processes is poorly understood, in part because of the limited number of known RNA bacteriophage species. Here, we identify partial genome sequences of 122 RNA bacteriophage phylotypes that are highly divergent from each other and from previously described RNA bacteriophages. These novel RNA bacteriophage sequences were present in samples collected from a range of ecological niches worldwide, including invertebrates and extreme microbial sediment, demonstrating that they are more widely distributed than previously recognized. Genomic analyses of these novel bacteriophages yielded multiple novel genome organizations. Furthermore, one RNA bacteriophage was detected in the transcriptome of a pure culture of Streptomyces avermitilis, suggesting for the first time that the known tropism of RNA bacteriophages may include gram-positive bacteria. Finally, reverse transcription PCR (RT-PCR)-based screening for two specific RNA bacteriophages in stool samples from a longitudinal cohort of macaques suggested that they are generally acutely present rather than persistent.

...read moreread less

Journal Article•DOI•

The Coding Region of the HCV Genome Contains a Network of Regulatory RNA Structures.

[...]

Nathan Pirakitikulr¹, Andrew Kohlway¹, Brett D. Lindenbach¹, Anna Marie Pyle², Anna Marie Pyle¹ - Show less +1 more•Institutions (2)

Yale University¹, Howard Hughes Medical Institute²

07 Apr 2016-Molecular Cell

TL;DR: This study describes a set of conserved but functionally diverse structural RNA motifs that occur in multiple coding regions of the HCV genome, and it is demonstrated that conformational changes in these motifs influence specific stages in the virus' life cycle.

...read moreread less

Journal Article•DOI•

A time- and cost-effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y

[...]

Marta Tomaszkiewicz¹, Samarth Rangavittal¹, Monika Cechova¹, Rebeca Campos Sanchez¹, Howard W. Fescemyer¹, Robert S. Harris¹, Danling Ye¹, Patricia C. M. O’Brien², Rayan Chikhi¹, Oliver A. Ryder, Malcolm A. Ferguson-Smith², Paul Medvedev¹, Kateryna D. Makova¹ - Show less +9 more•Institutions (2)

Pennsylvania State University¹, University of Cambridge²

01 Apr 2016-Genome Research

TL;DR: A much faster and more affordable strategy for sequencing and assembling mammalian Y Chromosomes of sufficient quality for most comparative genomics analyses and for conservation genetics applications is presented and is used to reconstruct sex chromosomes in a heterogametic sex of any species.

...read moreread less

Abstract: The mammalian Y Chromosome sequence, critical for studying male fertility and dispersal, is enriched in repeats and palindromes, and thus, is the most difficult component of the genome to assemble. Previously, expensive and labor-intensive BAC-based techniques were used to sequence the Y for a handful of mammalian species. Here, we present a much faster and more affordable strategy for sequencing and assembling mammalian Y Chromosomes of sufficient quality for most comparative genomics analyses and for conservation genetics applications. The strategy combines flow sorting, short- and long-read genome and transcriptome sequencing, and droplet digital PCR with novel and existing computational methods. It can be used to reconstruct sex chromosomes in a heterogametic sex of any species. We applied our strategy to produce a draft of the gorilla Y sequence. The resulting assembly allowed us to refine gene content, evaluate copy number of ampliconic gene families, locate species-specific palindromes, examine the repetitive element content, and produce sequence alignments with human and chimpanzee Y Chromosomes. Our results inform the evolution of the hominine (human, chimpanzee, and gorilla) Y Chromosomes. Surprisingly, we found the gorilla Y Chromosome to be similar to the human Y Chromosome, but not to the chimpanzee Y Chromosome. Moreover, we have utilized the assembled gorilla Y Chromosome sequence to design genetic markers for studying the male-specific dispersal of this endangered species.

...read moreread less

Characterization of the chicken vimentin gene: Single copy gene producing multiple mRNAs (cytoskeleton/intermediate filament proteins/muscle/termination/polyadenylylation)

[...]

Zendra E. Zehner, Bruce M. Paterson

01 Jan 2016

TL;DR: Chirikjian et al. as discussed by the authors found that the vimentin gene con- tained two sets of tandem polyadenylylation sites, 249 and 532 nucleotides downstream from the stop codon for protein synthesis.

...read moreread less

Abstract: Genomic clones and cDNA plasmids were iso- lated for the intermediate filament protein vimentin from chicken. The identity of the various clones was determined both by mRNA selection (Paterson, B. M. & Roberts, B. E. (1981) in Gene Am- plification and Analysis, Structural Analysis or Nucleic Acids, eds. Chirikjian, J. G. & Papas, T. S. (Elsevier, North Holland), Vol. 2, pp. 418-435) and nucleotide sequence analysis. Restriction analysis, hybridization data, and heteroduplex studies confirmed that all of the genomic isolates contained overlapping fragments of an identical vimentin gene. No evidence for the existence of a second vimentin gene could be found by a Southern analysis either by using coding fragments from the purified vimentin gene or by using cDNA plasmids as probe. Likewise, copy-number experi- ments verified that the vimentin gene was present only once in the haploid chicken genome. However, in a RNA blot analysis, at least two equally abundant vimentin mRNA species of approximately 2,200 and 2,500 nucleotides in length were detected in all RNAs tested. Sequence analysis revealed that the vimentin gene con- tained two sets of tandem polyadenylylation sites, 249 and 532 nucleotides downstream from the stop codon for protein synthesis. It is proposed that the larger mRNA species arise because of com- plete transcription of the 3'-end of the vimentin gene (560 nu- cleotides of 3' nontranslated sequence), whereas the smaller

...read moreread less

Journal Article•DOI•

Complete De Novo Assembly of Monoclonal Antibody Sequences.

[...]

Ngoc Hieu Tran¹, M. Ziaur Rahman¹, Lin He, Lei Xin, Baozhen Shan, Ming Li¹ - Show less +2 more•Institutions (1)

University of Waterloo¹

26 Aug 2016-Scientific Reports

TL;DR: An integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences by integrating de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph.

...read moreread less

Abstract: De novo protein sequencing is one of the key problems in mass spectrometry-based proteomics, especially for novel proteins such as monoclonal antibodies for which genome information is often limited or not available. However, due to limitations in peptides fragmentation and coverage, as well as ambiguities in spectra interpretation, complete de novo assembly of unknown protein sequences still remains challenging. To address this problem, we propose an integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences. Our system integrates de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph to assemble protein sequences. We evaluated ALPS performance on two antibody data sets, each including a heavy chain and a light chain. The results show that ALPS was able to assemble three complete monoclonal antibody sequences of length 216-441 AA, at 100% coverage, and 96.64-100% accuracy.

...read moreread less

Journal Article•DOI•

Transcriptome profiling of the salt-stress response in Triticum aestivum cv. Kharchia Local

[...]

Etika Goyal¹, Singh K. Amit², Ravi S. Singh², Ajay Kumar Mahato², Suresh Chand¹, Kumar Kanika² - Show less +2 more•Institutions (2)

Banasthali Vidyapith¹, Indian Council of Agricultural Research²

13 Jun 2016-Scientific Reports

TL;DR: The transcriptome data is the first report, which offers an insight into the mechanisms and genes involved in salt tolerance, which can be used to improve salt tolerance in elite wheat cultivars and to develop tolerant germplasm for other cereal crops.

...read moreread less

Abstract: Kharchia Local wheat variety is an Indian salt tolerant land race known for its tolerance to salinity. However, there is a lack of detailed information regarding molecular mechanism imparting tolerance to high salinity in this bread wheat. In the present study, differential root transcriptome analysis identifying salt stress responsive gene networks and functional annotation under salt stress in Kharchia Local was performed. A total of 453,882 reads were obtained after quality filtering, using Roche 454-GS FLX Titanium sequencing technology. From these reads 22,241 ESTs were generated out of which, 17,911 unigenes were obtained. A total of 14,898 unigenes were annotated against nr protein database. Seventy seven transcription factors families in 826 unigenes and 11,002 SSRs in 6,939 unigenes were identified. Kyoto Encyclopedia of Genes and Genomes database identified 310 metabolic pathways. The expression pattern of few selected genes was compared during the time course of salt stress treatment between salt-tolerant (Kharchia Local) and susceptible (HD2687). The transcriptome data is the first report, which offers an insight into the mechanisms and genes involved in salt tolerance. This information can be used to improve salt tolerance in elite wheat cultivars and to develop tolerant germplasm for other cereal crops.

...read moreread less

Journal Article•DOI•

Expression and diversification analysis reveals transposable elements play important roles in the origin of Lycopersicon-specific lncRNAs in tomato.

[...]

Xin Wang¹, Guo Ai¹, Chunli Zhang¹, Long Cui¹, Jiafa Wang¹, Hanxia Li¹, Junhong Zhang¹, Zhibiao Ye¹ - Show less +4 more•Institutions (1)

Huazhong Agricultural University¹

01 Mar 2016-New Phytologist

TL;DR: This work identified 413 and 709 multi-exon noncoding transcripts from 353 and 595 loci of the cultivar tomato Heinz1706 and its wild relative LA1589, respectively and shows that they are poorly conserved in Solanaceae, showing novel insights into the evolution of lncRNAs in plants.

...read moreread less

Abstract: Summary Long noncoding RNAs (lncRNAs) regulate gene expression and biological processes. With the development of high-throughput RNA sequencing technology, lncRNAs have been extensively studied in recent years. Nevertheless, the expression and evolution of lncRNAs in plants remain poorly understood. Here, we identified 413 and 709 multi-exon noncoding transcripts from 353 and 595 loci of the cultivar tomato Heinz1706 and its wild relative LA1589, respectively. Systematic comparison of the sequence and expression of lncRNAs showed that they are poorly conserved in Solanaceae, with only < 0.4% lncRNAs present in all sequenced genomes of tomato and potato. Sequence analysis of Lycopersicon-specific lncRNA loci in Solanum lycopersicum and S. pennellii showed that the origins of these molecules are associated with transposable elements (TEs). LncRNA-314, a fruit-specific lncRNA expressed in S. lycopersicum and S. pimpinellifolium, but not in S. pennellii, originated through two evolutionary events: speciation of S. pennellii resulted in insertion of a long terminal repeat (LTR) retrotransposon into chromosome 10 and contributed to most of the transcribed region of lncRNA-314; and a large deletion in Lycopersicon generated the promoter region and part of the transcribed region of lncRNA-314. These results provide novel insights into the evolution of lncRNAs in plants.

...read moreread less

Journal Article•DOI•

Comparative genomics and physiology of the butyrate-producing bacterium Intestinimonas butyriciproducens.

[...]

Thi Phuong Nam Bui, Sudarshan A. Shetty, Ilias Lagkouvardos¹, Jarmo Ritari², Bhawani Chamlagain, François P. Douillard², Lars Paulin², Vieno Piironen, Thomas Clavel¹, Caroline M. Plugge, Willem M. de Vos² - Show less +7 more•Institutions (2)

Technische Universität München¹, University of Helsinki²

07 Oct 2016-Environmental Microbiology Reports

TL;DR: This study provides genomic and physiological insight into Intestinimonas butyriciproducens, a prevalent butyrate-producing species, differentiating strains that originate from the mouse and human gut.

...read moreread less

Abstract: Intestinimonas is a newly described bacterial genus with representative strains present in the intestinal tract of human and other animals. Despite unique metabolic features including the production of butyrate from both sugars and amino acids, there is to date no data on their diversity, ecology, and physiology. Using a comprehensive phylogenetic approach, Intestinimomas was found to include at least three species that colonize primarily the human and mouse intestine. We focused on the most common and cultivable species of the genus, Intestinimonas butyriciproducens, and performed detailed genomic and physiological comparison of strains SRB521T and AF211, isolated from the mouse and human gut respectively. The complete 3.3-Mb genomic sequences of both strains were highly similar with 98.8% average nucleotide identity, testifying to their assignment to one single species. However, thorough analysis revealed significant genomic rearrangements, variations in phage-derived sequences, and the presence of new CRISPR sequences in both strains. Moreover, strain AF211 appeared to be more efficient than strain SRB521T in the conversion of the sugars arabinose and galactose. In conclusion, this study provides genomic and physiological insight into Intestinimonas butyriciproducens, a prevalent butyrate-producing species, differentiating strains that originate from the mouse and human gut.

...read moreread less

Journal Article•DOI•

Whole-exome sequencing and neurite outgrowth analysis in autism spectrum disorder

[...]

Ryota Hashimoto¹, Takanobu Nakazawa¹, Yoshinori Tsurusaki², Yuka Yasuda¹, Kazuki Nagayasu¹, Kensuke Matsumura¹, Hitoshi Kawashima, Hidenaga Yamamori¹, Michiko Fujimoto¹, Kazutaka Ohi¹, Satomi Umeda-Yano¹, Masaki Fukunaga, Haruo Fujino¹, Atsushi Kasai¹, Atsuko Hayata-Takano¹, Norihito Shintani¹, Masatoshi Takeda¹, Naomichi Matsumoto², Hitoshi Hashimoto¹ - Show less +15 more•Institutions (2)

Osaka University¹, Yokohama City University²

01 Mar 2016-Journal of Human Genetics

TL;DR: A convenient system for identifying an experimental evidence-based annotation of candidate ASD-associated genes and a substantial portion of these genes with de novo single-nucleotide variations might have roles in neuronal systems, although further detailed analysis might eliminate false positive genes from identified candidate ASD genes.

...read moreread less

Abstract: Autism spectrum disorder (ASD) is a complex group of clinically heterogeneous neurodevelopmental disorders with unclear etiology and pathogenesis. Genetic studies have identified numerous candidate genetic variants, including de novo mutated ASD-associated genes; however, the function of these de novo mutated genes remains unclear despite extensive bioinformatics resources. Accordingly, it is not easy to assign priorities to numerous candidate ASD-associated genes for further biological analysis. Here we developed a convenient system for identifying an experimental evidence-based annotation of candidate ASD-associated genes. We performed trio-based whole-exome sequencing in 30 sporadic cases of ASD and identified 37 genes with de novo single-nucleotide variations (SNVs). Among them, 5 of those 37 genes, POGZ, PLEKHA4, PCNX, PRKD2 and HERC1, have been previously reported as genes with de novo SNVs in ASD; and consultation with in silico databases showed that only HERC1 might be involved in neural function. To examine whether the identified gene products are involved in neural functions, we performed small hairpin RNA-based assays using neuroblastoma cell lines to assess neurite development. Knockdown of 8 out of the 14 examined genes significantly decreased neurite development (P<0.05, one-way analysis of variance), which was significantly higher than the number expected from gene ontology databases (P=0.010, Fisher's exact test). Our screening system may be valuable for identifying the neural functions of candidate ASD-associated genes for further analysis and a substantial portion of these genes with de novo SNVs might have roles in neuronal systems, although further detailed analysis might eliminate false positive genes from identified candidate ASD genes.

...read moreread less

Journal Article•DOI•

De novo assembly and comparative transcriptome analysis of Euglena gracilis in response to anaerobic conditions

[...]

Yuta Yoshida¹, Takuya Tomiyama², Takanori Maruta², Masaru Tomita¹, Takahiro Ishikawa², Kazuharu Arakawa¹ - Show less +2 more•Institutions (2)

Keio University¹, Shimane University²

03 Mar 2016-BMC Genomics

TL;DR: An RNA-Seq analysis provided comprehensive transcriptome information on E.gracilis for the first time and indicated that paramylon and wax ester metabolic pathways are regulated at post-transcriptional rather than the transcriptional level in response to anaerobic conditions.

...read moreread less

Abstract: The phytoflagellated protozoan, Euglena gracilis, has been proposed as an attractive feedstock for the accumulation of valuable compounds such as β-1,3-glucan, also known as paramylon, and wax esters. The production of wax esters proceeds under anaerobic conditions, designated as wax ester fermentation. In spite of the importance and usefulness of Euglena, the genome and transcriptome data are currently unavailable, though another research group has recently published E.gracilis transcriptome study during our submission. We herein performed an RNA-Seq analysis to provide a comprehensive sequence resource and some insights into the regulation of genes including wax ester metabolism by comparative transcriptome analysis of E.gracilis under aerobic and anaerobic conditions. The E.gracilis transcriptome analysis was performed using the Illumina platform and yielded 90.3 million reads after the filtering steps. A total of 49,826 components were assembled and identified as a reference sequence of E.gracilis, of which 26,479 sequences were considered to be potentially expressed (having FPKM value of greater than 1). Approximately half of all components were estimated to be regulated in a trans-splicing manner, with the addition of protruding spliced leader sequences. Nearly 40 % of 26,479 sequences were annotated by similarity to Swiss-Prot database using the BLASTX program. A total of 2080 transcripts were identified as differentially expressed genes (DEGs) in response to anaerobic treatment for 24 h. A comprehensive pathway enrichment analysis using the KEGG pathway revealed that the majority of DEGs were involved in photosynthesis, nucleotide metabolism, oxidative phosphorylation, fatty acid metabolism. We successfully identified a candidate gene set of paramylon and wax esters, including novel β-1,3-glucan and wax ester synthases. A comparative expression analysis of aerobic- and anaerobic-treated E.gracilis cells indicated that gene expression changes in these components were not extensive or dynamic during the anaerobic treatment. The RNA-Seq analysis provided comprehensive transcriptome information on E.gracilis for the first time, and this information will advance our understanding of this unique organism. The comprehensive analysis indicated that paramylon and wax ester metabolic pathways are regulated at post-transcriptional rather than the transcriptional level in response to anaerobic conditions.

...read moreread less

Journal Article•DOI•

Sequence Assembly of Yarrowia lipolytica Strain W29/CLIB89 Shows Transposable Element Diversity

[...]

Christophe Magnan¹, James Yu¹, Ivan Chang¹, Ethan Jahn¹, Yuzo Kanomata¹, Jenny Wu¹, Michael Zeller¹, Melanie Oakes¹, Pierre Baldi¹, Suzanne Sandmeyer¹ - Show less +6 more•Institutions (1)

University of California, Irvine¹

07 Sep 2016-PLOS ONE

TL;DR: A de novo annotated assembly of the chromosomal genome of an industrially-relevant strain, W29/CLIB89, determined by hybrid next-generation sequencing underscores the utility of an additional independent genome assembly for this economically important organism.

...read moreread less

Abstract: Yarrowia lipolytica, an oleaginous yeast, is capable of accumulating significant cellular mass in lipid making it an important source of biosustainable hydrocarbon-based chemicals. In spite of a similar number of protein-coding genes to that in other Hemiascomycetes, the Y. lipolytica genome is almost double that of model yeasts. Despite its economic importance and several distinct strains in common use, an independent genome assembly exists for only one strain. We report here a de novo annotated assembly of the chromosomal genome of an industrially-relevant strain, W29/CLIB89, determined by hybrid next-generation sequencing. For the first time, each Y. lipolytica chromosome is represented by a single contig. The telomeric rDNA repeats were localized by Irys long-range genome mapping and one complete copy of the rDNA sequence is reported. Two large structural variants and retroelement differences with reference strain CLIB122 including a full-length, novel Ty3/Gypsy long terminal repeat (LTR) retrotransposon and multiple LTR-like sequences are described. Strikingly, several of these are adjacent to RNA polymerase III-transcribed genes, which are almost double in number in Y. lipolytica compared to other Hemiascomycetes. In addition to previously-reported dimeric RNA polymerase III-transcribed genes, tRNA pseudogenes were identified. Multiple full-length and truncated LINE elements are also present. Therefore, although identified transposons do not constitute a significant fraction of the Y. lipolytica genome, they could have played an active role in its evolution. Differences between the sequence of this strain and of the existing reference strain underscore the utility of an additional independent genome assembly for this economically important organism.

...read moreread less

Journal Article•DOI•

FLDS: A Comprehensive dsRNA Sequencing Method for Intracellular RNA Virus Surveillance

[...]

Syun-ichi Urayama¹, Yoshihiro Takaki¹, Takuro Nunoura¹•Institutions (1)

Japan Agency for Marine-Earth Science and Technology¹

13 Feb 2016-Microbes and Environments

TL;DR: This novel dsRNA targeting metagenomic method is characterized by an extremely high recovery rate of viral RNA sequences, the retrieval of terminal sequences, and uniform read coverage, which has not previously been reported in other meetagenomic methods targeting RNA viruses.

...read moreread less

Abstract: Knowledge of the distribution and diversity of RNA viruses is still limited in spite of their possible environmental and epidemiological impacts because RNA virus-specific metagenomic methods have not yet been developed. We herein constructed an effective metagenomic method for RNA viruses by targeting long double-stranded (ds)RNA in cellular organisms, which is a hallmark of infection, or the replication of dsRNA and single-stranded (ss)RNA viruses, except for retroviruses. This novel dsRNA targeting metagenomic method is characterized by an extremely high recovery rate of viral RNA sequences, the retrieval of terminal sequences, and uniform read coverage, which has not previously been reported in other metagenomic methods targeting RNA viruses. This method revealed a previously unidentified viral RNA diversity of more than 20 complete RNA viral genomes including dsRNA and ssRNA viruses associated with an environmental diatom colony. Our approach will be a powerful tool for cataloging RNA viruses associated with organisms of interest.

...read moreread less

Journal Article•DOI•

Genome-Wide Analysis of Small Secreted Cysteine-Rich Proteins Identifies Candidate Effector Proteins Potentially Involved in Fusarium graminearum-Wheat Interactions.

[...]

Shunwen Lu¹, Michael C. Edwards¹•Institutions (1)

United States Department of Agriculture¹

12 Jan 2016-Phytopathology

TL;DR: This work provides a solid candidate list for SSCP-derived effectors that may play roles in mediating F. graminearum-wheat interactions and the in vitro secretome-based method presented here also may be applicable for identifying candidate effectors in other ascomycete pathogens of crop plants.

...read moreread less

Abstract: Pathogen-derived, small secreted cysteine-rich proteins (SSCPs) are known to be a common source of fungal effectors that trigger resistance or susceptibility in specific host plants. This group of proteins has not been well studied in Fusarium graminearum, the primary cause of Fusarium head blight (FHB), a devastating disease of wheat. We report here a comprehensive analysis of SSCPs encoded in the genome of this fungus and selection of candidate effector proteins through proteomics and sequence/transcriptional analyses. A total of 190 SSCPs were identified in the genome of F. graminearum (isolate PH-1) based on the presence of N-terminal signal peptide sequences, size (≤200 amino acids), and cysteine content (≥2%) of the mature proteins. Twenty-five (approximately 13%) SSCPs were confirmed to be true extracellular proteins by nanoscale liquid chromatography-tandem mass spectrometry (nanoLC-MS/MS) analysis of a minimal medium-based in vitro secretome. Sequence analysis suggested that 17 SSCPs harbor conserved functional domains, including two homologous to Ecp2, a known effector produced by the tomato pathogen Cladosporium fulvum. Transcriptional analysis revealed that at least 34 SSCPs (including 23 detected in the in vitro secretome) are expressed in infected wheat heads; about half are up-regulated with expression patterns correlating with the development of FHB. This work provides a solid candidate list for SSCP-derived effectors that may play roles in mediating F. graminearum-wheat interactions. The in vitro secretome-based method presented here also may be applicable for identifying candidate effectors in other ascomycete pathogens of crop plants.

...read moreread less

Journal Article•DOI•

Sequence basis of Barnacle Cement Nanostructure is Defined by Proteins with Silk Homology.

[...]

Christopher R. So¹, Kenan P. Fears¹, Dagmar H. Leary¹, Jenifer M. Scancella¹, Zheng Wang¹, Jinny L. Liu¹, Beatriz Orihuela², Dan Rittschof², Christopher M. Spillmann¹, Kathryn J. Wahl¹ - Show less +6 more•Institutions (2)

United States Naval Research Laboratory¹, Duke University²

08 Nov 2016-Scientific Reports

TL;DR: Distinct primary structures defined by homologous domains shed light on how barnacles use low complexity in nanofibers to enable adhesion, and serves as a starting point for unraveling the molecular architecture of a robust and unique class of adhesive nanostructures.

...read moreread less

Abstract: Barnacles adhere by producing a mixture of cement proteins (CPs) that organize into a permanently bonded layer displayed as nanoscale fibers. These cement proteins share no homology with any other marine adhesives, and a common sequence-basis that defines how nanostructures function as adhesives remains undiscovered. Here we demonstrate that a significant unidentified portion of acorn barnacle cement is comprised of low complexity proteins; they are organized into repetitive sequence blocks and found to maintain homology to silk motifs. Proteomic analysis of aggregate bands from PAGE gels reveal an abundance of Gly/Ala/Ser/Thr repeats exemplified by a prominent, previously unidentified, 43 kDa protein in the solubilized adhesive. Low complexity regions found throughout the cement proteome, as well as multiple lysyl oxidases and peroxidases, establish homology with silk-associated materials such as fibroin, silk gum sericin, and pyriform spidroins from spider silk. Distinct primary structures defined by homologous domains shed light on how barnacles use low complexity in nanofibers to enable adhesion, and serves as a starting point for unraveling the molecular architecture of a robust and unique class of adhesive nanostructures.

...read moreread less

Journal Article•DOI•

Genome-Wide Identification and Expression Profiling of Tomato Hsp20 Gene Family in Response to Biotic and Abiotic Stresses

[...]

Jiahong Yu, Cheng Yuan, Kun Feng, Meiying Ruan, Qingjing Ye, Rongqing Wang, Zhimiao Li, Guozhi Zhou, Zhuping Yao, Yuejian Yang, Hongjian Wan - Show less +7 more

17 Aug 2016-Frontiers in Plant Science

TL;DR: The transcript levels of SlHsp20 genes could be induced profusely by abiotic and biotic stresses such as heat, drought, salt, Botrytis cinerea, and Tomato Spotted Wilt Virus, indicating their potential roles in mediating the response of tomato plants to environment stresses.

...read moreread less

Abstract: The Hsp20 genes are involved in the response of plants to environment stresses including heat shock and also play a vital role in plant growth and development. They represent the most abundant small heat shock proteins (sHsps) in plants, but little is known about this family in tomato (Solanum lycopersicum), an important vegetable crop in the world. Here, we characterized heat shock protein 20 (SlHsp20) gene family in tomato through integration of gene structure, chromosome location, phylogenetic relationship and expression profile. Using bioinformatics-based methods, we identified at least 42 putative SlHsp20 genes in tomato. Sequence analysis revealed that most of SlHsp20 genes possessed no intron or a relatively short intron in length. Chromosome mapping indicated that inter-arm and intra-chromosome duplication events contributed remarkably to the expansion of SlHsp20 genes. Phylogentic tree of Hsp20 genes from tomato and other plant species revealed that SlHsp20 genes were grouped into 13 subfamilies, indicating that these genes may have a common ancestor that generated diverse subfamilies prior to the mono-dicot split. In addition, expression analysis using RNA-seq in various tissues and developmental stages of cultivated tomato and the wild relative Solanum pimpinellifolium revealed that most of these genes (83%) were expressed in at least one stage from at least one genotype. Out of 42 genes, 4 genes were expressed constitutively in almost all the tissues analyzed, implying that these genes might have specific housekeeping function in tomato cell under normal growth conditions. Two SlHsp20 genes displayed differential expression levels between cultivated tomato and S. pimpinellifolium in vegetative (leaf and root) and reproductive organs (floral bud and flower), suggesting inter-species diversification for functional specialization during the process of domestication. Based on genome-wide microarray analysis, we showed that the transcript levels of SlHsp20 genes could be induced profusely by abiotic and biotic stresses such as heat, drought, salt, Botrytis cinerea and Tomato Spotted Wilt Virus, indicating their potential roles in mediating the response of tomato plants to environment stresses. In conclusion, these results provide valuable information for elucidating the evolutionary relationship of Hsp20 gene family and functional characterization of the SlHsp20 gene family in the future.

...read moreread less

Journal Article•DOI•

The landscape of fusion transcripts in spitzoid melanoma and biologically indeterminate spitzoid tumors by RNA sequencing

[...]

Gang Wu¹, Raymond L. Barnhill², Seungjae Lee¹, Yongjin Li¹, Ying Shao¹, John Easton¹, James Dalton¹, Jinghui Zhang¹, Alberto S. Pappo¹, Armita Bahrami¹ - Show less +6 more•Institutions (2)

St. Jude Children's Research Hospital¹, University of Paris²

19 Feb 2016-Modern Pathology

TL;DR: In this article, the authors performed whole-transcriptome sequencing using formalin-fixed, paraffin-embedded (FFPE) tissues in malignant or biologically indeterminate spitzoid tumors from 7 patients (age 2-14 years).

...read moreread less

Journal Article•DOI•

Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison.

[...]

Tung Hoang, Changchuan Yin, Stephen S.-T. Yau¹•Institutions (1)

Tsinghua University¹

01 Oct 2016-Genomics

TL;DR: This research proposes to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship and gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega.

...read moreread less

Posted Content•

Distributed Representations for Biological Sequence Analysis.

[...]

Dhananjay Kimothi, Akshay Soni, Pravesh Biyani, James M. Hogan

21 Aug 2016-arXiv: Learning

TL;DR: A new method is presented, called seq2vec, to represent a complete biological sequence in an Euclidean space, which has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks.

...read moreread less

Abstract: Biological sequence comparison is a key step in inferring the relatedness of various organisms and the functional similarity of their components. Thanks to the Next Generation Sequencing efforts, an abundance of sequence data is now available to be processed for a range of bioinformatics applications. Embedding a biological sequence over a nucleotide or amino acid alphabet in a lower dimensional vector space makes the data more amenable for use by current machine learning tools, provided the quality of embedding is high and it captures the most meaningful information of the original sequences. Motivated by recent advances in the text document embedding literature, we present a new method, called seq2vec, to represent a complete biological sequence in an Euclidean space. The new representation has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks. We test our embeddings with protein sequence classification and retrieval tasks and demonstrate encouraging outcomes.

...read moreread less

Journal Article•DOI•

Whole-genome sequencing overcomes pseudogene homology to diagnose autosomal dominant polycystic kidney disease.

[...]

Amali Mallawaarachchi¹, Yvonne J. Hort¹, Mark J. Cowley¹, Mark J. McCabe², Mark J. McCabe¹, André E. Minoche¹, Marcel E. Dinger², Marcel E. Dinger¹, John Shine¹, Timothy J. Furlong¹, Timothy J. Furlong³ - Show less +7 more•Institutions (3)

Garvan Institute of Medical Research¹, University of New South Wales², St. Vincent's Health System³

11 May 2016-European Journal of Human Genetics

TL;DR: This paper reports the first use of whole-genome sequencing (WGS) to diagnose ADPKD, which overcomes pseudogene homology, provides uniform coverage, detects all variant types in a single test and is less labour-intensive than current techniques.

...read moreread less

Abstract: Autosomal dominant polycystic kidney disease (ADPKD) is the most common monogenic kidney disorder and is due to disease-causing variants in PKD1 or PKD2. Strong genotype-phenotype correlation exists although diagnostic sequencing is not part of routine clinical practice. This is because PKD1 bears 97.7% sequence similarity with six pseudogenes, requiring laborious and error-prone long-range PCR and Sanger sequencing to overcome. We hypothesised that whole-genome sequencing (WGS) would be able to overcome the problem of this sequence homology, because of 150 bp, paired-end reads and avoidance of capture bias that arises from targeted sequencing. We prospectively recruited a cohort of 28 unique pedigrees with ADPKD phenotype. Standard DNA extraction, library preparation and WGS were performed using Illumina HiSeq X and variants were classified following standard guidelines. Molecular diagnosis was made in 24 patients (86%), with 100% variant confirmation by current gold standard of long-range PCR and Sanger sequencing. We demonstrated unique alignment of sequencing reads over the pseudogene-homologous region. In addition to identifying function-affecting single-nucleotide variants and indels, we identified single- and multi-exon deletions affecting PKD1 and PKD2, which would have been challenging to identify using exome sequencing. We report the first use of WGS to diagnose ADPKD. This method overcomes pseudogene homology, provides uniform coverage, detects all variant types in a single test and is less labour-intensive than current techniques. This technique is translatable to a diagnostic setting, allows clinicians to make better-informed management decisions and has implications for other disease groups that are challenged by regions of confounding sequence homology.

...read moreread less

Collapse