Showing papers on "Sequence assembly published in 2015"

PDF

Open Access

Journal Article•DOI•

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

[...]

Mihaela Pertea¹, Geo Pertea¹, Corina Antonescu¹, Tsung Cheng Chang², Joshua T. Mendell², Steven L. Salzberg¹ - Show less +2 more•Institutions (2)

Johns Hopkins University¹, University of Texas Southwestern Medical Center²

01 Mar 2015-Nature Biotechnology

TL;DR: StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts produces more complete and accurate reconstructions of genes and better estimates of expression levels.

...read moreread less

Abstract: Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.

...read moreread less

6,594 citations

Journal Article•DOI•

PacBio Sequencing and Its Applications.

[...]

Anthony Rhoads¹, Kin Fai Au¹•Institutions (1)

University of Iowa¹

01 Oct 2015-Genomics, Proteomics & Bioinformatics

TL;DR: Single-molecule, real-time sequencing developed by Pacific BioSciences offers longer read lengths than the second-generation sequencing technologies, making it well-suited for unsolved problems in genome, transcriptome, and epigenetics research.

...read moreread less

1,542 citations

Journal Article•DOI•

De novo assembly of bacterial transcriptomes from RNA-seq data

[...]

Brian Tjaden¹•Institutions (1)

Wellesley College¹

13 Jan 2015-Genome Biology

TL;DR: This work presents novel algorithms, specific to bacterial gene structures and transcriptomes, for analysis of bacterial RNA-seq data and de novo transcriptome assembly, implemented in an open source software system called Rockhopper 2.

...read moreread less

Abstract: Transcriptome assays are increasingly being performed by high-throughput RNA sequencing (RNA-seq). For organisms whose genomes have not been sequenced and annotated, transcriptomes must be assembled de novo from the RNA-seq data. Here, we present novel algorithms, specific to bacterial gene structures and transcriptomes, for analysis of bacterial RNA-seq data and de novo transcriptome assembly. The algorithms are implemented in an open source software system called Rockhopper 2. We find that Rockhopper 2 outperforms other de novo transcriptome assemblers and offers accurate and efficient analysis of bacterial RNA-seq data. Rockhopper 2 is available at http://cs.wellesley.edu/~btjaden/Rockhopper.

...read moreread less

1,437 citations

Journal Article•DOI•

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

[...]

Konstantin Berlin¹, Sergey Koren, Chen-Shan Chin², James P Drake², Jane M. Landolin², Adam M. Phillippy - Show less +2 more•Institutions (2)

University of Maryland, College Park¹, Pacific Biosciences²

01 Jun 2015-Nature Biotechnology

TL;DR: The MinHash Alignment Process (MHAP) is introduced for overlapping noisy, long reads using probabilistic, locality-sensitive hashing and can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

...read moreread less

Abstract: Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

...read moreread less

886 citations

Journal Article•DOI•

A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data

[...]

David A. Coil¹, Guillaume Jospin¹, Aaron E. Darling¹•Institutions (1)

University of Technology, Sydney¹

15 Feb 2015-Bioinformatics

TL;DR: Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation and includes several improvements to read trimming, resulting in substantially improved assemblies that recover a more complete set of reference genes than previous methods.

...read moreread less

Abstract: MOTIVATION Open-source bacterial genome assembly remains inaccessible to many biologists because of its complexity Few software solutions exist that are capable of automating all steps in the process of de novo genome assembly from Illumina data RESULTS A5-miseq can produce high-quality microbial genome assemblies on a laptop computer without any parameter tuning A5-miseq does this by automating the process of adapter trimming, quality filtering, error correction, contig and scaffold generation and detection of misassemblies Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation and includes several improvements to read trimming Together, these changes result in substantially improved assemblies that recover a more complete set of reference genes than previous methods AVAILABILITY A5-miseq is licensed under the GPL open-source license Source code and precompiled binaries for Mac OS X 106+ and Linux 2615+ are available from http://sourceforgenet/projects/ngopt CONTACT aarondarling@utseduau SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online

...read moreread less

882 citations

Journal Article•DOI•

Inexpensive Multiplexed Library Preparation for Megabase-Sized Genomes

[...]

Michael H. Baym¹, Sergey Kryazhimskiy¹, Tami D. Lieberman¹, Hattie Chung¹, Michael M. Desai¹, Roy Kishony² - Show less +2 more•Institutions (2)

Harvard University¹, Technion – Israel Institute of Technology²

22 May 2015-PLOS ONE

TL;DR: A protocol for rapid and inexpensive preparation of hundreds of multiplexed genomic libraries for Illumina sequencing by carrying out the Nextera tagmentation reaction in small volumes, replacing costly reagents with cheaper equivalents, and omitting unnecessary steps is presented.

...read moreread less

Abstract: Whole-genome sequencing has become an indispensible tool of modern biology. However, the cost of sample preparation relative to the cost of sequencing remains high, especially for small genomes where the former is dominant. Here we present a protocol for rapid and inexpensive preparation of hundreds of multiplexed genomic libraries for Illumina sequencing. By carrying out the Nextera tagmentation reaction in small volumes, replacing costly reagents with cheaper equivalents, and omitting unnecessary steps, we achieve a cost of library preparation of $8 per sample, approximately 6 times cheaper than the standard Nextera XT protocol. Furthermore, our procedure takes less than 5 hours for 96 samples. Several hundred samples can then be pooled on the same HiSeq lane via custom barcodes. Our method will be useful for re-sequencing of microbial or viral genomes, including those from evolution experiments, genetic screens, and environmental samples, as well as for other sequencing applications including large amplicon, open chromosome, artificial chromosomes, and RNA sequencing.

...read moreread less

609 citations

Journal Article•DOI•

Assessing the performance of the Oxford Nanopore Technologies MinION.

[...]

Thomas W Laver¹, James W. Harrison¹, Paul O'Neill¹, Karen Moore¹, Audrey Farbos¹, Konrad Paszkiewicz¹, David J. Studholme¹ - Show less +3 more•Institutions (1)

University of Exeter¹

01 Mar 2015-Biomolecular Detection and Quantification

TL;DR: It is shown that MinION sequence reads can enhance contiguity of de novo assembly when used in conjunction with Illumina MiSeq data, as the first nanopore-based single molecule sequencer available to researchers.

...read moreread less

456 citations

Journal Article•DOI•

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage

[...]

Nicholas H. Putnam, Brendan O'Connell, Jonathan C. Stites, Brandon J. Rice, Andrew Fields, Paul D. Hartley, Charles W. Sugnet, David Haussler, Daniel S. Rokhsar, Richard E. Green - Show less +6 more

18 Feb 2015-arXiv: Genomics

TL;DR: In this article, a simpler approach based on in vitro reconstituted chromatin was proposed to increase the scaffold contiguity of assembly and provide haplotype phasing information.

...read moreread less

Abstract: Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem. These data dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing information. Here, we describe a simpler approach ("Chicago") based on in vitro reconstituted chromatin. We generated two Chicago datasets with human DNA and used a new software pipeline ("HiRise") to construct a highly accurate de novo assembly and scaffolding of a human genome with scaffold N50 of 30 Mb. We also demonstrated the utility of Chicago for improving existing assemblies by re-assembling and scaffolding the genome of the American alligator. With a single library and one lane of Illumina HiSeq sequencing, we increased the scaffold N50 of the American alligator from 508 kb to 10 Mb. Our method uses established molecular biology procedures and can be used to analyze any genome, as it requires only about 5 micrograms of DNA as the starting material.

...read moreread less

414 citations

Journal Article•DOI•

FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations

[...]

Gilberto dos Santos¹, Andrew J. Schroeder¹, Joshua L. Goodman², Victor B. Strelets², Madeline A. Crosby¹, Jim Thurmond², David B. Emmert¹, William M. Gelbart¹ - Show less +4 more•Institutions (2)

Harvard University¹, Indiana University²

28 Jan 2015-Nucleic Acids Research

TL;DR: The attributes of the new Release 6 reference genome assembly are described, the migration of FlyBase genome annotations to this new assembly is described, how genome features on this newAssembly can be viewed in FlyBase and how users can convert coordinates for their own data to the corresponding Release 6 coordinates.

...read moreread less

Abstract: Release 6, the latest reference genome assembly of the fruit fly Drosophila melanogaster, was released by the Berkeley Drosophila Genome Project in 2014; it replaces their previous Release 5 genome assembly, which had been the reference genome assembly for over 7 years. With the enormous amount of information now attached to the D. melanogaster genome in public repositories and individual laboratories, the replacement of the previous assembly by the new one is a major event requiring careful migration of annotations and genome-anchored data to the new, improved assembly. In this report, we describe the attributes of the new Release 6 reference genome assembly, the migration of FlyBase genome annotations to this new assembly, how genome features on this new assembly can be viewed in FlyBase (http://flybase.org) and how users can convert coordinates for their own data to the corresponding Release 6 coordinates.

...read moreread less

409 citations

Journal Article•DOI•

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

[...]

Li Song¹, Liliana Florea²•Institutions (2)

Johns Hopkins University¹, Johns Hopkins University School of Medicine²

19 Oct 2015-GigaScience

TL;DR: A k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads, which has an accuracy higher than or comparable to existing methods, including the only other method (SEECER), and is more time and memory efficient.

...read moreread less

Abstract: Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/ .

...read moreread less

359 citations

Journal Article•DOI•

Restriction site-associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference

[...]

Alicia Mastretta-Yanes¹, Nils Arrigo², Nadir Alvarez², Tove H. Jorgensen³, Daniel Piñero⁴, Brent C. Emerson⁵, Brent C. Emerson¹ - Show less +3 more•Institutions (5)

University of East Anglia¹, University of Lausanne², Aarhus University³, National Autonomous University of Mexico⁴, Spanish National Research Council⁵

01 Jan 2015-Molecular Ecology Resources

TL;DR: Individual sample replicates are used, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome and optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci.

...read moreread less

Abstract: Restriction site-associated DNA sequencing (RADseq) provides researchers with the ability to record genetic polymorphism across thousands of loci for nonmodel organisms, potentially revolutionizing the field of molecular ecology. However, as with other genotyping methods, RADseq is prone to a number of sources of error that may have consequential effects for population genetic inferences, and these have received only limited attention in terms of the estimation and reporting of genotyping error rates. Here we use individual sample replicates, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome. We then use sample replicates to (i) optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci; and (ii) quantify error rates for loci, alleles and single-nucleotide polymorphisms. As an empirical example, we use a double-digest RAD data set of a nonmodel plant species, Berberis alpina, collected from high-altitude mountains in Mexico.

...read moreread less

Journal Article•DOI•

Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

[...]

Sara Goodwin¹, James Gurtowski¹, Scott Ethe-Sayers¹, Panchajanya Deshpande¹, Michael C. Schatz¹, W. Richard McCombie¹ - Show less +2 more•Institutions (1)

Cold Spring Harbor Laboratory¹

07 Oct 2015-Genome Research

TL;DR: The assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

...read moreread less

Abstract: Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5-50 kbp) at such high error rates (between ∼5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

...read moreread less

Journal Article•DOI•

Genetic variation and the de novo assembly of human genomes.

[...]

Mark Chaisson¹, Richard K. Wilson², Evan E. Eichler¹•Institutions (2)

University of Washington¹, Washington University in St. Louis²

01 Nov 2015-Nature Reviews Genetics

TL;DR: Recent technological advances that improve both contiguity and accuracy are summarized and the importance of complete de novo assembly as opposed to read mapping is emphasized as the primary means to understanding the full range of human genetic variation.

...read moreread less

Abstract: The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.

...read moreread less

Journal Article•DOI•

The Release 6 reference sequence of the Drosophila melanogaster genome

[...]

Roger A. Hoskins¹, Joseph W. Carlson¹, Kenneth H. Wan¹, Soo Park¹, Ivonne Mendez¹, Samuel E. Galle¹, Benjamin W. Booth¹, Barret D. Pfeiffer², Reed A. George², Robert Svirskas², Martin Krzywinski³, Jacqueline E. Schein³, Maria Carmela Accardo⁴, Elisabetta Damia⁴, Giovanni Messina⁴, Maria Mendez-Lago⁵, Beatriz de Pablos⁵, Olga V. Demakova⁶, Evgeniya N. Andreyeva⁶, Lidiya V. Boldyreva⁶, Marco A. Marra³, A. Bernardo Carvalho⁷, Patrizio Dimitri⁴, Alfredo Villasante⁵, Igor F. Zhimulev⁶, Igor F. Zhimulev⁸, Gerald M. Rubin², Gary H. Karpen¹, Gary H. Karpen⁹, Susan E. Celniker¹ - Show less +26 more•Institutions (9)

Lawrence Berkeley National Laboratory¹, Howard Hughes Medical Institute², BC Cancer Agency³, Sapienza University of Rome⁴, Spanish National Research Council⁵, Russian Academy of Sciences⁶, Federal University of Rio de Janeiro⁷, Novosibirsk State University⁸, University of California, Berkeley⁹

14 Jan 2015-Genome Research

TL;DR: An improved reference sequence of the single-copy and middle-repetitive regions of the genome is reported, produced using cytogenetic mapping to mitotic and polytene chromosomes, clone-based finishing and BAC fingerprint verification, ordering of scaffolds by alignment to cDNA sequences, incorporation of other map and sequence data, and validation by whole-genome optical restriction mapping.

...read moreread less

Abstract: Drosophila melanogaster plays an important role in molecular, genetic, and genomic studies of heredity, development, metabolism, behavior, and human disease. The initial reference genome sequence reported more than a decade ago had a profound impact on progress in Drosophila research, and improving the accuracy and completeness of this sequence continues to be important to further progress. We previously described improvement of the 117-Mb sequence in the euchromatic portion of the genome and 21 Mb in the heterochromatic portion, using a whole-genome shotgun assembly, BAC physical mapping, and clone-based finishing. Here, we report an improved reference sequence of the single-copy and middle-repetitive regions of the genome, produced using cytogenetic mapping to mitotic and polytene chromosomes, clone-based finishing and BAC fingerprint verification, ordering of scaffolds by alignment to cDNA sequences, incorporation of other map and sequence data, and validation by whole-genome optical restriction mapping. These data substantially improve the accuracy and completeness of the reference sequence and the order and orientation of sequence scaffolds into chromosome arm assemblies. Representation of the Y chromosome and other heterochromatic regions is particularly improved. The new 143.9-Mb reference sequence, designated Release 6, effectively exhausts clone-based technologies for mapping and sequencing. Highly repeat-rich regions, including large satellite blocks and functional elements such as the ribosomal RNA genes and the centromeres, are largely inaccessible to current sequencing and assembly methods and remain poorly represented. Further significant improvements will require sequencing technologies that do not depend on molecular cloning and that produce very long reads.

...read moreread less

Journal Article•DOI•

The Drosophila Genome Nexus: A Population Genomic Resource of 623 Drosophila melanogaster Genomes, Including 197 from a Single Ancestral Range Population

[...]

Justin B. Lack¹, Charis Cardeno², Marc W. Crepeau², William Taylor¹, Russell Corbett-Detig³, Kristian Stevens², Charles H. Langley², John E. Pool¹ - Show less +4 more•Institutions (3)

University of Wisconsin-Madison¹, University of California, Davis², University of California, Berkeley³

01 Apr 2015-Genetics

TL;DR: An assembly strategy is settled on that utilizes two alignment programs and incorporates both substitutions and short indels to construct an updated reference for a second round of mapping prior to final variant detection, which will greatly facilitate population genomic analysis in this model species by reducing the methodological differences between data sets.

...read moreread less

Abstract: Hundreds of wild-derived Drosophila melanogaster genomes have been published, but rigorous comparisons across data sets are precluded by differences in alignment methodology. The most common approach to reference-based genome assembly is a single round of alignment followed by quality filtering and variant detection. We evaluated variations and extensions of this approach and settled on an assembly strategy that utilizes two alignment programs and incorporates both substitutions and short indels to construct an updated reference for a second round of mapping prior to final variant detection. Utilizing this approach, we reassembled published D. melanogaster population genomic data sets and added unpublished genomes from several sub-Saharan populations. Most notably, we present aligned data from phase 3 of the Drosophila Population Genomics Project (DPGP3), which provides 197 genomes from a single ancestral range population of D. melanogaster (from Zambia). The large sample size, high genetic diversity, and potentially simpler demographic history of the DPGP3 sample will make this a highly valuable resource for fundamental population genetic research. The complete set of assemblies described here, termed the Drosophila Genome Nexus, presently comprises 623 consistently aligned genomes and is publicly available in multiple formats with supporting documentation and bioinformatic tools. This resource will greatly facilitate population genomic analysis in this model species by reducing the methodological differences between data sets.

...read moreread less

Journal Article•DOI•

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum.

[...]

Robert VanBuren¹, Doug Bryant¹, Patrick P. Edger², Patrick P. Edger³, Haibao Tang⁴, Haibao Tang⁵, Diane Burgess³, Dinakar Challabathula⁶, Kristi E. Spittle⁷, Richard Hall⁷, Jenny Gu⁷, Eric Lyons⁴, Michael Freeling³, Dorothea Bartels⁶, Boudewijn F.H. Ten Hallers, Alex Hastie, Todd P. Michael, Todd C. Mockler¹ - Show less +14 more•Institutions (7)

Donald Danforth Plant Science Center¹, Michigan State University², University of California, Berkeley³, University of Arizona⁴, Fujian Agriculture and Forestry University⁵, University of Bonn⁶, Pacific Biosciences⁷

26 Nov 2015-Nature

TL;DR: The Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.

...read moreread less

Abstract: Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE). Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetium genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a 'near-complete' draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. The Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.

...read moreread less

Journal Article•DOI•

A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome

[...]

Jarrod Chapman¹, Martin Mascher², Aydin Buluc³, Kerrie Barry¹, Evangelos Georganas⁴, Evangelos Georganas³, Adam M. Session⁴, Veronika Strnadova⁵, Jerry Jenkins¹, Sunish K. Sehgal⁶, Sunish K. Sehgal⁷, Leonid Oliker³, Jeremy Schmutz¹, Katherine Yelick³, Katherine Yelick⁴, Uwe Scholz², Robbie Waugh⁸, Jesse Poland⁶, Gary J. Muehlbauer⁹, Nils Stein², Daniel S. Rokhsar⁴, Daniel S. Rokhsar¹ - Show less +18 more•Institutions (9)

Joint Genome Institute¹, Leibniz Association², Lawrence Berkeley National Laboratory³, University of California, Berkeley⁴, University of California, Santa Barbara⁵, Kansas State University⁶, South Dakota State University⁷, James Hutton Institute⁸, University of Minnesota⁹

31 Jan 2015-Genome Biology

TL;DR: In this paper, a sequence assembly representing 9.1 Gbp of the highly repetitive 16 Gbp genome of hexaploid wheat, Triticum aestivum, and 7.1 gb of this assembly to chromosomal locations is presented.

...read moreread less

Abstract: Polyploid species have long been thought to be recalcitrant to whole-genome assembly. By combining high-throughput sequencing, recent developments in parallel computing, and genetic mapping, we derive, de novo, a sequence assembly representing 9.1 Gbp of the highly repetitive 16 Gbp genome of hexaploid wheat, Triticum aestivum, and assign 7.1 Gb of this assembly to chromosomal locations. The genome representation and accuracy of our assembly is comparable or even exceeds that of a chromosome-by-chromosome shotgun assembly. Our assembly and mapping strategy uses only short read sequencing technology and is applicable to any species where it is possible to construct a mapping population.

...read moreread less

Journal Article•DOI•

An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data

[...]

Xutao Deng¹, Samia N. Naccache¹, Terry Ng¹, Scot Federman¹, Linlin Li¹, Charles Y. Chiu¹, Eric Delwart¹ - Show less +3 more•Institutions (1)

University of California, San Francisco¹

20 Apr 2015-Nucleic Acids Research

TL;DR: This work describes an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach and proposed new quality metrics that are suitable for evaluating metagenome de novo assembly.

...read moreread less

Abstract: Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

...read moreread less

Journal Article•DOI•

LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads

[...]

René L. Warren¹, Chen Yang¹, Benjamin P. Vandervalk¹, Bahar Behsaz¹, Albert Lagman¹, Steven J. M. Jones¹, Inanc Birol¹ - Show less +3 more•Institutions (1)

BC Cancer Agency¹

04 Aug 2015-GigaScience

TL;DR: LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction is presented.

...read moreread less

Abstract: Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value. We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes. This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.

...read moreread less

Journal Article•DOI•

Genome assembly using Nanopore-guided long and error-free DNA reads

[...]

Mohammed-Amin Madoui¹, Stefan Engelen¹, Corinne Cruaud¹, Caroline Belser¹, Laurie Bertrand¹, Adriana Alberti¹, Arnaud Lemainque¹, Patrick Wincker², Patrick Wincker³, Patrick Wincker¹, Jean-Marc Aury¹ - Show less +7 more•Institutions (3)

French Alternative Energies and Atomic Energy Commission¹, University of Évry Val d'Essonne², Centre national de la recherche scientifique³

20 Apr 2015-BMC Genomics

TL;DR: The hybrid strategy was able to generate NaS (Nanopore Synthetic-long) reads up to 60 kb that aligned entirely and with no error to the reference genome and that spanned highly conserved repetitive regions, in contrast to an Illumina-only assembly.

...read moreread less

Abstract: Long-read sequencing technologies were launched a few years ago, and in contrast with short-read sequencing technologies, they offered a promise of solving assembly problems for large and complex genomes. Moreover by providing long-range information, it could also solve haplotype phasing. However, existing long-read technologies still have several limitations that complicate their use for most research laboratories, as well as in large and/or complex genome projects. In 2014, Oxford Nanopore released the MinION® device, a small and low-cost single-molecule nanopore sequencer, which offers the possibility of sequencing long DNA fragments. The assembly of long reads generated using the Oxford Nanopore MinION® instrument is challenging as existing assemblers were not implemented to deal with long reads exhibiting close to 30% of errors. Here, we presented a hybrid approach developed to take advantage of data generated using MinION® device. We sequenced a well-known bacterium, Acinetobacter baylyi ADP1 and applied our method to obtain a highly contiguous (one single contig) and accurate genome assembly even in repetitive regions, in contrast to an Illumina-only assembly. Our hybrid strategy was able to generate NaS (Nanopore Synthetic-long) reads up to 60 kb that aligned entirely and with no error to the reference genome and that spanned highly conserved repetitive regions. The average accuracy of NaS reads reached 99.99% without losing the initial size of the input MinION® reads. We described NaS tool, a hybrid approach allowing the sequencing of microbial genomes using the MinION® device. Our method, based ideally on 20x and 50x of NaS and Illumina reads respectively, provides an efficient and cost-effective way of sequencing microbial or small eukaryotic genomes in a very short time even in small facilities. Moreover, we demonstrated that although the Oxford Nanopore technology is a relatively new sequencing technology, currently with a high error rate, it is already useful in the generation of high-quality genome assemblies.

...read moreread less

Journal Article•DOI•

IVA: accurate de novo assembly of RNA virus genomes.

[...]

Martin Hunt¹, Astrid Gall¹, Swee Hoe Ong¹, Jacqui Brener², Bridget Ferns³, Philip J. R. Goulder², Eleni Nastouli⁴, Jacqueline A. Keane¹, Paul Kellam¹, Thomas D. Otto¹ - Show less +6 more•Institutions (4)

Wellcome Trust Sanger Institute¹, University of Oxford², University College London³, University College Hospital⁴

15 Jul 2015-Bioinformatics

TL;DR: A new de novo assembler designed specifically for read pairs sequenced at highly variable depth from RNA virus samples, IVA (Iterative Virus Assembler) is developed, and it is demonstrated that IVA outperforms all other virus de noVO assemblers.

...read moreread less

Abstract: Motivation: An accurate genome assembly from short read sequencing data is critical for downstream analysis, for example allowing investigation of variants within a sequenced population. However, assembling sequencing data from virus samples, especially RNA viruses, into a genome sequence is challenging due to the combination of viral population diversity and extremely uneven read depth caused by amplification bias in the inevitable reverse transcription and polymerase chain reaction amplification process of current methods. Results: We developed a new de novo assembler called IVA (Iterative Virus Assembler) designed specifically for read pairs sequenced at highly variable depth from RNA virus samples. We tested IVA on datasets from 140 sequenced samples from human immunodeficiency virus-1 or influenza-virus-infected people and demonstrated that IVA outperforms all other virus de novo assemblers. Availability and implementation: The software runs under Linux, has the GPLv3 licence and is freely available from http://sanger-pathogens.github.io/iva Contact: ku.ca.regnas@avi Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Genome‐guided investigation of plant natural product biosynthesis

[...]

Franziska Kellner¹, Jeongwoon Kim², Bernardo J. Clavijo³, John P. Hamilton², Kevin L. Childs², Brieanne Vaillancourt², Jason Cepela², Marc Habermann², Burkhard Steuernagel⁴, Leah Clissold³, Kirsten McLay³, Carol Robin Buell², Sarah E. O'Connor¹ - Show less +9 more•Institutions (4)

John Innes Centre¹, Michigan State University², Norwich University³, Sainsbury Laboratory⁴

01 May 2015-Plant Journal

TL;DR: A genome assembly for C. roseus is generated that provides a near-comprehensive representation of the genic space that revealed the genomic context of key points within the MIA biosynthetic pathway including physically clustered genes, tandem gene duplication, expression sub-functionalization, and putative neo- functionalization.

...read moreread less

Abstract: Summary The medicinal plant Madagascar periwinkle, Catharanthus roseus (L.) G. Don, produces hundreds of biologically active monoterpene-derived indole alkaloid (MIA) metabolites and is the sole source of the potent, expensive anti-cancer compounds vinblastine and vincristine. Access to a genome sequence would enable insights into the biochemistry, control, and evolution of genes responsible for MIA biosynthesis. However, generation of a near-complete, scaffolded genome is prohibitive to small research communities due to the expense, time, and expertise required. In this study, we generated a genome assembly for C. roseus that provides a near-comprehensive representation of the genic space that revealed the genomic context of key points within the MIA biosynthetic pathway including physically clustered genes, tandem gene duplication, expression sub-functionalization, and putative neo-functionalization. The genome sequence also facilitated high resolution co-expression analyses that revealed three distinct clusters of co-expression within the components of the MIA pathway. Coordinated biosynthesis of precursors and intermediates throughout the pathway appear to be a feature of vinblastine/vincristine biosynthesis. The C. roseus genome also revealed localization of enzyme-rich genic regions and transporters near known biosynthetic enzymes, highlighting how even a draft genome sequence can empower the study of high-value specialized metabolites.

...read moreread less

Journal Article•DOI•

Soup to Tree: The Phylogeny of Beetles Inferred by Mitochondrial Metagenomics of a Bornean Rainforest Sample

[...]

Alex Crampton-Platt¹, Martijn J. T. N. Timmermans¹, Matthew L. Gimmel, Sujatha Narayanan Kutty¹, Timothy D. Cockerill¹, Chey Vun Khen, Alfried P. Vogler¹ - Show less +3 more•Institutions (1)

Natural History Museum¹

01 Sep 2015-Molecular Biology and Evolution

TL;DR: A shotgun sequencing approach is tested, whereby mitochondrial genomes are assembled from complex ecological mixtures through mitochondrial metagenomics, and it is demonstrated how the approach overcomes many of the taxonomic impediments to the study of biodiversity.

...read moreread less

Abstract: In spite of the growth of molecular ecology, systematics and next-generation sequencing, the discovery and analysis of diversity is not currently integrated with building the tree-of-life. Tropical arthropod ecologists are well placed to accelerate this process if all specimens obtained through mass-trapping, many of which will be new species, could be incorporated routinely into phylogeny reconstruction. Here we test a shotgun sequencing approach, whereby mitochondrial genomes are assembled from complex ecological mixtures through mitochondrial metagenomics, and demonstrate how the approach overcomes many of the taxonomic impediments to the study of biodiversity. DNA from approximately 500 beetle specimens, originating from a single rainforest canopy fogging sample from Borneo, was pooled and shotgun sequenced, followed by de novo assembly of complete and partial mitogenomes for 175 species. The phylogenetic tree obtained from this local sample was highly similar to that from existing mitogenomes selected for global coverage of major lineages of Coleoptera. When all sequences were combined only minor topological changes were induced against this reference set, indicating an increasingly stable estimate of coleopteran phylogeny, while the ecological sample expanded the tip-level representation of several lineages. Robust trees generated from ecological samples now enable an evolutionary framework for ecology. Meanwhile, the inclusion of uncharacterized samples in the tree-of-life rapidly expands taxon and biogeographic representation of lineages without morphological identification. Mitogenomes from shotgun sequencing of unsorted environmental samples and their associated metadata, placed robustly into the phylogenetic tree, constitute novel DNA "superbarcodes" for testing hypotheses regarding global patterns of diversity.

...read moreread less

Journal Article•DOI•

Single-Molecule Real-Time Sequencing Combined with Optical Mapping Yields Completely Finished Fungal Genome

[...]

Luigi Faino¹, Michael F. Seidl¹, Erwin Datema, Grardy C. M. van den Berg¹, Antoine Janssen, Alexander H. J. Wittenberg, Bart P. H. J. Thomma¹ - Show less +3 more•Institutions (1)

Wageningen University and Research Centre¹

01 Sep 2015-Mbio

TL;DR: Assessment of state-of-the-art sequencing and assembly strategies in order to produce a contiguous and complete eukaryotic genome assembly on the filamentous fungus Verticillium dahliae shows that a combination of PacBio-generated long reads and optical mapping yields a gapless telomere-to-telomere genome assembly, allowing in-depth genome analyses to facilitate functional studies into an organism's biology.

...read moreread less

Abstract: Next-generation sequencing (NGS) technologies have increased the scalability, speed, and resolution of genomic sequencing and, thus, have revolutionized genomic studies. However, eukaryotic genome sequencing initiatives typically yield considerably fragmented genome assemblies. Here, we assessed various state-of-the-art sequencing and assembly strategies in order to produce a contiguous and complete eukaryotic genome assembly, focusing on the filamentous fungus Verticillium dahliae. Compared with Illumina-based assemblies of the V. dahliae genome, hybrid assemblies that also include PacBio-generated long reads establish superior contiguity. Intriguingly, provided that sufficient sequence depth is reached, assemblies solely based on PacBio reads outperform hybrid assemblies and even result in fully assembled chromosomes. Furthermore, the addition of optical map data allowed us to produce a gapless and complete V. dahliae genome assembly of the expected eight chromosomes from telomere to telomere. Consequently, we can now study genomic regions that were previously not assembled or poorly assembled, including regions that are populated by repetitive sequences, such as transposons, allowing us to fully appreciate an organism9s biological complexity. Our data show that a combination of PacBio-generated long reads and optical mapping can be used to generate complete and gapless assemblies of fungal genomes. IMPORTANCE Studying whole-genome sequences has become an important aspect of biological research. The advent of next-generation sequencing (NGS) technologies has nowadays brought genomic science within reach of most research laboratories, including those that study nonmodel organisms. However, most genome sequencing initiatives typically yield (highly) fragmented genome assemblies. Nevertheless, considerable relevant information related to genome structure and evolution is likely hidden in those nonassembled regions. Here, we investigated a diverse set of strategies to obtain gapless genome assemblies, using the genome of a typical ascomycete fungus as the template. Eventually, we were able to show that a combination of PacBio-generated long reads and optical mapping yields a gapless telomere-to-telomere genome assembly, allowing in-depth genome analyses to facilitate functional studies into an organism9s biology.

...read moreread less

Journal Article•DOI•

The completed genome sequence of the pathogenic ascomycete fungus Fusarium graminearum

[...]

Robert King¹, Martin Urban¹, Michael C. U. Hammond-Kosack¹, Keywan Hassani-Pak¹, Kim E. Hammond-Kosack¹ - Show less +1 more•Institutions (1)

Rothamsted Research¹

22 Jul 2015-BMC Genomics

TL;DR: This fully completed F. graminearum PH-1 genome and manually curated annotation, available at Ensembl Fungi, provides the optimum resource to perform interspecies comparative analyses and gene function studies.

...read moreread less

Abstract: Accurate genome assembly and gene model annotation are critical for comparative species and gene functional analyses. Here we present the completed genome sequence and annotation of the reference strain PH-1 of Fusarium graminearum, the causal agent of head scab disease of small grain cereals which threatens global food security. Completion was achieved by combining (a) the BROAD Sanger sequenced draft, with (b) the gene predictions from Munich Information Services for Protein Sequences (MIPS) v3.2, with (c) de novo whole-genome shotgun re-sequencing, (d) re-annotation of the gene models using RNA-seq evidence and Fgenesh, Snap, GeneMark and Augustus prediction algorithms, followed by (e) manual curation. We have comprehensively completed the genomic 36,563,796 bp sequence by replacing unknown bases, placing supercontigs within their correct loci, correcting assembly errors, and inserting new sequences which include for the first time complete AT rich sequences such as centromere sequences, subtelomeric regions and the telomeres. Each of the four F. graminearium chromosomes was found to be submetacentric with respect to centromere positioning. The position of a potential neocentromere was also defined. A preferentially higher frequency of genetic recombination was observed at the end of the longer arm of each chromosome. Within the genome 1529 gene models have been modified and 412 new gene models predicted, with a total gene call of 14,164. The re-annotation impacts upon 69 entries held within the Pathogen-Host Interactions database (PHI-base) which stores information on genes for which mutant phenotypes in pathogen-host interactions have been experimentally tested, of which 59 are putative transcription factors, 8 kinases, 1 ATP citrate lyase (ACL1), and 1 syntaxin-like SNARE gene (GzSYN1). Although the completed F. graminearum contains very few transposon sequences, a previously unrecognised and potentially active gypsy-type long-terminal-repeat (LTR) retrotransposon was identified. In addition, each of the sub-telomeres and centromeres contained either a LTR or MarCry-1_FO element. The full content of the proposed ancient chromosome fusion sites has also been revealed and investigated. Regions with high recombination previously noted to be rich in secretome encoding genes were also found to be rich in tRNA sequences. This study has identified 741 F. graminearum species specific genes and provides the first complete genome assembly for a Sordariomycetes species. This fully completed F. graminearum PH-1 genome and manually curated annotation, available at Ensembl Fungi, provides the optimum resource to perform interspecies comparative analyses and gene function studies.

...read moreread less

Journal Article•DOI•

De novo assembly and annotation of the Asian tiger mosquito (Aedes albopictus) repeatome with dnaPipeTE from raw genomic reads and comparative analysis with the yellow fever mosquito (Aedes aegypti).

[...]

Clément Goubert¹, Laurent Modolo¹, Laurent Modolo², Cristina Vieira², Cristina Vieira¹, Claire ValienteMoro¹, Patrick Mavingui³, Matthieu Boulesteix², Matthieu Boulesteix¹ - Show less +5 more•Institutions (3)

University of Lyon¹, French Institute for Research in Computer Science and Automation², University of La Réunion³

01 Apr 2015-Genome Biology and Evolution

TL;DR: The dnaPipeTE pipeline’s ability to manage the repeatome annotation problem will make it helpful for new or ongoing assembly projects, and the results will benefit future genomic studies of A. albopictus.

...read moreread less

Abstract: Repetitive DNA, including transposable elements (TEs), is found throughout eukaryotic genomes. Annotating and assembling the “repeatome” during genome-wide analysis often poses a challenge. To address this problem, we present dnaPipeTE—a new bioinformatics pipeline that uses a sample of raw genomic reads. It produces precise estimates of repeated DNA content and TE consensus sequences, as well as the relative ages of TE families. We shows that dnaPipeTE performs well using very low coverage sequencing in different genomes, losing accuracy only with old TE families. We applied this pipeline to the genome of the Asian tiger mosquito Aedes albopictus, an invasive species of human health interest, for which the genome size is estimated to be over 1 Gbp. Using dnaPipeTE, we showed that this species harbors a large (50% of the genome) and potentially active repeatome with an overall TE class and order composition similar to that of Aedes aegypti, the yellow fever mosquito. However, intraorder dynamics show clear distinctions between the two species, with differences at the TE family level. Our pipeline’s ability to manage the repeatome annotation problem will make it helpful for new or ongoing assembly projects, and our results will benefit future genomic studies of A. albopictus.

...read moreread less

Journal Article•DOI•

Full-length de novo assembly of RNA-seq data in pea (Pisum sativum L.) provides a gene expression atlas and gives insights into root nodulation in this species.

[...]

Susete Alves-Carvalho¹, Grégoire Aubert¹, Sébastien Carrère¹, Corinne Cruaud, Anne-Lise Brochot¹, Françoise Jacquin¹, Anthony Klein¹, Chantal Martin¹, Karen Boucherot¹, Jonathan Kreplak¹, Corinne Da Silva, Sandra Moreau¹, Pascal Gamas¹, Patrick Wincker, Jérôme Gouzy¹, Judith Burstin¹ - Show less +12 more•Institutions (1)

Institut national de la recherche agronomique¹

01 Oct 2015-Plant Journal

TL;DR: This resource has allowed identification of the pea orthologs of major nodulation genes characterized in recent years in model species, as a major step towards deciphering unresolved pea nodulation phenotypes.

...read moreread less

Abstract: Next-generation sequencing technologies allow an almost exhaustive survey of the transcriptome, even in species with no available genome sequence. To produce a Unigene set representing most of the expressed genes of pea, 20 cDNA libraries produced from various plant tissues harvested at various developmental stages from plants grown under contrasting nitrogen conditions were sequenced. Around one billion reads and 100 Gb of sequence were de novo assembled. Following several steps of redundancy reduction, 46 099 contigs with N50 length of 1667 nt were identified. These constitute the 'Cameor' Unigene set. The high depth of sequencing allowed identification of rare transcripts and detected expression for approximately 80% of contigs in each library. The Unigene set is now available online (http://bios.dijon.inra.fr/FATAL/cgi/pscam.cgi), allowing (i) searches for pea orthologs of candidate genes based on gene sequences from other species, or based on annotation, (ii) determination of transcript expression patterns using various metrics, (iii) identification of uncharacterized genes with interesting patterns of expression, and (iv) comparison of gene ontology pathways between tissues. This resource has allowed identification of the pea orthologs of major nodulation genes characterized in recent years in model species, as a major step towards deciphering unresolved pea nodulation phenotypes. In addition to a remarkable conservation of the early transcriptome nodulation apparatus between pea and Medicago truncatula, some specific features were highlighted. The resource provides a reference for the pea exome, and will facilitate transcriptome and proteome approaches as well as SNP discovery in pea.

...read moreread less

Journal Article•DOI•

An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data

[...]

Huan Fan¹, Huan Fan², Huan Fan³, Anthony R. Ives², Yann Surget-Groba⁴, Charles H. Cannon⁵, Charles H. Cannon³ - Show less +3 more•Institutions (5)

Chinese Academy of Sciences¹, University of Wisconsin-Madison², Xishuangbanna Tropical Botanical Garden³, Université du Québec en Outaouais⁴, Texas Tech University⁵

14 Jul 2015-BMC Genomics

TL;DR: An Assembly and Alignment-Free (AAF) method is presented that constructs phylogenies directly from unassembled genome sequence data, bypassing both genome assembly and alignment, and rapidly creates a phylogenetic framework for further analysis of genome structure and diversity among non-model organisms.

...read moreread less

Abstract: Next-generation sequencing technologies are rapidly generating whole-genome datasets for an increasing number of organisms. However, phylogenetic reconstruction of genomic data remains difficult because de novo assembly for non-model genomes and multi-genome alignment are challenging. To greatly simplify the analysis, we present an Assembly and Alignment-Free (AAF) method ( https://sourceforge.net/projects/aaf-phylogeny ) that constructs phylogenies directly from unassembled genome sequence data, bypassing both genome assembly and alignment. Using mathematical calculations, models of sequence evolution, and simulated sequencing of published genomes, we address both evolutionary and sampling issues caused by direct reconstruction, including homoplasy, sequencing errors, and incomplete sequencing coverage. From these results, we calculate the statistical properties of the pairwise distances between genomes, allowing us to optimize parameter selection and perform bootstrapping. As a test case with real data, we successfully reconstructed the phylogeny of 12 mammals using raw sequencing reads. We also applied AAF to 21 tropical tree genome datasets with low coverage to demonstrate its effectiveness on non-model organisms. Our AAF method opens up phylogenomics for species without an appropriate reference genome or high sequence coverage, and rapidly creates a phylogenetic framework for further analysis of genome structure and diversity among non-model organisms.

...read moreread less

Journal Article•DOI•

Sequencing small genomic targets with high efficiency and extreme accuracy

[...]

Michael W. Schmitt¹, Edward J. Fox², Marc J. Prindle², Kate S Reid-Bayliss², Lawrence D. True², Jerald P. Radich¹, Lawrence A. Loeb² - Show less +3 more•Institutions (2)

Fred Hutchinson Cancer Research Center¹, University of Washington²

01 May 2015-Nature Methods

TL;DR: This work describes an efficient approach based on sequential rounds of hybridization with biotinylated oligonucleotides that enables more than 1-million-fold enrichment of genomic regions of interest and enables the quantification of mutations in individual DNA molecules.

...read moreread less

Abstract: The detection of minority variants in mixed samples requires methods for enrichment and accurate sequencing of small genomic intervals. We describe an efficient approach based on sequential rounds of hybridization with biotinylated oligonucleotides that enables more than 1-million-fold enrichment of genomic regions of interest. In conjunction with error-correcting double-stranded molecular tags, our approach enables the quantification of mutations in individual DNA molecules.

...read moreread less

Posted Content•DOI•

Oxford Nanopore Sequencing and de novo Assembly of a Eukaryotic Genome

[...]

Sara Goodwin¹, James Gurtowski¹, Scott Ethe-Sayers¹, Panchajanya Deshpande¹, Michael C. Schatz¹, W. Richard McCombie¹ - Show less +2 more•Institutions (1)

Cold Spring Harbor Laboratory¹

06 Jan 2015-bioRxiv

TL;DR: In this paper, the authors describe software developed to make use of these data as existing packages were incapable of assembling long reads at such high error rate (~35% error), with these methods were able to error correct and assemble the nanopore reads de novo, producing an assembly that is contiguous and accurate: with a contig N50 length of 479kb, and has greater than 99% consensus identity when compared to the reference.

...read moreread less

Abstract: Monitoring the progress of DNA through a pore has been postulated as a method for sequencing DNA for several decades1,2. Recently, a nanopore instrument, the Oxford Nanopore MinION, has become available3. Here we describe our sequencing of the S. cerevisiae genome. We describe software developed to make use of these data as existing packages were incapable of assembling long reads at such high error rate (~35% error). With these methods we were able to error correct and assemble the nanopore reads de novo, producing an assembly that is contiguous and accurate: with a contig N50 length of 479kb, and has greater than 99% consensus identity when compared to the reference. The assembly with the long nanopore reads was able to correctly assemble gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in an assembly using Illumina sequencing alone (with a contig N50 of only 59,927bp).

...read moreread less

Collapse