Showing papers on "Sequence assembly published in 2019"

PDF

Open Access

Journal Article•DOI•

Assembly of long, error-prone reads using repeat graphs

[...]

Mikhail Kolmogorov¹, Jeffrey Yuan¹, Yu Lin², Pavel A. Pevzner¹•Institutions (2)

University of California, San Diego¹, Australian National University²

01 Apr 2019-Nature Biotechnology

TL;DR: Flye as mentioned in this paper constructs an accurate repeat graph from these error-riddled disjointigs by generating arbitrary paths in an unknown repeat graph, which can then be used for genome assembly.

...read moreread less

Abstract: Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers.

...read moreread less

1,927 citations

Journal Article•DOI•

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome.

[...]

Aaron M. Wenger¹, Paul Peluso¹, William J Rowell¹, Pi-Chuan Chang², Richard Hall¹, Gregory T. Concepcion¹, Jana Ebler³, Arkarachai Fungtammasan, Alexander Kolesnikov², Nathan D. Olson⁴, Armin Töpfer¹, Michael Alonge⁵, Medhat Mahmoud⁶, Yufeng Qian¹, Chen-Shan Chin, Adam M. Phillippy⁷, Michael C. Schatz⁵, Gene Myers⁸, Mark A. DePristo², Jue Ruan, Tobias Marschall³, Tobias Marschall⁸, Fritz J. Sedlazeck⁶, Justin M. Zook⁴, Heng Li⁹, Sergey Koren⁷, Andrew Carroll², David R. Rank¹, Michael W. Hunkapiller¹ - Show less +25 more•Institutions (9)

Pacific Biosciences¹, Google², Saarland University³, National Institute of Standards and Technology⁴, Johns Hopkins University⁵, Baylor College of Medicine⁶, National Institutes of Health⁷, Max Planck Society⁸, Harvard University⁹

12 Aug 2019-Nature Biotechnology

TL;DR: The optimization of circular consensus sequencing (CCS) is reported to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb).

...read moreread less

Abstract: The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions 15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads. High-fidelity reads improve variant detection and genome assembly on the PacBio platform.

...read moreread less

876 citations

Journal Article•DOI•

Integrating Hi-C links with assembly graphs for chromosome-scale assembly.

[...]

Jay Ghurye¹, Jay Ghurye², Arang Rhie², Brian P. Walenz², Anthony D. Schmitt, Siddarth Selvaraj, Mihai Pop¹, Adam M. Phillippy², Sergey Koren² - Show less +5 more•Institutions (2)

University of Maryland, College Park¹, National Institutes of Health²

21 Aug 2019-PLOS Computational Biology

TL;DR: This work presents a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph.

...read moreread less

Abstract: Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.

...read moreread less

391 citations

Journal Article•DOI•

Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods

[...]

Brian J. Haas¹, Alexander Dobin², Bo Li³, Bo Li¹, Nicolas Stransky, Nathalie Pochet⁴, Nathalie Pochet¹, Aviv Regev⁵, Aviv Regev¹ - Show less +5 more•Institutions (5)

Broad Institute¹, Cold Spring Harbor Laboratory², Harvard University³, Brigham and Women's Hospital⁴, Massachusetts Institute of Technology⁵

21 Oct 2019-Genome Biology

TL;DR: The lower accuracy of de novo assembly-based methods notwithstanding, they are useful for reconstructing fusion isoforms and tumor viruses, both of which are important in cancer research.

...read moreread less

Abstract: Accurate fusion transcript detection is essential for comprehensive characterization of cancer transcriptomes. Over the last decade, multiple bioinformatic tools have been developed to predict fusions from RNA-seq, based on either read mapping or de novo fusion transcript assembly. We benchmark 23 different methods including applications we develop, STAR-Fusion and TrinityFusion, leveraging both simulated and real RNA-seq. Overall, STAR-Fusion, Arriba, and STAR-SEQR are the most accurate and fastest for fusion detection on cancer transcriptomes. The lower accuracy of de novo assembly-based methods notwithstanding, they are useful for reconstructing fusion isoforms and tumor viruses, both of which are important in cancer research.

...read moreread less

327 citations

Journal Article•DOI•

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data.

[...]

Elena Bushmanova¹, Dmitry Antipov¹, Alla Lapidus¹, Andrey D. Prjibelski¹•Institutions (1)

Saint Petersburg State University¹

01 Sep 2019-GigaScience

TL;DR: The novel transcriptome assembler rnaSPAdes, which has been developed on top of the SPAdes genome assembler, typically outperforms other assemblers by such important property as the number of assembled genes and isoforms and at the same time has higher accuracy statistics on average comparing to the closest competitors.

...read moreread less

Abstract: Background The possibility of generating large RNA-sequencing datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the organisms with finished and well-annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing, and paralogous genes. Results Herein we describe the novel transcriptome assembler rnaSPAdes, which has been developed on top of the SPAdes genome assembler and explores computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-sequencing datasets, and briefly highlight strong and weak points of different assemblers. Conclusions Based on the performed comparison between different assembly methods, we infer that it is not possible to detect the absolute leader according to all quality metrics and all used datasets. However, rnaSPAdes typically outperforms other assemblers by such important property as the number of assembled genes and isoforms, and at the same time has higher accuracy statistics on average comparing to the closest competitors.

...read moreread less

297 citations

Journal Article•DOI•

Long-Read Sequencing Emerging in Medical Genetics

[...]

Tuomo Mantere¹, Simone Kersten¹, Alexander Hoischen¹•Institutions (1)

Radboud University Nijmegen¹

07 May 2019-Frontiers in Genetics

TL;DR: The current LRS-based research on human genetic disorders is summarized and the potential of these technologies to facilitate the next major advancements in medical genetics is discussed.

...read moreread less

Abstract: The wide implementation of next-generation sequencing (NGS) technologies has revolutionized the field of medical genetics. However, the short read lengths of currently used sequencing approaches pose a limitation for the identification of structural variants, sequencing repetitive regions, phasing of alleles and distinguishing highly homologous genomic regions. These limitations may significantly contribute to the diagnostic gap in patients with genetic disorders who have undergone standard NGS, like whole exome or even genome sequencing. Now, the emerging long-read sequencing (LRS) technologies may offer improvements in the characterization of genetic variation and regions that are difficult to assess with the prevailing NGS approaches. LRS has so far mainly been used to investigate genetic disorders with previously known or strongly suspected disease loci. While these targeted approaches already show the potential of LRS, it remains to be seen whether LRS technologies can soon enable true whole genome sequencing routinely. Ultimately, this could allow the de novo assembly of individual whole genomes used as a generic test for genetic disorders. In this article, we summarize the current LRS-based research on human genetic disorders and discuss the potential of these technologies to facilitate the next major advancements in medical genetics.

...read moreread less

263 citations

Journal Article•DOI•

A high-quality apple genome assembly reveals the association of a retrotransposon and red fruit colour.

[...]

Liyi Zhang¹, Jiang Hu, Xiaolei Han¹, Jingjing Li, Yuan Gao¹, Christopher M. Richards², Caixia Zhang¹, Yi Tian¹, Guiming Liu, Hera Gul¹, Dajiang Wang¹, Yu Tian, Chuanxin Yang, Minghui Meng, Gaopeng Yuan¹, Guodong Kang¹, Yonglong Wu¹, Kun Wang¹, Hengtao Zhang, Depeng Wang, Peihua Cong¹ - Show less +17 more•Institutions (2)

Crops Research Institute¹, Agricultural Research Service²

02 Apr 2019-Nature Communications

TL;DR: Interestingly, a long terminal repeat (LTR) retrotransposon insertion upstream of MdMYB1, a core transcriptional activator of anthocyanin biosynthesis, is associated with red-skinned phenotype and provides insights into the molecular mechanisms underlying red fruit coloration.

...read moreread less

Abstract: A complete and accurate genome sequence provides a fundamental tool for functional genomics and DNA-informed breeding. Here, we assemble a high-quality genome (contig N50 of 6.99 Mb) of the apple anther-derived homozygous line HFTH1, including 22 telomere sequences, using a combination of PacBio single-molecule real-time (SMRT) sequencing, chromosome conformation capture (Hi-C) sequencing, and optical mapping. In comparison to the Golden Delicious reference genome, we identify 18,047 deletions, 12,101 insertions and 14 large inversions. We reveal that these extensive genomic variations are largely attributable to activity of transposable elements. Interestingly, we find that a long terminal repeat (LTR) retrotransposon insertion upstream of MdMYB1, a core transcriptional activator of anthocyanin biosynthesis, is associated with red-skinned phenotype. This finding provides insights into the molecular mechanisms underlying red fruit coloration, and highlights the utility of this high-quality genome assembly in deciphering agriculturally important trait in apple. Existing apple genome assemblies all derive from Golden Delicious. Here, the authors combine different sequencing technologies to assemble a high quality genome of an anther-derived homozygous genotype HFTH1 and find the association of a retrotransposon and red fruit colour.

...read moreread less

213 citations

Journal Article•DOI•

Ultra-deep, long-read nanopore sequencing of mock microbial community standards.

[...]

Samuel M. Nicholls¹, Joshua Quick¹, Shuiquan Tang², Nicholas J. Loman¹•Institutions (2)

University of Birmingham¹, Murphy Oil²

01 May 2019-GigaScience

TL;DR: These datasets will be useful for those developing bioinformatics methods for long-read metagenomics and for the validation and comparison of current laboratory and software pipelines.

...read moreread less

Abstract: Background Long sequencing reads are information-rich: aiding de novo assembly and reference mapping, and consequently have great potential for the study of microbial communities. However, the best approaches for analysis of long-read metagenomic data are unknown. Additionally, rigorous evaluation of bioinformatics tools is hindered by a lack of long-read data from validated samples with known composition. Findings We sequenced 2 commercially available mock communities containing 10 microbial species (ZymoBIOMICS Microbial Community Standards) with Oxford Nanopore GridION and PromethION. Both communities and the 10 individual species isolates were also sequenced with Illumina technology. We generated 14 and 16 gigabase pairs from 2 GridION flowcells and 150 and 153 gigabase pairs from 2 PromethION flowcells for the evenly distributed and log-distributed communities, respectively. Read length N50 ranged between 5.3 and 5.4 kilobase pairs over the 4 sequencing runs. Basecalls and corresponding signal data are made available (4.2 TB in total). Alignment to Illumina-sequenced isolates demonstrated the expected microbial species at anticipated abundances, with the limit of detection for the lowest abundance species below 50 cells (GridION). De novo assembly of metagenomes recovered long contiguous sequences without the need for pre-processing techniques such as binning. Conclusions We present ultra-deep, long-read nanopore datasets from a well-defined mock community. These datasets will be useful for those developing bioinformatics methods for long-read metagenomics and for the validation and comparison of current laboratory and software pipelines.

...read moreread less

192 citations

Journal Article•DOI•

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

[...]

Ole K. Tørresen¹, Bastiaan Star¹, Pablo Mier², Miguel A. Andrade-Navarro², Alex Bateman³, Patryk Jarnot⁴, Aleksandra Gruca⁴, Marcin Grynberg, Andrey V. Kajava⁵, Vasilis J. Promponas⁶, Maria Anisimova⁷, Kjetill S. Jakobsen¹, Dirk Linke¹ - Show less +9 more•Institutions (7)

University of Oslo¹, University of Mainz², European Bioinformatics Institute³, Silesian University of Technology⁴, University of Montpellier⁵, University of Cyprus⁶, Zurich University of Applied Sciences/ZHAW⁷

02 Dec 2019-Nucleic Acids Research

TL;DR: A review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses, to raise the awareness level within the community of database users and alert scientists working in the underlying workflow of database creation.

...read moreread less

Abstract: The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.

...read moreread less

178 citations

Journal Article•DOI•

High-quality genome assembly of the silkworm, Bombyx mori

[...]

Munetaka Kawamoto¹, Akiya Jouraku², Atsushi Toyoda³, Kakeru Yokoi², Yohei Minakuchi³, Susumu Katsuma¹, Asao Fujiyama³, Takashi Kiuchi¹, Kimiko Yamamoto², Toru Shimada¹ - Show less +6 more•Institutions (3)

University of Tokyo¹, National Agriculture and Food Research Organization², National Institute of Genetics³

01 Apr 2019-Insect Biochemistry and Molecular Biology

TL;DR: In this paper, a hybrid assembly and gene models for the domestic silkworm, Bombyx mori, were published by a Japanese and Chinese collaboration group, where the remaining gaps in the initial genome assembly were closed using BAC and Fosmid sequences, giving a new total length of 460.3

...read moreread less

178 citations

Posted Content•DOI•

GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes

[...]

Jian-Jun Jin¹, Wen-Bin Yu², Wen-Bin Yu¹, Jun-Bo Yang¹, Yu Song¹, Yu Song², Claude W. dePamphilis³, Ting-Shuang Yi¹, De-Zhu Li¹ - Show less +5 more•Institutions (3)

Chinese Academy of Sciences¹, Xishuangbanna Tropical Botanical Garden², Pennsylvania State University³

08 Oct 2019-bioRxiv

TL;DR: This toolkit recruit organelle-associated reads using a modified “baiting and iterative mapping” approach, conducts de novo assembly, filters and disentangles assembly graph, and produces all possible configurations of circular organelle genomes.

...read moreread less

Abstract: GetOrganelle is a state-of-the-art toolkit to assemble accurate organelle genomes from NGS data. This toolkit recruit organelle-associated reads using a modified “baiting and iterative mapping” approach, conducts de novo assembly, filters and disentangles assembly graph, and produces all possible configurations of circular organelle genomes. For 50 published samples, we reassembled the circular plastome in 47 samples using GetOrganelle, but only in 12 samples using NOVOPlasty. In comparison with published/NOVOPlasty plastomes, we demonstrated that GetOrganelle assemblies are more accurate. Moreover, we assembled complete mitogenomes of fungi and animals using GetOrganelle. GetOrganelle is freely released under a GPL-3 license (https://github.com/Kinggerm/GetOrganelle).

...read moreread less

Journal Article•DOI•

TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools

[...]

Cécile Monat¹, Sudharsan Padmarasu¹, Thomas Lux, Thomas Wicker², Heidrun Gundlach, Axel Himmelbach¹, Jennifer Ens³, Chengdao Li⁴, Chengdao Li⁵, Gary J. Muehlbauer⁶, Alan H. Schulman⁷, Robbie Waugh⁸, Robbie Waugh⁹, Ilka Braumann, Curtis J. Pozniak³, Uwe Scholz¹, Klaus F. X. Mayer¹⁰, Manuel Spannagl, Nils Stein¹, Nils Stein¹¹, Martin Mascher¹ - Show less +17 more•Institutions (11)

Leibniz Association¹, University of Zurich², University of Saskatchewan³, Yangtze University⁴, Murdoch University⁵, University of Minnesota⁶, University of Helsinki⁷, James Hutton Institute⁸, University of Dundee⁹, Technische Universität München¹⁰, University of Göttingen¹¹

18 Dec 2019-Genome Biology

TL;DR: TRITEX, an open-source computational workflow that combines paired-end, mate-pair, 10X Genomics linked-read with chromosome conformation capture sequencing data to construct sequence scaffolds with megabase-scale contiguity ordered into chromosomal pseudomolecules is presented.

...read moreread less

Abstract: Chromosome-scale genome sequence assemblies underpin pan-genomic studies. Recent genome assembly efforts in the large-genome Triticeae crops wheat and barley have relied on the commercial closed-source assembly algorithm DeNovoMagic. We present TRITEX, an open-source computational workflow that combines paired-end, mate-pair, 10X Genomics linked-read with chromosome conformation capture sequencing data to construct sequence scaffolds with megabase-scale contiguity ordered into chromosomal pseudomolecules. We evaluate the performance of TRITEX on publicly available sequence data of tetraploid wild emmer and hexaploid bread wheat, and construct an improved annotated reference genome sequence assembly of the barley cultivar Morex as a community resource.

...read moreread less

Journal Article•DOI•

Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly.

[...]

Ou Wang¹, Robert Chin, Xiaofang Cheng, M. Wu, Qing Mao, Jingbo Tang, Yuhui Sun, Ellis Anderson, Han K. Lam, Dan Chen, Yujun Zhou, Linying Wang, Fei Fan, Yan Zou, Yinlong Xie, Rebecca Yu Zhang, Snezana Drmanac, Darlene Nguyen, Chongjun Xu, Christian Villarosa, Scott Gablenz, Nina Barua, Staci Nguyen, Wenlan Tian, Jia Sophie Liu, Jingwan Wang, Xiao Liu, Xiaojuan Qi, Ao Chen, He Wang, Dong Yuliang, Wenwei Zhang, Andrei Alexeev, Huanming Yang, Jing Wang, Karsten Kristiansen¹, Xun Xu, Radoje Drmanac, Brock A. Peters - Show less +35 more•Institutions (1)

University of Copenhagen¹

02 Apr 2019-Genome Research

TL;DR: StLFR represents an easily automatable solution that enables high-quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

...read moreread less

Abstract: Here, we describe single-tube long fragment read (stLFR), a technology that enables sequencing of data from long DNA molecules using economical second-generation sequencing technology. It is based on adding the same barcode sequence to subfragments of the original long DNA molecule (DNA cobarcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process, up to 3.6 billion unique barcode sequences were generated on beads, enabling practically nonredundant cobarcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique cobarcoding of more than 8 million 20- to 300-kb genomic DNA fragments. Analysis of the human genome NA12878 with stLFR demonstrated high-quality variant calling and phase block lengths up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries, and their construction did not significantly add to the time or cost of whole-genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high-quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

...read moreread less

Journal Article•DOI•

Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes

[...]

De Maio N¹, Liam P. Shaw¹, Alasdair T. M. Hubbard², George S³, Nicholas D Sanderson¹, Jeremy Swann¹, Ryan R. Wick⁴, Manal AbuOun⁵, Emma Stubberfield⁵, Sarah Hoosdally¹, Derrick W. Crook¹, Derrick W. Crook³, Peto Tea.¹, Peto Tea.³, Anna E. Sheppard¹, Anna E. Sheppard³, Mark J. Bailey, Daniel S. Read, Muna F. Anjum⁵, Anne-Sophie Walker³, Anne-Sophie Walker¹, Nicole Stoesser¹ - Show less +18 more•Institutions (5)

University of Oxford¹, Liverpool School of Tropical Medicine², Public Health England³, University of Melbourne⁴, Animal and Plant Health Agency⁵

30 Aug 2019

TL;DR: In this article, the authors compared hybrid assembly for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio).

...read moreread less

Abstract: Illumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long-read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods impact on assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the Enterobacteriaceae family, as these frequently have highly plastic, repetitive genetic structures and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies, as well as comparing to long-read only assembly with Flye followed by short-read polishing with Pilon. Hybrid assembly with either PacBio or ONT reads facilitated high-quality genome reconstruction, and was superior to the long-read assembly and polishing approach evaluated with respect to accuracy and completeness. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower consumables cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.

...read moreread less

Journal Article•DOI•

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly.

[...]

Gui-Cai Xu¹, Gui-Cai Xu², Tian-Jun Xu¹, Rui Zhu³, Rui Zhu², Yan Zhang², Shang-Qi Li², Hong-Wei Wang², Jiong-Tang Li² - Show less +5 more•Institutions (3)

Zhejiang Ocean University¹, Chinese Academy of Fishery Sciences², Shanghai Ocean University³

01 Jan 2019-GigaScience

TL;DR: LR_Gapcloser is a fast and efficient tool that can be used to close gaps and improve the contiguity of genome assemblies and a proposed hybrid assembly including this tool promises reference-grade assemblies.

...read moreread less

Abstract: Background Completing a genome is an important goal of genome assembly. However, many assemblies, including reference assemblies, are unfinished and have a number of gaps. Long reads obtained from third-generation sequencing (TGS) platforms can help close these gaps and improve assembly contiguity. However, current gap-closure approaches using long reads require extensive runtime and high memory usage. Thus, a fast and memory-efficient approach using long reads is needed to obtain complete genomes. Findings We developed LR_Gapcloser to rapidly and efficiently close the gaps in genome assembly. This tool utilizes long reads generated from TGS sequencing platforms. Tested on de novo assembled gaps, repeat-derived gaps, and real gaps, LR_Gapcloser closed a higher number of gaps faster and with a lower error rate and a much lower memory usage than two existing, state-of-the art tools. This tool utilized raw reads to fill more gaps than when using error-corrected reads. It is applicable to gaps in the assemblies by different approaches and from large and complex genomes. After performing gap-closure using this tool, the contig N50 size of the human CHM1 genome was improved from 143 kb to 19 Mb, a 132-fold increase. We also closed the gaps in the Triticum urartu genome, a large genome rich in repeats; the contig N50 size was increased by 40%. Further, we evaluated the contiguity and correctness of six hybrid assembly strategies by combining the optimal TGS-based and next-generation sequencing-based assemblers with LR_Gapcloser. A proposed and optimal hybrid strategy generated a new human CHM1 genome assembly with marked contiguity. The contig N50 value was greater than 28 Mb, which is larger than previous non-reference assemblies of the diploid human genome. Conclusions LR_Gapcloser is a fast and efficient tool that can be used to close gaps and improve the contiguity of genome assemblies. A proposed hybrid assembly including this tool promises reference-grade assemblies. The software is available at http://www.fishbrowser.org/software/LR_Gapcloser/.

...read moreread less

Journal Article•DOI•

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

[...]

Damla Senol Cali¹, Jeremie S. Kim², Jeremie S. Kim¹, Saugata Ghose¹, Can Alkan³, Onur Mutlu¹, Onur Mutlu² - Show less +3 more•Institutions (3)

Carnegie Mellon University¹, ETH Zurich², Bilkent University³

19 Jul 2019-Briefings in Bioinformatics

TL;DR: The goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks, and provide guidelines for determining the appropriate tools for each step of the genome assembly pipeline using nanopore sequences data.

...read moreread less

Abstract: Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

...read moreread less

Journal Article•DOI•

De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers

[...]

Martin Hölzer¹, Manja Marz², Manja Marz¹•Institutions (2)

Schiller International University¹, Leibniz Association²

01 May 2019-GigaScience

TL;DR: A large-scale comparative study in which 10 de novo assembly tools are applied to 9 RNA-Seq data sets spanning different kingdoms of life, finding that Trinity, SPAdes, and Trans-ABySS, followed by Bridger and SOAPdenovo-Trans, generally outperformed the other tools compared.

...read moreread less

Abstract: Background In recent years, massively parallel complementary DNA sequencing (RNA sequencing [RNA-Seq]) has emerged as a fast, cost-effective, and robust technology to study entire transcriptomes in various manners. In particular, for non-model organisms and in the absence of an appropriate reference genome, RNA-Seq is used to reconstruct the transcriptome de novo. Although the de novo transcriptome assembly of non-model organisms has been on the rise recently and new tools are frequently developing, there is still a knowledge gap about which assembly software should be used to build a comprehensive de novo assembly. Results Here, we present a large-scale comparative study in which 10 de novo assembly tools are applied to 9 RNA-Seq data sets spanning different kingdoms of life. Overall, we built >200 single assemblies and evaluated their performance on a combination of 20 biological-based and reference-free metrics. Our study is accompanied by a comprehensive and extensible Electronic Supplement that summarizes all data sets, assembly execution instructions, and evaluation results. Trinity, SPAdes, and Trans-ABySS, followed by Bridger and SOAPdenovo-Trans, generally outperformed the other tools compared. Moreover, we observed species-specific differences in the performance of each assembler. No tool delivered the best results for all data sets. Conclusions We recommend a careful choice and normalization of evaluation metrics to select the best assembling results as a critical step in the reconstruction of a comprehensive de novo transcriptome assembly.

...read moreread less

Journal Article•DOI•

A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds

[...]

Andreas Wallberg¹, Ignas Bunikis¹, Olga Vinnere Pettersson¹, Mai Britt Mosbech¹, Anna K. Childers², Jay D. Evans², Alexander S. Mikheyev³, Hugh M. Robertson⁴, Gene E. Robinson⁴, Matthew T. Webster¹ - Show less +6 more•Institutions (4)

Science for Life Laboratory¹, Agricultural Research Service², Okinawa Institute of Science and Technology³, University of Illinois at Urbana–Champaign⁴

08 Apr 2019-BMC Genomics

TL;DR: This assembly is highly contiguous across centromeres and telomeres and includes hundreds of AvaI and AluI repeats associated with these features and will be of utility for refining gene models, studying genome function, mapping functional genetic variation, identification of structural variants, and comparative genomics.

...read moreread less

Abstract: The ability to generate long sequencing reads and access long-range linkage information is revolutionizing the quality and completeness of genome assemblies. Here we use a hybrid approach that combines data from four genome sequencing and mapping technologies to generate a new genome assembly of the honeybee Apis mellifera. We first generated contigs based on PacBio sequencing libraries, which were then merged with linked-read 10x Chromium data followed by scaffolding using a BioNano optical genome map and a Hi-C chromatin interaction map, complemented by a genetic linkage map. Each of the assembly steps reduced the number of gaps and incorporated a substantial amount of additional sequence into scaffolds. The new assembly (Amel_HAv3) is significantly more contiguous and complete than the previous one (Amel_4.5), based mainly on Sanger sequencing reads. N50 of contigs is 120-fold higher (5.381 Mbp compared to 0.053 Mbp) and we anchor > 98% of the sequence to chromosomes. All of the 16 chromosomes are represented as single scaffolds with an average of three sequence gaps per chromosome. The improvements are largely due to the inclusion of repetitive sequence that was unplaced in previous assemblies. In particular, our assembly is highly contiguous across centromeres and telomeres and includes hundreds of AvaI and AluI repeats associated with these features. The improved assembly will be of utility for refining gene models, studying genome function, mapping functional genetic variation, identification of structural variants, and comparative genomics.

...read moreread less

Journal Article•DOI•

A chromosome-scale genome assembly of cucumber (Cucumis sativus L.).

[...]

Qing Li, Hongbo Li, Wu Huang, Yuanchao Xu, Qian Zhou, Shenhao Wang¹, Jue Ruan, Sanwen Huang, Zhonghua Zhang - Show less +5 more•Institutions (1)

Northwest A&F University¹

01 Jun 2019-GigaScience

TL;DR: This high-quality genome presents novel features of the cucumber genome and will serve as a valuable resource for genetic research in cucumber and plant comparative genomics.

...read moreread less

Abstract: BACKGROUND Accurate and complete reference genome assemblies are fundamental for biological research. Cucumber is an important vegetable crop and model system for sex determination and vascular biology. Low-coverage Sanger sequences and high-coverage short Illumina sequences have been used to assemble draft cucumber genomes, but the incompleteness and low quality of these genomes limit their use in comparative genomics and genetic research. A high-quality and complete cucumber genome assembly is therefore essential. FINDINGS We assembled single-molecule real-time (SMRT) long reads to generate an improved cucumber reference genome. This version contains 174 contigs with a total length of 226.2 Mb and an N50 of 8.9 Mb, and provides 29.0 Mb more sequence data than previous versions. Using 10X Genomics and high-throughput chromosome conformation capture (Hi-C) data, 89 contigs (∼211.0 Mb) were directly linked into 7 pseudo-chromosome sequences. The newly assembled regions show much higher guanine-cytosine or adenine-thymine content than found previously, which is likely to have been inaccessible to Illumina sequencing. The new assembly contains 1,374 full-length long terminal retrotransposons and 1,078 novel genes including 239 tandemly duplicated genes. For example, we found 4 tandemly duplicated tyrosylprotein sulfotransferases, in contrast to the single copy of the gene found previously and in most other plants. CONCLUSION This high-quality genome presents novel features of the cucumber genome and will serve as a valuable resource for genetic research in cucumber and plant comparative genomics.

...read moreread less

Journal Article•DOI•

Chromosome-level assembly of the water buffalo genome surpasses human and goat genomes in sequence contiguity

[...]

Wai Yee Low¹, Rick Tearle¹, Derek M. Bickhart², Benjamin D. Rosen², Sarah B. Kingan³, Thomas Swale, Françoise Thibaud-Nissen⁴, Terence Murphy⁴, Rachel Young⁵, Lucas Lefevre⁵, David A. Hume⁶, Andrew Collins⁷, Paolo Ajmone-Marsan⁸, Timothy P. L. Smith², John L. Williams¹ - Show less +11 more•Institutions (8)

University of Adelaide¹, United States Department of Agriculture², Pacific Biosciences³, National Institutes of Health⁴, University of Edinburgh⁵, University of Queensland⁶, University of Southampton⁷, Catholic University of the Sacred Heart⁸

16 Jan 2019-Nature Communications

TL;DR: This new reference genome improves the contig N50 of the previous short-read based buffalo assembly more than a thousand-fold and contains only 383 gaps, which surpasses the human and goat references in sequence contiguity and facilitates the annotation of hard to assemble gene clusters such as the major histocompatibility complex (MHC).

...read moreread less

Abstract: Rapid innovation in sequencing technologies and improvement in assembly algorithms have enabled the creation of highly contiguous mammalian genomes. Here we report a chromosome-level assembly of the water buffalo (Bubalus bubalis) genome using single-molecule sequencing and chromatin conformation capture data. PacBio Sequel reads, with a mean length of 11.5 kb, helped to resolve repetitive elements and generate sequence contiguity. All five B. bubalis sub-metacentric chromosomes were correctly scaffolded with centromeres spanned. Although the index animal was partly inbred, 58% of the genome was haplotype-phased by FALCON-Unzip. This new reference genome improves the contig N50 of the previous short-read based buffalo assembly more than a thousand-fold and contains only 383 gaps. It surpasses the human and goat references in sequence contiguity and facilitates the annotation of hard to assemble gene clusters such as the major histocompatibility complex (MHC). Despite technological advances, chromosome-level assemblies of mammalian genomes are still rare. Here, the authors use PacBio, Chicago and Hi-C approaches to generate a highly contiguous and partially-phased genome assembly for the water buffalo, Bubalus bubalis

...read moreread less

Posted Content•DOI•

An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps

[...]

Prashant S. Hosmani¹, Mirella Flores-Gonzalez¹, Henri van de Geest², Florian Maumus³, Linda V. Bakker², Elio Schijlen², Jan C. van Haarst², Jan H.G. Cordewener², Gabino F. Sanchez-Perez², Sander Peters², Zhangjun Fei¹, James J. Giovannoni¹, Lukas A Mueller¹, Surya Saha¹ - Show less +10 more•Institutions (3)

Boyce Thompson Institute for Plant Research¹, Wageningen University and Research Centre², Université Paris-Saclay³

14 Sep 2019-bioRxiv

TL;DR: The latest tomato reference genome (SL4.4.0) assembled de novo from PacBio long reads and scaffolded using Hi-C contact maps and validated using Bionano optical maps and 10X linked-read sequences is presented.

...read moreread less

Abstract: The original Heinz 1706 reference genome was produced by a large team of scientists from across the globe from a variety of input sources that included 454 sequences in addition to full-length BACs, BAC and fosmid ends sequenced with Sanger technology. We present here the latest tomato reference genome (SL4.0) assembled de novo from PacBio long reads and scaffolded using Hi-C contact maps. The assembly was validated using Bionano optical maps and 10X linked-read sequences. This assembly is highly contiguous with fewer gaps compared to previous genome builds and almost all scaffolds have been anchored and oriented to the 12 tomato chromosomes. We have found more repeats compared to the previous versions and one of the largest repeat classes identified are the LTR retrotransposons. We also describe updates to the reference genome and annotation since the last publication. The corresponding ITAG4.0 annotation has 4,794 novel genes along with 29,281 genes preserved from ITAG2.4. Most of the updated genes have extensions in the 5’ and 3’ UTRs resulting in doubling of annotated UTRs per gene. The genome and annotation can be accessed using SGN through BLAST database, Pathway database (SolCyc), Apollo, JBrowse genome browser and FTP available at https://solgenomics.net.

...read moreread less

Journal Article•DOI•

CAMISIM: simulating metagenomes and microbial communities

[...]

Adrian Fritz, Peter Hofmann¹, Stephan Majda¹, Eik Dahms¹, Johannes Dröge¹, Jessika Fiedler¹, Till Robin Lesker, Peter Belmann², Matthew Z. DeMaere³, Aaron E. Darling³, Alexander Sczyrba², Andreas Bremges, Alice C. McHardy¹ - Show less +9 more•Institutions (3)

University of Düsseldorf¹, Bielefeld University², University of Technology, Sydney³

08 Feb 2019-Microbiome

TL;DR: CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with standards of truth for method evaluation, and generated the benchmark data sets of the first CAMI challenge.

...read moreread less

Abstract: Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. We describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series, and differential abundance studies, includes real and simulated strain-level diversity, and generates second- and third-generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes, we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT, and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with standards of truth for method evaluation. All data sets and the software are freely available at https://github.com/CAMI-challenge/CAMISIM

...read moreread less

Journal Article•DOI•

A High-Quality De novo Genome Assembly from a Single Mosquito Using PacBio Sequencing

[...]

Sarah B. Kingan¹, Haynes Heaton², Juliana Cudini², Christine C. Lambert¹, Primo Baybayan¹, Brendan Galvin¹, Richard Durbin³, Jonas Korlach¹, Mara K. N. Lawniczak² - Show less +5 more•Institutions (3)

Pacific Biosciences¹, Wellcome Trust Sanger Institute², University of Cambridge³

18 Jan 2019-Genes

TL;DR: A high-quality de novo genome assembly from a single Anopheles coluzzii mosquito, using a modified SMRTbell library construction protocol without DNA shearing and size selection, which puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.

...read moreread less

Abstract: A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 h movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes were present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes were present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.

...read moreread less

Journal Article•DOI•

Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system

[...]

Vincent Somerville¹, Stefanie Lutz¹, Michael Schmid¹, Daniel Frei, Aline Moser, Stefan Irmler, Jürg E. Frey, Christian H. Ahrens¹ - Show less +4 more•Institutions (1)

Swiss Institute of Bioinformatics¹

25 Jun 2019-BMC Microbiology

TL;DR: The feasibility of complete de novo genome assembly of all dominant strains from low-complexity NWCs based on whole metagenomics shotgun sequencing data is demonstrated.

...read moreread less

Abstract: Complete and contiguous genome assemblies greatly improve the quality of subsequent systems-wide functional profiling studies and the ability to gain novel biological insights. While a de novo genome assembly of an isolated bacterial strain is in most cases straightforward, more informative data about co-existing bacteria as well as synergistic and antagonistic effects can be obtained from a direct analysis of microbial communities. However, the complexity of metagenomic samples represents a major challenge. While third generation sequencing technologies have been suggested to enable finished metagenome-assembled genomes, to our knowledge, the complete genome assembly of all dominant strains in a microbiome sample has not been demonstrated. Natural whey starter cultures (NWCs) are used in cheese production and represent low-complexity microbiomes. Previous studies of Swiss Gruyere and selected Italian hard cheeses, mostly based on amplicon metagenomics, concurred that three species generally pre-dominate: Streptococcus thermophilus, Lactobacillus helveticus and Lactobacillus delbrueckii. Two NWCs from Swiss Gruyere producers were subjected to whole metagenome shotgun sequencing using the Pacific Biosciences Sequel and Illumina MiSeq platforms. In addition, longer Oxford Nanopore Technologies MinION reads had to be generated for one to resolve repeat regions. Thereby, we achieved the complete assembly of all dominant bacterial genomes from these low-complexity NWCs, which was corroborated by a 16S rRNA amplicon survey. Moreover, two distinct L. helveticus strains were successfully co-assembled from the same sample. Besides bacterial chromosomes, we could also assemble several bacterial plasmids and phages and a corresponding prophage. Biologically relevant insights were uncovered by linking the plasmids and phages to their respective host genomes using DNA methylation motifs on the plasmids and by matching prokaryotic CRISPR spacers with the corresponding protospacers on the phages. These results could only be achieved by employing long-read sequencing data able to span intragenomic as well as intergenomic repeats. Here, we demonstrate the feasibility of complete de novo genome assembly of all dominant strains from low-complexity NWCs based on whole metagenomics shotgun sequencing data. This allowed to gain novel biological insights and is a fundamental basis for subsequent systems-wide omics analyses, functional profiling and phenotype to genotype analysis of specific microbial communities.

...read moreread less

Journal Article•DOI•

Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing

[...]

Sarah Goldstein¹, Lidia Beka¹, Joerg Graf¹, Jonathan L. Klassen¹•Institutions (1)

University of Connecticut¹

09 Jan 2019-BMC Genomics

TL;DR: The results indicate that genome assembly using short-reads is challenged by repetitive sequences and extreme GC contents can be largely overcome by using single-molecule, long-read sequencing technologies such as the Oxford Nanopore MinION.

...read moreread less

Abstract: Short-read sequencing technologies have made microbial genome sequencing cheap and accessible. However, closing genomes is often costly and assembling short reads from genomes that are repetitive and/or have extreme %GC content remains challenging. Long-read, single-molecule sequencing technologies such as the Oxford Nanopore MinION have the potential to overcome these difficulties, although the best approach for harnessing their potential remains poorly evaluated. We sequenced nine bacterial genomes spanning a wide range of GC contents using Illumina MiSeq and Oxford Nanopore MinION sequencing technologies to determine the advantages of each approach, both individually and combined. Assemblies using only MiSeq reads were highly accurate but lacked contiguity, a deficiency that was partially overcome by adding MinION reads to these assemblies. Even more contiguous genome assemblies were generated by using MinION reads for initial assembly, but these assemblies were more error-prone and required further polishing. This was especially pronounced when Illumina libraries were biased, as was the case for our strains with both high and low GC content. Increased genome contiguity dramatically improved the annotation of insertion sequences and secondary metabolite biosynthetic gene clusters, likely because long-reads can disambiguate these highly repetitive but biologically important genomic regions. Genome assembly using short-reads is challenged by repetitive sequences and extreme GC contents. Our results indicate that these difficulties can be largely overcome by using single-molecule, long-read sequencing technologies such as the Oxford Nanopore MinION. Using MinION reads for assembly followed by polishing with Illumina reads generated the most contiguous genomes with sufficient accuracy to enable the accurate annotation of important but difficult to sequence genomic features such as insertion sequences and secondary metabolite biosynthetic gene clusters. The combination of Oxford Nanopore and Illumina sequencing can therefore cost-effectively advance studies of microbial evolution and genome-driven drug discovery.

...read moreread less

Posted Content•DOI•

Human Genome Assembly in 100 Minutes

[...]

Chen-Shan Chin, Asif Khalak

17 Jul 2019-bioRxiv

TL;DR: The continued advance of sequencing technologies coupled with the Peregrine assembler enables routine generation of human de novo assemblies, which will allow for population scale measurements of more comprehensive genomic variations -- beyond SNPs and small indels -- as well as novel applications requiring rapid access to de noVO assemblies.

...read moreread less

Abstract: De novo genome assembly provides comprehensive, unbiased genomic information and makes it possible to gain insight into new DNA sequences not present in reference genomes. Many de novo human genomes have been published in the last few years, leveraging a combination of inexpensive short-read and single-molecule long-read technologies. As long-read DNA sequencers become more prevalent, the computational burden of generating assemblies persists as a critical factor. The most common approach to long-read assembly, using an overlap-layout-consensus (OLC) paradigm, requires all-to-all read comparisons, which quadratically scales in computational complexity with the number of reads. We assert that recently achievements in sequencing technology (i.e. with accuracy ~99% and read length ~10-15k) enables a fundamentally better strategy for OLC that is effectively linear rather than quadratic. Our genome assembly implementation, Peregrine uses sparse hierarchical minimizers (SHIMMER) to index reads thereby avoiding the need for an all-to-all read comparison step. Peregrine can assemble 30x human PacBio CCS read datasets in less than 30 CPU hours and around 100 wall-clock minutes to a high contiguity assembly (N50 > 20Mb). The continued advance of sequencing technologies coupled with the Peregrine assembler enables routine generation of human de novo assemblies. This will allow for population scale measurements of more comprehensive genomic variations -- beyond SNPs and small indels -- as well as novel applications requiring rapid access to de novo assemblies.

...read moreread less

Journal Article•DOI•

A near-chromosome-scale genome assembly of the gemsbok (Oryx gazella): an iconic antelope of the Kalahari desert.

[...]

Marta Farré¹, Qiye Li², Yang Zhou, Joana Damas¹, Leona G. Chemnick, Jaebum Kim³, Oliver A. Ryder, Jian Ma⁴, Guojie Zhang², Guojie Zhang⁵, Denis M. Larkin¹, Harris A. Lewin⁶ - Show less +8 more•Institutions (6)

Royal Veterinary College¹, Kunming Institute of Zoology², Konkuk University³, Carnegie Mellon University⁴, University of Copenhagen⁵, University of Minnesota⁶

01 Feb 2019-GigaScience

TL;DR: The results provide the first high-quality, chromosome-scale genome sequence assembly for gemsbok, which will be a valuable resource for studying adaptive evolution of this species and other ruminants.

...read moreread less

Abstract: Background The gemsbok (Oryx gazella) is one of the largest antelopes in Africa. Gemsbok are heterothermic and thus highly adapted to live in the desert, changing their feeding behavior when faced with extreme drought and heat. A high-quality genome sequence of this species will assist efforts to elucidate these and other important traits of gemsbok and facilitate research on conservation efforts. Findings Using 180 Gbp of Illumina paired-end and mate-pair reads, a 2.9 Gbp assembly with scaffold N50 of 1.48 Mbp was generated using SOAPdenovo. Scaffolds were extended using Chicago library sequencing, which yielded an additional 114.7 Gbp of DNA sequence. The HiRise assembly using SOAPdenovo + Chicago library sequencing produced a scaffold N50 of 47 Mbp and a final genome size of 2.9 Gbp, representing 90.6% of the estimated genome size and including 93.2% of expected genes according to Benchmarking Universal Single-Copy Orthologs analysis. The Reference-Assisted Chromosome Assembly tool was used to generate a final set of 47 predicted chromosome fragments with N50 of 86.25 Mbp and containing 93.8% of expected genes. A total of 23,125 protein-coding genes and 1.14 Gbp of repetitive sequences were annotated using de novo and homology-based predictions. Conclusions Our results provide the first high-quality, chromosome-scale genome sequence assembly for gemsbok, which will be a valuable resource for studying adaptive evolution of this species and other ruminants.

...read moreread less

Journal Article•DOI•

New Approaches for Genome Assembly and Scaffolding

[...]

Edward S. Rice¹, Richard E. Green¹•Institutions (1)

University of California, Santa Cruz¹

14 Feb 2019-Annual Review of Animal Biosciences

TL;DR: An overview of the problem of chromosome-scale assembly and traditional methods for tackling this problem is given and new technologies for chromosome- scale assembly and recent genome projects that used these technologies to create highly contiguous genome assemblies at low cost are reviewed.

...read moreread less

Abstract: Affordable, high-throughput DNA sequencing has accelerated the pace of genome assembly over the past decade. Genome assemblies from high-throughput, short-read sequencing, however, are often not as contiguous as the first generation of genome assemblies. Whereas early genome assembly projects were often aided by clone maps or other mapping data, many current assembly projects forego these scaffolding data and only assemble genomes into smaller segments. Recently, new technologies have been invented that allow chromosome-scale assembly at a lower cost and faster speed than traditional methods. Here, we give an overview of the problem of chromosome-scale assembly and traditional methods for tackling this problem. We then review new technologies for chromosome-scale assembly and recent genome projects that used these technologies to create highly contiguous genome assemblies at low cost.

...read moreread less

Journal Article•DOI•

Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data.

[...]

Vasanthan Jayakumar¹, Yasubumi Sakakibara¹•Institutions (1)

Keio University¹

21 May 2019-Briefings in Bioinformatics

TL;DR: This study evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences data sets from different taxonomic categories with considerable differences in genome size to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects.

...read moreread less

Abstract: Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms.

...read moreread less

Journal Article•DOI•

Annotated bacterial chromosomes from frame-shift-corrected long-read metagenomic data.

[...]

Krithika Arumugam¹, Caner Bağcı², Irina Bessarab³, Sina Beier², Benjamin Buchfink⁴, Anna Górska², Guanglei Qiu¹, Daniel H. Huson², Daniel H. Huson³, Rohan B. H. Williams³ - Show less +6 more•Institutions (4)

Nanyang Technological University¹, University of Tübingen², National University of Singapore³, Max Planck Society⁴

16 Apr 2019-Microbiome

TL;DR: It is demonstrated that whole bacterial chromosomes can be obtained from an enriched community, by application of MinION sequencing to a sample from an EBPR bioreactor, producing 6 Gb of sequence that assembles into multiple closed bacterial chromosomes.

...read moreread less

Abstract: Short-read sequencing technologies have long been the work-horse of microbiome analysis. Continuing technological advances are making the application of long-read sequencing to metagenomic samples increasingly feasible. We demonstrate that whole bacterial chromosomes can be obtained from an enriched community, by application of MinION sequencing to a sample from an EBPR bioreactor, producing 6 Gb of sequence that assembles into multiple closed bacterial chromosomes. We provide a simple pipeline for processing such data, which includes a new approach to correcting erroneous frame-shifts. Advances in long-read sequencing technology and corresponding algorithms will allow the routine extraction of whole chromosomes from environmental samples, providing a more detailed picture of individual members of a microbiome.

...read moreread less

Collapse