Showing papers on "Hybrid genome assembly published in 2018"

PDF

Open Access

Journal Article•DOI•

Nanopore sequencing and assembly of a human genome with ultra-long reads

[...]

Miten Jain¹, Sergey Koren², Karen H. Miga¹, Josh Quick³, Arthur C Rand¹, Thomas A Sasani⁴, John R. Tyson⁵, Andrew D Beggs³, Alexander T. Dilthey², Ian T. Fiddes¹, Sunir Malla⁶, Hannah Marriott⁶, Tom Nieto³, Justin O'Grady⁷, Hugh E. Olsen¹, Brent S. Pedersen⁴, Arang Rhie², Hollian Richardson⁷, Aaron R. Quinlan⁴, Terrance P. Snutch⁵, Louise Tee³, Benedict Paten¹, Adam M. Phillippy², Jared T. Simpson⁸, Jared T. Simpson⁹, Nicholas J. Loman³, Matthew Loose⁶ - Show less +23 more•Institutions (9)

University of California, Santa Cruz¹, National Institutes of Health², University of Birmingham³, University of Utah⁴, University of British Columbia⁵, University of Nottingham⁶, University of East Anglia⁷, Ontario Institute for Cancer Research⁸, University of Toronto⁹

29 Jan 2018-Nature Biotechnology

TL;DR: Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.

...read moreread less

Abstract: We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ∼30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ∼3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ∼6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.

...read moreread less

1,425 citations

Journal Article•DOI•

Fast-SG: an alignment-free algorithm for hybrid assembly

[...]

Alex Di Genova, Gonzalo A. Ruz¹, Gonzalo A. Ruz², Marie-France Sagot³, Marie-France Sagot⁴, Alejandro Maass⁵ - Show less +2 more•Institutions (5)

Adolfo Ibáñez University¹, Coordenadoria de Aperfeiçoamento de Pessoal de Nível Superior², French Institute for Research in Computer Science and Automation³, Claude Bernard University Lyon 1⁴, University of Chile⁵

01 May 2018-GigaScience

TL;DR: This work proposes a new method, called Fast-SG, that uses a new ultrafast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures, and opens a door to achieve accurate hybrid long-range reconstructions of large genomes with low effort, high portability, and low cost.

...read moreread less

Abstract: Background: Long-read sequencing technologies are the ultimate solution for genome repeats, allowing near reference-level reconstructions of large genomes. However, long-read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods that combine short-and long-read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes. Results: Here, we propose a new method, called Fast-SG, that uses a new ultrafast alignment-free algorithm specifically designed for constructing a scaffolding graph using lightweight data structures. Fast-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short-read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how Fast-SG outperforms the state-of-the-art short-read aligners when building the scaffolding graph and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using Fast-SG with shallow long-read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878). Conclusions: Fast-SG opens a door to achieve accurate hybrid long-range reconstructions of large genomes with low effort, high portability, and low cost.

...read moreread less

13 citations

Journal Article•DOI•

Hybrid genome assembly and annotation of Paenibacillus pasadenensis strain R16 reveals insights on endophytic life style and antifungal activity.

[...]

Alessandro Passera¹, Luca Marcolungo², Paola Casati¹, Milena Brasca³, Fabio Quaglino¹, Chiara Cantaloni², Massimo Delledonne² - Show less +3 more•Institutions (3)

University of Milan¹, University of Verona², National Research Council³

19 Jan 2018-PLOS ONE

TL;DR: The functional analysis on the genome confirmed several plant-associated, plant-growth promotion, and biocontrol traits of strain R16, thus adding insights in the genetic bases of these complex features, and of the Paenibacillus genus in general.

...read moreread less

Abstract: Bacteria of the Paenibacillus genus are becoming important in many fields of science, including agriculture, for their positive effects on the health of plants. However, there are little information available on this genus compared to other bacteria (such as Bacillus or Pseudomonas), especially when considering genomic information. Sequencing the genomes of plant-beneficial bacteria is a crucial step to identify the genetic elements underlying the adaptation to life inside a plant host and, in particular, which of these features determine the differences between a helpful microorganism and a pathogenic one. In this study, we have characterized the genome of Paenibacillus pasadenensis, strain R16, recently investigated for its antifungal activities and plant-associated features. An hybrid assembly approach was used integrating the very precise reads obtained by Illumina technology and long fragments acquired with Oxford Nanopore Technology (ONT) sequencing. De novo genome assembly based solely on Illumina reads generated a relatively fragmented assembly of 5.72 Mbp in 99 ungapped sequences with an N50 length of 544 Kbp; hybrid assembly, integrating Illumina and ONT reads, improved the assembly quality, generating a genome of 5.75 Mbp, organized in 6 contigs with an N50 length of 3.4 Mbp. Annotation of the latter genome identified 4987 coding sequences, of which 1610 are hypothetical proteins. Enrichment analysis identified pathways of particular interest for the endophyte biology, including the chitin-utilization pathway and the incomplete siderophore pathway which hints at siderophore parasitism. In addition the analysis led to the identification of genes for the production of terpenes, as for example farnesol, that was hypothesized as the main antifungal molecule produced by the strain. The functional analysis on the genome confirmed several plant-associated, plant-growth promotion, and biocontrol traits of strain R16, thus adding insights in the genetic bases of these complex features, and of the Paenibacillus genus in general.

...read moreread less

11 citations

Journal Article•DOI•

Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data

[...]

Ashley Cacho¹, Weixin Yao¹, Xinping Cui¹•Institutions (1)

University of California, Riverside¹

01 Apr 2018-Statistics in Biosciences

TL;DR: A random effects mixture model is introduced that captures the sequencing process and its performance is compared to a model with fixed effects to assess the accuracy and quality of bases impact the results.

...read moreread less

Abstract: The emergence of next-generation sequencing technology has greatly influenced research in biology and clinical applications. This new technology allows millions of DNA fragments to be sequenced in parallel, reducing costs and increasing throughput. One of the most widely used DNA sequencing machines is the Illumina platform which contains a novel sequencing-by-synthesis method involving a series of chemical reactions and image processing. However, it suffers from biases inherent with the complex nature of the chemical processes involved. The process of converting the fluorescence intensity output of the sequencing-by-synthesis technology to the nucleotide bases is what is known as base-calling. The resulting DNA sequences are used in further downstream analyses such as in genome assemblies or variant detection in which the accuracy and quality of bases impact the results. In this paper, we introduce a random effects mixture model that captures the sequencing process and compare its performance to a model with fixed effects.

...read moreread less

1 citations