Sequence assembly demystified

doi:10.1038/NRG3367

Home
/
Papers
/
Sequence assembly demystified

Journal Article•DOI•

Sequence assembly demystified

Niranjan Nagarajan¹, Mihai Pop²•Institutions (2)

Genome Institute of Singapore¹, University of Maryland, College Park²

01 Mar 2013-Nature Reviews Genetics (Nature Publishing Group)-Vol. 14, Iss: 3, pp 157-167

TL;DR: The theoretical foundations that underlie modern assembly are surveyed and the options and practical trade-offs that need to be considered are highlighted, focusing on how individual features address the needs of specific applications.

read less

Abstract: As the use of next-generation sequencing has proliferated, so has the range of sequencing applications and software tools that are available for assembling sequences. To help readers to make informed choices about assembly techniques, this Review discusses the available options and practical trade-offs.

...read moreread less

Citations

PDF

Open Access

More filters

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

[...]

Glenn Tesler

01 Jun 2012

TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).

...read moreread less

Abstract: The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online ( http://bioinf.spbau.ru/spades ). It is distributed as open source software.

...read moreread less

10,124 citations

Journal Article•DOI•

MetaSPAdes: A new versatile metagenomic assembler

[...]

Sergey Nurk¹, Dmitry Meleshko¹, Anton Korobeynikov¹, Pavel A. Pevzner², Pavel A. Pevzner¹ - Show less +1 more•Institutions (2)

Saint Petersburg State University¹, University of California, San Diego²

01 May 2017-Genome Research

TL;DR: MetaSPAdes as mentioned in this paper addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.

...read moreread less

Abstract: While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging, thus stifling biological discoveries. Moreover, recent studies revealed that complex bacterial populations may be composed from dozens of related strains, thus further amplifying the challenge of metagenomic assembly. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. We benchmark metaSPAdes against other state-of-the-art metagenome assemblers and demonstrate that it results in high-quality assemblies across diverse data sets.

...read moreread less

2,295 citations

Journal Article•DOI•

Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads.

[...]

Ryan R. Wick¹, Louise M. Judd¹, Claire L. Gorrie¹, Kathryn E. Holt¹•Institutions (1)

University of Melbourne¹

08 Jun 2017-PLOS Computational Biology

TL;DR: Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low.

...read moreread less

Abstract: The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce complete genome assemblies, but the sequencing is more expensive and error-prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate "hybrid" assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler uses a novel semi-global aligner to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.

...read moreread less

2,245 citations

Journal Article•DOI•

Sequencing depth and coverage: key considerations in genomic analyses

[...]

David Sims¹, Ian Sudbery¹, Nicholas E. Ilott¹, Andreas Heger¹, Chris P. Ponting¹ - Show less +1 more•Institutions (1)

University of Oxford¹

01 Feb 2014-Nature Reviews Genetics

TL;DR: The issue of sequencing depth in the design of next-generation sequencing experiments is discussed and current guidelines and precedents on the issue of coverage are reviewed for four major study designs, including de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses.

...read moreread less

Abstract: Sequencing technologies have placed a wide range of genomic analyses within the capabilities of many laboratories. However, sequencing costs often set limits to the amount of sequences that can be generated and, consequently, the biological outcomes that can be achieved from an experimental design. In this Review, we discuss the issue of sequencing depth in the design of next-generation sequencing experiments. We review current guidelines and precedents on the issue of coverage, as well as their underlying considerations, for four major study designs, which include de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses (for example, chromatin immunoprecipitation followed by sequencing (ChIP-seq) and chromosome conformation capture (3C)).

...read moreread less

1,156 citations

Journal Article•DOI•

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

[...]

Dinghua Li¹, Ruibang Luo¹, Chi-Man Liu, Chi-Ming Leung¹, Hing-Fung Ting¹, Kunihiko Sadakane², Hiroshi Yamashita², Tak-Wah Lam¹ - Show less +4 more•Institutions (2)

University of Hong Kong¹, University of Tokyo²

01 Jun 2016-Methods

TL;DR: The details of the core algorithms in MEG AHIT v0.1 are described, and the new modules to upgrade MEGAHIT to version v1.0 are shown, which gives better assembly quality, runs faster and uses less memory.

...read moreread less

935 citations

Cites background or methods from "Sequence assembly demystified"

...genomes with uneven depth coverage and cross-genome repeats [6], specifically designed algorithms [8, 12-14] are required to achieve good assembly quality....
[...]
...The de novo assembly of metagenomic data, especially those sequenced from ultra-diverse environmental microbiomes, is constrained by the heavy requirement of computational resources [3, 6]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

[...]

Anton Bankevich¹, Sergey Nurk, Dmitry Antipov, Alexey Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, Pavel A. Pevzner - Show less +12 more•Institutions (1)

Saint Petersburg Academic University¹

07 May 2012-Journal of Computational Biology

TL;DR: SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies.

...read moreread less

Abstract: The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V−SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online (http://bioinf.spbau.ru/spades). It is distributed as open source software.

...read moreread less

16,859 citations

Journal Article•DOI•

Full-length transcriptome assembly from RNA-Seq data without a reference genome.

[...]

Manfred Grabherr¹, Brian J. Haas¹, Moran Yassour², Moran Yassour¹, Joshua Z. Levin¹, Dawn Thompson¹, Ido Amit¹, Xian Adiconis¹, Lin Fan¹, Raktima Raychowdhury¹, Qiandong Zeng¹, Zehua Chen¹, Evan Mauceli¹, Nir Hacohen¹, Andreas Gnirke¹, Nicholas Rhind³, Federica Di Palma¹, Bruce W. Birren¹, Chad Nusbaum¹, Kerstin Lindblad-Toh¹, Kerstin Lindblad-Toh⁴, Nir Friedman², Aviv Regev¹ - Show less +19 more•Institutions (4)

Massachusetts Institute of Technology¹, Hebrew University of Jerusalem², University of Massachusetts Medical School³, Science for Life Laboratory⁴

01 Jul 2011-Nature Biotechnology

TL;DR: The Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available, providing a unified solution for transcriptome reconstruction in any sample.

...read moreread less

Abstract: Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

...read moreread less

15,665 citations

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

[...]

Glenn Tesler

01 Jun 2012

...read moreread less

10,124 citations

Journal Article•DOI•

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

[...]

Daniel R. Zerbino¹, Ewan Birney¹•Institutions (1)

European Bioinformatics Institute¹

01 May 2008-Genome Research

TL;DR: Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.

...read moreread less

Abstract: We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

...read moreread less

9,389 citations

Journal Article•DOI•

A human gut microbial gene catalogue established by metagenomic sequencing

[...]

Junjie Qin¹, Ruiqiang Li¹, Jeroen Raes², Manimozhiyan Arumugam, Kristoffer Sølvsten Burgdorf, Chaysavanh Manichanh, Trine Nielsen, Nicolas Pons³, Florence Levenez³, Takuji Yamada, Daniel R. Mende, Junhua Li¹, Junming Xu¹, Shaochuan Li¹, Dongfang Li¹, Jianjun Cao¹, Bo Wang¹, Huiqing Liang¹, Huisong Zheng¹, Yinlong Xie¹, Julien Tap³, Patricia Lepage³, Marcelo Bertalan, Jean-Michel Batto³, Torben Hansen, Denis Le Paslier, Allan Linneberg, H. Bjørn Nielsen, Eric Pelletier, Pierre Renault³, Thomas Sicheritz-Pontén, Keith Turner⁴, Hongmei Zhu¹, Chang Yu¹, Shengting Li¹, Min Jian¹, Yan Zhou¹, Yingrui Li¹, Xiuqing Zhang¹, Songgang Li¹, Nan Qin¹, Huanming Yang¹, Jian Wang¹, Søren Brunak, Joël Doré³, Francisco Guarner⁵, Karsten Kristiansen, Oluf Pedersen, Julian Parkhill, Jean Weissenbach, Peer Bork, S. Dusko Ehrlich³, Jun Wang¹ - Show less +49 more•Institutions (5)

Beijing Genomics Institute¹, Vrije Universiteit Brussel², Institut national de la recherche agronomique³, Wellcome Trust Sanger Institute⁴, Hebron University⁵

04 Mar 2010-Nature

TL;DR: The Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals are described, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species.

...read moreread less

Abstract: To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent) microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions present in all individuals and most bacteria, respectively

...read moreread less

9,268 citations