ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads

doi:10.1186/GB-2009-10-10-R103

Home
/
Papers
/
ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads

Journal Article•DOI•

ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads

Iain MacCallum¹, Dariusz Przybylski¹, Sante Gnerre¹, Joshua N. Burton¹, Ilya Shlyakhter¹, Andreas Gnirke¹, Joel A. Malek², Joel A. Malek³, Kevin McKernan³, Swati Ranade³, Swati Ranade⁴, Terrance Shea¹, Louise Williams¹, Sarah Young¹, Chad Nusbaum¹, David B. Jaffe¹ - Show less +12 more•Institutions (4)

Broad Institute¹, Cornell University², Life Technologies³, Pacific Biosciences⁴

01 Oct 2009-Genome Biology (BioMed Central)-Vol. 10, Iss: 10, pp 1-10

TL;DR: Using 36 base (fragment) and 26 base (jumping) reads from five microbial genomes of varied GC composition and sizes up to 40 Mb, ALLPATHS2 generated assemblies with long, accurate contigs and scaffolds.

read less

Abstract: We demonstrate that genome sequences approaching finished quality can be generated from short paired reads. Using 36 base (fragment) and 26 base (jumping) reads from five microbial genomes of varied GC composition and sizes up to 40 Mb, ALLPATHS2 generated assemblies with long, accurate contigs and scaffolds. Velvet and EULER-SR were less accurate. For example, for Escherichia coli, the fraction of 10-kb stretches that were perfect was 99.8% (ALLPATHS2), 68.7% (Velvet), and 42.1% (EULER-SR).

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies

[...]

Tanja Magoc¹, Steven L. Salzberg¹•Institutions (1)

Johns Hopkins University School of Medicine¹

01 Nov 2011-Bioinformatics

TL;DR: FLASH is a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short and when FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds.

...read moreread less

Abstract: Motivation: Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome. Results: We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds. Availability and Implementation: The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash. Contact: moc.liamg@cogam.t

...read moreread less

9,827 citations

Journal Article•DOI•

VSEARCH: a versatile open source tool for metagenomics

[...]

Torbjørn Rognes¹, Torbjørn Rognes², Tomas Flouri³, Tomas Flouri⁴, Ben Nichols⁵, Christopher Quince⁵, Christopher Quince⁶, Frédéric Mahé⁷ - Show less +4 more•Institutions (7)

Oslo University Hospital¹, University of Oslo², Karlsruhe Institute of Technology³, Heidelberg Institute for Theoretical Studies⁴, University of Glasgow⁵, University of Warwick⁶, Kaiserslautern University of Technology⁷

18 Oct 2016-PeerJ

TL;DR: VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with US EARCH for paired-ends read merging and dereplication.

...read moreread less

Abstract: Background: VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use. Methods: When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads. Results: VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0. Discussion: VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

...read moreread less

5,850 citations

Cites methods from "ALLPATHS 2: small genomes assembled..."

...We used whole genome sequencing data from Staphylococcus aureus subspecies aureus strain USA 300 TCH 1516 sequenced by MacCallum et al. (2009) and retrieved from the GAGE-B repository (http://ccb.jhu.edu/gage_b/)....
[...]

Journal Article•DOI•

PEAR: a fast and accurate Illumina Paired-End reAd mergeR

[...]

Jiajie Zhang¹, Kassian Kobert¹, Tomasÿ Flouri¹, Alexandros Stamatakis¹•Institutions (1)

Heidelberg Institute for Theoretical Studies¹

01 Mar 2014-Bioinformatics

TL;DR: The PEAR software for merging raw Illumina paired-end reads from target fragments of varying length evaluates all possible paired- end read overlaps and does not require the target fragment size as input, and implements a statistical test for minimizing false-positive results.

...read moreread less

Abstract: Motivation The Illumina paired-end sequencing technology can generate reads from both ends of target DNA fragments, which can subsequently be merged to increase the overall read length. There already exist tools for merging these paired-end reads when the target fragments are equally long. However, when fragment lengths vary and, in particular, when either the fragment size is shorter than a single-end read, or longer than twice the size of a single-end read, most state-of-the-art mergers fail to generate reliable results. Therefore, a robust tool is needed to merge paired-end reads that exhibit varying overlap lengths because of varying target fragment lengths. Results We present the PEAR software for merging raw Illumina paired-end reads from target fragments of varying length. The program evaluates all possible paired-end read overlaps and does not require the target fragment size as input. It also implements a statistical test for minimizing false-positive results. Tests on simulated and empirical data show that PEAR consistently generates highly accurate merged paired-end reads. A highly optimized implementation allows for merging millions of paired-end reads within a few minutes on a standard desktop computer. On multi-core architectures, the parallel version of PEAR shows linear speedups compared with the sequential version of PEAR. Availability and implementation PEAR is implemented in C and uses POSIX threads. It is freely available at http://www.exelixis-lab.org/web/software/pear.

...read moreread less

3,270 citations

Cites methods from "ALLPATHS 2: small genomes assembled..."

...…overlap and DNA fragment sizes as well as the following two empirical datasets: (1) deep sequencing data of the Staphylococcus aureus genome by MacCallum et al. (2009), (2) reads generated from paired-end sequencing of a known single sequence (template) used by Masella et al. (2012) to test…...
[...]

Journal Article•DOI•

High-quality draft assemblies of mammalian genomes from massively parallel sequence data

[...]

Sante Gnerre¹, Iain MacCallum, Dariusz Przybylski, Filipe J. Ribeiro, Joshua N. Burton, Bruce J. Walker, Ted Sharpe, Giles Hall, Terrance Shea, Sean M. Sykes, Aaron M. Berlin, Daniel Aird, Maura Costello, Riza M. Daza, Louise Williams, Robert Nicol, Andreas Gnirke, Chad Nusbaum, Eric S. Lander, David B. Jaffe - Show less +16 more•Institutions (1)

Broad Institute¹

25 Jan 2011-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: The development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform, have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome.

...read moreread less

Abstract: Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100-base) sequence reads at very low cost. Whereas such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but much more expensive) capillary-based sequencing approach. Here, we report the development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of improved sequencing technology and improved computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. The ALLPATHS-LG program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd.

...read moreread less

1,616 citations

Cites background or methods from "ALLPATHS 2: small genomes assembled..."

...] We developed several laboratory techniques for making the libraries (see SI Materials and Methods for details): (i) For fragments, we adapted existing protocols with the goal of improving the representation of high GC-content DNA; (ii) for short jumps (∼3 kb), we used the Illumina protocol (6); (iii) for long jumps (∼6 kb), we used a protocol that we had previously developed, on the basis of a protocol for the SOLiD sequencing platform that involves circularization and EcoP15I digestion (7, 9); and (iv) for Fosmid jumps (∼40 kb), we developed two methodologies, “ShARC” and “Fosill” (described in SI Materials and Methods)....
[...]
...For this purpose, we made extensive improvements to our previous program ALLPATHS (9, 16), which can routinely assemble small genomes....
[...]
...Scaffold accuracy: Validity at 100 kb (9): We report the probability that two 100-base sequences in the assembly, separated by 100 kb, and also present in the reference, have the same orientation and are separated by 100 kb ± 10%....
[...]
...In practice, however, recalcitrant sequence contexts (including those with low and high GC content) do cause low coverage (9, 18), sometimes even to zero....
[...]

Journal Article•DOI•

Assembly algorithms for next-generation sequencing data.

[...]

Jason R. Miller¹, Sergey Koren¹, Granger G. Sutton¹•Institutions (1)

J. Craig Venter Institute¹

01 Jun 2010-Genomics

TL;DR: This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo to compare the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.

...read moreread less

1,176 citations

Cites background from "ALLPATHS 2: small genomes assembled..."

...It was published with results on simulated data [57] and revised for real data [58]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

[...]

Daniel R. Zerbino¹, Ewan Birney¹•Institutions (1)

European Bioinformatics Institute¹

01 May 2008-Genome Research

TL;DR: Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.

...read moreread less

Abstract: We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

...read moreread less

9,389 citations

"ALLPATHS 2: small genomes assembled..." refers background or methods in this paper

...To understand how the ALLPATHS assemblies would compare to assemblies produced by existing software, we also assembled the identical datasets with Velvet [12] and EULERSR [9,14], using standardized arguments for each assembler applied to all five genomes....
[...]
...Recent work has begun to explore the possibilities of short read assembly [6-14], but high-quality assembly from experimentally generated paired reads has not been demonstrated, even for small genomes....
[...]
...We also ran the assembly programs Velvet [12] and EULER-SR [9,14] on the same data sets and provide a side-by-side comparison....
[...]

Journal Article•DOI•

Genome sequencing in microfabricated high-density picolitre reactors

[...]

Marcel Margulies, Michael Egholm, William E. Altman, Said Attiya, Joel S. Bader, Lisa A. Bemben, Jan Berka, Michael S. Braverman, Yi-Ju Chen, Zhoutao Chen, Scott Dewell, Lei Du, J. M. Fierro, Xavier V. Gomes, Brian C. Godwin, Wen He, Scott Edward Helgesen, Chun Heen Ho, Gerard P. Irzyk, Szilveszter C. Jando, Maria L. I. Alenquer, Thomas P. Jarvie, Kshama B. Jirage, Jong-Bum Kim, James R. Knight, Janna R. Lanza, John H. Leamon, Steven Lefkowitz, Ming Lei, Jing Li, Kenton Lohman, Hong Lu, Vinod Makhijani, Keith Mcdade, Michael P. McKenna, Eugene W. Myers¹, Elizabeth Nickerson, John Nobile, Ramona Plant, Bernard P. Puc, Michael T. Ronan, George T. Roth, Gary J. Sarkis, Jan Fredrik Simons, John Simpson, Maithreyan Srinivasan, Karrie R. Tartaro, Alexander Tomasz², Kari A. Vogt, Greg A. Volkmer, Shally H. Wang, Yong Wang, Michael P. Weiner³, Pengguang Yu, Richard F. Begley, Jonathan M. Rothberg - Show less +52 more•Institutions (3)

University of California, Berkeley¹, Rockefeller University², Rothberg Institute For Childhood Diseases³

15 Sep 2005-Nature

TL;DR: A scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments with 96% coverage at 99.96% accuracy in one run of the machine is described.

...read moreread less

Abstract: The proliferation of large-scale DNA-sequencing projects in recent years has driven a search for alternative methods to reduce time and cost. Here we describe a scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel fibre-optic slide of individual wells and is able to sequence 25 million bases, at 99% or better accuracy, in one four-hour run. To achieve an approximately 100-fold increase in throughput over current Sanger sequencing technology, we have developed an emulsion method for DNA amplification and an instrument for sequencing by synthesis using a pyrosequencing protocol optimized for solid support and picolitre-scale volumes. Here we show the utility, throughput, accuracy and robustness of this system by shotgun sequencing and de novo assembly of the Mycoplasma genitalium genome with 96% coverage at 99.96% accuracy in one run of the machine.

...read moreread less

8,434 citations

Journal Article•DOI•

Base-calling of automated sequencer traces using Phred. I. accuracy assessment

[...]

Brent Ewing¹, LaDeana W. Hillier², Michael C. Wendl², Philip Green¹•Institutions (2)

University of Washington¹, Washington University in St. Louis²

01 Mar 1998-Genome Research

TL;DR: In this article, a base-calling program for automated sequencer traces, phred, with improved accuracy was proposed. But it was not shown to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.

...read moreread less

Abstract: The availability of massive amounts of DNA sequence information has begun to revolutionize the practice of biology. As a result, current large-scale sequencing output, while impressive, is not adequate to keep pace with growing demand and, in particular, is far short of what will be required to obtain the 3-billion-base human genome sequence by the target date of 2005. To reach this goal, improved automation will be essential, and it is particularly important that human involvement in sequence data processing be significantly reduced or eliminated. Progress in this respect will require both improved accuracy of the data processing software and reliable accuracy measures to reduce the need for human involvement in error correction and make human review more efficient. Here, we describe one step toward that goal: a base-calling program for automated sequencer traces, phred, with improved accuracy. phred appears to be the first base-calling program to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.

...read moreread less

7,627 citations

Journal Article•DOI•

Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities

[...]

Brent Ewing¹, Philip Green¹•Institutions (1)

University of Washington¹

01 Mar 1998-Genome Research

TL;DR: The ability to estimate a probability of error for each base-call, as a function of certain parameters computed from the trace data, is developed and implemented in the base-calling program.

...read moreread less

Abstract: Elimination of the data processing bottleneck in high-throughput sequencing will require both improved accuracy of data processing software and reliable measures of that accuracy. We have developed and implemented in our base-calling program phred the ability to estimate a probability of error for each base-call, as a function of certain parameters computed from the trace data. These error probabilities are shown here to be valid (correspond to actual error rates) and to have high power to discriminate correct base-calls from incorrect ones, for read data collected under several different chemistries and electrophoretic conditions. They play a critical role in our assembly program phrap and our finishing program consed.

...read moreread less

5,334 citations

Journal Article•DOI•

Accurate whole human genome sequencing using reversible terminator chemistry

[...]

David R. Bentley¹, Shankar Balasubramanian², Harold Swerdlow¹, Harold Swerdlow³ +198 more•Institutions (4)

06 Nov 2008-Nature

TL;DR: An approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost is reported, effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

...read moreread less

Abstract: DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

...read moreread less

3,802 citations

"ALLPATHS 2: small genomes assembled..." refers background or methods in this paper

...For example, a recent method [5] based on blunt end ligation rather than restriction generates jumping construct libraries of sufficient complexity for large genomes and not having a hard size limit on end reads....
[...]
...Background Recent advances in sequencing technology [1-5] have rapidly driven down the cost of DNA sequence data....
[...]
...The data for the assemblies were of three types: paired 36base reads [5] derived from approximately 200-bp fragments, paired 26-base reads derived via a 'jumping' construction from approximately 4,000-bp fragments, and for one genome, additional unpaired 36-base reads....
[...]