The MaSuRCA genome assembler

doi:10.1093/BIOINFORMATICS/BTT476

Journal Article•DOI•

The MaSuRCA genome assembler

Aleksey V. Zimin¹, Guillaume Marçais¹, Daniela Puiu¹, Michael Roberts¹, Steven L. Salzberg¹, James A. Yorke¹ - Show less +2 more•Institutions (1)

Johns Hopkins University School of Medicine¹

01 Nov 2013-Bioinformatics (Oxford University Press)-Vol. 29, Iss: 21, pp 2669-2677

TL;DR: A new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error is described.

read less

Abstract: Motivation. Second-generation sequencing technologies produce high coverage of the genome by short reads at a very low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this paper we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms very large numbers of paired-end reads into a much smaller number of longer “super-reads.” The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced “mazurka”). Results. We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two data sets from organisms for which highquality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. Availability. MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. Contact. Aleksey Zimin, alekseyz@ipst.umd.edu

...read moreread less

The MaSuRCA genome assembler

Citations

Cites methods from "The MaSuRCA genome assembler"

Additional excerpts

References

"The MaSuRCA genome assembler" refers methods in this paper

"The MaSuRCA genome assembler" refers background in this paper

"The MaSuRCA genome assembler" refers background in this paper

"The MaSuRCA genome assembler" refers methods in this paper

Related Papers (5)