scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data.

01 Feb 2013-Bioinformatics (Oxford University Press)-Vol. 29, Iss: 3, pp 308-315
TL;DR: This article uses the k-mer spectrum approach and introduces three correction techniques in a multistage workflow: two-sided conservative correction, one-sided aggressive correction and voting-based refinement to reveal that Musket is consistently one of the top performing correctors for Illumina short-read data.
Abstract: Motivation: The imperfect sequence data produced by nextgeneration sequencing technologies has motivated the development of a number of short-read error correctors in recent years. The majority of methods focus on the correction of substitution errors, which are the dominant error source in data produced by Illumina sequencing technology. Existing tools either score high in terms of recall or precision but not consistently high in terms of both measures. Results: In this paper, we present Musket, an efficient multistage kmer based corrector for Illumina short-read data. We employ the kmer spectrum approach and introduce three correction techniques in a multistage workflow: two-sided conservative correction, one-sided aggressive correction and voting-based refinement. Our performance evaluation results, in terms of correction quality and de novo genome assembly measures, reveal that Musket is consistently one of the top performing correctors. In addition, Musket is multithreaded using a master-slave model and demonstrates superior parallel scalability compared to all other evaluated correctors as well as a highly competitive overall execution time. Availability: Musket is available at http://musket.sourceforge.net. Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de Supplementary information: available at Bioinformatics online

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
08 Sep 2016-Cell
TL;DR: The genomes and phenomes of 157 industrial Saccharomyces cerevisiae yeasts are presented to shed light on the origins, evolutionary history, and phenotypic diversity of industrial yeasts and provide a resource for further selection of superior strains.

510 citations


Cites methods from "Musket: a multistage k-mer spectrum..."

  • ...After k-mer based read correction with musket (Liu et al., 2013), reads were assembled using idba_ud (Peng et al., 2010)....

    [...]

  • ...187) Katoh and Standley, 2013 http://mafft.cbrc.jp/alignment/software/; RRID: SCR_011811 Musket Liu et al., 2013 http://musket.sourceforge.net/homepage.htm#latest PAL2NAL (v14) Suyama et al., 2006 http://www.bork.embl.de/pal2nal/ PartitionFinder (v1....

    [...]

Journal ArticleDOI
TL;DR: A k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads, which has an accuracy higher than or comparable to existing methods, including the only other method (SEECER), and is more time and memory efficient.
Abstract: Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/ .

359 citations


Cites methods from "Musket: a multistage k-mer spectrum..."

  • ...Methods in this category include Quake [5], Hammer [6], Musket [7], Bless [1], BFC [2], and Lighter [3]....

    [...]

  • ...Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data....

    [...]

  • ...These include Musket (v1....

    [...]

Journal ArticleDOI
TL;DR: The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Abstract: Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.

278 citations


Additional excerpts

  • ...…http://atgc.lirmm.fr/lordec/ LSC (Au et al., 2012) [97] http://www.healthcare.uiowa.edu/labs/au/LSC/ MisEd (Tammi, 2003) [86] not available Musket (Liu et al., 2013) [110] http://musket.sourceforge.net/homepage.htm#latest MyHybrid (Zhao et al., 2011a) [43] not available N-corr (Shin and Park,…...

    [...]

  • ...Citations and software URLs of error correction tools Tool Author, year Citation Software URL Acacia (Bragg et al., 2012) [88] http://sourceforge.net/projects/acaciaerrorcorr/ AHA (Bashir et al., 2012) [96] https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/AHA ALLPATHS (Butler et al., 2008) [69] http://www.broadinstitute.org/science/programs/genome-biology/computa tional-rd/computational-research-and-development ALLPATHS-LG (Gnerre et al., 2011) [77] http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id¼12 AmpliconNoise (Quince et al., 2011) [91] https://code.google.com/p/ampliconnoise/downloads/list ARACHNE (Batzoglou, 2002) [40] http://www.broadinstitute.org/science/programs/genome-biology/computa tional-rd/computational-research-and-development AutoEdit (Gajer, 2004) [80] not available (any more) BayesHammer (Nikolenko et al., 2013) [60] http://bioinf.spbau.ru/en/spades BFC (Li, 2015) [71] https://github.com/lh3/bfc BLESS (Heo et al., 2014) [104] http://sourceforge.net/projects/bless-ec/ Bloocoo (Drezen et al., 2014) [105] https://gatb.inria.fr/gatb/binaries/ Blue (Greenfield et al., 2014) [64] http://www.bioinformatics.csiro.au/blue Coral (Salmela and Schroder, 2011) [61] http://www.cs.helsinki.fi/u/lmsalmel/coral/ CUDA-EC (Shi et al., 2010a, 2010b) [106, 107] http://sourceforge.net/projects/cuda-ec/ DecGPU (Liu et al., 2011) [108] http://decgpu.sourceforge.net/homepage.htm#latest DeNoiser (Reeder and Knight, 2010) [90] http://www.microbio.me/denoiser/ ECHO (Kao et al., 2011) [62] http://uc-echo.sourceforge.net/ ECTools (Lee et al., 2014) [99] https://github.com/jgurtowski/ectools EDAR (Zhao et al., 2010) [75] not available EULER (Pevzner et al., 2001) [45] http://cseweb.ucsd.edu/ ppevzner/software.html EULER (Chaisson et al., 2004) [65] http://cseweb.ucsd.edu/ ppevzner/software.html EULER-SR (Chaisson and Pevzner, 2008) [68] http://cseweb.ucsd.edu/ ppevzner/software.html EULER-USR (Chaisson et al., 2009) [83] http://cseweb.ucsd.edu/ ppevzner/software.html fermi (Li, 2012) [55] https://github.com/lh3/fermi Fiona (Schulz et al., 2014) [52] http://www.seqan.de/projects/fiona/ FreClu (Qu et al., 2009) [21] http://mlab.cb.k.u-tokyo.ac.jp/ quwei/DeNovoShortReadclust/ Hammer (Medvedev et al., 2011) [59] http://bix.ucsd.edu/projects/hammer/ HECTOR (Wirawan et al., 2014) [92] http://hector454.sourceforge.net/ HiTEC (Ilie et al., 2011) [49] http://www.csd.uwo.ca/ ilie/HiTEC/ Hybrid-SHREC (Salmela, 2010) [47] http://www.cs.helsinki.fi/u/lmsalmel/hybrid-shrec/ KEC (Skums et al., 2012) [93] http://alan.cs.gsu.edu/NGS/?q¼content/pyrosequencing-error-correctionalgorithm Lighter (Song et al., 2014) [109] https://github.com/mourisl/Lighter LoRDEC (Salmela and Rivals, 2014) [85] http://atgc.lirmm.fr/lordec/ LSC (Au et al., 2012) [97] http://www.healthcare.uiowa.edu/labs/au/LSC/ MisEd (Tammi, 2003) [86] not available Musket (Liu et al., 2013) [110] http://musket.sourceforge.net/homepage.htm#latest MyHybrid (Zhao et al., 2011a) [43] not available N-corr (Shin and Park, 2014) [111] http://nar.oxfordjournals.org/content/suppl/2014/01/27/gku070.DC1/nar-00508met-k-2013-File010.docx Nanocorr (Goodwin et al., 2015) [101] https://github.com/jgurtowski/nanocorr pacbio_qc (Jiao, 2013) [28] http://david.abcc.ncifcrf.gov/manuscripts/pacbio_qc/ PBcR (Koren et al., 2012, 2013) [26, 95] http://cbcb.umd.edu/software/PBcR/ Potts model (Aita et al., 2013) [63] not available PREMIER (Yin et al., 2013) [79] not available proovread (Hackl et al., 2014) [98] http://proovread.bioapps.biozentrum.uni-wuerzburg.de/ PSAEC (Zhao et al., 2011b) [51] not available PyroNoise (Quince et al., 2009) [89] http://userweb.eng.gla.ac.uk/christopher.quince/Software/PyroNoise.html Quake (Kelley et al., 2010) [70] http://www.cbcb.umd.edu/software/quake/ QuorUM (Marçais et al., 2013) [76] http://www.genome.umd.edu/quorum.html RACER (Ilie and Molnar, 2013) [112] http://www.csd.uwo.ca/ ilie/RACER/ RECOUNT (Wijaya et al., 2009) [81] not available (any more) REDEEM (Yang et al., 2011) [113] http://aluru-sun.ece.iastate.edu/doku.php?id¼redeem Reptile (Yang et al., 2010) [58] http://aluru-sun.ece.iastate.edu/doku.php?id¼reptile SEECER (Le et al., 2013) [78] http://sb.cs.cmu.edu/seecer/ SGA (Simpson and Durbin, 2012) [44] https://github.com/jts/sga ShoRAH (Zagordi et al., 2010a, 2011) [41, 42] http://www.bsse.ethz.ch/cbg/software/shorah SHREC (Schroder et al., 2009) [46] http://sourceforge.net/projects/shrec-ec/ SleepEC (Sleep et al., 2013) [73] https://ep.unisa.edu.au/view/view.php?id¼46870 SOAPdenovo (Li et al., 2010) [66] http://soap.genomics.org.cn/soapdenovo.html SOAPdenovo2 (Luo et al., 2012) [67] http://soap.genomics.org.cn/soapdenovo.html SysCall (Meacham et al., 2011) [82] http://bio.math.berkeley.edu/SysCall/ Trowel (Lim et al., 2014) [74] http://sourceforge.net/projects/trowel-ec/ grouping strategies can be used to generate a multiple sequence alignment (MSA) of reads, thus generating the positional pileup....

    [...]

Journal ArticleDOI
TL;DR: Lighter is a fast, memory-efficient tool for correcting sequencing errors that uses a pair of Bloom filters, one holding a sample of the input k-mers and the other likely to be correct, and is both faster and more memory- efficient than competing approaches while achieving comparable accuracy.
Abstract: Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

222 citations


Cites methods or result from "Musket: a multistage k-mer spectrum..."

  • ...More recent tools, such as Musket [14] and BLESS [15], use a combination of Bloom filters and hash tables to count k-mers or to represent the set of solid k-mers....

    [...]

  • ...Lighter and Musket perform best overall....

    [...]

  • ...3 [11], Musket v1....

    [...]

  • ...By counting the multiplicity of the k-mers overlapping heterozygous positions, we conclude that Musket would classify 214,458 (99.949...

    [...]

  • ...Musket and BLESS both infer a threshold for the multiplicity of solid k-mers....

    [...]

Journal ArticleDOI
Claudia Knief1
TL;DR: Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods.
Abstract: Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods.

210 citations


Additional excerpts

  • ...Several tools (e.g., Coral, HiTEC, Musket, Quake, RACER, Reptile, or SHREC) have been developed for this purpose, in particular for the correction of substitution errors in Illumina data (Ilie and Molnar, 2013; Liu et al., 2013; Yang et al., 2013)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.
Abstract: We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

9,389 citations


"Musket: a multistage k-mer spectrum..." refers methods in this paper

  • ...Corresponding assemblers include Velvet (Zerbino and Birney 2008), ALLPATHS (Butler et al....

    [...]

  • ...Corresponding assemblers include Velvet (Zerbino and Birney 2008), ALLPATHS (Butler et al. 2008), ABySS (Simpson et al. 2009), ALLPATHS-LG (Gnerre et al. 2010), SOAPdenovo (Li et al. 2010b) and PASHA (Liu et al. 2011b)....

    [...]

Journal ArticleDOI
TL;DR: In this article, a base-calling program for automated sequencer traces, phred, with improved accuracy was proposed. But it was not shown to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.
Abstract: The availability of massive amounts of DNA sequence information has begun to revolutionize the practice of biology. As a result, current large-scale sequencing output, while impressive, is not adequate to keep pace with growing demand and, in particular, is far short of what will be required to obtain the 3-billion-base human genome sequence by the target date of 2005. To reach this goal, improved automation will be essential, and it is particularly important that human involvement in sequence data processing be significantly reduced or eliminated. Progress in this respect will require both improved accuracy of the data processing software and reliable accuracy measures to reduce the need for human involvement in error correction and make human review more efficient. Here, we describe one step toward that goal: a base-calling program for automated sequencer traces, phred, with improved accuracy. phred appears to be the first base-calling program to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.

7,627 citations

Journal ArticleDOI
TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Abstract: In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hash-coding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods.In such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to “catch” the small fraction of errors associated with the new methods. An example is discussed which illustrates possible areas of application for the new methods.Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

7,390 citations


"Musket: a multistage k-mer spectrum..." refers methods in this paper

  • ...For k-mer spectrum construction, Musket counts the number of occurrences of all non-unique k-mers using a combination of a Bloom filter (Bloom 1970) and a hash table....

    [...]

Journal ArticleDOI
TL;DR: The ability to estimate a probability of error for each base-call, as a function of certain parameters computed from the trace data, is developed and implemented in the base-calling program.
Abstract: Elimination of the data processing bottleneck in high-throughput sequencing will require both improved accuracy of data processing software and reliable measures of that accuracy. We have developed and implemented in our base-calling program phred the ability to estimate a probability of error for each base-call, as a function of certain parameters computed from the trace data. These error probabilities are shown here to be valid (correspond to actual error rates) and to have high power to discriminate correct base-calls from incorrect ones, for read data collected under several different chemistries and electrophoretic conditions. They play a critical role in our assembly program phrap and our finishing program consed.

5,334 citations

Journal ArticleDOI
TL;DR: ABySS (Assembly By Short Sequences), a parallelized sequence assembler, was developed and assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc, representing 68% of the reference human genome.
Abstract: Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs > or =100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.

3,483 citations


"Musket: a multistage k-mer spectrum..." refers methods in this paper

  • ...Corresponding assemblers include Velvet (Zerbino and Birney 2008), ALLPATHS (Butler et al. 2008), ABySS (Simpson et al. 2009), ALLPATHS-LG (Gnerre et al. 2010), SOAPdenovo (Li et al. 2010b) and PASHA (Liu et al. 2011b)....

    [...]