scispace - formally typeset
Search or ask a question
Journal ArticleDOI

DSK: K-Mer Counting With Very Low Memory Usage

01 Mar 2013-Bioinformatics (Oxford University Press)-Vol. 29, Iss: 5, pp 652-653
TL;DR: This work presents a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space, and is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory & disk space.
Abstract: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed, userdefined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 hours.
Citations
More filters
Journal ArticleDOI
TL;DR: Kmergenie as discussed by the authors constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods and then presents a fast heuristic that uses the generated abundance histogram for putative k values to estimate the best possible value of k.
Abstract: Motivation: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. Results: We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies. Availability: Our tool KMERGENIE is freely available at: http://kmergenie.

544 citations

Posted Content
TL;DR: A fast and accurate sampling method is developed that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods and a fast heuristic is presented that uses the generated abundance histogram for putative k values to estimate the best possible value of k.
Abstract: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. We develop a fast and accurate sampling method that constructs approximate abundance histograms with a several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies. Our tool KmerGenie is freely available at: this http URL

499 citations

Journal ArticleDOI
TL;DR: ABySS 2.0 is benchmarked using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual and implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.
Abstract: The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.

458 citations


Cites background or methods from "DSK: K-Mer Counting With Very Low M..."

  • ...As expected given the succinct representation of the de Bruijn graph using Bloom filters, both Minia and ABySS 2.0.0 had memory footprints that were an order of magnitude smaller than other assemblers, with the exception of BCALM2, which both achieved the smallest memory footprint, by virtue of its novel partitioning strategy to constructing the de Bruijn graph, and completed the assembly in 9 hours, 8 hours of which was spent counting k-mers with DSK (Rizk et al. 2013)....

    [...]

  • ...…than other assemblers, with the exception of BCALM2, which both achieved the smallest memory footprint, by virtue of its novel partitioning strategy to constructing the de Bruijn graph, and completed the assembly in 9 hours, 8 hours of which was spent counting k-mers with DSK (Rizk et al. 2013)....

    [...]

  • ...0 had memory footprints that were an order of magnitude smaller than other assemblers, with the exception of BCALM2, which both achieved the smallest memory footprint, by virtue of its novel partitioning strategy to constructing the de Bruijn graph, and completed the assembly in 9 hours, 8 hours of which was spent counting k-mers with DSK (Rizk et al. 2013)....

    [...]

Journal ArticleDOI
TL;DR: A new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations, is proposed, which performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.
Abstract: The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB). We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

345 citations

Journal ArticleDOI
TL;DR: Deorowicz et al. as discussed by the authors introduced KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases, which is shown on a few real problems.
Abstract: Counting all k -mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k -mer databases. Usefulness of the tools is shown on a few real problems.Program is freely available at http://sun.aei.polsl.pl/REFRESH/kmc .sebastian.deorowicz@polsl.pl.Supplementary data are available at Bioinformatics online.

336 citations

References
More filters
Journal ArticleDOI
TL;DR: This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.
Abstract: Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish. Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

2,779 citations


"DSK: K-Mer Counting With Very Low M..." refers methods in this paper

  • ...State of the art methods for k-mer counting rely on hash tables (Jellyfish; Marçais and Kingsford, 2011) and/or Bloom filters (BFCounter; Melsted and Pritchard, 2011)....

    [...]

  • ...State of the art methods for k-mer counting rely on hash tables (Jellyfish; Marçais and Kingsford, 2011) and/or Bloom filters (BFCounter; Melsted and Pritchard, 2011)....

    [...]

Journal ArticleDOI
TL;DR: A new method is presented that identifies all the k-mers that occur more than once in a DNA sequence data set using a Bloom filter, a probabilistic data structure that stores all the observed k-mer implicitly in memory with greatly reduced memory requirements.
Abstract: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html

279 citations


"DSK: K-Mer Counting With Very Low M..." refers methods in this paper

  • ...State of the art methods for k-mer counting rely on hash tables (Jellyfish; Marçais and Kingsford, 2011) and/or Bloom filters (BFCounter; Melsted and Pritchard, 2011)....

    [...]

  • ...State of the art methods for k-mer counting rely on hash tables (Jellyfish; Marçais and Kingsford, 2011) and/or Bloom filters (BFCounter; Melsted and Pritchard, 2011)....

    [...]