DSK: K-Mer Counting With Very Low Memory Usage

doi:10.1093/BIOINFORMATICS/BTT020

Home
/
Papers
/
DSK: K-Mer Counting With Very Low Memory Usage

Journal Article•DOI•

DSK: K-Mer Counting With Very Low Memory Usage

Guillaume Rizk, Dominique Lavenier, Rayan Chikhi

01 Mar 2013-Bioinformatics (Oxford University Press)-Vol. 29, Iss: 5, pp 652-653

TL;DR: This work presents a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space, and is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory & disk space.

read less

Abstract: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a ﬁxed, userdeﬁned amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low abundance k-mers are optionally ﬁltered. DSK is the ﬁrst approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 hours.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Informed and automated k-mer size selection for genome assembly

[...]

Rayan Chikhi¹, Paul Medvedev¹•Institutions (1)

Pennsylvania State University¹

01 Jan 2014-Bioinformatics

TL;DR: Kmergenie as discussed by the authors constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods and then presents a fast heuristic that uses the generated abundance histogram for putative k values to estimate the best possible value of k.

...read moreread less

Abstract: Motivation: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. Results: We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies. Availability: Our tool KMERGENIE is freely available at: http://kmergenie.

...read moreread less

544 citations

Posted Content•

Informed and Automated k-Mer Size Selection for Genome Assembly

[...]

Rayan Chikhi¹, Paul Medvedev¹•Institutions (1)

Pennsylvania State University¹

20 Apr 2013-arXiv: Genomics

TL;DR: A fast and accurate sampling method is developed that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods and a fast heuristic is presented that uses the generated abundance histogram for putative k values to estimate the best possible value of k.

...read moreread less

Abstract: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. We develop a fast and accurate sampling method that constructs approximate abundance histograms with a several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies. Our tool KmerGenie is freely available at: this http URL

...read moreread less

499 citations

Journal Article•DOI•

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter.

[...]

Shaun D. Jackman¹, Benjamin P. Vandervalk¹, Hamid Mohamadi¹, Justin Chu¹, Sarah Yeo¹, S. Austin Hammond¹, Golnaz Jahesh¹, Hamza Khan¹, Lauren Coombe¹, René L. Warren¹, Inanc Birol¹ - Show less +7 more•Institutions (1)

BC Cancer Agency¹

23 Feb 2017-Genome Research

TL;DR: ABySS 2.0 is benchmarked using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual and implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.

...read moreread less

Abstract: The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.

...read moreread less

458 citations

Cites background or methods from "DSK: K-Mer Counting With Very Low M..."

...As expected given the succinct representation of the de Bruijn graph using Bloom filters, both Minia and ABySS 2.0.0 had memory footprints that were an order of magnitude smaller than other assemblers, with the exception of BCALM2, which both achieved the smallest memory footprint, by virtue of its novel partitioning strategy to constructing the de Bruijn graph, and completed the assembly in 9 hours, 8 hours of which was spent counting k-mers with DSK (Rizk et al. 2013)....
[...]
...…than other assemblers, with the exception of BCALM2, which both achieved the smallest memory footprint, by virtue of its novel partitioning strategy to constructing the de Bruijn graph, and completed the assembly in 9 hours, 8 hours of which was spent counting k-mers with DSK (Rizk et al. 2013)....
[...]
...0 had memory footprints that were an order of magnitude smaller than other assemblers, with the exception of BCALM2, which both achieved the smallest memory footprint, by virtue of its novel partitioning strategy to constructing the de Bruijn graph, and completed the assembly in 9 hours, 8 hours of which was spent counting k-mers with DSK (Rizk et al. 2013)....
[...]

Journal Article•DOI•

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

[...]

Rayan Chikhi¹, Guillaume Rizk•Institutions (1)

École normale supérieure de Cachan¹

16 Sep 2013-Algorithms for Molecular Biology

TL;DR: A new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations, is proposed, which performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

...read moreread less

Abstract: The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB). We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

...read moreread less

345 citations

Journal Article•DOI•

KMC 3: counting and manipulating k-mer statistics.

[...]

Marek Kokot¹, Maciej Dlugosz¹, Sebastian Deorowicz¹•Institutions (1)

Silesian University of Technology¹

01 Sep 2017-Bioinformatics

TL;DR: Deorowicz et al. as discussed by the authors introduced KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases, which is shown on a few real problems.

...read moreread less

Abstract: Counting all k -mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k -mer databases. Usefulness of the tools is shown on a few real problems.Program is freely available at http://sun.aei.polsl.pl/REFRESH/kmc .sebastian.deorowicz@polsl.pl.Supplementary data are available at Bioinformatics online.

...read moreread less

336 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

[...]

Guillaume Marçais¹, Carl Kingsford¹•Institutions (1)

University of Maryland, College Park¹

01 Mar 2011-Bioinformatics

TL;DR: This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.

...read moreread less

Abstract: Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish. Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

2,779 citations

"DSK: K-Mer Counting With Very Low M..." refers methods in this paper

...State of the art methods for k-mer counting rely on hash tables (Jellyfish; Marçais and Kingsford, 2011) and/or Bloom filters (BFCounter; Melsted and Pritchard, 2011)....
[...]
...State of the art methods for k-mer counting rely on hash tables (Jellyfish; Marçais and Kingsford, 2011) and/or Bloom filters (BFCounter; Melsted and Pritchard, 2011)....
[...]

Journal Article•DOI•

Efficient counting of k-mers in DNA sequences using a bloom filter

[...]

Páll Melsted¹, Jonathan K. Pritchard¹, Jonathan K. Pritchard²•Institutions (2)

University of Chicago¹, Howard Hughes Medical Institute²

10 Aug 2011-BMC Bioinformatics

TL;DR: A new method is presented that identifies all the k-mers that occur more than once in a DNA sequence data set using a Bloom filter, a probabilistic data structure that stores all the observed k-mer implicitly in memory with greatly reduced memory requirements.

...read moreread less

Abstract: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html

...read moreread less

279 citations

"DSK: K-Mer Counting With Very Low M..." refers methods in this paper

...State of the art methods for k-mer counting rely on hash tables (Jellyfish; Marçais and Kingsford, 2011) and/or Bloom filters (BFCounter; Melsted and Pritchard, 2011)....
[...]
...State of the art methods for k-mer counting rely on hash tables (Jellyfish; Marçais and Kingsford, 2011) and/or Bloom filters (BFCounter; Melsted and Pritchard, 2011)....
[...]