Space-efficient and exact de Bruijn graph representation based on a Bloom filter

doi:10.1186/1748-7188-8-22

Open AccessJournal ArticleDOI

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Rayan Chikhi, +1 more

- 16 Sep 2013 -

Algorithms for Molecular Biology

- Vol. 8, Iss: 1, pp 22-22

TLDR

A new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations, is proposed, which performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

Abstract:

The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB). We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

Citations

PDF

Open Access

More filters

Posted Content

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Dinghua Li, +4 more

- 25 Sep 2014 -

arXiv: Genomics

TL;DR: MEGAHIT as mentioned in this paper is a NGS de novo assembler for assembling large and complex metagenomics data in a time and cost-efficient manner, which avoids preprocessing like partitioning and normalization, which might compromise on result integrity.

...read moreread less

Journal ArticleDOI

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

Dinghua Li, +7 more

- 01 Jun 2016 -

Methods

TL;DR: The details of the core algorithms in MEG AHIT v0.1 are described, and the new modules to upgrade MEGAHIT to version v1.0 are shown, which gives better assembly quality, runs faster and uses less memory.

...read moreread less

Journal ArticleDOI

Critical Assessment of Metagenome Interpretation - A benchmark of metagenomics software

Alexander Sczyrba, +75 more

- 02 Oct 2017 -

Nature Methods

TL;DR: The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups as discussed by the authors.

...read moreread less

Journal ArticleDOI

LoRDEC: accurate and efficient long read error correction.

Leena Salmela, +1 more

- 15 Dec 2014 -

Bioinformatics

TL;DR: LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph is presented.

...read moreread less

Posted Content

Informed and Automated k-Mer Size Selection for Genome Assembly

Rayan Chikhi, +1 more

- 20 Apr 2013 -

arXiv: Genomics

TL;DR: A fast and accurate sampling method is developed that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods and a fast heuristic is presented that uses the generated abundance histogram for putative k values to estimate the best possible value of k.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Manfred Grabherr, +22 more

- 01 Jul 2011 -

Nature Biotechnology

TL;DR: The Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available, providing a unified solution for transcriptome reconstruction in any sample.

...read moreread less

Journal ArticleDOI

ABySS: A parallel assembler for short read sequence data

Jared T. Simpson, +5 more

- 01 Jun 2009 -

Genome Research

TL;DR: ABySS (Assembly By Short Sequences), a parallelized sequence assembler, was developed and assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc, representing 68% of the reference human genome.

...read moreread less

Journal ArticleDOI

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Guillaume Marçais, +1 more

- 01 Mar 2011 -

Bioinformatics

TL;DR: This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.

...read moreread less

Journal ArticleDOI

De novo assembly of human genomes with massively parallel short read sequencing

Ruiqiang Li, +13 more

- 01 Feb 2010 -

Genome Research

TL;DR: The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.

...read moreread less

Journal ArticleDOI

Assembly algorithms for next-generation sequencing data.

Jason R. Miller, +2 more

- 01 Jun 2010 -

Genomics

TL;DR: This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo to compare the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.

...read moreread less