scispace - formally typeset
Open AccessJournal ArticleDOI

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

TLDR
A new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations, is proposed, which performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.
Abstract
The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB). We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

TL;DR: MEGAHIT as mentioned in this paper is a NGS de novo assembler for assembling large and complex metagenomics data in a time and cost-efficient manner, which avoids preprocessing like partitioning and normalization, which might compromise on result integrity.
Journal ArticleDOI

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

TL;DR: The details of the core algorithms in MEG AHIT v0.1 are described, and the new modules to upgrade MEGAHIT to version v1.0 are shown, which gives better assembly quality, runs faster and uses less memory.
Journal ArticleDOI

Critical Assessment of Metagenome Interpretation - A benchmark of metagenomics software

Alexander Sczyrba, +75 more
- 02 Oct 2017 - 
TL;DR: The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups as discussed by the authors.
Journal ArticleDOI

LoRDEC: accurate and efficient long read error correction.

TL;DR: LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph is presented.
Posted Content

Informed and Automated k-Mer Size Selection for Genome Assembly

TL;DR: A fast and accurate sampling method is developed that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods and a fast heuristic is presented that uses the generated abundance histogram for putative k values to estimate the best possible value of k.
References
More filters
Journal ArticleDOI

ABySS: A parallel assembler for short read sequence data

TL;DR: ABySS (Assembly By Short Sequences), a parallelized sequence assembler, was developed and assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc, representing 68% of the reference human genome.
Journal ArticleDOI

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

TL;DR: This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.
Journal ArticleDOI

De novo assembly of human genomes with massively parallel short read sequencing

TL;DR: The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
Journal ArticleDOI

Assembly algorithms for next-generation sequencing data.

TL;DR: This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo to compare the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.
Related Papers (5)