scispace - formally typeset
Search or ask a question

Showing papers on "De Bruijn graph published in 2011"


Journal ArticleDOI
TL;DR: A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling a contiguous genome from billions of short sequencing reads into a tractable computational problem.
Abstract: A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling a contiguous genome from billions of short sequencing reads into a tractable computational problem.

623 citations


Proceedings ArticleDOI
01 Aug 2011
TL;DR: MetaVelvet succeeded to generate higher N50 scores and smaller chimeric scaffolds than any compared single-genome assemblers, produce high-quality scaffolds as well as the separate assembly using Velvet from isolated species sequence reads, and MetaVelvet reconstructed even relatively low-coverage genome sequences as scaffolds.
Abstract: Motivation:An important step of "metagenomics" analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines employ a single-genome assembler with carefully optimized parameters and post-process the resulting scaffolds to correct assembly errors. Limitations of the use of a single-genome assembler for de novo metagenome assembly are that highly conserved sequences shared between different species often causes chimera contigs, and sequences of highly abundant species are likely mis-identified as repeats in a single genome, resulting in a number of small fragmented scaffolds. The metagenome assembly problem becomes harder when assembling from very short sequence reads.Method:We modified and extended a single-genome and de Bruijn-graph based assembler, known as "Velvet" [27], for short reads to metagenome assembly, called "MetaVelvet", for mixed short reads of multiple species. Our fundamental ideas are first decomposing de Bruijn graph constructed from mixed short reads into individual sub-graphs and second building scaffolds based on every decomposed de Bruijn sub-graph as isolate species genome. We make use of two features, graph connectivity and coverage (abundance) difference, for the decomposition of de Bruijn graph.Results:On simulated datasets, MetaVelvet succeeded to generate higher N50 scores and smaller chimeric scaffolds than any compared single-genome assemblers, produce high-quality scaffolds as well as the separate assembly using Velvet from isolated species sequence reads, and MetaVelvet reconstructed even relatively low-coverage genome sequences as scaffolds. On a real dataset of Human Gut microbial read data, MetaVelvet produced longer scaffolds, increased the number of predicted genes, and improved the assignments of a phylum-level taxonomy in the sense that the rate of predicted genes that cannot be assigned to any tanoxomy is reduced.Availability:The source code of MetaVelvet is freely available at http://metavelvet.dna.bio.keio.ac.jp under the GNU General Public License.

218 citations


Journal ArticleDOI
TL;DR: The paired de bruijn graph is introduced, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step to effectively improve the contig sizes in assembly.
Abstract: The recent proliferation of next generation sequencing with short reads has enabled many new experimental opportunities but, at the same time, has raised formidable computational challenges in genome assembly. One of the key advances that has led to an improvement in contig lengths has been mate pairs, which facilitate the assembly of repeating regions. Mate pairs have been algorithmically incorporated into most next generation assemblers as various heuristic post-processing steps to correct the assembly graph or to link contigs into scaffolds. Such methods have allowed the identification of longer contigs than would be possible with single reads; however, they can still fail to resolve complex repeats. Thus, improved methods for incorporating mate pairs will have a strong effect on contig length in the future. Here, we introduce the paired de Bruijn graph, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step. This graph has the potential to be used in place of the de Bruijn graph in any de Bruijn graph based assembler, maintaining all other assembly steps such as error-correction and repeat resolution. Through assembly results on simulated perfect data, we argue that this can effectively improve the contig sizes in assembly.

81 citations


Journal ArticleDOI
TL;DR: A method that eschews the traditional graph-based approach in favor of a simple 3' extension approach that has potential to be massively parallelized and able to obtain assemblies that are more contiguous, complete and less error prone compared with existing methods is presented.
Abstract: Motivation: Many de novo genome assemblers have been proposed recently. The basis for most existing methods relies on the de bruijn graph: a complex graph structure that attempts to encompass the entire genome. Such graphs can be prohibitively large, may fail to capture subtle information and is difficult to be parallelized. Result: We present a method that eschews the traditional graph-based approach in favor of a simple 3′ extension approach that has potential to be massively parallelized. Our results show that it is able to obtain assemblies that are more contiguous, complete and less error prone compared with existing methods. Availability: The software package can be found at http://www.comp.nus.edu.sg/~bioinfo/peasm/. Alternatively it is available from authors upon request. Contact:[email protected]; [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

47 citations


Book ChapterDOI
10 Oct 2011
TL;DR: This paper presents a dynamic overlay network based on the De Bruijn graph which it is shown that there is a simple local-control algorithm that can recover the LDB network from any network topology that is weakly connected.
Abstract: This paper presents a dynamic overlay network based on the De Bruijn graph which we call Linearized De Bruijn (LDB) network. The LDB network has the advantage that it has a guaranteed constant node degree and that the routing between any two nodes takes at most O(log n) hops with high probability. Also, we show that there is a simple local-control algorithm that can recover the LDB network from any network topology that is weakly connected.

39 citations


Book ChapterDOI
28 Mar 2011
TL;DR: The paired de bruijn graph is introduced, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step, and it is argued that this can effectively improve the contig sizes in assembly.
Abstract: The recent proliferation of next generation sequencing with short reads has enabled many new experimental opportunities but, at the same time, has raised formidable computational challenges in genome assembly. One of the key advances that has led to an improvement in contig lengths has been mate pairs, which facilitate the assembly of repeating regions. Mate pairs have been algorithmically incorporated into most next generation assemblers as various heuristic post-processing steps to correct the assembly graph or to link contigs into scaffolds. Such methods have allowed the identification of longer contigs than would be possible with single reads; however, they can still fail to resolve complex repeats. Thus, improved methods for incorporating mate pairs will have a strong effect on contig length in the future. Here, we introduce the paired de Bruijn graph, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step. This graph has the potential to be used in place of the de Bruijn graph in any de Bruijn graph based assembler, maintaining all other assembly steps such as error-correction and repeat resolution. Through assembly results on simulated error-free data, we argue that this can effectively improve the contig sizes in assembly.

37 citations


Journal ArticleDOI
TL;DR: A complete proof of the following theorem is given: Every de Bruijn sequence of order n in at least three symbols can be extended to a de Bru Netherlands sequence ofOrder n+1.

32 citations


Journal ArticleDOI
TL;DR: This method generalizes the Lempel construction of binary de Bruijn sequences as well as its efficient implementation by Annextein and obtains an exponentially large class of distinct de bruijn cycles.
Abstract: This paper presents a method to find new de Bruijn sequences based on ones of lesser order. This is done by mapping a de Bruijn cycle to several vertex disjoint cycles in a de Bruijn digraph of higher order and then connecting these cycles into one full cycle. We present precise formulae for the locations where those cycles can be rejoined into one full cycle. We obtain an exponentially large class of distinct de Bruijn cycles. This method generalizes the Lempel construction of binary de Bruijn sequences as well as its efficient implementation by Annextein.

21 citations


Journal ArticleDOI
TL;DR: The generalized binary de Bruijn (GBDB) graph is proposed as a reliable and efficient network topology for a large NoC and a reliable routing algorithm to detour a faulty channel between two adjacent switches is proposed.
Abstract: Employing thousands of cores in a single chip is the natural trend to handle the ever increasing performance requirements of complex applications such as those used in graphics and multimedia processing. System-on-chips (SoCs) platforms based on network-on-chips (NoCs) could be a viable option for the deployment of large multicore designs with thousands of cores. This paper proposes the generalized binary de Bruijn (GBDB) graph as a reliable and efficient network topology for a large NoC. We propose a reliable routing algorithm to detour a faulty channel between two adjacent switches. In addition, using integer linear programming, we propose an optimal tile-based implementation for a GBDB-based NoC in which the number of channels is less than that of Torus which has the same number of links. Our experimental results show that the latency and energy consumption of the generalized de Bruijn graph are much less than those of Mesh and Torus. The low energy consumption of a de Bruijn graph-based NoC makes it suitable for portable devices which have to operate on limited batteries. Also, the gate level implementation of the proposed reliable routing shows small area, power, and timing overheads due to the proposed reliable routing algorithm.

17 citations


Book ChapterDOI
18 Feb 2011
TL;DR: This paper efficiently generates maximum-density de Bruijn sequences for all values of n and m and is a "complement-free de Bruijk sequence" since it is a circular binary string that contains each binary string of length n or its complement exactly once as a substring.
Abstract: A de Bruijn sequence is a circular binary string of length 2n that contains each binary string of length n exactly once as a substring. A maximum-density de Bruijn sequence is a circular binary string of length n (n 0)+(n 1)+(n 2)+...+(n m) that contains each binary string of length n with density (number of 1s) between 0 and m, inclusively. In this paper we efficiently generate maximum-density de Bruijn sequences for all values of n and m. An interesting special case occurs when n = 2m+1. In this case our result is a "complement-free de Bruijn sequence" since it is a circular binary string of length 2n-1 that contains each binary string of length n or its complement exactly once as a substring.

14 citations


Book ChapterDOI
28 Mar 2011
TL;DR: This work proposes the T-IDBA algorithm, a de novo transcriptome assembler that outperforms Abyss substantially in terms of sensitivity and precision for both simulated and real data.
Abstract: RNA-seq data produced by next-generation sequencing technology is a useful tool for analyzing transcriptomes. However, existing de novo transcriptome assemblers do not fully utilize the properties of transcriptomes and may result in short contigs because of the splicing nature (shared exons) of the genes. We propose the T-IDBA algorithm to reconstruct expressed isoforms without reference genome. By using pair-end information to solve the problem of long repeats in different genes and branching in the same gene due to alternative splicing, the graph can be decomposed into small components, each corresponds to a gene. The most possible isoforms with sufficient support from the pair-end reads will be found heuristically. In practice, our de novo transcriptome assembler, T-IDBA, outperforms Abyss substantially in terms of sensitivity and precision for both simulated and real data. T-IDBA is available at http://www.cs.hku.hk/~alse/tidba/

Patent
29 Jun 2011
TL;DR: In this article, a sliding cutting on each base of the received order-checking sequence is carried out to obtain a short string with a fixed base length and a left and right connecting relation of the short string; storing a sequence value of the obtained short string, the left andright connecting relation and a connection number as a node of a de Bruijn graph; and assembling a genome based on the constructed de Bruijin graph.
Abstract: The invention is applicable to the technical field of gene engineering, and provides a method for assembling genome. The method comprises the following steps: receiving an order-checking sequence; carrying out sliding cutting on each base of the received order-checking sequence to obtain a short string with a fixed base length and a left and right connecting relation of the short string; storing a sequence value of the obtained short string, the left and right connecting relation and a connection number as a node of a de Bruijn graph; and assembling a genome based on the constructed de Bruijin graph . In the invention, the method for assemblying genome can be realized by slidingly cutting the base of the received order-checking sequence one by one to obtain the short string with the fixed base length and the left and right connecting relation of the short string, and storing the sequence value of the obtained short string, the left and right connecting relation andthe connection number as the node of the de Bruijn graph. The method can assemble a large genome with small occupied memory and fast speed.

16 Dec 2011
TL;DR: In this expository paper, the properties of Hamiltonian and Eulerian cycles that occur on De Bruijn graphs are explored and the type of redundancy that occurs as a result is explored.
Abstract: The goal of this expository paper is to introduce De Bruijn graphs and discuss their applications to fault tolerant networks. We will begin by examining N.G. de Bruijn’s original paper and the proof of his claim that there are exactly 2 n−1−n De Bruijn cycles in the binary De Bruijn graph B(2, n). In order to study fault tolerance we explore the properties of Hamiltonian and Eulerian cycles that occur on De Bruijn graphs and the type of redundancy that occurs as a result. Lastly, in this paper we seek to provide some guidance into further research on De Bruijn graphs and their potential applications to other areas.

Journal ArticleDOI
TL;DR: In this paper, the upper bound on the minimum feedback vertex sets in shuffle-based interconnection networks has been shown to be Ω(d,n) for the de Bruijn graph, where n is the number of vertices whose removal from the vertices results in an acyclic subgraph.
Abstract: The feedback number of a graph $G$ is the minimum number of vertices whose removal from $G$ results in an acyclic subgraph. We use $f(d,n)$ to denote the feedback number of the de Bruijn graph $UB(d,n)$. R. Kr a lovic and P. Ruzicka [Minimum feedback vertex sets in shuffle-based interconnection networks. Information Processing Letters, 86 (4) (2003), 191-196] proved that $f(2,n)=\lceil \frac{2^{n}-2}{3}\rceil$. This paper gives the upper bound on $f(d,n)$ for $d\ge 3$, that is, $f(d,n)\leq d^n\left(1-\left(\frac{d}{1+d}\right)^{d-1}\right)+\binom{n+d-2}{d-2}$.

Posted Content
TL;DR: A sparse de Bruijn graph-based denoising algorithm that can remove more than 99% of substitution errors from datasets with a \leq 2% error rate is developed and a novel Dijkstra-like breadth-first search algorithm is introduced to circumvent residual errors and resolve polymorphisms.
Abstract: de Bruijn graph-based algorithms are one of the two most widely used approaches for de novo genome assembly A major limitation of this approach is the large computational memory space requirement to construct the de Bruijn graph, which scales with k-mer length and total diversity (N) of unique k-mers in the genome expressed in base pairs or roughly (2k+8)N bits This limitation is particularly important with large-scale genome analysis and for sequencing centers that simultaneously process multiple genomes We present a sparse de Bruijn graph structure, based on which we developed SparseAssembler that greatly reduces memory space requirements The structure also allows us to introduce a novel method for the removal of substitution errors introduced during sequencing The sparse de Bruijn graph structure skips g intermediate k-mers, therefore reducing the theoretical memory space requirement to ~(2k/g+8)N We have found that a practical value of g=16 consumes approximately 10% of the memory required by standard de Bruijn graph-based algorithms but yields comparable results A high error rate could potentially derail the SparseAssembler Therefore, we developed a sparse de Bruijn graph-based denoising algorithm that can remove more than 99% of substitution errors from datasets with a \leq 2% error rate Given that substitution error rates for the current generation of sequencers is lower than 1%, our denoising procedure is sufficiently effective to safeguard the performance of our algorithm Finally, we also introduce a novel Dijkstra-like breadth-first search algorithm for the sparse de Bruijn graph structure to circumvent residual errors and resolve polymorphisms

Proceedings ArticleDOI
21 Oct 2011
TL;DR: In this paper, a new randomized construction method based on genetic algorithms is proposed for constructing binary de Bruijn sequences of order n, which is a cyclic sequence of period 2n, where each n-bit pattern appears exactly once.
Abstract: A binary de Bruijn sequence of order n is a cyclic sequence of period 2n, in which each n-bit pattern appears exactly once. These sequences are commonly used in random number generation and symmetric key cryptography particularly in stream cipher design, mainly due to their good statistical properties. Constructing de Bruijn sequences is of interest and well studied in the literature. In this study, we propose a new randomized construction method based on genetic algorithms. The method models de Bruijn sequences as a special type of traveling salesman tours and tries to find optimal solutions to this special type of the traveling salesman problem (TSP). We present some experimental results for n d 14.

Book ChapterDOI
15 Aug 2011
TL;DR: This paper proves that there are at least ⌊σ/2⌋ mutually-orthogonal order-k de Bruijn sequences on alphabets of size σ for all k, and presents a heuristic which proves capable of efficiently constructing optimal collections of mutually- orthogonal sequences for small values of σ and k.
Abstract: A (σ, k)-de Bruijn sequence is a minimum length string on an alphabet set of size σ which contains all σk k-mers exactly once. Motivated by an application in synthetic biology, we say a given collection of de Bruijn sequences are orthogonal if no two of them contain the same (k + 1)-mer; that is, the length of their longest common substring is k. In this paper, we show how to construct large collections of orthogonal de Bruijn sequences. In particular, we prove that there are at least ⌊σ/2⌋ mutually-orthogonal order-k de Bruijn sequences on alphabets of size σ for all k. Based on this approach, we present a heuristic which proves capable of efficiently constructing optimal collections of mutually-orthogonal sequences for small values of σ and k, which supports our conjecture that σ - 1 mutually-orthogonal de Bruijn sequences exist for all σ and k.

Posted Content
TL;DR: This paper considers the mathematical problem of uniformly tiling a de Bruijn or Kautz graph by a set of identical subgraphs, and derives a simple lower bound on the number of edges which must leave each tile, and constructs a class of tilings whose number of edge leaving each tile agrees asymptotically in form with the lower bound to within a constant factor.
Abstract: Kautz and de Bruijn graphs have a high degree of connectivity which makes them ideal candidates for massively parallel computer network topologies. In order to realize a practical computer architecture based on these graphs, it is useful to have a means of constructing a large-scale system from smaller, simpler modules. In this paper we consider the mathematical problem of uniformly tiling a de Bruijn or Kautz graph. This can be viewed as a generalization of the graph bisection problem. We focus on the problem of graph tilings by a set of identical subgraphs. Tiles should contain a maximal number of internal edges so as to minimize the number of edges connecting distinct tiles. We find necessary and sufficient conditions for the construction of tilings. We derive a simple lower bound on the number of edges which must leave each tile, and construct a class of tilings whose number of edges leaving each tile agrees asymptotically in form with the lower bound to within a constant factor. These tilings make possible the construction of large-scale computing systems based on de Bruijn and Kautz graph topologies.

Proceedings ArticleDOI
13 Apr 2011
TL;DR: This paper investigates and presents different node ID assignment algorithms for group-theoretic graphs such as Borel Cayley and de Bruijn graphs and finds that simulated annealing has the best performance, and all three methods outperforms random ID assignment for the authors' simulations.
Abstract: In this paper, we investigate and present different node ID assignment algorithms for group-theoretic graphs such as Borel Cayley and de Bruijn graphs. These graphs have been shown to be effective logical topologies in wireless sensor networks when all the nodes are within communication range of each other. However, in practice a sensor node's communication range is limited and some nodes can be out of range with each other. Under this more realistic scenario, the original theoretic graph cannot be imposed to the network in its entirety. But rather, only partial connections of the original graphs can be imposed on the physical network. Thus, node ID assignment becomes an important issue. An effective assignment allows most connections to be imposed and hence resulting in a shorter diameter and the average path length. We investigate three algorithms: (a) ID swapping assignment, (b) simulated annealing based assignment, and (c) distributed ID swapping assignment. While the first two are centralized algorithms that are appropriate for wireless sensor network with fixed infrastructure, the latter is efficient for ad hoc WSNs. As expected, being most computationally intensive, simulated annealing has the best performance, and all three methods outperforms random ID assignment for our simulations.

Posted Content
TL;DR: SparseAssembler1 as discussed by the authors replaces the idea of the de Bruijn graph from the beginning, and achieves similar memory efficiency and much better robustness compared with the previous SparseAssembleler1.
Abstract: The formal version of our work has been published in BMC Bioinformatics and can be found here: this http URL Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90% memory consumption in comparison with the state-of-art assemblers, but it requires rounds of denoising to accurately assemble genomes. In this paper, we introduce a new general model for genome assembly that uses only sparse k-mers. The new model replaces the idea of the de Bruijn graph from the beginning, and achieves similar memory efficiency and much better robustness compared with our previous SparseAssembler1. Results: We demonstrate that the decomposition of reads of all overlapping k-mers, which is used in existing de Bruijn graph genome assemblers, is overly cautious. We introduce a sparse k-mer graph structure for saving sparse k-mers, which greatly reduces memory space requirements necessary for de novo genome assembly. In contrast with the de Bruijn graph approach, we devise a simple but powerful strategy, i.e., finding links between the k-mers in the genome and traversing following the links, which can be done by saving only a few k-mers. To implement the strategy, we need to only select some k-mers that may not even be overlapping ones, and build the links between these k-mers indicated by the reads. We can traverse through this sparse k-mer graph to build the contigs, and ultimately complete the genome assembly. Since the new sparse k-mers graph shares almost all advantages of de Bruijn graph, we are able to adapt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms.

Proceedings ArticleDOI
01 Dec 2011
TL;DR: A new static polynomial time RWA heuristic LBGD-RWA (Load Balancing with Graph Decomposition based RWA) for static Wavelength Assignment in a special class of WDM networks which are based on de Bruijn graph is proposed.
Abstract: An important parameter for performance analysis of Routing and Wavelength Assignment (RWA) strategies in WDM networks is blocking probability. Past research has shown that the process in which RWA is carried out significantly affects the wavelength conversion requirements, which in turn affects blocking probability. In this paper we propose a new strategy GDWA (Graph Decomposition based Wavelength Assignment) for static Wavelength Assignment (WA) in a special class of WDM networks which are based on de Bruijn graph. We combine our own request routing strategy LBR (Load Balanced Routing) with the new WA strategy effectively to propose a new static polynomial time RWA heuristic LBGD-RWA (Load Balancing with Graph Decomposition based RWA). We compare the performance of our heuristic with three alternate RWA strategies. Performance comparison reveals that the proposed heuristic gives the best blocking performance.

Journal ArticleDOI
TL;DR: This work shows that de Bruijn graph can be expressed as union of edge-disjoint rings and wavelengths are assigned for the individual rings in the graph thus resulting in the assignment for the graph itself.
Abstract: This paper proposes an offline wavelength assignment technique for de Bruijn (d, k) optical wavelength division multiplexing (WDM) networks having nodal degree d and diameter k that can support lightpaths between pair of nodes. Each lightpath uses a channel (wavelength) along each link in its route. An efficient algorithm is proposed that can be used to assign wavelengths to lightpath requests in O(k|V|) time, where |V| represents the number of nodes of the de Bruijn network and V represents the set of nodes. The proposed algorithm can be efficiently used for wavelength assignment in de Bruijn WDM networks having limited wavelength conversion capabilities. This work shows that de Bruijn graph can be expressed as union of edge-disjoint rings. Wavelengths are assigned for the individual rings in the graph thus resulting in the assignment for the graph itself. Results are shown for a given de Bruijn graph and for an arbitrary request. The proposed algorithm is compared with two other algorithms, which use tw...

Journal ArticleDOI
TL;DR: An inductive characterization of the maximum independent sets of the de Bruijn graphs and a recurrence relation and an exponential generating function for their number are derived.
Abstract: The nodes of the de Bruijn graph $B(d,3)$ consist of all strings of length $3$, taken from an alphabet of size $d$, with edges between words which are distinct substrings of a word of length $4$. We give an inductive characterization of the maximum independent sets of the de Bruijn graphs $B(d,3)$ and for the de Bruijn graph of diameter three with loops removed, for arbitrary alphabet size. We derive a recurrence relation and an exponential generating function for their number. This recurrence allows us to construct exponentially many comma-free codes of length 3 with maximal cardinality.


Proceedings Article
Gaurav Thareja, Vivek Kumar, Michael Zyskowski1, Simon Mercer1, Bob Davidson1 
15 Jul 2011
TL;DR: PadeNA (Parallel de Novo Assembler), a parallelized DNA sequence assembler with a graphical user interface, designed using interface-driven architecture to facilitate code reusability and extensibility, and is provided as part of the open source Microsoft Biology Foundation.
Abstract: Recent technological advances in DNA sequencing technology are resulting in ever-larger quantities of sequence information being made available to an increasingly broad segment of the scientific and clinical community. This is in turn driving the need for standard, rapid and easy to use tools for genomic reconstruction and analysis. As a step towards addressing this challenge, we present PadeNA (Parallel de Novo Assembler), a parallelized DNA sequence assembler with a graphical user interface. PadeNA is designed using interface-driven architecture to facilitate code reusability and extensibility, and is provided as part of the open source Microsoft Biology Foundation. Installers and documentation are available at http://research.microsoft.com/bio/.

01 Jan 2011
TL;DR: This essay is an attempt to create a generalized periodic shift register function that produces a De Bruijn sequence and the minimal Sum-of-Products boolean functions and the Exclusive-OR-Sum- of-Products are discussed.
Abstract: This essay is an attempt to create a generalized periodic shift register function that produces a De Bruijn sequence. To this end we first devise an algorithm to create all De Bruijn sequences. In this algorithm all spanning trees of a De Bruijn graph are created, these trees are converted into Euler paths and finally the De Bruijn sequences are extracted from the Euler paths. Then the focus shifts onto creating the boolean functions that produce these sequences. The minimal Sum-of-Products boolean functions and the Exclusive-OR-Sum-of-Products are discussed. Finally some general properties of the functions are derived, but no general function is found.

01 Jan 2011
TL;DR: In this paper, the authors studied homomorphisms between de Bruijn digraphs of different orders, where the inverse of a lower order digraph is also a factor in the higher order one, where a factor is a collection of cycles that partition the digraph.
Abstract: We study homomorphisms between de Bruijn digraphs of different orders. A main theme of this paper is to characterize de Bruijn graph homomorphisms such that the inverse of a factor in the lower order digraph is also a factor in the higher order one, where a factor is a collection of cycles that partition the digraph. We generalize Lempel's homomorphism by describing and characterizing a class of homomorphisms between two de Bruijn digraphs of arbitrarily different orders but with the same alphabet, the direction of these functions being of course from the higher order digraph to the lower order one. Finally, we single out the binary case, which due to its simplicity admits a more concise characterization.

Proceedings ArticleDOI
03 Feb 2011
TL;DR: The authors' algorithms are based on sorting and efficient in sequential, out-of-core, and parallel settings and provide computationally efficient algorithms to these fundamental bi-directed de Bruijn graph operations.
Abstract: Next Generation Sequence (NGS) assemblers are challenged with the problem of handling massive number of reads. Bi-directed de Bruijn graph is the most fundamental data structure on which numerous NGS assemblers have been built (e.g. Velvet, ABySS). Most of these assemblers only differ in the heuristics which they employ to operate on this de Bruijn graph. These heuristics are composed of several fundamental operations such as construction, compaction and pruning of the underlying bi-directed de Bruijn graph. Unfortunately the current algorithms to accomplish these fundamental operations on the de Bruijn graph are computationally inefficient and have become a bottleneck to scale the NGS assemblers. In this talk we discuss some of our recent results which provide computationally efficient algorithms to these fundamental bi-directed de Bruijn graph operations. Our algorithms [1] are based on sorting and efficient in sequential, out-of-core, and parallel settings.

Journal ArticleDOI
TL;DR: A new general model for genome assembly that uses only sparse k-mers is introduced, which greatly reduces memory space requirements necessary for de novo genome assembly and adapts a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms.
Abstract: Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90% memory consumption in comparison with the state-of-art assemblers, but it requires rounds of denoising to accurately assemble genomes. Algorithmetically, we developed an extension of de Bruijn graph structure — 'sparse de Bruijn graphs' — skipping a certain number of intermediate k-mers. In this paper, we introduce a new general model for genome assembly that uses only sparse k-mers. The new model replaces the idea of the de Bruijn graph from the beginning, and achieves similar memory efficiency and much better robustness compared with our previous SparseAssembler1. Results: Based on the sparse k-mers graph model, we develop SparseAssembler2. We demonstrate that the decomposition of reads of all overlapping k-mers, which is used in existing de Bruijn graph genome assemblers, is overly cautious. We introduce a sparse *To whom correspondence should be addressed. k-mer graph structure for saving sparse k-mers, which greatly reduces memory space requirements necessary for de novo genome assembly. In contrast with the de Bruijn graph approach, we devise a simple but powerful strategy, i.e., finding links between the k-mers in the genome and traversing following the links, which can be done by saving only a few k-mers. To implement the strategy, we need to only select some k-mers that may not even be overlapping ones, and build the links between these k-mers indicated by the reads. We can traverse through this sparse k-mer graph to build the contigs, and ultimately complete the genome assembly. Since the new sparse k-mers graph shares almost all advantages of de Bruijn graph, we are able to adapt a Dijkstra-like breadth-first search algorithm, for the new sparse k-mer graph in order to circumvent sequencing errors and resolve polymorphisms. Availability: Programs in both Windows and Linux are available at: https://sites.google.com/site/sparseassembler/. Contact: ma@vandals.uidaho.edu or mpop@umiacs.umd.edu SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

Journal Article
TL;DR: JGI is comparing the performance of Convey?s graph constructor and Velvet on both synthetic and real data, and preliminary results on memory usage and run time metrics for various data sets with different sizes are presented.
Abstract: Advanced architectures can deliver dramatically increased throughput for genomics and proteomics applications, reducing time-to-completion in some cases from days to minutes. One such architecture, hybrid-core computing, marries a traditional x86 environment with a reconfigurable coprocessor, based on field programmable gate array (FPGA) technology. In addition to higher throughput, increased performance can fundamentally improve research quality by allowing more accurate, previously impractical approaches. We will discuss the approach used by Convey?s de Bruijn graph constructor for short-read, de-novo assembly. Bioinformatics applications that have random access patterns to large memory spaces, such as graph-based algorithms, experience memory performance limitations on cache-based x86 servers. Convey?s highly parallel memory subsystem allows application-specific logic to simultaneously access 8192 individual words in memory, significantly increasing effective memory bandwidth over cache-based memory systems. Many algorithms, such as Velvet and other de Bruijn graph based, short-read, de-novo assemblers, can greatly benefit from this type of memory architecture. Furthermore, small data type operations (four nucleotides can be represented in two bits) make more efficient use of logic gates than the data types dictated by conventional programming models. JGI is comparing the performance of Convey?s graph constructor and Velvet on both synthetic and real data. We will present preliminary results on memory usage and run time metrics for various data sets with different sizes, from small microbial and fungal genomes to very large cow rumen metagenome. For genomes with references we will also present assembly quality comparisons between the two assemblers.