Global Optimization for Scaffolding and Completing Genome Assemblies
Sebastien François,Rumen Andonov,Dominique Lavenier,Hristo N. Djidjev +3 more
- Vol. 64, pp 185-194
TLDR
In this paper, an optimization-based approach for finding the genome sequence as a longest sequence that is consistent with the given contig and linkage information is proposed. But this approach does not address the problem of constructing a set of disjoint paths, which would require additional steps of gap filling and scaffold extension, involving additional work.Abstract:
This work focuses simultaneously on both the scaffolding and gap filling phases of de nouveau genome assembly. Given a set of contigs and their relationships--overlaps and/or remoteness in terms of distances between them--we propose an optimization-based approach for finding the genome sequence as a longest sequence that is consistent with the given contig and linkage information. Specifically, we define a graph, which we call a contig graph, that encodes information about contigs and overlaps and mate-pair distances between them, and reduce the scaffolding problem to the problem of finding a longest simple path in that graph such that as many as possible mate-pairs distances are satisfied. Since both conditions cannot generally be simultaneously satisfied, our objective function is a linear combination of the path length and penalties for distance mismatches.
Unlike the shortest path problem with non-negative weights, for which efficient polynomial-time algorithms exist, the longest weighted path problem is NP-hard. We solve this problem by reformulating it as a mixed integer linear program (MILP) and develop a method that exactly solves the resulting program on genomes of up to 165 contigs and up to 6682 binary variables.
An advantage of our approach is that the modeling of scaffolding as a longest path problem allows one to solve simultaneously several subtasks specific for this problem like: contig orientation and ordering, repeats, gap filling, and scaffold extension, which in other approaches are targeted as separate problems. We are not aware of previous approaches on scaffolding based on the longest path problem reduction. A drawback of the typically used strategy of constructing a set of disjoint paths, rather than a single path, is that it would require additional steps of gap filling and scaffold extension, involving additional work. Moreover, it would make impossible to find a provably optimal final solution, since, even if each separate problem is implemented optimally, their combination may not be optimal.
We tested this model on a set of chloroplast and bacteria genome data and showed that it allows to assemble the complete genome as a single scaffold. Compared to the publicly available scaffolding tools that we have tested, our solution produces assemblies of significantly higher quality.read more
Citations
More filters
Posted ContentDOI
Global optimization approach for circular and chloroplast genome assembly
TL;DR: In this article, a global optimization approach for genome assembly where the steps of scaffolding, gap-filling, and scaffold extension are simultaneously solved in the framework of a common objective function is described.
Journal ArticleDOI
Complete assembly of circular and chloroplast genomes based on global optimization.
TL;DR: This paper focuses on the last two stages of genome assembly, namely, scaffolding and gap-filling, and shows that they can be solved as part of a single optimization problem, formulated as a mixed-integer linear programming (MILP) problem and applied to a benchmark of chloroplasts.
Posted ContentDOI
Global optimization approach for circular and chloroplast genome assembly
TL;DR: A global optimization approach for genome assembly where the steps of scaffolding, gap-filling, and scaffold extension are simultaneously solved in the framework of a common objective function is described.
OtherDOI
Genome Assembly
TL;DR: In this paper , two main steps that are almost systematically found in the various strategies implemented: the construction and the ordering of scaffolds are explained, and error correction-based techniques can also be applied to increase the overall quality of the reads.
References
More filters
Journal ArticleDOI
QUAST: quality assessment tool for genome assemblies
TL;DR: This tool improves on leading assembly comparison software with new ideas and quality metrics, and can evaluate assemblies both with a reference genome, as well as without a reference.
Journal ArticleDOI
Scaffolding pre-assembled contigs using SSPACE
TL;DR: A new tool, called SSPACE, which is a stand-alone scaffolder of pre-assembled contigs using paired-read data with a short runtime, multiple library input of paired-end and/or mate pair datasets and possible contig extension with unmapped sequence reads.
Journal ArticleDOI
An Eulerian path approach to DNA fragment assembly
TL;DR: This work abandons the classical “overlap–layout–consensus” approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old “repeat problem” in fragment assembly.
Journal ArticleDOI
ART: a next-generation sequencing read simulator
TL;DR: UNLABELLED ART is a set of simulation tools that generate synthetic next-generation sequencing reads that are essential for testing and benchmarking tools for next- generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery.