scispace - formally typeset
Open AccessJournal ArticleDOI

Global Optimization for Scaffolding and Completing Genome Assemblies

TLDR
In this paper, an optimization-based approach for finding the genome sequence as a longest sequence that is consistent with the given contig and linkage information is proposed. But this approach does not address the problem of constructing a set of disjoint paths, which would require additional steps of gap filling and scaffold extension, involving additional work.
Abstract
This work focuses simultaneously on both the scaffolding and gap filling phases of de nouveau genome assembly. Given a set of contigs and their relationships--overlaps and/or remoteness in terms of distances between them--we propose an optimization-based approach for finding the genome sequence as a longest sequence that is consistent with the given contig and linkage information. Specifically, we define a graph, which we call a contig graph, that encodes information about contigs and overlaps and mate-pair distances between them, and reduce the scaffolding problem to the problem of finding a longest simple path in that graph such that as many as possible mate-pairs distances are satisfied. Since both conditions cannot generally be simultaneously satisfied, our objective function is a linear combination of the path length and penalties for distance mismatches. ​​Unlike the shortest path problem with non-negative weights, for which efficient polynomial-time algorithms exist, the longest weighted path problem is NP-hard. We solve this problem by reformulating it as a mixed integer linear program (MILP) and develop a method that exactly solves the resulting program on genomes of up to 165 contigs and up to 6682 binary variables. An advantage of our approach is that the modeling of scaffolding as a longest path problem allows one to solve simultaneously several subtasks specific for this problem like: contig orientation and ordering, repeats, gap filling, and scaffold extension, which in other approaches are targeted as separate problems. We are not aware of previous approaches on scaffolding based on the longest path problem reduction. A drawback of the typically used strategy of constructing a set of disjoint paths, rather than a single path, is that it would require additional steps of gap filling and scaffold extension, involving additional work. Moreover, it would make impossible to find a provably optimal final solution, since, even if each separate problem is implemented optimally, their combination may not be optimal. We tested this model on a set of chloroplast and bacteria genome data and showed that it allows to assemble the complete genome as a single scaffold. Compared to the publicly available scaffolding tools that we have tested, our solution produces assemblies of significantly higher quality.

read more

Citations
More filters
Posted ContentDOI

Global optimization approach for circular and chloroplast genome assembly

TL;DR: In this article, a global optimization approach for genome assembly where the steps of scaffolding, gap-filling, and scaffold extension are simultaneously solved in the framework of a common objective function is described.
Journal ArticleDOI

Complete assembly of circular and chloroplast genomes based on global optimization.

TL;DR: This paper focuses on the last two stages of genome assembly, namely, scaffolding and gap-filling, and shows that they can be solved as part of a single optimization problem, formulated as a mixed-integer linear programming (MILP) problem and applied to a benchmark of chloroplasts.
Posted ContentDOI

Global optimization approach for circular and chloroplast genome assembly

TL;DR: A global optimization approach for genome assembly where the steps of scaffolding, gap-filling, and scaffold extension are simultaneously solved in the framework of a common objective function is described.
OtherDOI

Genome Assembly

Trude Nilsen
TL;DR: In this paper , two main steps that are almost systematically found in the various strategies implemented: the construction and the ordering of scaffolds are explained, and error correction-based techniques can also be applied to increase the overall quality of the reads.
References
More filters
Journal ArticleDOI

QUAST: quality assessment tool for genome assemblies

TL;DR: This tool improves on leading assembly comparison software with new ideas and quality metrics, and can evaluate assemblies both with a reference genome, as well as without a reference.
Journal ArticleDOI

Scaffolding pre-assembled contigs using SSPACE

TL;DR: A new tool, called SSPACE, which is a stand-alone scaffolder of pre-assembled contigs using paired-read data with a short runtime, multiple library input of paired-end and/or mate pair datasets and possible contig extension with unmapped sequence reads.
Journal ArticleDOI

An Eulerian path approach to DNA fragment assembly

TL;DR: This work abandons the classical “overlap–layout–consensus” approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old “repeat problem” in fragment assembly.
Journal ArticleDOI

ART: a next-generation sequencing read simulator

TL;DR: UNLABELLED ART is a set of simulation tools that generate synthetic next-generation sequencing reads that are essential for testing and benchmarking tools for next- generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery.
Related Papers (5)