scispace - formally typeset
Search or ask a question
Book ChapterDOI

A fast and accurate heuristic for the single individual snp haplotyping problem with many gaps, high reading error rate and low coverage

TL;DR: A new heuristic method is described that is able to tackle the case of many gapped fragments and retains its effectiveness even when the input fragments have high rate of reading errors and low coverage.
Abstract: Single nucleotide polymorphism (SNP) is the most frequent form of DNA variation. The set of SNPs present in a chromosome (called the haplotype) is of interest in a wide area of applications in molecular biology and biomedicine, including diagnostic and medical therapy. In this paper we propose a new heuristic method for the problem of haplotype reconstruction for (portions of ) a pair of homologous human chromosomes from a single individual (SIH). The problem is well known in literature and exact algorithms have been proposed for the case when no (or few) gaps are allowed in the input fragments. These algorithms, though exact and of polynomial complexity, are slow in practice. Therefore fast heuristics have been proposed. In this paper we describe a new heuristic method that is able to tackle the case of many gapped fragments and retains its effectiveness even when the input fragments have high rate of reading errors (up to 20%) and low coverage (as low as 3). We test our method on real data from the HapMap Project.
Citations
More filters
Journal ArticleDOI
TL;DR: Comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that the proposed SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes.
Abstract: Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.

125 citations


Cites methods from "A fast and accurate heuristic for t..."

  • ...Two more algorithms are mentioned in (23), a randomized one called SHRThree (31), and SpeedHap (32) which tries to build first a core solution with variants and fragments with full agreement and evidence of presence of the two alleles for each variant, and then includes the remaining fragments and variants by relaxing constraints....

    [...]

  • ...These blocks were used as the input for eight SIH algorithms (namely ReFHap, HapCUT, FastHare, DGS, MLF, 2d-MEC, SHRThree and SpeedHap)....

    [...]

Journal ArticleDOI
TL;DR: This article surveys seven recent approaches to the SIH problem and evaluates them extensively using real human haplotype data from the HapMap project and implemented a data generator tailored to the current shotgun sequencing technology.
Abstract: Motivation: Single nucleotide polymorphisms are the most common form of variation in human DNA, and are involved in many research fields, from molecular biology to medical therapy. The technological opportunity to deal with long DNA sequences using shotgun sequencing has raised the problem of fragment recombination. In this regard, Single Individual Haplotyping (SIH) problem has received considerable attention over the past few years. Results: In this article, we survey seven recent approaches to the SIH problem and evaluate them extensively using real human haplotype data from the HapMap project. We also implemented a data generator tailored to the current shotgun sequencing technology that uses haplotypes from the HapMap project. Availability: The data we used to compare the algorithms are available on demand, since we think they represent an important benchmark that can be used to easily compare novel algorithmic ideas with the state of the art. Moreover, we had to re-implement six of the algorithms surveyed because the original code was not available to us. Five of these algorithms and the data generator used in this article endowed with a Web interface are available at http://bioalgo.iit.cnr.it/rehap Contact: filippo.geraci@iit.cnr.it

73 citations

Proceedings ArticleDOI
02 Aug 2010
TL;DR: A novel problem formulation for single individual haplotyping that initially finds the best cut based on a heuristic algorithm for max-cut and then builds haplotypes consistent with that cut and is found that ReFHap performs significantly faster than previous methods without loss of accuracy.
Abstract: Full human genomic sequences have been published in the latest two years for a growing number of individuals. Most of them are a mixed consensus of the two real haplotypes because it is still very expensive to separate information coming from the two copies of a chromosome. However, latest improvements and new experimental approaches promise to solve these issues and provide enough information to reconstruct the sequences for the two copies of each chromosome through bioinformatics methods such as single individual haplotyping. Full haploid sequences provide a complete understanding of the structure of the human genome, allowing accurate predictions of translation in protein coding regions and increasing power of association studies.In this paper we present a novel problem formulation for single individual haplotyping. We start by assigning a score to each pair of fragments based on their common allele calls and then we use these score to formulate the problem as the cut of fragments that maximize an objective function, similar to the well known max-cut problem. Our algorithm initially finds the best cut based on a heuristic algorithm for max-cut and then builds haplotypes consistent with that cut. We have compared both accuracy and running time of ReFHap with other heuristic methods on both simulated and real data and found that ReFHap performs significantly faster than previous methods without loss of accuracy.

67 citations


Cites methods from "A fast and accurate heuristic for t..."

  • ...Computational properties of these problems have been analyzed by [16, 15] and several algorithms have been proposed for MEC [1, 6, 23, 26]....

    [...]

  • ...A practical exact algorithm for the individual haplotyping problem MEC/GI....

    [...]

  • ...The input for this test case is a matrix of 32347 SNPs covered by Table 2: MEC percentage and running time of ReFHap and HapCUT for a real instance with 32347 SNPs and 13905 fragments in chromosome 22 ReFHap HapCUT (1 It) HapCUT (50 It) %MEC 6.32% 6.26% 6.24% Time 73.04 Sec 0.99 Hours 50.4 Hours 13905 fragments....

    [...]

  • ...The .rst one is the Minimum Error Correction (MEC), which is the minimum number of changes within the matrix to make it consistent with the answer haplotypes....

    [...]

  • ...ReFHap consistently produces lower MEC and switch errors....

    [...]

Journal ArticleDOI
TL;DR: A probabilistic model is developed to approach two realistic scenarios regarding the singular haplotype reconstruction problem--the incompleteness and inconsistency that occurred in the DNA sequencing process to generate the input haplotype fragments, and the common practice used to generate synthetic data in experimental algorithm studies.
Abstract: In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem--the incompleteness and inconsistency that occurred in the DNA sequencing process to generate the input haplotype fragments, and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in linear time in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration.

45 citations


Cites methods from "A fast and accurate heuristic for t..."

  • ...Other methods have proposed heuristics (Genovese et al., 2007; Alessandro Panconesi, 2004), but do not...

    [...]

Journal ArticleDOI
TL;DR: A Genetic Algorithm (GA) based method, named GAHap, is introduced to reconstruct SIHs with lowest MEC times, equipped with a well-designed fitness function to obtain better reconstruction rates and is compared with existing methods to show its ability in generating highly reliable solutions.

18 citations

References
More filters
Journal ArticleDOI
John W. Belmont1, Andrew Boudreau, Suzanne M. Leal1, Paul Hardenbol  +229 moreInstitutions (40)
27 Oct 2005
TL;DR: A public database of common variation in the human genome: more than one million single nucleotide polymorphisms for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted.
Abstract: Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.

5,479 citations

Journal ArticleDOI
15 Feb 2001-Nature
TL;DR: This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.
Abstract: We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

2,908 citations

Book ChapterDOI
28 Aug 2001
TL;DR: It is shown that the general SNPs Haplotyping Problem is NP-hard for mate-pairs assembly data, and polynomial time algorithms for fragment assembly data are designed, and the Minimum SNPs Removal problem amounts to finding the largest independent set in a weakly triangulated graph.
Abstract: Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation. They are of fundamental importance for a variety of applications including medical diagnostic and drug design. They also provide the highest-resolution genomic fingerprint for tracking disease genes. This paper is devoted to algorithmic problems related to computational SNPs validation based on genome assembly of diploid organisms. In diploid genomes, there are two copies of each chromosome. A description of the SNPs sequence information from one of the two chromosomes is called SNPs haplotype. The basic problem addressed here is the Haplotyping, i.e., given a set of SNPs prospects inferred from the assembly alignment of a genomic region of a chromosome, find the maximally consistent pair of SNPs haplotypes by removing data "errors" related to DNA sequencing errors, repeats, and paralogous recruitment. In this paper, we introduce several versions of the problem from a computational point of view. We show that the general SNPs Haplotyping Problem is NP-hard for mate-pairs assembly data, and design polynomial time algorithms for fragment assembly data.We give a network-flow based polynomial algorithm for the Minimum Fragment Removal Problem, and we show that the Minimum SNPs Removal problem amounts to finding the largest independent set in a weakly triangulated graph.

213 citations


"A fast and accurate heuristic for t..." refers background or methods in this paper

  • ...MFR is NP-hard for fragments with at most 1 gap, and MSR is NP-hard for fragments with at most 2 gaps [7]....

    [...]

  • ...This problem has been tackled both from a theoretical point of view [1, 3, 4, 7, 13] and from a more practical one [8, 11, 14]....

    [...]

  • ...In previous papers [7, 11] experiments were based on SNP matrices obtained from the fragmentation of arti cially generated haplotype data....

    [...]

  • ...At this point the strings are split in fragments by selecting iteratively the next cut point at an integer distance from the previous one chosen uniformly at random in the range [3, 7], starting from the rst base....

    [...]

  • ...Each fragment covers a number of SNP's in the range roughly [3, 7], thus we chose the length of each fragment in this range....

    [...]

Journal ArticleDOI
TL;DR: To improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotypes information (shortly, MEC/GI).
Abstract: Motivation: Haplotype reconstruction based on aligned single nucleotide polymorphism (SNP) fragments is to infer a pair of haplotypes from localized polymorphism data gathered through short genome fragment assembly. An important computational model of this problem is the minimum error correction (MEC) model, which has been mentioned in several literatures. The model retrieves a pair of haplotypes by correcting minimum number of SNPs in given genome fragments coming from an individual's DNA. Results: In the first part of this paper, an exact algorithm for the MEC model is presented. Owing to the NP-hardness of the MEC model, we also design a genetic algorithm (GA). The designed GA is intended to solve large size problems and has very good performance. The strength and weakness of the MEC model are shown using experimental results on real data and simulation data. In the second part of this paper, to improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotype information (shortly, MEC/GI). Computational results on extensive datasets show that the new model has much higher accuracy in haplotype reconstruction than the pure MEC model. Contact: wangrsh@amss.ac.cn

145 citations


"A fast and accurate heuristic for t..." refers background or methods in this paper

  • ...As future work we plan a comparison of our method with the one in [14]....

    [...]

  • ...We are not aware of any publicly available implementation of the methods described in [8, 11, 14, 16], therefore we chose as baseline the method in [11] that is comparable to ours in terms of speed, and does not rely on any statistical model....

    [...]

  • ...This problem has been tackled both from a theoretical point of view [1, 3, 4, 7, 13] and from a more practical one [8, 11, 14]....

    [...]

  • ...[14] describe a Genetic Algorithm for this problem that in some reported experiments gives good performance for short haplotypes (about 100 SNPs)....

    [...]

Book ChapterDOI
17 Sep 2004
TL;DR: A simple heuristic is introduced and it is proved experimentally that is very fast and accurate and when compared with a dynamic programming of [8] it is much faster and also more accurate.
Abstract: We study the single individual SNP haplotype reconstruction problem. We introduce a simple heuristic and prove experimentally that is very fast and accurate. In particular, when compared with a dynamic programming of [8] it is much faster and also more accurate. We expect Fast Hare to be very useful in practical applications. We also introduce a combinatorial problem related to the SNP haplotype reconstruction problem that we call Min Element Removal. We prove its NP-hardness in the gapless case and its O(log n)-approximability in the general case.

136 citations