A fast and accurate heuristic for the single individual snp haplotyping problem with many gaps, high reading error rate and low coverage

doi:10.1007/978-3-540-74126-8_6

Home
/
Papers
/
A fast and accurate heuristic for the single individual snp haplotyping problem with many gaps, high reading error rate and low coverage

Book Chapter•DOI•

A fast and accurate heuristic for the single individual snp haplotyping problem with many gaps, high reading error rate and low coverage

Loredana M. Genovese, Filippo Geraci, Marco Pellegrini

08 Aug 2007-pp 49-60

TL;DR: A new heuristic method is described that is able to tackle the case of many gapped fragments and retains its effectiveness even when the input fragments have high rate of reading errors and low coverage.

read less

Abstract: Single nucleotide polymorphism (SNP) is the most frequent form of DNA variation. The set of SNPs present in a chromosome (called the haplotype) is of interest in a wide area of applications in molecular biology and biomedicine, including diagnostic and medical therapy. In this paper we propose a new heuristic method for the problem of haplotype reconstruction for (portions of ) a pair of homologous human chromosomes from a single individual (SIH). The problem is well known in literature and exact algorithms have been proposed for the case when no (or few) gaps are allowed in the input fragments. These algorithms, though exact and of polynomial complexity, are slow in practice. Therefore fast heuristics have been proposed. In this paper we describe a new heuristic method that is able to tackle the case of many gapped fragments and retains its effectiveness even when the input fragments have high rate of reading errors (up to 20%) and low coverage (as low as 3). We test our method on real data from the HapMap Project.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques.

[...]

Jorge Duitama¹, Gayle K. McEwen¹, Thomas Huebsch¹, Stefanie Palczewski¹, Sabrina Schulz¹, Kevin J. Verstrepen¹, Eun-Kyung Suk¹, Margret R. Hoehe¹ - Show less +4 more•Institutions (1)

Max Planck Society¹

01 Mar 2012-Nucleic Acids Research

TL;DR: Comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that the proposed SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes.

...read moreread less

Abstract: Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.

...read moreread less

125 citations

Cites methods from "A fast and accurate heuristic for t..."

...Two more algorithms are mentioned in (23), a randomized one called SHRThree (31), and SpeedHap (32) which tries to build first a core solution with variants and fragments with full agreement and evidence of presence of the two alleles for each variant, and then includes the remaining fragments and variants by relaxing constraints....
[...]
...These blocks were used as the input for eight SIH algorithms (namely ReFHap, HapCUT, FastHare, DGS, MLF, 2d-MEC, SHRThree and SpeedHap)....
[...]

Journal Article•DOI•

A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem

[...]

Filippo Geraci

01 Sep 2010-Bioinformatics

TL;DR: This article surveys seven recent approaches to the SIH problem and evaluates them extensively using real human haplotype data from the HapMap project and implemented a data generator tailored to the current shotgun sequencing technology.

...read moreread less

Abstract: Motivation: Single nucleotide polymorphisms are the most common form of variation in human DNA, and are involved in many research fields, from molecular biology to medical therapy. The technological opportunity to deal with long DNA sequences using shotgun sequencing has raised the problem of fragment recombination. In this regard, Single Individual Haplotyping (SIH) problem has received considerable attention over the past few years. Results: In this article, we survey seven recent approaches to the SIH problem and evaluate them extensively using real human haplotype data from the HapMap project. We also implemented a data generator tailored to the current shotgun sequencing technology that uses haplotypes from the HapMap project. Availability: The data we used to compare the algorithms are available on demand, since we think they represent an important benchmark that can be used to easily compare novel algorithmic ideas with the state of the art. Moreover, we had to re-implement six of the algorithms surveyed because the original code was not available to us. Five of these algorithms and the data generator used in this article endowed with a Web interface are available at http://bioalgo.iit.cnr.it/rehap Contact: filippo.geraci@iit.cnr.it

...read moreread less

73 citations

Proceedings Article•DOI•

ReFHap: a reliable and fast algorithm for single individual haplotyping

[...]

Jorge Duitama¹, Thomas Huebsch², Gayle K. McEwen², Eun-Kyung Suk², Margret R. Hoehe² - Show less +1 more•Institutions (2)

University of Connecticut¹, Max Planck Society²

02 Aug 2010

TL;DR: A novel problem formulation for single individual haplotyping that initially finds the best cut based on a heuristic algorithm for max-cut and then builds haplotypes consistent with that cut and is found that ReFHap performs significantly faster than previous methods without loss of accuracy.

...read moreread less

Abstract: Full human genomic sequences have been published in the latest two years for a growing number of individuals. Most of them are a mixed consensus of the two real haplotypes because it is still very expensive to separate information coming from the two copies of a chromosome. However, latest improvements and new experimental approaches promise to solve these issues and provide enough information to reconstruct the sequences for the two copies of each chromosome through bioinformatics methods such as single individual haplotyping. Full haploid sequences provide a complete understanding of the structure of the human genome, allowing accurate predictions of translation in protein coding regions and increasing power of association studies.In this paper we present a novel problem formulation for single individual haplotyping. We start by assigning a score to each pair of fragments based on their common allele calls and then we use these score to formulate the problem as the cut of fragments that maximize an objective function, similar to the well known max-cut problem. Our algorithm initially finds the best cut based on a heuristic algorithm for max-cut and then builds haplotypes consistent with that cut. We have compared both accuracy and running time of ReFHap with other heuristic methods on both simulated and real data and found that ReFHap performs significantly faster than previous methods without loss of accuracy.

...read moreread less

67 citations

Cites methods from "A fast and accurate heuristic for t..."

...Computational properties of these problems have been analyzed by [16, 15] and several algorithms have been proposed for MEC [1, 6, 23, 26]....
[...]
...A practical exact algorithm for the individual haplotyping problem MEC/GI....
[...]
...The input for this test case is a matrix of 32347 SNPs covered by Table 2: MEC percentage and running time of ReFHap and HapCUT for a real instance with 32347 SNPs and 13905 fragments in chromosome 22 ReFHap HapCUT (1 It) HapCUT (50 It) %MEC 6.32% 6.26% 6.24% Time 73.04 Sec 0.99 Hours 50.4 Hours 13905 fragments....
[...]
...The .rst one is the Minimum Error Correction (MEC), which is the minimum number of changes within the matrix to make it consistent with the answer haplotypes....
[...]
...ReFHap consistently produces lower MEC and switch errors....
[...]

Journal Article•DOI•

Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments

[...]

Zhixiang Chen¹, Bin Fu¹, Robert T. Schweller¹, Boting Yang², Zhiyu Zhao³, Zhiyu Zhao⁴, Binhai Zhu⁵ - Show less +3 more•Institutions (5)

University of Texas–Pan American¹, University of Regina², University of New Orleans³, University of Texas Southwestern Medical Center⁴, Montana State University⁵

12 Jun 2008-Journal of Computational Biology

TL;DR: A probabilistic model is developed to approach two realistic scenarios regarding the singular haplotype reconstruction problem--the incompleteness and inconsistency that occurred in the DNA sequencing process to generate the input haplotype fragments, and the common practice used to generate synthetic data in experimental algorithm studies.

...read moreread less

Abstract: In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem--the incompleteness and inconsistency that occurred in the DNA sequencing process to generate the input haplotype fragments, and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in linear time in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration.

...read moreread less

45 citations

Cites methods from "A fast and accurate heuristic for t..."

...Other methods have proposed heuristics (Genovese et al., 2007; Alessandro Panconesi, 2004), but do not...
[...]

Journal Article•DOI•

Using genetic algorithm in reconstructing single individual haplotype with minimum error correction

[...]

Tai-Chun Wang¹, Javid Taheri¹, Albert Y. Zomaya¹•Institutions (1)

University of Sydney¹

01 Oct 2012-Journal of Biomedical Informatics

TL;DR: A Genetic Algorithm (GA) based method, named GAHap, is introduced to reconstruct SIHs with lowest MEC times, equipped with a well-designed fitness function to obtain better reconstruction rates and is compared with existing methods to show its ability in generating highly reliable solutions.

...read moreread less

18 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

A haplotype map of the human genome

[...]

John W. Belmont¹, Andrew Boudreau, Suzanne M. Leal¹, Paul Hardenbol +229 more•Institutions (40)

27 Oct 2005

TL;DR: A public database of common variation in the human genome: more than one million single nucleotide polymorphisms for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted.

...read moreread less

Abstract: Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.

...read moreread less

5,479 citations

Journal Article•DOI•

A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms

[...]

Ravi Sachidanandam, David Weissman, Steven Schmidt, Jerzy M. Kakol, Lincoln Stein, Gabor T. Marth, Steve Sherry, James C. Mullikin, Beverley J. Mortimore, David Willey, Sarah E. Hunt, Charlotte G. Cole, Penny Coggill, Catherine M. Rice, Zemin Ning, Jane Rogers, David R. Bentley, Pui-Yan Kwok, Elaine R. Mardis, Raymond T. Yeh, Brian Schultz, Lisa Cook, Ruth Davenport, Michael Dante, Lucinda Fulton, LaDeana W. Hillier, Robert H. Waterston, John Douglas Mcpherson, Brian Gilman, Stephen F. Schaffner, William J. Van Etten, David Reich, John M. Higgins, Mark J. Daly, Brendan Blumenstiel, Jennifer Baldwin, Nicole Stange-Thomann, Michael C. Zody, Lauren Linton, Eric S. Lander, David Altshuler - Show less +37 more

15 Feb 2001-Nature

TL;DR: This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

...read moreread less

Abstract: We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

...read moreread less

2,908 citations

Book Chapter•DOI•

SNPs Problems, Complexity, and Algorithms

[...]

Giuseppe Lancia¹, Giuseppe Lancia², Vineet Bafna¹, Sorin Istrail¹, Ross Lippert¹, Russell Schwartz¹ - Show less +2 more•Institutions (2)

Celera Corporation¹, University of Padua²

28 Aug 2001

TL;DR: It is shown that the general SNPs Haplotyping Problem is NP-hard for mate-pairs assembly data, and polynomial time algorithms for fragment assembly data are designed, and the Minimum SNPs Removal problem amounts to finding the largest independent set in a weakly triangulated graph.

...read moreread less

Abstract: Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation. They are of fundamental importance for a variety of applications including medical diagnostic and drug design. They also provide the highest-resolution genomic fingerprint for tracking disease genes. This paper is devoted to algorithmic problems related to computational SNPs validation based on genome assembly of diploid organisms. In diploid genomes, there are two copies of each chromosome. A description of the SNPs sequence information from one of the two chromosomes is called SNPs haplotype. The basic problem addressed here is the Haplotyping, i.e., given a set of SNPs prospects inferred from the assembly alignment of a genomic region of a chromosome, find the maximally consistent pair of SNPs haplotypes by removing data "errors" related to DNA sequencing errors, repeats, and paralogous recruitment. In this paper, we introduce several versions of the problem from a computational point of view. We show that the general SNPs Haplotyping Problem is NP-hard for mate-pairs assembly data, and design polynomial time algorithms for fragment assembly data.We give a network-flow based polynomial algorithm for the Minimum Fragment Removal Problem, and we show that the Minimum SNPs Removal problem amounts to finding the largest independent set in a weakly triangulated graph.

...read moreread less

213 citations

"A fast and accurate heuristic for t..." refers background or methods in this paper

...MFR is NP-hard for fragments with at most 1 gap, and MSR is NP-hard for fragments with at most 2 gaps [7]....
[...]
...This problem has been tackled both from a theoretical point of view [1, 3, 4, 7, 13] and from a more practical one [8, 11, 14]....
[...]
...In previous papers [7, 11] experiments were based on SNP matrices obtained from the fragmentation of arti cially generated haplotype data....
[...]
...At this point the strings are split in fragments by selecting iteratively the next cut point at an integer distance from the previous one chosen uniformly at random in the range [3, 7], starting from the rst base....
[...]
...Each fragment covers a number of SNP's in the range roughly [3, 7], thus we chose the length of each fragment in this range....
[...]

Journal Article•DOI•

Haplotype reconstruction from SNP fragments by minimum error correction

[...]

Rui-Sheng Wang¹, Ling-Yun Wu¹, Zhenping Li, Xiang-Sun Zhang¹•Institutions (1)

Chinese Academy of Sciences¹

15 May 2005-Bioinformatics

TL;DR: To improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotypes information (shortly, MEC/GI).

...read moreread less

Abstract: Motivation: Haplotype reconstruction based on aligned single nucleotide polymorphism (SNP) fragments is to infer a pair of haplotypes from localized polymorphism data gathered through short genome fragment assembly. An important computational model of this problem is the minimum error correction (MEC) model, which has been mentioned in several literatures. The model retrieves a pair of haplotypes by correcting minimum number of SNPs in given genome fragments coming from an individual's DNA. Results: In the first part of this paper, an exact algorithm for the MEC model is presented. Owing to the NP-hardness of the MEC model, we also design a genetic algorithm (GA). The designed GA is intended to solve large size problems and has very good performance. The strength and weakness of the MEC model are shown using experimental results on real data and simulation data. In the second part of this paper, to improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotype information (shortly, MEC/GI). Computational results on extensive datasets show that the new model has much higher accuracy in haplotype reconstruction than the pure MEC model. Contact: wangrsh@amss.ac.cn

...read moreread less

145 citations

"A fast and accurate heuristic for t..." refers background or methods in this paper

...As future work we plan a comparison of our method with the one in [14]....
[...]
...We are not aware of any publicly available implementation of the methods described in [8, 11, 14, 16], therefore we chose as baseline the method in [11] that is comparable to ours in terms of speed, and does not rely on any statistical model....
[...]
...This problem has been tackled both from a theoretical point of view [1, 3, 4, 7, 13] and from a more practical one [8, 11, 14]....
[...]
...[14] describe a Genetic Algorithm for this problem that in some reported experiments gives good performance for short haplotypes (about 100 SNPs)....
[...]

Book Chapter•DOI•

Fast hare: A fast heuristic for single individual SNP haplotype reconstruction

[...]

Alessandro Panconesi¹, Mauro Sozio¹•Institutions (1)

Sapienza University of Rome¹

17 Sep 2004

TL;DR: A simple heuristic is introduced and it is proved experimentally that is very fast and accurate and when compared with a dynamic programming of [8] it is much faster and also more accurate.

...read moreread less

Abstract: We study the single individual SNP haplotype reconstruction problem. We introduce a simple heuristic and prove experimentally that is very fast and accurate. In particular, when compared with a dynamic programming of [8] it is much faster and also more accurate. We expect Fast Hare to be very useful in practical applications. We also introduce a combinatorial problem related to the SNP haplotype reconstruction problem that we call Min Element Removal. We prove its NP-hardness in the gapless case and its O(log n)-approximability in the general case.

...read moreread less

136 citations