Self-organizing map approaches for the haplotype assembly problem

doi:10.1016/J.MATCOM.2009.01.021

Home
/
Papers
/
Self-organizing map approaches for the haplotype assembly problem

Journal Article•DOI•

Self-organizing map approaches for the haplotype assembly problem

Ling-Yun Wu, Zhenping Li¹, Rui-Sheng Wang², Xiang-Sun Zhang, Luonan Chen³ - Show less +1 more•Institutions (3)

Beijing Wuzi University¹, Renmin University of China², Osaka Sangyo University³

01 Jun 2009-Mathematics and Computers in Simulation (North-Holland)-Vol. 79, Iss: 10, pp 3026-3037

TL;DR: This paper focuses on studying minimum error correction (MEC) model for the haplotype assembly problem and explores self-organizing map (SOM) methods for this problem, and proposes a novel SOM approach that can efficiently reconstruct haplotype pairs under realistic parameter settings.

read less

About: This article is published in Mathematics and Computers in Simulation.The article was published on 2009-06-01. It has received 9 citations till now.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Self-Organizing Maps applied to ecological sciences

[...]

Tae-Soo Chon¹•Institutions (1)

Pusan National University¹

01 Jan 2011-Ecological Informatics

TL;DR: Further development of the SOM is discussed regarding network architecture, spatio-temporal patterning, and the presentation of model results in ecological sciences.

...read moreread less

173 citations

Journal Article•DOI•

Exact algorithms for haplotype assembly from whole-genome sequence data.

[...]

Zhi-Zhong Chen¹, Fei Deng¹, Lusheng Wang¹•Institutions (1)

City University of Hong Kong¹

15 Aug 2013-Bioinformatics

TL;DR: An approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model with or without the all-heterozygous assumption is developed.

...read moreread less

Abstract: Motivation: Haplotypes play a crucial role in genetic analysis and have many applications such as gene disease diagnoses, association studies, ancestry inference and so forth. The development of DNA sequencing technologies makes it possible to obtain haplotypes from a set of aligned reads originated from both copies of a chromosome of a single individual. This approach is often known as haplotype assembly. Exact algorithms that can give optimal solutions to the haplotype assembly problem are highly demanded. Unfortunately, previous algorithms for this problem either fail to output optimal solutions or take too long time even executed on a PC cluster. Results: We develop an approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model. Most of the previous approaches assume that the columns in the input matrix correspond to (putative) heterozygous sites. This all-heterozygous assumption is correct for most columns, but it may be incorrect for a small number of columns. In this article, we consider the MEC model with or without the all-heterozygous assumption. In our approach, we first use new methods to decompose the input read matrix into small independent blocks and then model the problem for each block as an integer linear programming problem, which is then solved by an integer linear programming solver. We have tested our program on a single PC [a Linux (x64) desktop PC with i7-3960X CPU], using the filtered HuRef and the NA 12878 datasets (after applying some variant calling methods). With the all-heterozygous assumption, our approach can optimally solve the whole HuRef data set within a total time of 31h (26h for the most difficult block of the 15th chromosome and only 5h for the other blocks). To our knowledge, this is the first time that MEC optimal solutions are completely obtained for the filtered HuRef dataset. Moreover, in the general case (without the all-heterozygous assumption), for the HuRef dataset our approach can optimally solve all the chromosomes except the most difficult block in chromosome 15 within a total time of 12 days. For both of the HuRef and NA12878 datasets, the optimal costs in the general case are sometimes much smaller than those in the allheterozygous case. This implies that some columns in the input matrix (after applying certain variant calling methods) still correspond to falseheterozygous sites. Availability: Our program, the optimal solutions found for the HuRef dataset available at http://rnc.r.dendai.ac.jp/hapAssembly.html.

...read moreread less

64 citations

Journal Article•DOI•

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study

[...]

Ehsan Motazedi¹, Richard Finkers¹, Chris Maliepaard¹, Dick de Ridder¹•Institutions (1)

Wageningen University and Research Centre¹

08 Jan 2017-Briefings in Bioinformatics

TL;DR: The simulation pipeline haplosim is developed to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model and shows that sequencing depth is the major determinant of haplotypes quality.

...read moreread less

Abstract: Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated. We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1 kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.

...read moreread less

54 citations

Journal Article•DOI•

Survey of computational haplotype determination methods for single individual

[...]

Je-Keun Rhee¹, Honglan Li², Je-Gun Joung³, Kyu-Baek Hwang², Byoung-Tak Zhang⁴, Soo-Yong Shin⁵ - Show less +2 more•Institutions (5)

Catholic University of Korea¹, Soongsil University², Samsung Medical Center³, Seoul National University⁴, Asan Medical Center⁵

01 Jan 2016-Genes & Genomics

TL;DR: This review investigates how the computational haplotype determination methods have been developed, and the remaining problems affecting the determination of the haplotype of single individual using next-generation sequencing methods are presented.

...read moreread less

Abstract: Genome-wide association studies have expanded our understanding of the relationship between the human genome and disease. However, because of current technical limitations, it is still challenging to clearly resolve diploid sequences, that is, two copies for each chromosome. One copy of each chromosome is inherited from each parent and the genomic function is determined by the interplay between the alleles represented as genotypes in the diploid sequences. Thus, to understand the nature of genetic variation in biological processes, including disease, it is necessary to determine the complete genomic sequence of each haplotype. Although there are experimental approaches for haplotype sequencing that physically separate the chromosomes, these methods are expensive and laborious and require special equipment. Here, we review the computational approaches that can be used to determine the haplotype phase. Since 1990, many researchers have tried to reconstruct the haplotype phase using a variety of computational methods, and some researches have been successfully help to determine the haplotype phase. In this review, we investigate how the computational haplotype determination methods have been developed, and we present the remaining problems affecting the determination of the haplotype of single individual using next-generation sequencing methods.

...read moreread less

35 citations

Cites methods from "Self-organizing map approaches for ..."

...DperFragToFragðfj; hkÞ ¼ Xn i¼0 DperSNPðfji; fkiÞ ð7Þ Wu et al. (2009) clustered the m fragments to two groups representing haplotypes by self-organizing map (SOM) and Xu and Li (2012) used a semi-supervised k-means clustering method....
[...]

Journal Article•DOI•

Modeling data envelopment analysis by chance method in hybrid uncertain environments

[...]

Rui Qin¹, Yan-Kui Liu¹•Institutions (1)

Hebei University¹

01 Jan 2010-Mathematics and Computers in Simulation

TL;DR: A new class of chance model (C-model for short) about data envelopment analysis (DEA) in fuzzy random environments, in which the inputs and outputs are assumed to be characterized by fuzzy random variables with known possibility and probability distributions is developed.

...read moreread less

33 citations

Cites methods from "Self-organizing map approaches for ..."

...[45] applied SOMA to solve the minimum error correction model for the haplotype assembly problem....
[...]

References

PDF

Open Access

More filters

Journal Article•DOI•

The sequence of the human genome.

[...]

J. Craig Venter¹, Mark Raymond Adams¹, Eugene W. Myers¹, Peter W. Li¹ +269 more•Institutions (12)

16 Feb 2001-Science

TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.

...read moreread less

Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

...read moreread less

12,098 citations

Journal Article•DOI•

The Diploid Genome Sequence of an Individual Human

[...]

Samuel Levy¹, Granger G. Sutton¹, Pauline C. Ng¹, Lars Feuk², Aaron L. Halpern¹, Brian P. Walenz¹, Nelson Axelrod¹, Jiaqi Huang¹, Ewen F. Kirkness¹, Gennady Denisov¹, Yuan Lin¹, Jeffrey R. MacDonald², Andy Wing Chun Pang², Mary Shago², Timothy B. Stockwell¹, Alexia Tsiamouri¹, Vineet Bafna³, Vikas Bansal³, Saul A. Kravitz¹, Dana A. Busam¹, Karen Beeson¹, Tina C McIntosh¹, Karin A. Remington¹, Josep F. Abril⁴, John Gill¹, Jon Borman¹, Yu-Hui Rogers¹, Marvin Frazier¹, Stephen W. Scherer², Robert L. Strausberg¹, J. Craig Venter¹ - Show less +27 more•Institutions (4)

J. Craig Venter Institute¹, University of Toronto², University of California, San Diego³, University of Barcelona⁴

04 Sep 2007-PLOS Biology

TL;DR: A modified version of the Celera assembler is developed to facilitate the identification and comparison of alternate alleles within this individual diploid genome, and a novel haplotype assembly strategy is used, able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploids nature of the genome.

...read moreread less

Abstract: Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.

...read moreread less

1,843 citations

Journal Article•DOI•

High-resolution haplotype structure in the human genome.

[...]

Mark J. Daly¹, John D. Rioux¹, Stephen F. Schaffner¹, Thomas J. Hudson¹, Thomas J. Hudson², Eric S. Lander¹ - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, McGill University²

01 Oct 2001-Nature Genetics

TL;DR: A high-resolution analysis of the haplotype structure across 500 kilobases on chromosome 5q31 using 103 single-nucleotide polymorphisms (SNPs) in a European-derived population offers a coherent framework for creating a haplotype map of the human genome.

...read moreread less

Abstract: Linkage disequilibrium (LD) analysis is traditionally based on individual genetic markers and often yields an erratic, non-monotonic picture, because the power to detect allelic associations depends on specific properties of each marker, such as frequency and population history. Ideally, LD analysis should be based directly on the underlying haplotype structure of the human genome, but this structure has remained poorly understood. Here we report a high-resolution analysis of the haplotype structure across 500 kilobases on chromosome 5q31 using 103 single-nucleotide polymorphisms (SNPs) in a European-derived population. The results show a picture of discrete haplotype blocks (of tens to hundreds of kilobases), each with limited diversity punctuated by apparent sites of recombination. In addition, we develop an analytical model for LD mapping based on such haplotype blocks. If our observed structure is general (and published data suggest that it may be), it offers a coherent framework for creating a haplotype map of the human genome.

...read moreread less

1,778 citations

"Self-organizing map approaches for ..." refers methods in this paper

...Next, we conduct experiments using the data from public Daly set [2]....
[...]

Journal Article•DOI•

Inference of haplotypes from PCR-amplified samples of diploid populations.

[...]

Andrew G. Clark¹•Institutions (1)

Pennsylvania State University¹

01 Mar 1990-Molecular Biology and Evolution

TL;DR: Details of the algorithm for extracting allelic sequences from population samples, along with some population-genetic considerations that influence the likelihood for success of the method, are presented here.

...read moreread less

Abstract: Direct sequencing of genomic DNA from diploid individuals leads to ambiguities on sequencing gels whenever there is more than one mismatching site in the sequences of the two orthologous copies of a gene. While these ambiguities cannot be resolved from a single sample without resorting to other experimental methods (such as cloning in the traditional way), population samples may be useful for inferring haplotypes. For each individual in the sample that is homozygous for the amplified sequence, there are no ambiguities in the identification of the allele’s sequence. The sequences of other alleles can be inferred by taking the remaining sequence after “subtracting off’ the sequencing ladder of each known site. Details of the algorithm for extracting allelic sequences from such data are presented here, along with some population-genetic considerations that influence the likelihood for success of the method. The algorithm also applies to the problem of inferring haplotype frequencies of closely linked restriction-site polymorphisms.

...read moreread less

796 citations

"Self-organizing map approaches for ..." refers methods in this paper

...Therefore, computational methods for determining haplotypes based on either genotypes [1,6,8,13] or SNP fragments [5,11,14,23] have attracted much attention....
[...]

Journal Article•DOI•

Sequence variability and candidate gene analysis in complex disease: association of µ opioid receptor gene variation with substance dependence

[...]

Margret R. Hoehe¹, Karla Köpke, Birgit Wendel, Klaus Rohde, Christina Flachmeier, Kenneth K. Kidd, Wade H. Berrettini, George M. Church - Show less +4 more•Institutions (1)

Max Delbrück Center for Molecular Medicine¹

22 Nov 2000-Human Molecular Genetics

TL;DR: This study provides an example of approaches that have been successfully applied to the establishment of complex genotype-phenotype relationships in the presence of abundant DNA sequence variation and test a potential role of OPRM1 in substance (heroin/cocaine) dependence.

...read moreread less

Abstract: To analyze candidate genes and establish complex genotype-phenotype relationships against a background of high natural genome sequence variability, we have developed approaches to (i) compare candidate gene sequence information in multiple individuals; (ii) predict haplotypes from numerous variants; and (iii) classify haplotypes and identify specific sequence variants, or combinations of variants (pattern), associated with the phenotype. Using the human mu opioid receptor gene (OPRM1) as a model system, we have combined these approaches to test a potential role of OPRM1 in substance (heroin/cocaine) dependence. All known functionally relevant regions of this prime candidate gene were analyzed by multiplex sequence comparison in 250 cases and controls; 43 variants were identified and 52 different haplotypes predicted in the subgroup of 172 African-Americans. These haplotypes were classified by similarity clustering into two functionally related categories, one of which was significantly more frequent in substance-dependent individuals. Common to this category was a characteristic pattern of sequence variants [-1793T-->A, -1699Tins, -1320A-->G, -111C-->T, +17C-->T (A6V)], which was associated with substance dependence. This study provides an example of approaches that have been successfully applied to the establishment of complex genotype-phenotype relationships in the presence of abundant DNA sequence variation.

...read moreread less

299 citations

"Self-organizing map approaches for ..." refers background in this paper

...The investigation of genetic differences among humans has given evidence that mutations in DNA sequences are responsible for many genetic diseases [10]....
[...]