scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Self-organizing map approaches for the haplotype assembly problem

01 Jun 2009-Mathematics and Computers in Simulation (North-Holland)-Vol. 79, Iss: 10, pp 3026-3037
TL;DR: This paper focuses on studying minimum error correction (MEC) model for the haplotype assembly problem and explores self-organizing map (SOM) methods for this problem, and proposes a novel SOM approach that can efficiently reconstruct haplotype pairs under realistic parameter settings.
About: This article is published in Mathematics and Computers in Simulation.The article was published on 2009-06-01. It has received 9 citations till now.
Citations
More filters
Journal ArticleDOI
TL;DR: Further development of the SOM is discussed regarding network architecture, spatio-temporal patterning, and the presentation of model results in ecological sciences.

173 citations

Journal ArticleDOI
TL;DR: An approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model with or without the all-heterozygous assumption is developed.
Abstract: Motivation: Haplotypes play a crucial role in genetic analysis and have many applications such as gene disease diagnoses, association studies, ancestry inference and so forth. The development of DNA sequencing technologies makes it possible to obtain haplotypes from a set of aligned reads originated from both copies of a chromosome of a single individual. This approach is often known as haplotype assembly. Exact algorithms that can give optimal solutions to the haplotype assembly problem are highly demanded. Unfortunately, previous algorithms for this problem either fail to output optimal solutions or take too long time even executed on a PC cluster. Results: We develop an approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model. Most of the previous approaches assume that the columns in the input matrix correspond to (putative) heterozygous sites. This all-heterozygous assumption is correct for most columns, but it may be incorrect for a small number of columns. In this article, we consider the MEC model with or without the all-heterozygous assumption. In our approach, we first use new methods to decompose the input read matrix into small independent blocks and then model the problem for each block as an integer linear programming problem, which is then solved by an integer linear programming solver. We have tested our program on a single PC [a Linux (x64) desktop PC with i7-3960X CPU], using the filtered HuRef and the NA 12878 datasets (after applying some variant calling methods). With the all-heterozygous assumption, our approach can optimally solve the whole HuRef data set within a total time of 31h (26h for the most difficult block of the 15th chromosome and only 5h for the other blocks). To our knowledge, this is the first time that MEC optimal solutions are completely obtained for the filtered HuRef dataset. Moreover, in the general case (without the all-heterozygous assumption), for the HuRef dataset our approach can optimally solve all the chromosomes except the most difficult block in chromosome 15 within a total time of 12 days. For both of the HuRef and NA12878 datasets, the optimal costs in the general case are sometimes much smaller than those in the allheterozygous case. This implies that some columns in the input matrix (after applying certain variant calling methods) still correspond to falseheterozygous sites. Availability: Our program, the optimal solutions found for the HuRef dataset available at http://rnc.r.dendai.ac.jp/hapAssembly.html.

64 citations

Journal ArticleDOI
TL;DR: The simulation pipeline haplosim is developed to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model and shows that sequencing depth is the major determinant of haplotypes quality.
Abstract: Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated. We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1 kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.

54 citations

Journal ArticleDOI
TL;DR: This review investigates how the computational haplotype determination methods have been developed, and the remaining problems affecting the determination of the haplotype of single individual using next-generation sequencing methods are presented.
Abstract: Genome-wide association studies have expanded our understanding of the relationship between the human genome and disease. However, because of current technical limitations, it is still challenging to clearly resolve diploid sequences, that is, two copies for each chromosome. One copy of each chromosome is inherited from each parent and the genomic function is determined by the interplay between the alleles represented as genotypes in the diploid sequences. Thus, to understand the nature of genetic variation in biological processes, including disease, it is necessary to determine the complete genomic sequence of each haplotype. Although there are experimental approaches for haplotype sequencing that physically separate the chromosomes, these methods are expensive and laborious and require special equipment. Here, we review the computational approaches that can be used to determine the haplotype phase. Since 1990, many researchers have tried to reconstruct the haplotype phase using a variety of computational methods, and some researches have been successfully help to determine the haplotype phase. In this review, we investigate how the computational haplotype determination methods have been developed, and we present the remaining problems affecting the determination of the haplotype of single individual using next-generation sequencing methods.

35 citations


Cites methods from "Self-organizing map approaches for ..."

  • ...DperFragToFragðfj; hkÞ ¼ Xn i¼0 DperSNPðfji; fkiÞ ð7Þ Wu et al. (2009) clustered the m fragments to two groups representing haplotypes by self-organizing map (SOM) and Xu and Li (2012) used a semi-supervised k-means clustering method....

    [...]

Journal ArticleDOI
Rui Qin1, Yan-Kui Liu1
TL;DR: A new class of chance model (C-model for short) about data envelopment analysis (DEA) in fuzzy random environments, in which the inputs and outputs are assumed to be characterized by fuzzy random variables with known possibility and probability distributions is developed.

33 citations


Cites methods from "Self-organizing map approaches for ..."

  • ...[45] applied SOMA to solve the minimum error correction model for the haplotype assembly problem....

    [...]

References
More filters
Journal ArticleDOI
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 moreInstitutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations

Journal ArticleDOI
TL;DR: A modified version of the Celera assembler is developed to facilitate the identification and comparison of alternate alleles within this individual diploid genome, and a novel haplotype assembly strategy is used, able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploids nature of the genome.
Abstract: Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.

1,843 citations

Journal ArticleDOI
TL;DR: A high-resolution analysis of the haplotype structure across 500 kilobases on chromosome 5q31 using 103 single-nucleotide polymorphisms (SNPs) in a European-derived population offers a coherent framework for creating a haplotype map of the human genome.
Abstract: Linkage disequilibrium (LD) analysis is traditionally based on individual genetic markers and often yields an erratic, non-monotonic picture, because the power to detect allelic associations depends on specific properties of each marker, such as frequency and population history. Ideally, LD analysis should be based directly on the underlying haplotype structure of the human genome, but this structure has remained poorly understood. Here we report a high-resolution analysis of the haplotype structure across 500 kilobases on chromosome 5q31 using 103 single-nucleotide polymorphisms (SNPs) in a European-derived population. The results show a picture of discrete haplotype blocks (of tens to hundreds of kilobases), each with limited diversity punctuated by apparent sites of recombination. In addition, we develop an analytical model for LD mapping based on such haplotype blocks. If our observed structure is general (and published data suggest that it may be), it offers a coherent framework for creating a haplotype map of the human genome.

1,778 citations


"Self-organizing map approaches for ..." refers methods in this paper

  • ...Next, we conduct experiments using the data from public Daly set [2]....

    [...]

Journal ArticleDOI
TL;DR: Details of the algorithm for extracting allelic sequences from population samples, along with some population-genetic considerations that influence the likelihood for success of the method, are presented here.
Abstract: Direct sequencing of genomic DNA from diploid individuals leads to ambiguities on sequencing gels whenever there is more than one mismatching site in the sequences of the two orthologous copies of a gene. While these ambiguities cannot be resolved from a single sample without resorting to other experimental methods (such as cloning in the traditional way), population samples may be useful for inferring haplotypes. For each individual in the sample that is homozygous for the amplified sequence, there are no ambiguities in the identification of the allele’s sequence. The sequences of other alleles can be inferred by taking the remaining sequence after “subtracting off’ the sequencing ladder of each known site. Details of the algorithm for extracting allelic sequences from such data are presented here, along with some population-genetic considerations that influence the likelihood for success of the method. The algorithm also applies to the problem of inferring haplotype frequencies of closely linked restriction-site polymorphisms.

796 citations


"Self-organizing map approaches for ..." refers methods in this paper

  • ...Therefore, computational methods for determining haplotypes based on either genotypes [1,6,8,13] or SNP fragments [5,11,14,23] have attracted much attention....

    [...]

Journal ArticleDOI
TL;DR: This study provides an example of approaches that have been successfully applied to the establishment of complex genotype-phenotype relationships in the presence of abundant DNA sequence variation and test a potential role of OPRM1 in substance (heroin/cocaine) dependence.
Abstract: To analyze candidate genes and establish complex genotype-phenotype relationships against a background of high natural genome sequence variability, we have developed approaches to (i) compare candidate gene sequence information in multiple individuals; (ii) predict haplotypes from numerous variants; and (iii) classify haplotypes and identify specific sequence variants, or combinations of variants (pattern), associated with the phenotype. Using the human mu opioid receptor gene (OPRM1) as a model system, we have combined these approaches to test a potential role of OPRM1 in substance (heroin/cocaine) dependence. All known functionally relevant regions of this prime candidate gene were analyzed by multiplex sequence comparison in 250 cases and controls; 43 variants were identified and 52 different haplotypes predicted in the subgroup of 172 African-Americans. These haplotypes were classified by similarity clustering into two functionally related categories, one of which was significantly more frequent in substance-dependent individuals. Common to this category was a characteristic pattern of sequence variants [-1793T-->A, -1699Tins, -1320A-->G, -111C-->T, +17C-->T (A6V)], which was associated with substance dependence. This study provides an example of approaches that have been successfully applied to the establishment of complex genotype-phenotype relationships in the presence of abundant DNA sequence variation.

299 citations


"Self-organizing map approaches for ..." refers background in this paper

  • ...The investigation of genetic differences among humans has given evidence that mutations in DNA sequences are responsible for many genetic diseases [10]....

    [...]