NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering
TL;DR: A method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm, which can effectively detect the noise and outliers in the input data and reduce their effects in the clustering process.
Abstract: Single individual haplotype problem refers to reconstructing haplotypes of an individual based on several input fragments sequenced from a specified chromosome. Solving this problem is an important task in computational biology and has many applications in the pharmaceutical industry, clinical decision-making, and genetic diseases. It is known that solving the problem is NP-hard. Although several methods have been proposed to solve the problem, it is found that most of them have low performances in dealing with noisy input fragments. Therefore, proposing a method which is accurate and scalable, is a challenging task.
In this paper, we introduced a method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm. The NCM algorithm can effectively detect the noise and outliers in the input data. In addition, it can reduce their effects in the clustering process. The proposed method has been evaluated by several benchmark datasets. Comparing with existing methods indicates when NCM is tuned by suitable parameters, the results are encouraging. In particular, when the amount of noise increases, it outperforms the comparing methods. The proposed method is validated using simulated and real datasets. The achieved results recommend the application of NCMHap on the datasets which involve the fragments with a huge amount of gaps and noise.
Citations
More filters
••
21 Dec 2022
TL;DR: QuickHap as discussed by the authors is a heuristic algorithm to achieve a high speed of haplotyping with acceptable accuracy, which contains two phases; first, a partial haplotype is built and expanded during several iterations, and then the second phase is applied to refine the reconstructed haplotypes to improve accuracy.
Abstract: Single individual haplotype reconstruction refers to the computational problem of inferring the two distinct copies of each chromosome. Determination of haplotypes offers many advantages for genomic-based studies in various fields of human genetics. Although many methods have been proposed to obtain haplotypes with high accuracy, the rapid and accurate solution of haplotype assembly is still a challenging problem. The largeness of the high-throughput sequence data and the length of the human genome emphasize the importance of the speed of algorithms. In this paper, we propose QuickHap, a heuristic algorithm to achieve a high speed of haplotyping with acceptable accuracy. Our algorithm contains two phases; first, a partial haplotype is built and expanded during several iterations. In this phase, we utilize a new metric to measure the quality of the reconstructed haplotype in each iteration to achieve the optimum solution. The second phase is applied to refine the reconstructed haplotypes to improve accuracy. The result demonstrates that the proposed method can reconstruct the haplotypes with promising accuracy. It outperforms the comparing methods in speed, particularly in dealing with high coverage sequencing data.
••
21 Dec 2022
TL;DR: QuickHap as mentioned in this paper is a heuristic algorithm to achieve a high speed of haplotyping with acceptable accuracy, which contains two phases; first, a partial haplotype is built and expanded during several iterations, and then the second phase is applied to refine the reconstructed haplotypes to improve accuracy.
Abstract: Single individual haplotype reconstruction refers to the computational problem of inferring the two distinct copies of each chromosome. Determination of haplotypes offers many advantages for genomic-based studies in various fields of human genetics. Although many methods have been proposed to obtain haplotypes with high accuracy, the rapid and accurate solution of haplotype assembly is still a challenging problem. The largeness of the high-throughput sequence data and the length of the human genome emphasize the importance of the speed of algorithms. In this paper, we propose QuickHap, a heuristic algorithm to achieve a high speed of haplotyping with acceptable accuracy. Our algorithm contains two phases; first, a partial haplotype is built and expanded during several iterations. In this phase, we utilize a new metric to measure the quality of the reconstructed haplotype in each iteration to achieve the optimum solution. The second phase is applied to refine the reconstructed haplotypes to improve accuracy. The result demonstrates that the proposed method can reconstruct the haplotypes with promising accuracy. It outperforms the comparing methods in speed, particularly in dealing with high coverage sequencing data.
••
TL;DR: Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data as mentioned in this paper , a haplotype is a set of DNA variants inherited together from one parent or chromosome.
Abstract: Abstract Background A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. Result Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms’ run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. Conclusion The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
References
More filters
••
TL;DR: A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.
Abstract: Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.
10,056 citations
"NCMHap: a novel method for haplotyp..." refers methods in this paper
...Moreover, the trio-phased variant calls from the GATK resource bundle [33] was used as the true haplotypes....
[...]
••
TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Abstract: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
7,538 citations
••
Baylor College of Medicine1, Chinese Academy of Sciences2, Chinese National Human Genome Center3, University of Hong Kong4, The Chinese University of Hong Kong5, Hong Kong University of Science and Technology6, Illumina7, McGill University8, Washington University in St. Louis9, University of California, San Francisco10, Wellcome Trust Sanger Institute11, Beijing Normal University12, Health Sciences University of Hokkaido13, Shinshu University14, University of Tsukuba15, Howard University16, University of Ibadan17, Case Western Reserve University18, University of Utah19, Cold Spring Harbor Laboratory20, Johns Hopkins University21, University of Oxford22, North Carolina State University23, National Institutes of Health24, Massachusetts Institute of Technology25, Chinese Academy of Social Sciences26, Kyoto University27, Nagasaki University28, Wellcome Trust29, Genome Canada30, Foundation for the National Institutes of Health31, University of Maryland, Baltimore32, Vanderbilt University33, Stanford University34, New York University35, University of California, Berkeley36, University of Oklahoma37, University of New Mexico38, Université de Montréal39, University of California, Los Angeles40, University of Michigan41, University of Wisconsin-Madison42, London School of Economics and Political Science43, Genetic Alliance44, GlaxoSmithKline45, University of Washington46, Harvard University47, University of Chicago48, Fred Hutchinson Cancer Research Center49, University of Tokyo50
TL;DR: The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance the ability to choose targets for therapeutic intervention.
Abstract: The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.
5,926 citations
••
5,607 citations
••
Max Planck Society1, Broad Institute2, University of California, Berkeley3, European Bioinformatics Institute4, National Institutes of Health5, University of Massachusetts Medical School6, University of Washington7, Spanish National Research Council8, University of Montana9, Croatian Academy of Sciences and Arts10, University of Oviedo11, University of Bonn12, Emory University13, University College Cork14, Harvard University15
TL;DR: The genomic data suggest that Neandertals mixed with modern human ancestors some 120,000 years ago, leaving traces of Ne andertal DNA in contemporary humans, suggesting that gene flow from Neand Bertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.
Abstract: Neandertals, the closest evolutionary relatives of present-day humans, lived in large parts of Europe and western Asia before disappearing 30,000 years ago. We present a draft sequence of the Neandertal genome composed of more than 4 billion nucleotides from three individuals. Comparisons of the Neandertal genome to the genomes of five present-day humans from different parts of the world identify a number of genomic regions that may have been affected by positive selection in ancestral modern humans, including genes involved in metabolism and in cognitive and skeletal development. We show that Neandertals shared more genetic variants with present-day humans in Eurasia than with present-day humans in sub-Saharan Africa, suggesting that gene flow from Neandertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.
3,575 citations
"NCMHap: a novel method for haplotyp..." refers background in this paper
...Haplotypes provide more attainable information than individual SNPs which can be remarkable for investigating the relationship between genetic variations and complex diseases [6], studying human history [7], providing personalized medicine [8] and studying biological mechanisms [9]....
[...]