scispace - formally typeset
Search or ask a question
Journal ArticleDOI

NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering

08 Jun 2020-BMC Bioinformatics (BioMed Central)-Vol. 21, Iss: 1, pp 1-15
TL;DR: A method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm, which can effectively detect the noise and outliers in the input data and reduce their effects in the clustering process.
Abstract: Single individual haplotype problem refers to reconstructing haplotypes of an individual based on several input fragments sequenced from a specified chromosome. Solving this problem is an important task in computational biology and has many applications in the pharmaceutical industry, clinical decision-making, and genetic diseases. It is known that solving the problem is NP-hard. Although several methods have been proposed to solve the problem, it is found that most of them have low performances in dealing with noisy input fragments. Therefore, proposing a method which is accurate and scalable, is a challenging task. In this paper, we introduced a method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm. The NCM algorithm can effectively detect the noise and outliers in the input data. In addition, it can reduce their effects in the clustering process. The proposed method has been evaluated by several benchmark datasets. Comparing with existing methods indicates when NCM is tuned by suitable parameters, the results are encouraging. In particular, when the amount of noise increases, it outperforms the comparing methods. The proposed method is validated using simulated and real datasets. The achieved results recommend the application of NCMHap on the datasets which involve the fragments with a huge amount of gaps and noise.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
21 Dec 2022
TL;DR: QuickHap as discussed by the authors is a heuristic algorithm to achieve a high speed of haplotyping with acceptable accuracy, which contains two phases; first, a partial haplotype is built and expanded during several iterations, and then the second phase is applied to refine the reconstructed haplotypes to improve accuracy.
Abstract: Single individual haplotype reconstruction refers to the computational problem of inferring the two distinct copies of each chromosome. Determination of haplotypes offers many advantages for genomic-based studies in various fields of human genetics. Although many methods have been proposed to obtain haplotypes with high accuracy, the rapid and accurate solution of haplotype assembly is still a challenging problem. The largeness of the high-throughput sequence data and the length of the human genome emphasize the importance of the speed of algorithms. In this paper, we propose QuickHap, a heuristic algorithm to achieve a high speed of haplotyping with acceptable accuracy. Our algorithm contains two phases; first, a partial haplotype is built and expanded during several iterations. In this phase, we utilize a new metric to measure the quality of the reconstructed haplotype in each iteration to achieve the optimum solution. The second phase is applied to refine the reconstructed haplotypes to improve accuracy. The result demonstrates that the proposed method can reconstruct the haplotypes with promising accuracy. It outperforms the comparing methods in speed, particularly in dealing with high coverage sequencing data.
Proceedings ArticleDOI
21 Dec 2022
TL;DR: QuickHap as mentioned in this paper is a heuristic algorithm to achieve a high speed of haplotyping with acceptable accuracy, which contains two phases; first, a partial haplotype is built and expanded during several iterations, and then the second phase is applied to refine the reconstructed haplotypes to improve accuracy.
Abstract: Single individual haplotype reconstruction refers to the computational problem of inferring the two distinct copies of each chromosome. Determination of haplotypes offers many advantages for genomic-based studies in various fields of human genetics. Although many methods have been proposed to obtain haplotypes with high accuracy, the rapid and accurate solution of haplotype assembly is still a challenging problem. The largeness of the high-throughput sequence data and the length of the human genome emphasize the importance of the speed of algorithms. In this paper, we propose QuickHap, a heuristic algorithm to achieve a high speed of haplotyping with acceptable accuracy. Our algorithm contains two phases; first, a partial haplotype is built and expanded during several iterations. In this phase, we utilize a new metric to measure the quality of the reconstructed haplotype in each iteration to achieve the optimum solution. The second phase is applied to refine the reconstructed haplotypes to improve accuracy. The result demonstrates that the proposed method can reconstruct the haplotypes with promising accuracy. It outperforms the comparing methods in speed, particularly in dealing with high coverage sequencing data.
Journal ArticleDOI
TL;DR: Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data as mentioned in this paper , a haplotype is a set of DNA variants inherited together from one parent or chromosome.
Abstract: Abstract Background A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. Result Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms’ run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. Conclusion The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
References
More filters
Journal ArticleDOI
TL;DR: A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.
Abstract: Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.

10,056 citations


"NCMHap: a novel method for haplotyp..." refers methods in this paper

  • ...Moreover, the trio-phased variant calls from the GATK resource bundle [33] was used as the true haplotypes....

    [...]

Journal ArticleDOI
28 Oct 2010-Nature
TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Abstract: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

7,538 citations

Journal ArticleDOI
John W. Belmont1, Paul Hardenbol, Thomas D. Willis, Fuli Yu1, Huanming Yang2, Lan Yang Ch'Ang, Wei Huang3, Bin Liu2, Yan Shen3, Paul K.H. Tam4, Lap-Chee Tsui4, Mary M.Y. Waye5, Jeffrey Tze Fei Wong6, Changqing Zeng2, Qingrun Zhang2, Mark S. Chee7, Luana Galver7, Semyon Kruglyak7, Sarah S. Murray7, Arnold Oliphant7, Alexandre Montpetit8, Fanny Chagnon8, Vincent Ferretti8, Martin Leboeuf8, Michael S. Phillips8, Andrei Verner8, Shenghui Duan9, Denise L. Lind10, Raymond D. Miller9, John P. Rice9, Nancy L. Saccone9, Patricia Taillon-Miller9, Ming Xiao10, Akihiro Sekine, Koki Sorimachi, Yoichi Tanaka, Tatsuhiko Tsunoda, Eiji Yoshino, David R. Bentley11, Sarah E. Hunt11, Don Powell11, Houcan Zhang12, Ichiro Matsuda13, Yoshimitsu Fukushima14, Darryl Macer15, Eiko Suda15, Charles N. Rotimi16, Clement Adebamowo17, Toyin Aniagwu17, Patricia A. Marshall18, Olayemi Matthew17, Chibuzor Nkwodimmah17, Charmaine D.M. Royal16, Mark Leppert19, Missy Dixon19, Fiona Cunningham20, Ardavan Kanani20, Gudmundur A. Thorisson20, Peter E. Chen21, David J. Cutler21, Carl S. Kashuk21, Peter Donnelly22, Jonathan Marchini22, Gilean McVean22, Simon Myers22, Lon R. Cardon22, Andrew P. Morris22, Bruce S. Weir23, James C. Mullikin24, Michael Feolo24, Mark J. Daly25, Renzong Qiu26, Alastair Kent, Georgia M. Dunston16, Kazuto Kato27, Norio Niikawa28, Jessica Watkin29, Richard A. Gibbs1, Erica Sodergren1, George M. Weinstock1, Richard K. Wilson9, Lucinda Fulton9, Jane Rogers11, Bruce W. Birren25, Hua Han2, Hongguang Wang, Martin Godbout30, John C. Wallenburg8, Paul L'Archevêque, Guy Bellemare, Kazuo Todani, Takashi Fujita, Satoshi Tanaka, Arthur L. Holden, Francis S. Collins24, Lisa D. Brooks24, Jean E. McEwen24, Mark S. Guyer24, Elke Jordan31, Jane Peterson24, Jack Spiegel24, Lawrence M. Sung32, Lynn F. Zacharia24, Karen Kennedy29, Michael Dunn29, Richard Seabrook29, Mark Shillito, Barbara Skene29, John Stewart29, David Valle21, Ellen Wright Clayton33, Lynn B. Jorde19, Aravinda Chakravarti21, Mildred K. Cho34, Troy Duster35, Troy Duster36, Morris W. Foster37, Maria Jasperse38, Bartha Maria Knoppers39, Pui-Yan Kwok10, Julio Licinio40, Jeffrey C. Long41, Pilar N. Ossorio42, Vivian Ota Wang33, Charles N. Rotimi16, Patricia Spallone29, Patricia Spallone43, Sharon F. Terry44, Eric S. Lander25, Eric H. Lai45, Deborah A. Nickerson46, Gonçalo R. Abecasis41, David Altshuler47, Michael Boehnke41, Panos Deloukas11, Julie A. Douglas41, Stacey Gabriel25, Richard R. Hudson48, Thomas J. Hudson8, Leonid Kruglyak49, Yusuke Nakamura50, Robert L. Nussbaum24, Stephen F. Schaffner25, Stephen T. Sherry24, Lincoln Stein20, Toshihiro Tanaka 
18 Dec 2003-Nature
TL;DR: The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance the ability to choose targets for therapeutic intervention.
Abstract: The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.

5,926 citations

Journal ArticleDOI
07 May 2010-Science
TL;DR: The genomic data suggest that Neandertals mixed with modern human ancestors some 120,000 years ago, leaving traces of Ne andertal DNA in contemporary humans, suggesting that gene flow from Neand Bertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.
Abstract: Neandertals, the closest evolutionary relatives of present-day humans, lived in large parts of Europe and western Asia before disappearing 30,000 years ago. We present a draft sequence of the Neandertal genome composed of more than 4 billion nucleotides from three individuals. Comparisons of the Neandertal genome to the genomes of five present-day humans from different parts of the world identify a number of genomic regions that may have been affected by positive selection in ancestral modern humans, including genes involved in metabolism and in cognitive and skeletal development. We show that Neandertals shared more genetic variants with present-day humans in Eurasia than with present-day humans in sub-Saharan Africa, suggesting that gene flow from Neandertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.

3,575 citations


"NCMHap: a novel method for haplotyp..." refers background in this paper

  • ...Haplotypes provide more attainable information than individual SNPs which can be remarkable for investigating the relationship between genetic variations and complex diseases [6], studying human history [7], providing personalized medicine [8] and studying biological mechanisms [9]....

    [...]