01 Oct 2020-bioRxiv (Cold Spring Harbor Laboratory)-
TL;DR: An iterative method is proposed, which employs hypergraph to reconstruct haplotype, and outperforms most other approaches, and is promising to perform the haplotype assembly.
Abstract: Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge amount of data that enables researchers to determine haplotypes for diploid and polyploid organisms. Although various methods have been developed to reconstruct haplotypes in diploid form, their accuracy is still a challenging task. Also, most of the current methods cannot be applied to polyploid form. In this paper, an iterative method is proposed, which employs hypergraph to reconstruct haplotype. The proposed method by utilizing chaotic viewpoint can enhance the obtained haplotypes. For this purpose, a haplotype set was randomly generated as an initial estimate, and its consistency with the input fragments was described by constructing a weighted hypergraph. Partitioning the hypergraph specifies those positions in the haplotype set that need to be corrected. This procedure is repeated until no further improvement could be achieved. Each element of the finalized haplotype set is mapped to a line by chaos game representation, and a coordinate series is defined based on the position of mapped points. Then, some positions with low qualities can be assessed by applying a local projection. Experimental results on both simulated and real datasets demonstrate that this method outperforms most other approaches, and is promising to perform the haplotype assembly.
Improving the high-throughput DNA sequencing technologies dramatically decreased the costs of genome sequencing methods.
Each SNP contains valuable information about genomic alternations.
H-PoP [34] is a heuristic method that divides the input fragments into P clusters.
Also, a local projection (LP) method is applied to refine the remaining ambiguous measures and increasing the quality of the reconstructed haplotypes.
Preliminaries and assumptions
The challenge of the SIH problem in the polyploid organisms includes the reconstruction of the whole setH = {h1, h2, . . ., hP} containing P haplotype sequences.
In the error-free case, the fragments can be clustered in P clusters, such that the members of each cluster are compatible with each other.
In diploid case, several models have been proposed to solve the SIH problem based on the input fragments.
Recently, several MEC-based approaches have been developed to solve this problem.
In dealing with the noisy SNP matrix, it is expected that some fragments to be in conflict with their corresponding haplotypes.
The proposed method
This section presents a Haplotype Reconstruction approach based on the Chaotic viewpoint and Hypergraph model (HRCH).
The proposed method is briefly described below.
(i) a set of haplotype sequences is randomly generated;(ii) the input fragments are assigned to the haplotype sequences based on their similarities;(iii) a weighted SNP hypergraph is built, using the similarity measure between haplotype sequences and the assigned input fragments; (iv) the constructed hypergraph is used to find a set called CutSet, containing the SNPs which should be modified.
This procedure is repeated for a predefined number of iterations to minimize the MEC score.
Next, by considering the existence of chaotic properties of haplotype sequences, the results are improved.
Pair-SNP consistency
Let⋈ be a binary operator which provides the concatenation of two variables.
Tij X fk2covðsi ;sjÞ fkðiÞ ffl fkðjÞ½ �⨁ hcðfkÞðiÞ ffl hcðfkÞðjÞ h i ð5Þ.
Where Tij is the number of fragments covering both SNPs si and sj.
Hypergraph construction
To construct the weighted hypergraph based on the achieved ωmatrix, for each SNP si, its K nearest neighbors is found using the following Eq.: KNNðsiÞ ¼ fsjji 6¼ j;oij � oilg ð6Þ.
Each hyperedge can connect more than two vertices.
Therefore, the connectivity of vertices is defined by finding frequent itemsets.
FP-growth is a tree-based method which uses a depth-first strategy to mine frequent itemsets.
The runtime of this algorithm increases linearly, and it depends on the number of SNPs [40].
Improving Ht by partitioning the hypergraph
As can be seen in Fig 3, in the constructed hypergraph, the SNPs correspond with vertices, and each hyperedge equals with an obtained frequent itemset.
The vertices can be divided into two clusters via partitioning the hypergraph.
Moreover, in order to evaluate more allelic combinations of SNPs, for a predefined percent of SNPs belonging to the CutSet, in each time two arbitrary SNPs are nominated.
SinceHt has randomly generated, in the early iterations, its MEC score is poor.
Refinement of Ht
CGR was initially introduced by Barnsley [42] to evaluate random sequences.
Each letter of the given sequence is iteratively mapped as a point inside the square.
Then, the measure of ambiguous positions can be determined by applying a local projection (LP) method.
Results
In the following section, the performance of the proposed method is compared with several state-of-the-art approaches in diploid and polyploid forms.
The method was implemented in MATLAB, and all the results were obtained on a Windows 10 PC with 3.6 GHz CPU and 16 G Ram.
Reconstruction rate (RR) [4] as a conventional metric was used to evaluate the quality of the obtained haplotypes.
Here,HD denotes hamming distance between hi and bhj which are the target and the reconstructed haplotype, respectively and i, j = 1,2.
Diploid case
The experiments have been carried out on two widely used and well-known datasets including Geraci’s dataset [49] and a dataset from the 1000 genome project that are prime examples of the simulated and experimental datasets, respectively.
It should be noted that the first column demonstrates the quality of the obtained haplotypes after terminating the first phase.
The next two columns involve the rate of reconstruction for em equals to 1 and 2, respectively.
The results show that in most cases the proposed method achieved higher reconstruction rates compared to the others.
Polyploid case
Here, the proposed method is compared with three recent approaches that have been developed to solve haplotype assembly in polyploid form including Althap [23], H-POP [34] and SCGD [36].
The source codes of all comparing methods are available.
To investigate the quality of reconstructed haplotypes, reconstruction rate (RR), and MEC measure of the methods have compared.
Each sample contains an SNP matrix with a huge amount of gaps.
As can be seen in Tables 6–8 the proposed method is compared with RR and MEC-based algorithms.
Conclusion
The high amounts of noise, as well as existing gaps in the input fragments, are the main challenges in solving the SIH problem.
TL;DR: A review of the alignment-based methods of polyploid phasing can be found in this paper , where the authors discuss the advantages and limitations of these methods and the metrics used to assess their performance, proposing that accuracy and contiguity are the most meaningful metrics.
Abstract: Phasing, and in particular polyploid phasing, have been challenging problems held back by the limited read length of high-throughput short read sequencing methods which can't overcome the distance between heterozygous sites and labor high cost of alternative methods such as the physical separation of chromosomes for example. Recently developed single molecule long-read sequencing methods provide much longer reads which overcome this previous limitation. Here we review the alignment-based methods of polyploid phasing that rely on four main strategies: population inference methods, which leverage the genetic information of several individuals to phase a sample; objective function minimization methods, which minimize a function such as the Minimum Error Correction (MEC); graph partitioning methods, which represent the read data as a graph and split it into k haplotype subgraphs; cluster building methods, which iteratively grow clusters of similar reads into a final set of clusters that represent the haplotypes. We discuss the advantages and limitations of these methods and the metrics used to assess their performance, proposing that accuracy and contiguity are the most meaningful metrics. Finally, we propose the field of alignment-based polyploid phasing would greatly benefit from the use of a well-designed benchmarking dataset with appropriate evaluation metrics. We consider that there are still significant improvements which can be achieved to obtain more accurate and contiguous polyploid phasing results which reflect the complexity of polyploid genome architectures.
TL;DR: A method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm, which can effectively detect the noise and outliers in the input data and reduce their effects in the clustering process.
Abstract: Single individual haplotype problem refers to reconstructing haplotypes of an individual based on several input fragments sequenced from a specified chromosome. Solving this problem is an important task in computational biology and has many applications in the pharmaceutical industry, clinical decision-making, and genetic diseases. It is known that solving the problem is NP-hard. Although several methods have been proposed to solve the problem, it is found that most of them have low performances in dealing with noisy input fragments. Therefore, proposing a method which is accurate and scalable, is a challenging task.
In this paper, we introduced a method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm. The NCM algorithm can effectively detect the noise and outliers in the input data. In addition, it can reduce their effects in the clustering process. The proposed method has been evaluated by several benchmark datasets. Comparing with existing methods indicates when NCM is tuned by suitable parameters, the results are encouraging. In particular, when the amount of noise increases, it outperforms the comparing methods. The proposed method is validated using simulated and real datasets. The achieved results recommend the application of NCMHap on the datasets which involve the fragments with a huge amount of gaps and noise.
3 citations
Cites background or methods from "A chaotic viewpoint-based approach ..."
...The reconstruction rate for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project....
[...]
...8 HRCH [29] utilizes a chaotic viewpoint to reconstruct haplotypes....
[...]
...5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]....
[...]
... 5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]....
[...]
...The average of running time for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project (In seconds)....
TL;DR: Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data as mentioned in this paper , a haplotype is a set of DNA variants inherited together from one parent or chromosome.
Abstract: Abstract Background A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. Result Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms’ run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. Conclusion The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
TL;DR: A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.
Abstract: Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.
10,056 citations
"A chaotic viewpoint-based approach ..." refers methods in this paper
...Furthermore, the trio-phased variant calls from the GATK resource bundle[55] was used as the target haplotypes....
TL;DR: This study proposes a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develops an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth.
Abstract: Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns.In this study, we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods.
6,118 citations
"A chaotic viewpoint-based approach ..." refers background in this paper
...The runtime of this algorithm increases linearly, and it depends on the number of SNPs[40]....
TL;DR: Focusing on how fractal geometry can be used to model real objects in the physical world, this up-to-date edition featurestwo 16-page full-color inserts, problems and tools emphasizing fractal applications, and an answers section.
Abstract: Focusing on how fractal geometry can be used to model real objects in the physical world, this up-to-date edition featurestwo 16-page full-color inserts, problems and tools emphasizing fractal applications, and an answers section. A bonus CD of an IFS Generator provides an excellent software tool for designing iterated function systems codes and fractal images.
TL;DR: The genomic data suggest that Neandertals mixed with modern human ancestors some 120,000 years ago, leaving traces of Ne andertal DNA in contemporary humans, suggesting that gene flow from Neand Bertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.
Abstract: Neandertals, the closest evolutionary relatives of present-day humans, lived in large parts of Europe and western Asia before disappearing 30,000 years ago. We present a draft sequence of the Neandertal genome composed of more than 4 billion nucleotides from three individuals. Comparisons of the Neandertal genome to the genomes of five present-day humans from different parts of the world identify a number of genomic regions that may have been affected by positive selection in ancestral modern humans, including genes involved in metabolism and in cognitive and skeletal development. We show that Neandertals shared more genetic variants with present-day humans in Eurasia than with present-day humans in sub-Saharan Africa, suggesting that gene flow from Neandertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.
3,575 citations
"A chaotic viewpoint-based approach ..." refers background in this paper
...Haplotypes can also be used to investigate the pattern of inheritance over evolution, human migration, and the genetically aspects of populations [12-14]....
TL;DR: This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.
Abstract: We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.
Q1. What are the contributions mentioned in the paper "A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model" ?
Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge amount of data that enables researchers to determine haplotypes for diploid and polyploid organisms. In this paper, an iterative method is proposed, which employs hypergraph to reconstruct haplotype. This procedure is repeated until no further improvement could be achieved. Experimental results on both simulated and real datasets demonstrate that this method outperforms most other approaches, and is promising to perform the haplotype assembly.