A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model
Summary (2 min read)
Introduction
- Improving the high-throughput DNA sequencing technologies dramatically decreased the costs of genome sequencing methods.
- Each SNP contains valuable information about genomic alternations.
- H-PoP [34] is a heuristic method that divides the input fragments into P clusters.
- Also, a local projection (LP) method is applied to refine the remaining ambiguous measures and increasing the quality of the reconstructed haplotypes.
Preliminaries and assumptions
- The challenge of the SIH problem in the polyploid organisms includes the reconstruction of the whole setH = {h1, h2, . . ., hP} containing P haplotype sequences.
- In the error-free case, the fragments can be clustered in P clusters, such that the members of each cluster are compatible with each other.
- In diploid case, several models have been proposed to solve the SIH problem based on the input fragments.
- Recently, several MEC-based approaches have been developed to solve this problem.
- In dealing with the noisy SNP matrix, it is expected that some fragments to be in conflict with their corresponding haplotypes.
The proposed method
- This section presents a Haplotype Reconstruction approach based on the Chaotic viewpoint and Hypergraph model (HRCH).
- The proposed method is briefly described below.
- (i) a set of haplotype sequences is randomly generated;(ii) the input fragments are assigned to the haplotype sequences based on their similarities;(iii) a weighted SNP hypergraph is built, using the similarity measure between haplotype sequences and the assigned input fragments; (iv) the constructed hypergraph is used to find a set called CutSet, containing the SNPs which should be modified.
- This procedure is repeated for a predefined number of iterations to minimize the MEC score.
- Next, by considering the existence of chaotic properties of haplotype sequences, the results are improved.
Pair-SNP consistency
- Let⋈ be a binary operator which provides the concatenation of two variables.
- Tij X fk2covðsi ;sjÞ fkðiÞ ffl fkðjÞ½ �⨁ hcðfkÞðiÞ ffl hcðfkÞðjÞ h i ð5Þ.
- Where Tij is the number of fragments covering both SNPs si and sj.
Hypergraph construction
- To construct the weighted hypergraph based on the achieved ωmatrix, for each SNP si, its K nearest neighbors is found using the following Eq.: KNNðsiÞ ¼ fsjji 6¼ j;oij � oilg ð6Þ.
- Each hyperedge can connect more than two vertices.
- Therefore, the connectivity of vertices is defined by finding frequent itemsets.
- FP-growth is a tree-based method which uses a depth-first strategy to mine frequent itemsets.
- The runtime of this algorithm increases linearly, and it depends on the number of SNPs [40].
Improving Ht by partitioning the hypergraph
- As can be seen in Fig 3, in the constructed hypergraph, the SNPs correspond with vertices, and each hyperedge equals with an obtained frequent itemset.
- The vertices can be divided into two clusters via partitioning the hypergraph.
- Moreover, in order to evaluate more allelic combinations of SNPs, for a predefined percent of SNPs belonging to the CutSet, in each time two arbitrary SNPs are nominated.
- SinceHt has randomly generated, in the early iterations, its MEC score is poor.
Refinement of Ht
- CGR was initially introduced by Barnsley [42] to evaluate random sequences.
- Each letter of the given sequence is iteratively mapped as a point inside the square.
- Then, the measure of ambiguous positions can be determined by applying a local projection (LP) method.
Results
- In the following section, the performance of the proposed method is compared with several state-of-the-art approaches in diploid and polyploid forms.
- The method was implemented in MATLAB, and all the results were obtained on a Windows 10 PC with 3.6 GHz CPU and 16 G Ram.
- Reconstruction rate (RR) [4] as a conventional metric was used to evaluate the quality of the obtained haplotypes.
- Here,HD denotes hamming distance between hi and bhj which are the target and the reconstructed haplotype, respectively and i, j = 1,2.
Diploid case
- The experiments have been carried out on two widely used and well-known datasets including Geraci’s dataset [49] and a dataset from the 1000 genome project that are prime examples of the simulated and experimental datasets, respectively.
- It should be noted that the first column demonstrates the quality of the obtained haplotypes after terminating the first phase.
- The next two columns involve the rate of reconstruction for em equals to 1 and 2, respectively.
- The results show that in most cases the proposed method achieved higher reconstruction rates compared to the others.
Polyploid case
- Here, the proposed method is compared with three recent approaches that have been developed to solve haplotype assembly in polyploid form including Althap [23], H-POP [34] and SCGD [36].
- The source codes of all comparing methods are available.
- To investigate the quality of reconstructed haplotypes, reconstruction rate (RR), and MEC measure of the methods have compared.
- Each sample contains an SNP matrix with a huge amount of gaps.
- As can be seen in Tables 6–8 the proposed method is compared with RR and MEC-based algorithms.
Conclusion
- The high amounts of noise, as well as existing gaps in the input fragments, are the main challenges in solving the SIH problem.
- The proposed method involves two main steps.
Did you find this useful? Give us your feedback
Citations
3 citations
3 citations
Cites background or methods from "A chaotic viewpoint-based approach ..."
...The reconstruction rate for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project....
[...]
...8 HRCH [29] utilizes a chaotic viewpoint to reconstruct haplotypes....
[...]
...5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]....
[...]
... 5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]....
[...]
...The average of running time for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project (In seconds)....
[...]
References
18 citations
Additional excerpts
...The MEC is one of the most popular and successful algorithms compared with the models as mentioned above [4,21-28]....
[...]
15 citations
"A chaotic viewpoint-based approach ..." refers methods in this paper
...Using this procedure, many attempts have been made with the purpose of extracting novel features from biological sequences by exploiting CGR[44-48]....
[...]
14 citations
"A chaotic viewpoint-based approach ..." refers background in this paper
...In[30], the authors proposed a parallel version of WhatsHap which is able to process higher...
[...]
13 citations
12 citations
"A chaotic viewpoint-based approach ..." refers methods or result in this paper
...In [33]....
[...]
...Removing the homozygote positions was performed as described by [33] such that the most frequent measure for each column could be found....
[...]
...The output of the proposed method was compared with a set of state-of-theart and well-known methods including; SCGD[36], H-pop[34], ARO[24], HG[33], FCM[25], FastHap[26], DGS[50] ,SHR[51], MLF[52], HapCut[27], 2d[22], Fast[53], and SPH[54]....
[...]
...The obtained reconstruction rates of the proposed method are compared to those of H-pop[34], SCGD[36], HG[33], ARO[24], and FCM[25] approaches....
[...]