# A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model

## Summary (2 min read)

### Introduction

- Improving the high-throughput DNA sequencing technologies dramatically decreased the costs of genome sequencing methods.
- Each SNP contains valuable information about genomic alternations.
- H-PoP [34] is a heuristic method that divides the input fragments into P clusters.
- Also, a local projection (LP) method is applied to refine the remaining ambiguous measures and increasing the quality of the reconstructed haplotypes.

### Preliminaries and assumptions

- The challenge of the SIH problem in the polyploid organisms includes the reconstruction of the whole setH = {h1, h2, . . ., hP} containing P haplotype sequences.
- In the error-free case, the fragments can be clustered in P clusters, such that the members of each cluster are compatible with each other.
- In diploid case, several models have been proposed to solve the SIH problem based on the input fragments.
- Recently, several MEC-based approaches have been developed to solve this problem.
- In dealing with the noisy SNP matrix, it is expected that some fragments to be in conflict with their corresponding haplotypes.

### The proposed method

- This section presents a Haplotype Reconstruction approach based on the Chaotic viewpoint and Hypergraph model (HRCH).
- The proposed method is briefly described below.
- (i) a set of haplotype sequences is randomly generated;(ii) the input fragments are assigned to the haplotype sequences based on their similarities;(iii) a weighted SNP hypergraph is built, using the similarity measure between haplotype sequences and the assigned input fragments; (iv) the constructed hypergraph is used to find a set called CutSet, containing the SNPs which should be modified.
- This procedure is repeated for a predefined number of iterations to minimize the MEC score.
- Next, by considering the existence of chaotic properties of haplotype sequences, the results are improved.

### Pair-SNP consistency

- Let⋈ be a binary operator which provides the concatenation of two variables.
- Tij X fk2covðsi ;sjÞ fkðiÞ ffl fkðjÞ½ �⨁ hcðfkÞðiÞ ffl hcðfkÞðjÞ h i ð5Þ.
- Where Tij is the number of fragments covering both SNPs si and sj.

### Hypergraph construction

- To construct the weighted hypergraph based on the achieved ωmatrix, for each SNP si, its K nearest neighbors is found using the following Eq.: KNNðsiÞ ¼ fsjji 6¼ j;oij � oilg ð6Þ.
- Each hyperedge can connect more than two vertices.
- Therefore, the connectivity of vertices is defined by finding frequent itemsets.
- FP-growth is a tree-based method which uses a depth-first strategy to mine frequent itemsets.
- The runtime of this algorithm increases linearly, and it depends on the number of SNPs [40].

### Improving Ht by partitioning the hypergraph

- As can be seen in Fig 3, in the constructed hypergraph, the SNPs correspond with vertices, and each hyperedge equals with an obtained frequent itemset.
- The vertices can be divided into two clusters via partitioning the hypergraph.
- Moreover, in order to evaluate more allelic combinations of SNPs, for a predefined percent of SNPs belonging to the CutSet, in each time two arbitrary SNPs are nominated.
- SinceHt has randomly generated, in the early iterations, its MEC score is poor.

### Refinement of Ht

- CGR was initially introduced by Barnsley [42] to evaluate random sequences.
- Each letter of the given sequence is iteratively mapped as a point inside the square.
- Then, the measure of ambiguous positions can be determined by applying a local projection (LP) method.

### Results

- In the following section, the performance of the proposed method is compared with several state-of-the-art approaches in diploid and polyploid forms.
- The method was implemented in MATLAB, and all the results were obtained on a Windows 10 PC with 3.6 GHz CPU and 16 G Ram.
- Reconstruction rate (RR) [4] as a conventional metric was used to evaluate the quality of the obtained haplotypes.
- Here,HD denotes hamming distance between hi and bhj which are the target and the reconstructed haplotype, respectively and i, j = 1,2.

### Diploid case

- The experiments have been carried out on two widely used and well-known datasets including Geraci’s dataset [49] and a dataset from the 1000 genome project that are prime examples of the simulated and experimental datasets, respectively.
- It should be noted that the first column demonstrates the quality of the obtained haplotypes after terminating the first phase.
- The next two columns involve the rate of reconstruction for em equals to 1 and 2, respectively.
- The results show that in most cases the proposed method achieved higher reconstruction rates compared to the others.

### Polyploid case

- Here, the proposed method is compared with three recent approaches that have been developed to solve haplotype assembly in polyploid form including Althap [23], H-POP [34] and SCGD [36].
- The source codes of all comparing methods are available.
- To investigate the quality of reconstructed haplotypes, reconstruction rate (RR), and MEC measure of the methods have compared.
- Each sample contains an SNP matrix with a huge amount of gaps.
- As can be seen in Tables 6–8 the proposed method is compared with RR and MEC-based algorithms.

### Conclusion

- The high amounts of noise, as well as existing gaps in the input fragments, are the main challenges in solving the SIH problem.
- The proposed method involves two main steps.

Did you find this useful? Give us your feedback

##### Citations

3 citations

3 citations

### Cites background or methods from "A chaotic viewpoint-based approach ..."

...The reconstruction rate for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project....

[...]

...8 HRCH [29] utilizes a chaotic viewpoint to reconstruct haplotypes....

[...]

...5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]....

[...]

... 5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]....

[...]

...The average of running time for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project (In seconds)....

[...]

##### References

10,056 citations

### "A chaotic viewpoint-based approach ..." refers methods in this paper

...Furthermore, the trio-phased variant calls from the GATK resource bundle[55] was used as the target haplotypes....

[...]

6,118 citations

### "A chaotic viewpoint-based approach ..." refers background in this paper

...The runtime of this algorithm increases linearly, and it depends on the number of SNPs[40]....

[...]

[...]

4,361 citations

^{1}, Broad Institute

^{2}, University of California, Berkeley

^{3}, European Bioinformatics Institute

^{4}, National Institutes of Health

^{5}, University of Massachusetts Medical School

^{6}, Spanish National Research Council

^{7}, University of Washington

^{8}, University of Montana

^{9}, Croatian Academy of Sciences and Arts

^{10}, University of Oviedo

^{11}, University of Bonn

^{12}, Emory University

^{13}, University College Cork

^{14}, Harvard University

^{15}

3,575 citations

### "A chaotic viewpoint-based approach ..." refers background in this paper

...Haplotypes can also be used to investigate the pattern of inheritance over evolution, human migration, and the genetically aspects of populations [12-14]....

[...]

2,908 citations