# A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model

Abstract: Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge amount of data that enables researchers to determine haplotypes for diploid and polyploid organisms. Although various methods have been developed to reconstruct haplotypes in diploid form, their accuracy is still a challenging task. Also, most of the current methods cannot be applied to polyploid form. In this paper, an iterative method is proposed, which employs hypergraph to reconstruct haplotype. The proposed method by utilizing chaotic viewpoint can enhance the obtained haplotypes. For this purpose, a haplotype set was randomly generated as an initial estimate, and its consistency with the input fragments was described by constructing a weighted hypergraph. Partitioning the hypergraph specifies those positions in the haplotype set that need to be corrected. This procedure is repeated until no further improvement could be achieved. Each element of the finalized haplotype set is mapped to a line by chaos game representation, and a coordinate series is defined based on the position of mapped points. Then, some positions with low qualities can be assessed by applying a local projection. Experimental results on both simulated and real datasets demonstrate that this method outperforms most other approaches, and is promising to perform the haplotype assembly.

### Introduction

- Improving the high-throughput DNA sequencing technologies dramatically decreased the costs of genome sequencing methods.
- Each SNP contains valuable information about genomic alternations.
- H-PoP [34] is a heuristic method that divides the input fragments into P clusters.
- Also, a local projection (LP) method is applied to refine the remaining ambiguous measures and increasing the quality of the reconstructed haplotypes.

### Preliminaries and assumptions

- The challenge of the SIH problem in the polyploid organisms includes the reconstruction of the whole setH = {h1, h2, . . ., hP} containing P haplotype sequences.
- In the error-free case, the fragments can be clustered in P clusters, such that the members of each cluster are compatible with each other.
- In diploid case, several models have been proposed to solve the SIH problem based on the input fragments.
- Recently, several MEC-based approaches have been developed to solve this problem.
- In dealing with the noisy SNP matrix, it is expected that some fragments to be in conflict with their corresponding haplotypes.

### The proposed method

- This section presents a Haplotype Reconstruction approach based on the Chaotic viewpoint and Hypergraph model (HRCH).
- The proposed method is briefly described below.
- (i) a set of haplotype sequences is randomly generated;(ii) the input fragments are assigned to the haplotype sequences based on their similarities;(iii) a weighted SNP hypergraph is built, using the similarity measure between haplotype sequences and the assigned input fragments; (iv) the constructed hypergraph is used to find a set called CutSet, containing the SNPs which should be modified.
- This procedure is repeated for a predefined number of iterations to minimize the MEC score.
- Next, by considering the existence of chaotic properties of haplotype sequences, the results are improved.

### Pair-SNP consistency

- Let⋈ be a binary operator which provides the concatenation of two variables.
- Tij X fk2covðsi ;sjÞ fkðiÞ ffl fkðjÞ½ �⨁ hcðfkÞðiÞ ffl hcðfkÞðjÞ h i ð5Þ.
- Where Tij is the number of fragments covering both SNPs si and sj.

### Hypergraph construction

- To construct the weighted hypergraph based on the achieved ωmatrix, for each SNP si, its K nearest neighbors is found using the following Eq.: KNNðsiÞ ¼ fsjji 6¼ j;oij � oilg ð6Þ.
- Each hyperedge can connect more than two vertices.
- Therefore, the connectivity of vertices is defined by finding frequent itemsets.
- FP-growth is a tree-based method which uses a depth-first strategy to mine frequent itemsets.
- The runtime of this algorithm increases linearly, and it depends on the number of SNPs [40].

### Improving Ht by partitioning the hypergraph

- As can be seen in Fig 3, in the constructed hypergraph, the SNPs correspond with vertices, and each hyperedge equals with an obtained frequent itemset.
- The vertices can be divided into two clusters via partitioning the hypergraph.
- Moreover, in order to evaluate more allelic combinations of SNPs, for a predefined percent of SNPs belonging to the CutSet, in each time two arbitrary SNPs are nominated.
- SinceHt has randomly generated, in the early iterations, its MEC score is poor.

### Refinement of Ht

- CGR was initially introduced by Barnsley [42] to evaluate random sequences.
- Each letter of the given sequence is iteratively mapped as a point inside the square.
- Then, the measure of ambiguous positions can be determined by applying a local projection (LP) method.

### Results

- In the following section, the performance of the proposed method is compared with several state-of-the-art approaches in diploid and polyploid forms.
- The method was implemented in MATLAB, and all the results were obtained on a Windows 10 PC with 3.6 GHz CPU and 16 G Ram.
- Reconstruction rate (RR) [4] as a conventional metric was used to evaluate the quality of the obtained haplotypes.
- Here,HD denotes hamming distance between hi and bhj which are the target and the reconstructed haplotype, respectively and i, j = 1,2.

### Diploid case

- The experiments have been carried out on two widely used and well-known datasets including Geraci’s dataset [49] and a dataset from the 1000 genome project that are prime examples of the simulated and experimental datasets, respectively.
- It should be noted that the first column demonstrates the quality of the obtained haplotypes after terminating the first phase.
- The next two columns involve the rate of reconstruction for em equals to 1 and 2, respectively.
- The results show that in most cases the proposed method achieved higher reconstruction rates compared to the others.

### Polyploid case

- Here, the proposed method is compared with three recent approaches that have been developed to solve haplotype assembly in polyploid form including Althap [23], H-POP [34] and SCGD [36].
- The source codes of all comparing methods are available.
- To investigate the quality of reconstructed haplotypes, reconstruction rate (RR), and MEC measure of the methods have compared.
- Each sample contains an SNP matrix with a huge amount of gaps.
- As can be seen in Tables 6–8 the proposed method is compared with RR and MEC-based algorithms.

### Conclusion

- The high amounts of noise, as well as existing gaps in the input fragments, are the main challenges in solving the SIH problem.
- The proposed method involves two main steps.

