What are the contributions mentioned in the paper "A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model" ?

Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge amount of data that enables researchers to determine haplotypes for diploid and polyploid organisms. In this paper, an iterative method is proposed, which employs hypergraph to reconstruct haplotype. This procedure is repeated until no further improvement could be achieved. Experimental results on both simulated and real datasets demonstrate that this method outperforms most other approaches, and is promising to perform the haplotype assembly.

(Open Access) A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model (2020) | Mohammad Hossein Olyaee

RESEARCH ARTICLE

A chaotic viewpoint-based approach to solve

haplotype assembly using hypergraph model

Mohammad Hossein Olyaee

, Alireza Khanteymoori

*, Khosrow Khalifeh

3,4

1 Faculty of Engineering, Department of Computer Engineering, University of Gonabad, Gonabad, Iran,

2 Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau,

Germany, 3 Department of Biology, Faculty of Sciences, University of Zanjan, Zanjan, Iran, 4 Department of

Biotechnology, Research Institute of Modern Biological Techniques, University of Zanjan, Zanjan, Iran

* khanteymoori@gmail.com

Abstract

Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge

amount of data that enables researchers to determine haplotypes for diploid and polyploid

organisms. Although various methods have been developed to reconstruct haplotypes in

diploid form, their accuracy is still a challenging task. Also, most of the current methods can-

not be applied to polyploid form. In this paper, an iterative method is proposed, which

employs hypergraph to reconstruct haplotype. The proposed method by utilizing chaotic

viewpoint can enhance the obtained haplotypes. For this purpose, a haplotype set was ran-

domly generated as an initial estimate, and its consistency with the input fragments was

described by constructing a weighted hypergraph. Partitioning the hypergraph specifies

those positions in the haplotype set that need to be corrected. This procedure is repeated

until no further improvement could be achieved. Each element of the finalized haplotype set

is mapped to a line by chaos game representation, and a coordinate series is defined based

on the position of mapped points. Then, some positions with low qualities can be assessed

by applying a local projection. Experimental results on both simulated and real datasets

demonstrate that this method outperforms most other approaches, and is promising to per-

form the haplotype assembly.

Introduction

Improving the high-throughput DNA sequencing technologies dramatically decreased the

costs of genome sequencing methods. This achievement help researchers to understand the

variation of individual’s genomic data and pave the way toward individualized strategies for

diagnostic or therapeutic decision-making [1]. The most frequent type of genetic variation is

the single nucleotide polymorphisms (SNPs). Each SNP is just a mutation over similar distinc-

tive positions on the DNA sequences of homologous pair of chromosomes in an individual,

and among the corresponding DNA sequences of the whole population. Similarly, the term

“allele” refers to different forms of a gene at one loci. Accordingly, four different alleles are pos-

sible for a given SNP site. Nonetheless, most SNPs are bi-allelic containing only two kinds of

PLOS ONE

PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 1 / 19

a1111111111

OPEN ACCESS

Citation: Olyaee MH, Khanteymoori A, Khalifeh K

(2020) A chaotic viewpoint-based approach to

solve haplotype assembly using hypergraph

model. PLoS ONE 15(10): e0241291. https://doi.

org/10.1371/journal.pone.0241291

Editor: Zechen Chong, University of Alabama at

Birmingham, UNITED STATES

Received: May 3, 2020

Accepted: October 12, 2020

Published: October 29, 2020

access article distributed under the terms of the

Creative Commons Attribution License, which

permits unrestricted use, distribution, and

reproduction in any medium, provided the original

author and source are credited.

Data Availability Statement: The Geraci’s dataset

is available via email (contact via filippo.geraci@iit.

cnr.it). The real dataset is available for download

(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/). The

source code is available from GitHub (https://

github.com/mholyaee/HRCH).

Funding: The authors received no specific funding

for this work.

Competing interests: The authors have declared

that no competing interests exist.

alleles, which can be simply denoted by ‘0’ and ‘1’ [2]. Each SNP contains valuable information

about genomic alternations. Experimental studies revealed that SNPs have been clustered

across the human genome and are not randomly distributed [3]. In line with this assumption,

linkage disequilibrium (LD), demonstrates that there are correlations and spatial dependencies

among neighboring SNPs. Different SNPs on the string of DNA is known as a haplotype. In

other words, a haplotype could be considered as the combinations of marker alleles which are

positioned closely together on the same strand of DNA, and tend to be inherited together from

parents to offspring [4]. It has been shown that some diseases such as sickle-cell anemia [5],

cystic fibrosis [6] and hemochromatosis [7] are more common in specific ethnic populations

due to unique genetic mutations in their genomes; but they are rarely found in others. There

are also reports indicating that different populations may have various responses to drugs [8–

10]. These findings demonstrate that haplotypes in human genomics data could be a useful

and informative tool in mapping genes that are involves in representative diseases, as well as

personalized medicine [11]. Haplotypes can also be used to investigate the pattern of inheri-

tance over evolution, human migration, and the genetically aspects of populations [12–14].

Genetic association analysis for gene mapping can also be improved by haplotype analysis

[15]. Also, it is possible to detect errors and missing sequencing data in experimental sequenc-

ing of DNA sequences using the information of haplotypes [16].

It is worth mentioning that the experimental analysis of haplotypes is labor-intensive and

expensive. Moreover, it can be used only for constructing local haplotypes. In other words,

human haplotypes are provided as sequencing reads or fragments. It is a vital task to obtain

haplotype information from the numerous fragments due to its profound impacts on different

aspects of medicine and molecular biology [15, 17–19]. However, the detection of genetic vari-

ations has critical limitations compared with the molecular approaches. According to the type

of input data, the existing methods of haplotype reconstruction are divided into two main cate-

gories, including single individual haplotyping (SIH) and haplotype inference. SIH methods

receive several fragments that have been sequenced from a given chromosome. It is to be

noted that most of the fragments contain gaps, and are usually disrupted by noise. To cope

with these problems, the input fragments are clustered based on their similarities. Then, the

haplotypes can be reconstructed using the center of each cluster [4]. The haplotype inference

methods receive genotype information of several individuals as input data and infer their

related haplotype sequences [20]. It is worth noting that each genotype represents a combina-

tion of haplotypes on the homologous chromosomes.

With increasing the size of data, a growing number of researchers have tried to solve haplo-

type assembly problem. Moreover, several computational models, including minimum frag-

ment removal (MFR), minimum error correction (MEC), minimum SNP removal (MSR), and

the longest haplotype reconstruction (LHR), have been developed to cope with the SIH prob-

lem. The MEC is one of the most popular and successful algorithms compared with the models

as mentioned above [4, 21–28]. This model attempts to cluster the input fragments, such that

all the fragments belonging to a specified cluster to be compatible. Otherwise, they will be

compatible by applying the minimum alternations. The current approaches can be divided

into exact and heuristic methods. Since finding the optimal minimum error correction is

NP-Hard, the exact approaches have exponential complexity [21]. Among exact solutions,

WhatsHap [29] is regarded as a pioneering method, which is dynamic programming-based

and utilizes a weighted variant of the MEC. The experimental results demonstrate that it can

process long reads at coverage up to 20×. In [30], the authors proposed a parallel version of

WhatsHap which is able to process higher coverages up to 25×. AROHap [24] is a recently

published evolutionary-based method that exploits the asexual reproduction optimization

algorithm to solve the SIH problem. In this method, the fitness function is designed based on

PLOS ONE

A chaotic method to solve haplotype assembly

PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 2 / 19

the MEC model. In [26], a heuristic method, namely, Fasthap was developed, where it makes a

weighted fuzzy conflict graph based on the MEC model. Furthermore, the constructed graph

is used to cluster the input fragments. Fuzzy C-means (FCM) approach has been applied in

[25] to enhance the performance of the proposed method in clustering the fragments. How-

ever, this method obtains low performance in dealing with noisy fragments. Some popular

methods, including MCMC [31], HapCUT [27], and HapCUT2 [32], have differently con-

struct the graph. These methods start with a set of arbitrary sequences as initial haplotypes,

and improve it step by step concerning the input fragments. They make a similar weighted

graph in their distinctive model. However, instead of fragments, SNPs are used as vertices of

the graph. Each pair of SNPs is connected if they are covered by at least one input fragment.

The weight of each edge determines the amount of consistency with their corresponding posi-

tions in the current haplotypes. Although this model efficiently determines the consistency of

the current haplotype with the input fragments, the existing gaps and noise lead to a loss of

accuracy in determining the weight of edges. In [33]. It has been proved that the hypergraph

can precisely describe the distance of input fragments.

Although, various methods have been developed to solve the SIH problem, most of them can

only be applied to diploid organisms, and fail to consider polyploid organisms. It should be

noted that the haplotype reconstruction in polyploid type is more complicated than a diploid

one. Suppose that P is the number of ploids, and m is the length of haplotype sequences. In this

case, there are at least 2

m−1

(P − 1)

different solutions for phasing the haplotypes [23]. Recently,

several studies, such as [23, 34–36], have been conducted on the polyploid organism. Althap [23]

and SCGD [36] are two recently developed methods based on matrix factorization to solve the

SIH problem. H-PoP [34] is a heuristic method that divides the input fragments into P clusters.

Therefore, the members of each cluster have the minimum distance with each other and are

entirely far from the fragments of other clusters. Belief propagation (BP) [35] is another method

addressing the SIH problem by mapping the MEC model to a decoding mechanism. It involves

a message transmission in a noisy channel. In this context, it has been reported that the haplo-

type’s blocks with proper lengths can exhibit chaotic behavior. This feature has been recently

used to improve the reconstruction rate in the single individual haplotyping problem [37].

Considering the chaotic nature of haplotype sequences, in this paper, an iterative algorithm

is proposed to reconstruct the haplotypes using the hypergraph model. The method includes

two main steps. Firstly, an iterative mechanism is applied due to the SNP matrix to construct

the haplotype set, and the consistency between SNPs is modeled based on the hypergraph.

Then, the corrected parts of the haplotypes are determined by partitioning the hypergraph.

This step is followed by transforming the obtained haplotypes into a line using the chaos

game representation, where a coordinate series is defined based on the position of the mapped

points. Also, a local projection (LP) method is applied to refine the remaining ambiguous mea-

sures and increasing the quality of the reconstructed haplotypes.

The significant contributions of the proposed method are as follows:

• The similarity measurement between the input fragments can be described more accurately

by utilizing the hypergraph model. Moreover, it helps to overcome challenges originated

from the huge amount of gaps and sequencing errors.

• The quality score for each position of the reconstructed haplotypes can be calculated to pre-

dict the remaining error measures.

• The chaotic nature hypothesis is used to refine the reconstructed haplotypes. To this end, we

only concentrate on the neighboring dependencies between SNPs.

• The proposed method could be applied effectively for both diploid and polyploid organisms.

PLOS ONE

A chaotic method to solve haplotype assembly

PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 3 / 19

The rest of the paper is organized as follows. Section 2 provides a brief review of the prob-

lem statement. In section 3, the proposed method is described in detail. Experimental results

are presented in section 4. Finally, the conclusion is arrived at section 5.

Preliminaries and assumptions

The challenge of the SIH problem in the polyploid organisms includes the reconstruction of

the whole set H = {h

, h

, . . ., h

} containing P haplotype sequences. It is based on the available

aligned input fragments. Similar to diploid case, the input fragments can be represented as a

standard form. Let X be the SNP matrix in which each row corresponds to an input fragment,

and each column indicates a specified SNP. In binary allelic haplotypes, it is assumed that

2 {0,1,

−

} indicating the obtained allele in a specified fragment f

at SNP s

. Also, each hap-

lotype h

(i = 1,2, . . ., P) equals to {1,0}

. In diploid case, there are some positions called homo-

zygote sites in which h

equals to h

. On the other hand, the sites with different measures are

called heterozygote positions. Homozygote sites are usually removed from the input matrix, as

they do not provide useful information for the haplotype assembly problem. It is worth noting

that the

−

sign indicates missing information during the sequencing process. For two frag-

ments which are originated from different haplotypes, it is expected that there are some dis-

similarities between them. Several relations have been developed to describe the differences

between the two fragments. Hamming distance (HD) is the most practical approach, which

can be used to calculate the differences between two input fragments f

and f

as follows:

HDðf

; f

Þ ¼

l¼1

dðf

½l; f

½lÞ ð1Þ

Where d is defined as follows:

dðx; yÞ ¼

1 x 6¼ y and; x 6¼



and y 6¼



0 else

(

ð2Þ

In the case where the SNP matrix is error-free, two fragments that were sequenced from the

same haplotype are compatible, as their distance equals to zero. On the other hand, in dealing

with the noisy SNP matrix, for two arbitrary fragments f

, f

, it is not possible to simply inter-

pret the dissimilarity between two fragments, as they can be originated from the existing noise

or have been sequenced from different haplotypes. In the error-free case, the fragments can be

clustered in P clusters, such that the members of each cluster are compatible with each other.

Fig 1 represents an example of the SIH problem in the ploidy level. The rows of matrix X

indicate sequenced fragments, and the rows of matrix H contain the obtained haplotypes.

In diploid case, several models have been proposed to solve the SIH problem based on the

input fragments.

Extending the models to solve the SIH problem in polyploidy form is a difficult task [38].

Recently, several MEC-based approaches have been developed to solve this problem. In this

regard, the input fragments are organized in P clusters, and the haplotypes are considered as

the centers of constructed clusters. In fact, each cluster involves the fragments which have the

same provenance. The optimized result of the clustering algorithm can be obtained by mini-

mizing the following Eq.:

MECðX; HÞ ¼

i¼1

f 2C

HDðf ; H

Þ ð3Þ

PLOS ONE

A chaotic method to solve haplotype assembly

PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 4 / 19

In the optimal case, if the SNP matrix is error-free, then the MEC measurement equals

zero, and each fragment f belonging to C

is compatible with H

. However, in dealing with the

noisy SNP matrix, it is expected that some fragments to be in conflict with their corresponding

haplotypes. It should be noted that finding the optimal MEC measure is an NP-hard problem.

On the other hand, the huge amount of gaps in the input fragments does negatively affect the

distance measurement between pairs of input fragments. Therefore, the current work aims to

address these challenges by a better description of the similarity measurement between the

input fragments. This was done by a heuristic method with a favorable runtime based on the

hypergraph model.

The proposed method

This section presents a Haplotype Reconstruction approach based on the Chaotic viewpoint

and Hypergraph model (HRCH). The proposed method is briefly described below.

(i) a set of haplotype sequences is randomly generated;(ii) the input fragments are assigned

to the haplotype sequences based on their similarities;(iii) a weighted SNP hypergraph is built,

using the similarity measure between haplotype sequences and the assigned input fragments;

(iv) the constructed hypergraph is used to find a set called CutSet, containing the SNPs which

should be modified. This procedure is repeated for a predefined number of iterations to mini-

mize the MEC score. Next, by considering the existence of chaotic properties of haplotype

sequences, the results are improved. A high-level overview of the method is demonstrated in

Fig 2.

Data preprocessing

As described in the preliminaries sections, X

M×N

is a matrix containing M reads with length N.

It is essential to note that homozygote columns can be ignored in diploid cases. Removing the

homozygote positions was performed as described by [33] such that the most frequent

Fig 1. An example of SNP matrices X and H relevant to the resulting haplotypes. The red measures in X indicate

sequencing errors. Each row of H demonstrates a specified haplotype sequence.

https://doi.org/10.1371/journal.pone.0241291.g001

PLOS ONE

A chaotic method to solve haplotype assembly

PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 5 / 19

A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model

Figures

Citations

Towards accurate, contiguous and complete alignment-based polyploid phasing algorithms.

NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering

Pairwise comparative analysis of six haplotype assembly methods based on users’ experience

References

Survey of computational haplotype determination methods for single individual

SpeedHap: An Accurate Heuristic for the Single Individual SNP Haplotyping Problem with Many Gaps, High Reading Error Rate and Low Coverage

Predicting protein structural classes based on complex networks and recurrence analysis.

Black and white patients response to antidepressant treatment for major depression.

Sickle cell anemia: clinical diversity and beta S-globin haplotypes.

Related Papers (5)

Self-organizing map approaches for the haplotype assembly problem

FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs.

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

A fast and accurate heuristic for the single individual snp haplotyping problem with many gaps, high reading error rate and low coverage

A fast and accurate algorithm for single individual haplotyping.

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model" ?