scispace - formally typeset
Search or ask a question
Posted ContentDOI

A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model

01 Oct 2020-bioRxiv (Cold Spring Harbor Laboratory)-
TL;DR: An iterative method is proposed, which employs hypergraph to reconstruct haplotype, and outperforms most other approaches, and is promising to perform the haplotype assembly.
Abstract: Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge amount of data that enables researchers to determine haplotypes for diploid and polyploid organisms. Although various methods have been developed to reconstruct haplotypes in diploid form, their accuracy is still a challenging task. Also, most of the current methods cannot be applied to polyploid form. In this paper, an iterative method is proposed, which employs hypergraph to reconstruct haplotype. The proposed method by utilizing chaotic viewpoint can enhance the obtained haplotypes. For this purpose, a haplotype set was randomly generated as an initial estimate, and its consistency with the input fragments was described by constructing a weighted hypergraph. Partitioning the hypergraph specifies those positions in the haplotype set that need to be corrected. This procedure is repeated until no further improvement could be achieved. Each element of the finalized haplotype set is mapped to a line by chaos game representation, and a coordinate series is defined based on the position of mapped points. Then, some positions with low qualities can be assessed by applying a local projection. Experimental results on both simulated and real datasets demonstrate that this method outperforms most other approaches, and is promising to perform the haplotype assembly.

Summary (2 min read)

Introduction

  • Improving the high-throughput DNA sequencing technologies dramatically decreased the costs of genome sequencing methods.
  • Each SNP contains valuable information about genomic alternations.
  • H-PoP [34] is a heuristic method that divides the input fragments into P clusters.
  • Also, a local projection (LP) method is applied to refine the remaining ambiguous measures and increasing the quality of the reconstructed haplotypes.

Preliminaries and assumptions

  • The challenge of the SIH problem in the polyploid organisms includes the reconstruction of the whole setH = {h1, h2, . . ., hP} containing P haplotype sequences.
  • In the error-free case, the fragments can be clustered in P clusters, such that the members of each cluster are compatible with each other.
  • In diploid case, several models have been proposed to solve the SIH problem based on the input fragments.
  • Recently, several MEC-based approaches have been developed to solve this problem.
  • In dealing with the noisy SNP matrix, it is expected that some fragments to be in conflict with their corresponding haplotypes.

The proposed method

  • This section presents a Haplotype Reconstruction approach based on the Chaotic viewpoint and Hypergraph model (HRCH).
  • The proposed method is briefly described below.
  • (i) a set of haplotype sequences is randomly generated;(ii) the input fragments are assigned to the haplotype sequences based on their similarities;(iii) a weighted SNP hypergraph is built, using the similarity measure between haplotype sequences and the assigned input fragments; (iv) the constructed hypergraph is used to find a set called CutSet, containing the SNPs which should be modified.
  • This procedure is repeated for a predefined number of iterations to minimize the MEC score.
  • Next, by considering the existence of chaotic properties of haplotype sequences, the results are improved.

Pair-SNP consistency

  • Let⋈ be a binary operator which provides the concatenation of two variables.
  • Tij X fk2covðsi ;sjÞ fkðiÞ ffl fkðjÞ½ �⨁ hcðfkÞðiÞ ffl hcðfkÞðjÞ h i ð5Þ.
  • Where Tij is the number of fragments covering both SNPs si and sj.

Hypergraph construction

  • To construct the weighted hypergraph based on the achieved ωmatrix, for each SNP si, its K nearest neighbors is found using the following Eq.: KNNðsiÞ ¼ fsjji 6¼ j;oij � oilg ð6Þ.
  • Each hyperedge can connect more than two vertices.
  • Therefore, the connectivity of vertices is defined by finding frequent itemsets.
  • FP-growth is a tree-based method which uses a depth-first strategy to mine frequent itemsets.
  • The runtime of this algorithm increases linearly, and it depends on the number of SNPs [40].

Improving Ht by partitioning the hypergraph

  • As can be seen in Fig 3, in the constructed hypergraph, the SNPs correspond with vertices, and each hyperedge equals with an obtained frequent itemset.
  • The vertices can be divided into two clusters via partitioning the hypergraph.
  • Moreover, in order to evaluate more allelic combinations of SNPs, for a predefined percent of SNPs belonging to the CutSet, in each time two arbitrary SNPs are nominated.
  • SinceHt has randomly generated, in the early iterations, its MEC score is poor.

Refinement of Ht

  • CGR was initially introduced by Barnsley [42] to evaluate random sequences.
  • Each letter of the given sequence is iteratively mapped as a point inside the square.
  • Then, the measure of ambiguous positions can be determined by applying a local projection (LP) method.

Results

  • In the following section, the performance of the proposed method is compared with several state-of-the-art approaches in diploid and polyploid forms.
  • The method was implemented in MATLAB, and all the results were obtained on a Windows 10 PC with 3.6 GHz CPU and 16 G Ram.
  • Reconstruction rate (RR) [4] as a conventional metric was used to evaluate the quality of the obtained haplotypes.
  • Here,HD denotes hamming distance between hi and bhj which are the target and the reconstructed haplotype, respectively and i, j = 1,2.

Diploid case

  • The experiments have been carried out on two widely used and well-known datasets including Geraci’s dataset [49] and a dataset from the 1000 genome project that are prime examples of the simulated and experimental datasets, respectively.
  • It should be noted that the first column demonstrates the quality of the obtained haplotypes after terminating the first phase.
  • The next two columns involve the rate of reconstruction for em equals to 1 and 2, respectively.
  • The results show that in most cases the proposed method achieved higher reconstruction rates compared to the others.

Polyploid case

  • Here, the proposed method is compared with three recent approaches that have been developed to solve haplotype assembly in polyploid form including Althap [23], H-POP [34] and SCGD [36].
  • The source codes of all comparing methods are available.
  • To investigate the quality of reconstructed haplotypes, reconstruction rate (RR), and MEC measure of the methods have compared.
  • Each sample contains an SNP matrix with a huge amount of gaps.
  • As can be seen in Tables 6–8 the proposed method is compared with RR and MEC-based algorithms.

Conclusion

  • The high amounts of noise, as well as existing gaps in the input fragments, are the main challenges in solving the SIH problem.
  • The proposed method involves two main steps.

Did you find this useful? Give us your feedback

Figures (17)

Content maybe subject to copyright    Report

RESEARCH ARTICLE
A chaotic viewpoint-based approach to solve
haplotype assembly using hypergraph model
Mohammad Hossein Olyaee
1
, Alireza Khanteymoori
ID
2
*, Khosrow Khalifeh
3,4
1 Faculty of Engineering, Department of Computer Engineering, University of Gonabad, Gonabad, Iran,
2 Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau,
Germany, 3 Department of Biology, Faculty of Sciences, University of Zanjan, Zanjan, Iran, 4 Department of
Biotechnology, Research Institute of Modern Biological Techniques, University of Zanjan, Zanjan, Iran
* khanteymoori@gmail.com
Abstract
Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge
amount of data that enables researchers to determine haplotypes for diploid and polyploid
organisms. Although various methods have been developed to reconstruct haplotypes in
diploid form, their accuracy is still a challenging task. Also, most of the current methods can-
not be applied to polyploid form. In this paper, an iterative method is proposed, which
employs hypergraph to reconstruct haplotype. The proposed method by utilizing chaotic
viewpoint can enhance the obtained haplotypes. For this purpose, a haplotype set was ran-
domly generated as an initial estimate, and its consistency with the input fragments was
described by constructing a weighted hypergraph. Partitioning the hypergraph specifies
those positions in the haplotype set that need to be corrected. This procedure is repeated
until no further improvement could be achieved. Each element of the finalized haplotype set
is mapped to a line by chaos game representation, and a coordinate series is defined based
on the position of mapped points. Then, some positions with low qualities can be assessed
by applying a local projection. Experimental results on both simulated and real datasets
demonstrate that this method outperforms most other approaches, and is promising to per-
form the haplotype assembly.
Introduction
Improving the high-throughput DNA sequencing technologies dramatically decreased the
costs of genome sequencing methods. This achievement help researchers to understand the
variation of individual’s genomic data and pave the way toward individualized strategies for
diagnostic or therapeutic decision-making [1]. The most frequent type of genetic variation is
the single nucleotide polymorphisms (SNPs). Each SNP is just a mutation over similar distinc-
tive positions on the DNA sequences of homologous pair of chromosomes in an individual,
and among the corresponding DNA sequences of the whole population. Similarly, the term
“allele” refers to different forms of a gene at one loci. Accordingly, four different alleles are pos-
sible for a given SNP site. Nonetheless, most SNPs are bi-allelic containing only two kinds of
PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 1 / 19
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Olyaee MH, Khanteymoori A, Khalifeh K
(2020) A chaotic viewpoint-based approach to
solve haplotype assembly using hypergraph
model. PLoS ONE 15(10): e0241291. https://doi.
org/10.1371/journal.pone.0241291
Editor: Zechen Chong, University of Alabama at
Birmingham, UNITED STATES
Received: May 3, 2020
Accepted: October 12, 2020
Published: October 29, 2020
Copyright: © 2020 Olyaee et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: The Geraci’s dataset
is available via email (contact via filippo.geraci@iit.
cnr.it). The real dataset is available for download
(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/). The
source code is available from GitHub (https://
github.com/mholyaee/HRCH).
Funding: The authors received no specific funding
for this work.
Competing interests: The authors have declared
that no competing interests exist.

alleles, which can be simply denoted by ‘0’ and ‘1’ [2]. Each SNP contains valuable information
about genomic alternations. Experimental studies revealed that SNPs have been clustered
across the human genome and are not randomly distributed [3]. In line with this assumption,
linkage disequilibrium (LD), demonstrates that there are correlations and spatial dependencies
among neighboring SNPs. Different SNPs on the string of DNA is known as a haplotype. In
other words, a haplotype could be considered as the combinations of marker alleles which are
positioned closely together on the same strand of DNA, and tend to be inherited together from
parents to offspring [4]. It has been shown that some diseases such as sickle-cell anemia [5],
cystic fibrosis [6] and hemochromatosis [7] are more common in specific ethnic populations
due to unique genetic mutations in their genomes; but they are rarely found in others. There
are also reports indicating that different populations may have various responses to drugs [8
10]. These findings demonstrate that haplotypes in human genomics data could be a useful
and informative tool in mapping genes that are involves in representative diseases, as well as
personalized medicine [11]. Haplotypes can also be used to investigate the pattern of inheri-
tance over evolution, human migration, and the genetically aspects of populations [1214].
Genetic association analysis for gene mapping can also be improved by haplotype analysis
[15]. Also, it is possible to detect errors and missing sequencing data in experimental sequenc-
ing of DNA sequences using the information of haplotypes [16].
It is worth mentioning that the experimental analysis of haplotypes is labor-intensive and
expensive. Moreover, it can be used only for constructing local haplotypes. In other words,
human haplotypes are provided as sequencing reads or fragments. It is a vital task to obtain
haplotype information from the numerous fragments due to its profound impacts on different
aspects of medicine and molecular biology [15, 1719]. However, the detection of genetic vari-
ations has critical limitations compared with the molecular approaches. According to the type
of input data, the existing methods of haplotype reconstruction are divided into two main cate-
gories, including single individual haplotyping (SIH) and haplotype inference. SIH methods
receive several fragments that have been sequenced from a given chromosome. It is to be
noted that most of the fragments contain gaps, and are usually disrupted by noise. To cope
with these problems, the input fragments are clustered based on their similarities. Then, the
haplotypes can be reconstructed using the center of each cluster [4]. The haplotype inference
methods receive genotype information of several individuals as input data and infer their
related haplotype sequences [20]. It is worth noting that each genotype represents a combina-
tion of haplotypes on the homologous chromosomes.
With increasing the size of data, a growing number of researchers have tried to solve haplo-
type assembly problem. Moreover, several computational models, including minimum frag-
ment removal (MFR), minimum error correction (MEC), minimum SNP removal (MSR), and
the longest haplotype reconstruction (LHR), have been developed to cope with the SIH prob-
lem. The MEC is one of the most popular and successful algorithms compared with the models
as mentioned above [4, 2128]. This model attempts to cluster the input fragments, such that
all the fragments belonging to a specified cluster to be compatible. Otherwise, they will be
compatible by applying the minimum alternations. The current approaches can be divided
into exact and heuristic methods. Since finding the optimal minimum error correction is
NP-Hard, the exact approaches have exponential complexity [21]. Among exact solutions,
WhatsHap [29] is regarded as a pioneering method, which is dynamic programming-based
and utilizes a weighted variant of the MEC. The experimental results demonstrate that it can
process long reads at coverage up to 20×. In [30], the authors proposed a parallel version of
WhatsHap which is able to process higher coverages up to 25×. AROHap [24] is a recently
published evolutionary-based method that exploits the asexual reproduction optimization
algorithm to solve the SIH problem. In this method, the fitness function is designed based on
PLOS ONE
A chaotic method to solve haplotype assembly
PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 2 / 19

the MEC model. In [26], a heuristic method, namely, Fasthap was developed, where it makes a
weighted fuzzy conflict graph based on the MEC model. Furthermore, the constructed graph
is used to cluster the input fragments. Fuzzy C-means (FCM) approach has been applied in
[25] to enhance the performance of the proposed method in clustering the fragments. How-
ever, this method obtains low performance in dealing with noisy fragments. Some popular
methods, including MCMC [31], HapCUT [27], and HapCUT2 [32], have differently con-
struct the graph. These methods start with a set of arbitrary sequences as initial haplotypes,
and improve it step by step concerning the input fragments. They make a similar weighted
graph in their distinctive model. However, instead of fragments, SNPs are used as vertices of
the graph. Each pair of SNPs is connected if they are covered by at least one input fragment.
The weight of each edge determines the amount of consistency with their corresponding posi-
tions in the current haplotypes. Although this model efficiently determines the consistency of
the current haplotype with the input fragments, the existing gaps and noise lead to a loss of
accuracy in determining the weight of edges. In [33]. It has been proved that the hypergraph
can precisely describe the distance of input fragments.
Although, various methods have been developed to solve the SIH problem, most of them can
only be applied to diploid organisms, and fail to consider polyploid organisms. It should be
noted that the haplotype reconstruction in polyploid type is more complicated than a diploid
one. Suppose that P is the number of ploids, and m is the length of haplotype sequences. In this
case, there are at least 2
m1
(P 1)
m
different solutions for phasing the haplotypes [23]. Recently,
several studies, such as [23, 3436], have been conducted on the polyploid organism. Althap [23]
and SCGD [36] are two recently developed methods based on matrix factorization to solve the
SIH problem. H-PoP [34] is a heuristic method that divides the input fragments into P clusters.
Therefore, the members of each cluster have the minimum distance with each other and are
entirely far from the fragments of other clusters. Belief propagation (BP) [35] is another method
addressing the SIH problem by mapping the MEC model to a decoding mechanism. It involves
a message transmission in a noisy channel. In this context, it has been reported that the haplo-
type’s blocks with proper lengths can exhibit chaotic behavior. This feature has been recently
used to improve the reconstruction rate in the single individual haplotyping problem [37].
Considering the chaotic nature of haplotype sequences, in this paper, an iterative algorithm
is proposed to reconstruct the haplotypes using the hypergraph model. The method includes
two main steps. Firstly, an iterative mechanism is applied due to the SNP matrix to construct
the haplotype set, and the consistency between SNPs is modeled based on the hypergraph.
Then, the corrected parts of the haplotypes are determined by partitioning the hypergraph.
This step is followed by transforming the obtained haplotypes into a line using the chaos
game representation, where a coordinate series is defined based on the position of the mapped
points. Also, a local projection (LP) method is applied to refine the remaining ambiguous mea-
sures and increasing the quality of the reconstructed haplotypes.
The significant contributions of the proposed method are as follows:
The similarity measurement between the input fragments can be described more accurately
by utilizing the hypergraph model. Moreover, it helps to overcome challenges originated
from the huge amount of gaps and sequencing errors.
The quality score for each position of the reconstructed haplotypes can be calculated to pre-
dict the remaining error measures.
The chaotic nature hypothesis is used to refine the reconstructed haplotypes. To this end, we
only concentrate on the neighboring dependencies between SNPs.
The proposed method could be applied effectively for both diploid and polyploid organisms.
PLOS ONE
A chaotic method to solve haplotype assembly
PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 3 / 19

The rest of the paper is organized as follows. Section 2 provides a brief review of the prob-
lem statement. In section 3, the proposed method is described in detail. Experimental results
are presented in section 4. Finally, the conclusion is arrived at section 5.
Preliminaries and assumptions
The challenge of the SIH problem in the polyploid organisms includes the reconstruction of
the whole set H = {h
1
, h
2
, . . ., h
P
} containing P haplotype sequences. It is based on the available
aligned input fragments. Similar to diploid case, the input fragments can be represented as a
standard form. Let X be the SNP matrix in which each row corresponds to an input fragment,
and each column indicates a specified SNP. In binary allelic haplotypes, it is assumed that
x
ij
2 {0,1,
0
0
} indicating the obtained allele in a specified fragment f
i
at SNP s
j
. Also, each hap-
lotype h
i
(i = 1,2, . . ., P) equals to {1,0}
N
. In diploid case, there are some positions called homo-
zygote sites in which h
1k
equals to h
2k
. On the other hand, the sites with different measures are
called heterozygote positions. Homozygote sites are usually removed from the input matrix, as
they do not provide useful information for the haplotype assembly problem. It is worth noting
that the
0
0
sign indicates missing information during the sequencing process. For two frag-
ments which are originated from different haplotypes, it is expected that there are some dis-
similarities between them. Several relations have been developed to describe the differences
between the two fragments. Hamming distance (HD) is the most practical approach, which
can be used to calculate the differences between two input fragments f
i
and f
j
as follows:
HDðf
i
; f
j
Þ ¼
X
l¼1
dðf
i
½l; f
j
½lÞ ð1Þ
Where d is defined as follows:
dðx; yÞ ¼
1 x y and; x
0
0
and y
0
0
0 else
(
ð2Þ
In the case where the SNP matrix is error-free, two fragments that were sequenced from the
same haplotype are compatible, as their distance equals to zero. On the other hand, in dealing
with the noisy SNP matrix, for two arbitrary fragments f
i
, f
j
, it is not possible to simply inter-
pret the dissimilarity between two fragments, as they can be originated from the existing noise
or have been sequenced from different haplotypes. In the error-free case, the fragments can be
clustered in P clusters, such that the members of each cluster are compatible with each other.
Fig 1 represents an example of the SIH problem in the ploidy level. The rows of matrix X
indicate sequenced fragments, and the rows of matrix H contain the obtained haplotypes.
In diploid case, several models have been proposed to solve the SIH problem based on the
input fragments.
Extending the models to solve the SIH problem in polyploidy form is a difficult task [38].
Recently, several MEC-based approaches have been developed to solve this problem. In this
regard, the input fragments are organized in P clusters, and the haplotypes are considered as
the centers of constructed clusters. In fact, each cluster involves the fragments which have the
same provenance. The optimized result of the clustering algorithm can be obtained by mini-
mizing the following Eq.:
MECðX; HÞ ¼
X
P
i¼1
X
f 2C
i
HDðf ; H
i
Þ ð3Þ
PLOS ONE
A chaotic method to solve haplotype assembly
PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 4 / 19

In the optimal case, if the SNP matrix is error-free, then the MEC measurement equals
zero, and each fragment f belonging to C
i
is compatible with H
i
. However, in dealing with the
noisy SNP matrix, it is expected that some fragments to be in conflict with their corresponding
haplotypes. It should be noted that finding the optimal MEC measure is an NP-hard problem.
On the other hand, the huge amount of gaps in the input fragments does negatively affect the
distance measurement between pairs of input fragments. Therefore, the current work aims to
address these challenges by a better description of the similarity measurement between the
input fragments. This was done by a heuristic method with a favorable runtime based on the
hypergraph model.
The proposed method
This section presents a Haplotype Reconstruction approach based on the Chaotic viewpoint
and Hypergraph model (HRCH). The proposed method is briefly described below.
(i) a set of haplotype sequences is randomly generated;(ii) the input fragments are assigned
to the haplotype sequences based on their similarities;(iii) a weighted SNP hypergraph is built,
using the similarity measure between haplotype sequences and the assigned input fragments;
(iv) the constructed hypergraph is used to find a set called CutSet, containing the SNPs which
should be modified. This procedure is repeated for a predefined number of iterations to mini-
mize the MEC score. Next, by considering the existence of chaotic properties of haplotype
sequences, the results are improved. A high-level overview of the method is demonstrated in
Fig 2.
Data preprocessing
As described in the preliminaries sections, X
M×N
is a matrix containing M reads with length N.
It is essential to note that homozygote columns can be ignored in diploid cases. Removing the
homozygote positions was performed as described by [33] such that the most frequent
Fig 1. An example of SNP matrices X and H relevant to the resulting haplotypes. The red measures in X indicate
sequencing errors. Each row of H demonstrates a specified haplotype sequence.
https://doi.org/10.1371/journal.pone.0241291.g001
PLOS ONE
A chaotic method to solve haplotype assembly
PLOS ONE | https://doi.org/10.1371/journal.pone.0241291 October 29, 2020 5 / 19

Citations
More filters
Journal ArticleDOI
25 Apr 2022-Genomics
TL;DR: A review of the alignment-based methods of polyploid phasing can be found in this paper , where the authors discuss the advantages and limitations of these methods and the metrics used to assess their performance, proposing that accuracy and contiguity are the most meaningful metrics.

3 citations

Journal ArticleDOI
TL;DR: A method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm, which can effectively detect the noise and outliers in the input data and reduce their effects in the clustering process.
Abstract: Single individual haplotype problem refers to reconstructing haplotypes of an individual based on several input fragments sequenced from a specified chromosome. Solving this problem is an important task in computational biology and has many applications in the pharmaceutical industry, clinical decision-making, and genetic diseases. It is known that solving the problem is NP-hard. Although several methods have been proposed to solve the problem, it is found that most of them have low performances in dealing with noisy input fragments. Therefore, proposing a method which is accurate and scalable, is a challenging task. In this paper, we introduced a method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm. The NCM algorithm can effectively detect the noise and outliers in the input data. In addition, it can reduce their effects in the clustering process. The proposed method has been evaluated by several benchmark datasets. Comparing with existing methods indicates when NCM is tuned by suitable parameters, the results are encouraging. In particular, when the amount of noise increases, it outperforms the comparing methods. The proposed method is validated using simulated and real datasets. The achieved results recommend the application of NCMHap on the datasets which involve the fragments with a huge amount of gaps and noise.

3 citations


Cites background or methods from "A chaotic viewpoint-based approach ..."

  • ...The reconstruction rate for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project....

    [...]

  • ...8 HRCH [29] utilizes a chaotic viewpoint to reconstruct haplotypes....

    [...]

  • ...5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]....

    [...]

  • ... 5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]....

    [...]

  • ...The average of running time for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project (In seconds)....

    [...]

Journal ArticleDOI
TL;DR: Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data as mentioned in this paper , a haplotype is a set of DNA variants inherited together from one parent or chromosome.
Abstract: Abstract Background A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. Result Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms’ run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. Conclusion The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
References
More filters
Book ChapterDOI
17 Sep 2004
TL;DR: A simple heuristic is introduced and it is proved experimentally that is very fast and accurate and when compared with a dynamic programming of [8] it is much faster and also more accurate.
Abstract: We study the single individual SNP haplotype reconstruction problem. We introduce a simple heuristic and prove experimentally that is very fast and accurate. In particular, when compared with a dynamic programming of [8] it is much faster and also more accurate. We expect Fast Hare to be very useful in practical applications. We also introduce a combinatorial problem related to the SNP haplotype reconstruction problem that we call Min Element Removal. We prove its NP-hardness in the gapless case and its O(log n)-approximability in the general case.

136 citations


"A chaotic viewpoint-based approach ..." refers methods in this paper

  • ...The output of the proposed method was compared with a set of state-of-theart and well-known methods including; SCGD[36], H-pop[34], ARO[24], HG[33], FCM[25], FastHap[26], DGS[50] ,SHR[51], MLF[52], HapCut[27], 2d[22], Fast[53], and SPH[54]....

    [...]

  • ...In [26], a heuristic method, namely, Fasthap was developed, where it makes a weighted fuzzy conflict graph based on the MEC model....

    [...]

  • ...The output of the proposed method was compared with a set of state-of-the-art and well-known methods including; SCGD[36], H-pop[34], ARO[24], HG[33], FCM[25], FastHap[26], DGS[50], SHR[51], MLF[52], HapCut[27], 2d[22], Fast[53], and SPH[54]....

    [...]

Book ChapterDOI
TL;DR: This chapter provides a detailed review of methods for haplotype inference using unrelated individuals as well as related individuals from pedigrees and covers a number of statistical methods that employ haplotype information in association analysis.
Abstract: Association methods based on linkage disequilibrium (LD) offer a promising approach for detecting genetic variations that are responsible for complex human diseases. Although methods based on individual single nucleotide polymorphisms (SNPs) may lead to significant findings, methods based on haplotypes comprising multiple SNPs on the same inherited chromosome may provide additional power for mapping disease genes and also provide insight on factors influencing the dependency among genetic markers. Such insights may provide information essential for understanding human evolution and also for identifying cis-interactions between two or more causal variants. Because obtaining haplotype information directly from experiments can be cost prohibitive in most studies, especially in large scale studies, haplotype analysis presents many unique challenges. In this chapter, we focus on two main issues: haplotype inference and haplotype-association analysis. We first provide a detailed review of methods for haplotype inference using unrelated individuals as well as related individuals from pedigrees. We then cover a number of statistical methods that employ haplotype information in association analysis. In addition, we discuss the advantages and limitations of different methods.

123 citations

Journal ArticleDOI

80 citations


"A chaotic viewpoint-based approach ..." refers background in this paper

  • ...It is a vital task to obtain haplotype information from the numerous fragments due to its profound impacts on different aspects of medicine and molecular biology[15,17-19]....

    [...]

Journal ArticleDOI
TL;DR: Carrier testing using a broad mutation panel detects differences in the distribution of mutations among ethnic groups in the US, including African American, Hispanic, and Asian individuals.
Abstract: BACKGROUND: The incidence of cystic fibrosis (CF) and the frequency of specific disease-causing mutations vary among populations. Affected individuals experience a range of serious clinical consequences, notably lung and pancreatic disease, which are only partially dependent on genotype. METHODS: An allele-specific primer-extension reaction, liquid-phase hybridization to a bead array, and subsequent fluorescence detection were used in testing for carriers of 98 CFTR [cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7)] mutations among 364 890 referred individuals with no family history of CF. RESULTS: One in 38 individuals carried one of the 98 CFTR mutations included in this panel. Of the 87 different mutations detected, 18 were limited to a single ethnic group. African American, Hispanic, and Asian individuals accounted for 33% of the individuals tested. The mutation frequency distribution of Caucasians was significantly different from that of each of these ethnic groups ( P < 1 × 10−10). CONCLUSIONS: Carrier testing using a broad mutation panel detects differences in the distribution of mutations among ethnic groups in the US.

77 citations


"A chaotic viewpoint-based approach ..." refers background in this paper

  • ...It has been shown that some diseases such as sickle-cell anemia [5], cystic fibrosis [6] and hemochromatosis [7] are more common in specific ethnic populations due to unique genetic mutations in their genomes; but they are rarely found in others....

    [...]

Journal ArticleDOI
TL;DR: Advances in whole-genome haplotyping approaches are reviewed and the importance of haplotypes for genomic medicine is discussed, which is more specific than less complex variants such as single nucleotide variants.
Abstract: Genomic information reported as haplotypes rather than genotypes will be increasingly important for personalized medicine. Current technologies generate diploid sequence data that is rarely resolved into its constituent haplotypes. Furthermore, paradigms for thinking about genomic information are based on interpreting genotypes rather than haplotypes. Nevertheless, haplotypes have historically been useful in contexts ranging from population genetics to disease-gene mapping efforts. The main approaches for phasing genomic sequence data are molecular haplotyping, genetic haplotyping, and population-based inference. Long-read sequencing technologies are enabling longer molecular haplotypes, and decreases in the cost of whole-genome sequencing are enabling the sequencing of whole-chromosome genetic haplotypes. Hybrid approaches combining high-throughput short-read assembly with strategic approaches that enable physical or virtual binning of reads into haplotypes are enabling multi-gene haplotypes to be generated from single individuals. These techniques can be further combined with genetic and population approaches. Here, we review advances in whole-genome haplotyping approaches and discuss the importance of haplotypes for genomic medicine. Clinical applications include diagnosis by recognition of compound heterozygosity and by phasing regulatory variation to coding variation. Haplotypes, which are more specific than less complex variants such as single nucleotide variants, also have applications in prognostics and diagnostics, in the analysis of tumors, and in typing tissue for transplantation. Future advances will include technological innovations, the application of standard metrics for evaluating haplotype quality, and the development of databases that link haplotypes to disease.

75 citations


"A chaotic viewpoint-based approach ..." refers background in this paper

  • ...These findings demonstrate that haplotypes in human genomics data could be a useful and informative tool in mapping genes that are involves in representative diseases, as well as personalized medicine [11]....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model" ?

Decreasing the cost of high-throughput DNA sequencing technologies, provides a huge amount of data that enables researchers to determine haplotypes for diploid and polyploid organisms. In this paper, an iterative method is proposed, which employs hypergraph to reconstruct haplotype. This procedure is repeated until no further improvement could be achieved. Experimental results on both simulated and real datasets demonstrate that this method outperforms most other approaches, and is promising to perform the haplotype assembly.