scispace - formally typeset
Search or ask a question

Showing papers by "Wing-Kin Sung published in 2018"


Journal ArticleDOI
TL;DR: This paper proposes new techniques to improve database-free non-reference transposition calling, including a realignment strategy called one-end remapping that corrects the alignments of reads in interspersed repeats and a SNV-aware filter that removes some incorrectly aligned reads.
Abstract: Transpositions transfer DNA segments between different loci within a genome; in particular, when a transposition is found in a sample but not in a reference genome, it is called a non-reference transposition. They are important structural variations that have clinical impact. Transpositions can be called by analyzing second generation high-throughput sequencing datasets. Current methods follow either a database-based or a database-free approach. Database-based methods require a database of transposable elements. Some of them have good specificity; however this approach cannot detect novel transpositions, and it requires a good database of transposable elements, which is not yet available for many species. Database-free methods perform de novo calling of transpositions, but their accuracy is low. We observe that this is due to the misalignment of the reads; since reads are short and the human genome has many repeats, false alignments create false positive predictions while missing alignments reduce the true positive rate. This paper proposes new techniques to improve database-free non-reference transposition calling: first, we propose a realignment strategy called one-end remapping that corrects the alignments of reads in interspersed repeats; second, we propose a SNV-aware filter that removes some incorrectly aligned reads. By combining these two techniques and other techniques like clustering and positive-to-negative ratio filter, our proposed transposition caller TranSurVeyor shows at least 3.1-fold improvement in terms of F1-score over existing database-free methods. More importantly, even though TranSurVeyor does not use databases of prior information, its performance is at least as good as existing database-based methods such as MELT, Mobster and Retroseq. We also illustrate that TranSurVeyor can discover transpositions that are not known in the current database.

17 citations


Journal ArticleDOI
TL;DR: This article presents two new deterministic algorithms for constructing consensus trees that constructs the majority rule (+) consensus tree and the frequency difference consensus tree.
Abstract: This article presents two new deterministic algorithms for constructing consensus trees. Given an input of $k$ phylogenetic trees with identical leaf label sets and $n$ leaves each, the first algorithm constructs the majority rule (+) consensus tree in $O(k n)$ time, which is optimal since the input size is $\Omega (k n)$ , and the second one constructs the frequency difference consensus tree in $\min \lbrace O(k n^{2}), O(k n (k + \log ^{2}n))\rbrace$ time.

7 citations


Proceedings ArticleDOI
01 Jan 2018
TL;DR: In this article, the authors improved the running time of the greedy consensus tree and the frequency difference consensus tree to O(k n −1.5 ) and O(n −2.5 ), respectively.
Abstract: A consensus tree is a phylogenetic tree that captures the similarity between a set of conflicting phylogenetic trees. The problem of computing a consensus tree is a major step in phylogenetic tree reconstruction. It is also central for predicting a species tree from a set of gene trees, as indicated recently in [Nature 2013]. This paper focuses on two of the most well-known and widely used consensus tree methods: the greedy consensus tree and the frequency difference consensus tree. Given k conflicting trees each with n leaves, the previous fastest algorithms for these problems were O(k n^2) for the greedy consensus tree [J. ACM 2016] and O~(min{k n^2, k^2n}) for the frequency difference consensus tree [ACM TCBB 2016]. We improve these running times to O~(k n^{1.5}) and O~(k n) respectively.

5 citations


Journal ArticleDOI
TL;DR: This article presents a detailed characterization of how the computational complexity of the Consistency problem changes under various restrictions, and presents an efficient algorithm for dense inputs satisfying [Formula: see text] whose running time is linear in the size of the input and therefore optimal.
Abstract: The [Formula: see text] Consistency problem takes as input two sets [Formula: see text] and [Formula: see text] of resolved triplets and two sets [Formula: see text] and [Formula: see text] of fan triplets, and asks for a distinctly leaf-labeled tree that contains all elements in [Formula: see text] and no elements in [Formula: see text] as embedded subtrees, if such a tree exists. This article presents a detailed characterization of how the computational complexity of the problem changes under various restrictions. Our main result is an efficient algorithm for dense inputs satisfying [Formula: see text] whose running time is linear in the size of the input and therefore optimal.

1 citations


Journal ArticleDOI
02 Mar 2018
TL;DR: In this article, it was shown that constructing a phylogenetictree consistent with R that contains the minimum number of additional rooted triplets is also NP-hard, and developed exact, exponential-time algorithms for both problems.
Abstract: The problem of constructing a minimally resolved phylogenetic supertree (i.e., a rootedtree having the smallest possible number of internal nodes) that contains all of the rooted triplets froma consistent set R is known to be NP-hard. In this article, we prove that constructing a phylogenetictree consistent with R that contains the minimum number of additional rooted triplets is also NP-hard,and develop exact, exponential-time algorithms for both problems. The new algorithms are applied toconstruct two variants of the local consensus tree; for any set S of phylogenetic trees over some leaflabel set L, this gives a minimal phylogenetic tree over L that contains every rooted triplet present in alltrees in S, where “minimal” means either having the smallest possible number of internal nodes or thesmallest possible number of rooted triplets. (The second variant generalizes the RV-II tree, introducedby Kannan et al. in 1998.) We also measure the running times and memory usage in practice of thenew algorithms for various inputs. Finally, we use our implementations to experimentally investigatethe non-optimality of Aho et al.’s well-known BUILD algorithm from 1981 when applied to the localconsensus tree problems considered here.