scispace - formally typeset
Search or ask a question
Posted Content

An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings

TL;DR: In this article, the authors presented an algorithm for comparing trees that are labeled in an arbitrary manner, which is faster than the previous algorithms and is at the core of their maximum agreement subtree algorithm.
Abstract: A widely used method for determining the similarity of two labeled trees is to compute a maximum agreement subtree of the two trees. Previous work on this similarity measure is only concerned with the comparison of labeled trees of two special kinds, namely, uniformly labeled trees (i.e., trees with all their nodes labeled by the same symbol) and evolutionary trees (i.e., leaf-labeled trees with distinct symbols for distinct leaves). This paper presents an algorithm for comparing trees that are labeled in an arbitrary manner. In addition to this generality, this algorithm is faster than the previous algorithms. Another contribution of this paper is on maximum weight bipartite matchings. We show how to speed up the best known matching algorithms when the input graphs are node-unbalanced or weight-unbalanced. Based on these enhancements, we obtain an efficient algorithm for a new matching problem called the hierarchical bipartite matching problem, which is at the core of our maximum agreement subtree algorithm.
Citations
More filters
01 Jan 2013
TL;DR: This work proposes a novel technique to extract conflict-free information from multi-labeled trees as a much smaller single labeled tree, and shows that the inherent problem in identifying rogue taxa is NP-hard and give fixed-parameter tractable and integer linear programming solutions.
Abstract: The ever growing availability of phylogenomic data makes it increasingly possible to study and analyze phylogenetic relationships across a wide range of species. Indeed, current phylogenetic analyses are now producing enormous collections of trees that vary greatly in size. Our proposed research addresses the challenges posed by storing, querying, and analyzing such phylogenetic databases. Our first contribution is the further development of STBase, a phylogenetic tree database consisting of a billion trees whose leaf sets range from four to 20000. STBase applies techniques from different areas of computer science for efficient tree storage and retrieval. It also introduces new ideas that are specific to tree databases. STBase provides a unique opportunity to explore innovative ways to analyze the results from queries on large sets of phylogenetic trees. We propose new ways of extracting consensus information from a collection of phylogenetic trees. Specifically, this involves extending the maximum agreement subtree problem. We greatly improve upon an existing approach based on frequent subtrees and, propose two new approaches based on agreement subtrees and frequent subtrees respectively. The final part of our proposed work deals with the problem of simplifying multi-labeled trees and handling “rogue” taxa. We propose a novel technique to extract conflict-free information from multi-labeled trees as a much smaller single labeled tree. We show that the inherent problem in identifying rogue taxa is NP-hard and give fixed-parameter tractable and integer linear programming solutions.

1 citations

Journal ArticleDOI
TL;DR: The mapping kernel that is introduced in this paper is a natural generalization of Haussler's convolution kernel, in that the input to the primitive kernel moves over a predetermined subset rather than the entire cross product.
Abstract: Haussler's convolution kernel provides a successful framework for engineering new positive semidefinite kernels, and has been applied to a wide range of data types and applications. In the framework, each data object represents a finite set of finer grained components. Then, Haussler's convolution kernel takes a pair of data objects as input, and returns the sum of the return values of the predetermined primitive positive semidefinite kernel calculated for all the possible pairs of the components of the input data objects. On the other hand, the mapping kernel that we introduce in this paper is a natural generalization of Haussler's convolution kernel, in that the input to the primitive kernel moves over a predetermined subset rather than the entire cross product. Although we have plural instances of the mapping kernel in the literature, their positive semidefiniteness was investigated in case-by-case manners, and worse yet, was sometimes incorrectly concluded. In fact, there exists a simple and easily checkable necessary and sufficient condition, which is generic in the sense that it enables us to investigate the positive semidefiniteness of an arbitrary instance of the mapping kernel. This paper presents and proves the validity of the condition. In addition, we introduce two important instances of the mapping kernel, which we refer to as the size-of-index-structure-distribution kernel and the edit-cost-distribution kernel. Both of them are naturally derived from well known (dis)similarity measurements in the literature (the maximum agreement tree, the edit distance), and are reasonably expected to improve the performance of the existing measures by evaluating their distributional features rather than their peak (maximum/minimum) features.

1 citations

Journal ArticleDOI
03 Apr 2020
TL;DR: A generic and theoretic framework to investigate similarity of structured data through structure-preserving one-to-one partial mappings, which the authors call morphisms is proposed and seen that the center star algorithm can be abstracted so that it not only applies to data structures other than strings but also can be used to solve problems of pattern extraction.
Abstract: In mathematics, morphism is a term that indicates structure-preserving mappings between mathematical structures of the same type. Linear transformations for linear spaces, homomorphisms for algebraic structures and continuous functions for topological spaces are examples. Many data researched in machine learning, on the other hand, can include mathematical structures in them. Strings are totally ordered sets, and trees can be understood not only as graphs but also as partially ordered sets with respect to an ancestor-to-descendent order and semigroups with respect to the binary operation to determine nearest common ancestor. In this paper, we propose a generic and theoretic framework to investigate similarity of structured data through structure-preserving one-to-one partial mappings, which we call morphisms. Through morphisms, useful and important methods studied in the literature can be abstracted into common concepts, although they have been studied separately. When we study new structures of data, we will be able to extend the legacy methods for the purpose of studying the new structure, if we can define morphisms properly. Also, this view reveals hidden relations between methods known in the literature and can let us understand them more clearly. For example, we see that the center star algorithm, which was originally developed to compute sequential multiple alignments, can be abstracted so that it not only applies to data structures other than strings but also can be used to solve problems of pattern extraction. The methods that we study in this paper include edit distance, multiple alignment, pattern extraction and kernel, but it is sure that there exist much more methods that can be abstracted within our framework.

1 citations

Journal ArticleDOI
TL;DR: It is shown that the MCAST problem can be reduced to the MAST problem in linear time and thus algorithms for MCAST with running times matching the fastest known algorithms for MAST are shown.
Abstract: We propose and study the Maximum Constrained Agreement Subtree (MCAST) problem, which is a variant of the classical Maximum Agreement Subtree (MAST) problem. Our problem allows users to apply their domain knowledge to control the construction of the agreement subtrees in order to get better results. We show that the MCAST problem can be reduced to the MAST problem in linear time and thus we have algorithms for MCAST with running times matching the fastest known algorithms for MAST.

1 citations

References
More filters
Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

Journal ArticleDOI
TL;DR: This paper presents algorithms for the assignment problem, the transportation problem, and the minimum- cost flow problem of operations research that find a minimum-cost solution, yet run in time close to the best-known bounds for the corresponding problems without costs.
Abstract: This paper presents algorithms for the assignment problem, the transportation problem, and the minimum-cost flow problem of operations research. The algorithms find a minimum-cost solution, yet run in time close to the best-known bounds for the corresponding problems without costs. For example, the assignment problem (equivalently, minimum-cost matching in a bipartite graph) can be solved in $O(\sqrt {nm} \log (nN))$ time, where $n,m$, and N denote the number of vertices, number of edges, and largest magnitude of a cost; costs are assumed to be integral. The algorithms work by scaling. As in the work of Goldberg and Tarjan, in each scaled problem an approximate optimum solution is found, rather than an exact optimum.

457 citations

Journal ArticleDOI
TL;DR: This paper presents another approach to the problem of comparing many secondary structures by utilizing a very efficient tree-matching algorithm that will compare two trees in O([T1] X [T2] X L1 X L2) in the worst case and very close to O[T1?] for average trees representing secondary structures.
Abstract: In a previous paper, an algorithm was presented for analyzing multiple RNA secondary structures utilizing a multiple string alignment algorithm. In this paper we present another approach to the problem of comparing many secondary structures by utilizing a very efficient tree-matching algorithm that will compare two trees in O([T1] X [T2] X L1 X L2) in the worst case and very close to O([T1] X [T2]) for average trees representing secondary structures. The result of the pairwise comparison algorithm is then used with a cluster algorithm to produce a multiple structure clustering which can be displayed in a taxonomy tree to show related structures.

346 citations

Journal ArticleDOI
TL;DR: The tree obtained by regrafting branches on to a largest common pruned tree is shown to contain all the classes present in the strict consensus tree.
Abstract: Given two or more dendrograms (rooted tree diagrams) based on the same set of objects, ways are presented of defining and obtaining common pruned trees. Bounds on the size of a largest common pruned tree are introduced, as is a categorization of objects according to whether they belong to all, some, or no largest common pruned trees. Also described is a procedure for regrafting pruned branches, yielding trees for which one can assess the reliability of the depicted relationships. The tree obtained by regrafting branches on to a largest common pruned tree is shown to contain all the classes present in the strict consensus tree. The theory is illustrated by application to two classifications of a set of forty-nine stratigraphical pollen spectra.

221 citations