scispace - formally typeset
Search or ask a question

Showing papers by "Martin Farach published in 1995"


Journal ArticleDOI
TL;DR: This paper presents several natural and realistic ways of modeling the inaccuracies in the distance data, and considers various ways of “fitting” a given distance matrix to a tree in order to minimize various criteria of error in the fit.
Abstract: Constructing evolutionary trees for species sets is a fundamental problem in computational biology. One of the standard models assumes the ability to compute distances between every pair of species, and seeks to find an edge-weighted treeT in which the distanced in the tree between the leaves ofT corresponding to the speciesi andj exactly equals the observed distance,d ij . When such a tree exists, this is expressed in the biological literature by saying that the distance function or matrix isadditive, and trees can be constructed from additive distance matrices in0(n 2) time. Real distance data is hardly ever additive, and we therefore need ways of modeling the problem of finding the best-fit tree as an optimization problem. In this paper we present several natural and realistic ways of modeling the inaccuracies in the distance data. In one model we assume that we have upper and lower bounds for the distances between pairs of species and try to find an additive distance matrix between these bounds. In a second model we are given a partial matrix and asked to find if we can fill in the unspecified entries in order to make the entire matrix additive. For both of these models we also consider a more restrictive problem of finding a matrix that fits a tree which is not only additive but alsoultrametric. Ultrametric matrices correspond to trees which can be rooted so that the distance from the root to any leaf is the same. Ultrametric matrices are desirable in biology since the edge weights then indicate evolutionary time. We give polynomial-time algorithms for some of the problems while showing others to be NP-complete. We also consider various ways of “fitting” a given distance matrix (or a pair of upper- and lower-bound matrices) to a tree in order to minimize various criteria of error in the fit. For most criteria this optimization problem turns out to be NP-hard, while we do get polynomial-time algorithms for some.

152 citations


Journal ArticleDOI
TL;DR: An algorithm which computes the MAST of k trees on n leaves where some tree has maximum outdegree d in time O( kn 3 + n d ) is given.

124 citations


Journal ArticleDOI
TL;DR: A faster algorithm for dynamic string dictionary matching with bounded alphabets, and a novel method to efficiently manipulate failure links for two-dimensional patterns.
Abstract: In the dynamic dictionary matching problem, a dictionary D contains a set of patterns that can change over time by insertion and deletion of individual patterns. The user also presents text strings and asks for all occurrences of any patterns in the text. The two main contributions of this paper are: (1) a faster algorithm for dynamic string dictionary matching with bounded alphabets, and (2) a dynamic dictionary matching algorithm for two-dimensional texts and patterns. The first contribution is based on an algorithm that solves the general problem of maintaining a sequence of well-balanced parentheses under the operations insert, delete, and find nearest enclosing parenthesis pair. The main new idea behind the second contribution is a novel method to efficiently manipulate failure links for two-dimensional patterns.

114 citations


Proceedings ArticleDOI
29 May 1995
TL;DR: The theory of string matching has a long association with compression algorithms, and data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm.
Abstract: String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW.

113 citations


Proceedings ArticleDOI
22 Jan 1995
TL;DR: It is proved that the match length entropy estimator has a relatively fast converge rate and it is demonstrated experimentally that by using this entropy estimators, one can indeed extract a meaningful signal from segments of DNA.
Abstract: gree than the retained sequences (“exons”) We have applied the information theoretic notion of entropy to characterize DNA sequences We consider a genetic sequence signal that is too small for asymptotic entropy estimates to be accurate, and for which similar approaches have previously failed We prove that the match length entropy estimator has a relatively fast converge rate and demonstrate experimentally that by using this entropy estimator, we can indeed extract a meaningful signal from segments of DNA Further, we derive a method for detecting certain signals within DNA known as splice junctions with significantly better performance than previously known methods

92 citations


Journal ArticleDOI
TL;DR: This work derives an O ( n 2 + o (1) ) time algorithm for the Unrooted Maximum Agreement Subtree Problem and its rooted variant ( RMAST).
Abstract: Constructing evolutionary trees for species sets is a fundamental problem in biology. Unfortunately, there is no single agreed upon method for this task, and many methods are in use. Current practice dictates that trees be constructed using different methods and that the resulting trees then be compared for consensus. It has become necessary to automate this process as the number of species under consideration has grown. We study the Unrooted Maximum Agreement Subtree Problem ( UMAST ) and its rooted variant ( RMAST ). The UMAST problem is as follows: given a set A and two trees T 0 and T 1 leaf-labeled by the elements of A , find a maximum cardinality subset B of A such that the restrictions of T 0 and T 1 to B are topologically isomorphic. Our main result is an O ( n 2 + o (1) ) time algorithm for the UMAST problem. We also derive an O ( n 2 ) time algorithm for the RMAST problem. The previous best algorithm for both these problems has running time O ( n 4.5 + o (1) ).

73 citations


Journal ArticleDOI
TL;DR: A O(kn2 √ m logm √ k log k + k2n2) algorithm which combines convolutions with dynamic programming is shown which solves the Smaller Matching Problem and the k-Aligned Ones with Location Problem.
Abstract: Efficient algorithms exist for the approximate two dimensional matching problem for rectangles. This is the problem of finding all occurrences of an m × m pattern in an n × n text with no more than k mismatch, insertion, and deletion errors. In computer vision it is important to generalize this problem to non-rectangular figures. We make progress towards this goal by defining half-rectangular figures of height m and area a. The approximate two dimensional matching problem for half-rectangular patterns can be solved using a dynamic programming approach in time O(an2). We show an O(kn2formula]formula] + k2n2) algorithm which combines convolutions with dynamic programming. Note that our algorithm is superior to previous known solutions for k ? m13. At the heart of the algorithm are the Smaller Matching Problem and the k-Aligned Ones with Location Problem. These are interesting problems in their own right. Efficient algorithms to solve both these problems are presented.

58 citations


Book ChapterDOI
25 Sep 1995
TL;DR: An algorithm is given which computes the MAST of k trees on n species where some tree has maximum degree d in time O(kn3+n d ).
Abstract: The Maximum Agreement Subtree (MAST) is a well-studied measure of similarity of leaf-labelled trees. There are several variants, depending on the number of trees, their degrees, and whether or not they are rooted. It turns out that the different variants display very different computational behavior. We address the common situation in biology, where the involved trees are rooted and of bounded degree, most typically simply being binary. We give an algorithm which computes the MAST of k trees on n species where some tree has maximum degree d in time O(kn3+n d ). This improves the Amir and Keselman FOCS '94 O(knd+1+n2d) bound. We give an algorithm which computes the MAST of 2 trees with degree bound d in time O(n√d log3 n). This should be contrasted with the Farach and Thorup FOCS '94 \(O(nc^{\sqrt {log n} } + n\sqrt d \log n)\) bound. Thus, for d a constant, we get an O(n log3n) bound, replacing the previous \(O(nc^{\sqrt {log n} } )\)bound.

24 citations


Proceedings ArticleDOI
20 Jul 1995
TL;DR: Parallel Dictionary Matching and Compression is a parallel search algorithm that automates the very labor-intensive and therefore time-heavy and expensive and expensive process of manually cataloging words in a dictionary.
Abstract: Parallel Dictionary Matching and Compression

17 citations