scispace - formally typeset
Search or ask a question
Author

Martin Farach

Bio: Martin Farach is an academic researcher from Rutgers University. The author has contributed to research in topics: Approximate string matching & Pattern matching. The author has an hindex of 35, co-authored 55 publications receiving 3591 citations. Previous affiliations of Martin Farach include University of Latvia & University of Copenhagen.

Papers published on a yearly basis

Papers
More filters
Proceedings ArticleDOI
Martin Farach1
19 Oct 1997
TL;DR: This work builds suffix trees in linear time for integer alphabet using Weiner's algorithm, which matches a trivial /spl Omega/(n log n)-time lower bound based on sorting.
Abstract: The suffix tree of a string is the fundamental data structure of combinatorial pattern matching. Weiner (1973), who introduced the data structure, gave an O(n)-time algorithm for building the suffix tree of an n-character string drawn from a constant size alphabet. In the comparison model, there is a trivial /spl Omega/(n log n)-time lower bound based on sorting, and Weiner's algorithm matches this bound trivially. For integer alphabets, a substantial gap remains between the known upper and lower bounds, and closing this gap is the main open question in the construction of suffix trees. There is no super-linear lower bound, and the fastest known algorithm was the O(n log n) time comparison based algorithm. We settle this open problem by closing the gap: we build suffix trees in linear time for integer alphabet.

426 citations

Journal ArticleDOI
TL;DR: In this article, the authors consider pattern matching without decompression in the UNIX Z-compression scheme and show how to modify their algorithms to achieve a trade-off between the amount of extra space used and the algorithm's time complexity.

223 citations

Journal ArticleDOI
TL;DR: This paper gives the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm, which is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit.
Abstract: String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if |Z|=o(|T|) , i.e., if the compression ratio |T|/|Z| is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(n log 2 u/n + p) where n=|Z| , u=|T| , and p=|P| . Compare with the naive ``decompresion'' algorithm, which takes time Θ(u+p) to decide if P occurs in T . Writing u+p as (n u)/n+p , we see that we have improved the complexity, replacing the compression factor u/n by a factor log 2 u/n . Our algorithm is competitive in the sense that O(n log 2 u/n + p)=O(u+p) , and opportunistic in the sense that O(n log 2 u/n + p)=o(u+p) if n=o(u) and p=o(u) .

179 citations

Proceedings ArticleDOI
28 Jan 1996
TL;DR: In this paper, the problem of fitting an n x n distance matrix D by a tree metric T was considered and an O(n sup 2) algorithm was proposed for this problem with a performance guarantee.
Abstract: We consider the problem of fitting an n x n distance matrix D by a tree metric T. Let e be the distance to the closest tree metric under the Linf norm, that is e=minT{||T-D||inf}. First we present an O(n sup 2) algorithm for finding a tree metric T such that ||T-D||inf >= 3e. Second we show that it is NP-hard to find a tree metric T such that ||T-D||inf >= 9e/8. This paper presents the first algorithm for this problem with a performance guarantee.

159 citations

Journal ArticleDOI
TL;DR: This paper presents several natural and realistic ways of modeling the inaccuracies in the distance data, and considers various ways of “fitting” a given distance matrix to a tree in order to minimize various criteria of error in the fit.
Abstract: Constructing evolutionary trees for species sets is a fundamental problem in computational biology. One of the standard models assumes the ability to compute distances between every pair of species, and seeks to find an edge-weighted treeT in which the distanced in the tree between the leaves ofT corresponding to the speciesi andj exactly equals the observed distance,d ij . When such a tree exists, this is expressed in the biological literature by saying that the distance function or matrix isadditive, and trees can be constructed from additive distance matrices in0(n 2) time. Real distance data is hardly ever additive, and we therefore need ways of modeling the problem of finding the best-fit tree as an optimization problem. In this paper we present several natural and realistic ways of modeling the inaccuracies in the distance data. In one model we assume that we have upper and lower bounds for the distances between pairs of species and try to find an additive distance matrix between these bounds. In a second model we are given a partial matrix and asked to find if we can fill in the unspecified entries in order to make the entire matrix additive. For both of these models we also consider a more restrictive problem of finding a matrix that fits a tree which is not only additive but alsoultrametric. Ultrametric matrices correspond to trees which can be rooted so that the distance from the root to any leaf is the same. Ultrametric matrices are desirable in biology since the edge weights then indicate evolutionary time. We give polynomial-time algorithms for some of the problems while showing others to be NP-complete. We also consider various ways of “fitting” a given distance matrix (or a pair of upper- and lower-bound matrices) to a tree in order to minimize various criteria of error in the fit. For most criteria this optimization problem turns out to be NP-hard, while we do get polynomial-time algorithms for some.

152 citations


Cited by
More filters
Journal ArticleDOI

3,734 citations

Journal ArticleDOI

2,415 citations

Journal ArticleDOI
17 Sep 2002
TL;DR: Neighbor-Net is presented, a distance based method for constructing phylogenetic networks that is based on the Neighbor-Joining (NJ) algorithm of Saitou and Nei and can quickly produce detailed and informative networks for several hundred taxa.
Abstract: We introduce NeighborNet, a network construction and data representation method that combines aspects of the neighbor joining (NJ) and SplitsTree. Like NJ, NeighborNet uses agglomeration: taxa are combined into progressively larger and larger overlapping clusters. Like SPLITSTREE, NeighborNet constructs networks rather than trees, and so can be used to represent multiple phylogenetic hypotheses simultaneously, or to detect complex evolutionary processes like recombination, lateral transfer and hybridization. NeighborNet tends to produce networks that are substantially more resolved than those made with SPLITSTREE. The method is efficient (O(n3) time) and is well suited for the preliminary analyses of complex phylogenetic data. We report results of three case studies: one based on mitochondrial gene order data from early branching eukaryotes, another based on nuclear sequence data from New Zealand alpine buttercups (Ranunculi), and a third on poorly corrected synthetic data.

1,846 citations

Journal ArticleDOI
TL;DR: An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string, developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries.
Abstract: An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It always has the suffix tree for the scanned part of the string ready. The method is developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries. Regardless of its quadratic worst case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give, in a natural way, the well-known algorithms for constructing suffix automata (DAWGs).

1,528 citations