scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 1992"


Journal ArticleDOI
06 Jan 1992
TL;DR: Two string distance functions that are computable in linear time give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edited distance based string matching.
Abstract: We study approximate string matching in connection with two string distance functions that are computable in linear time. The first function is based on the so-called $q$-grams. An algorithm is given for the associated string matching problem that finds the locally best approximate occurences of pattern $P$, $|P|=m$, in text $T$, $|T|=n$, in time $O(n\log (m-q))$. The occurences with distance $\leq k$ can be found in time $O(n\log k)$. The other distance function is based on finding maximal common substrings and allows a form of approximate string matching in time $O(n)$. Both distances give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edit distance based string matching.

665 citations


Book ChapterDOI
29 Apr 1992
TL;DR: An optimal sequential solution of the color set size problem and string matching applications including a linear time algorithm for the problem of finding the longest substring common to at least k out of m input strings for all k between 1 and m is given.
Abstract: The Color Set Size problem is: Given a rooted tree of size n with l leaves colored from 1 to m, m ≤ l, for each vertex u find the number of different leaf colors in the subtree rooted at u. This problem formulation, together with the Generalized Suffix Tree data structure has applications to string matching. This paper gives an optimal sequential solution of the color set size problem and string matching applications including a linear time algorithm for the problem of finding the longest substring common to at least k out of m input strings for all k between 1 and m. In addition, parallel solutions to the above problems are given. These solutions may shed light on problems in computational biology, such as the multiple string alignment problem.

200 citations


Book ChapterDOI
29 Apr 1992
TL;DR: This work presents an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases and presents a new approach to string searching.
Abstract: We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs.

118 citations


Book ChapterDOI
29 Apr 1992
TL;DR: A probabilistic analysis of the DP table is given in order to prove that the expected running time of the algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text.
Abstract: We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(√b-1)) on the average, when alphabet size is taken into consideration.

89 citations


Proceedings Article
01 Sep 1992
TL;DR: This paper presents a new algorithmic technique for two-dimensional matching, that of periodicity analysis, and introduces a new pattern matching paradigm - Compressed Matching
Abstract: String matching is rich with a variety of algorithmic tools. In contrast, multidimensional matching has a rather sparse set of techniques. This paper presents a new algorithmic technique for two-dimensional matching, that of periodicity analysis.Periodicity in strings has been used to solve string matching problems. The success of these algorithms suggests that periodicity can be as important a tool in multidimensional matching. However, multidimensional periodicity is not as simple as it is in strings and was not formally studied or used in pattern matching.This paper's main contribution is defining and analysing two-dimensional periodicity in rectangular arrays. In addition, we introduce a new pattern matching paradigm - Compressed Matching. A text array T and a pattern array P are given in compressed forms c(T) and c(P). We seek all appearances of P in T, without decompressing T. By using periodicity analysis, we show that for the two-dimensional run-length compression there is a O(|c(T)|log|P|+|P|), or almost optimal algorithm that can achieve a search time that is sublinear in the size of the text |T|.

87 citations


Journal ArticleDOI
TL;DR: An algorithm is presented that solves the problem of finding the suffix-prefix match for each of the k(k - 1) ordered pairs of strings in O(m + k 2) time, for any fixed alphabet.

83 citations


Proceedings ArticleDOI
01 Jul 1992
TL;DR: An algorithm for two dimensional matching with an 0(n2) text scanning phase that runs on the same model as standard linear time string matching algorithm and requires no special assumptions about the alphabet.
Abstract: Alphabet Independent Two Dimensional Matching Amihood Amir* Gary Bensont Martin Farach$ Georgia Tech Univ. of Maryland DIMACS There are many solutions to the string matching pmZllem whkh are strictly linear in the input size and independent ofalphabet.size. Furthermore, the model of computation for these algorithms is very weak: they allow only simple arithmetic and comparisons of equality between characters of the input. In contrast, algorithm for two dimensional matching have needed stronger models of computation, most notably assuming a totally ordered alphabet. The fastest algorithms for two dimensional matching have therefore had a logarithmic dependence on the alphabet size. In the worst case, this givea an algorithm that runs in 0(n2 log m) with 0(rn2 log m) preprocessing. We show an algorithm for two dimensional matching with an 0(n2) text scanning phase. Furthermore, the text scan requires no special assumptions about the alphabet, i.e. it runs on the same model as standard linear time string matching algorithm. the *College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0280; (404) 853-0083; amir@cc.gatecb. edu; Partially supported by NSF ~ant IR.I-90-13055. tDept. of Computer Scienee, University of Maryland, College Park, MD 20742; (301) 405-2715; benaon@cs.umd.edq Partially supported by NSF grant IRI-90-13055. :DIMACS, Box 1179, Rutgers University, Piscataway, NJ 08855; (808) 932-592% farach@Xhu.acs.mtgers.edw, Supported by DIMACS under NSF contract STC-88-09648. Permission to copy without fee all or pert of thie material ie grantad provided that tha copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and ite date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or apacific permission. 24th ANNUAL ACM STOC 5/92/VICTORIA, B.C., CANADA a 1992 ACM 0-89791-51 2-7/9210004/0059 ...$1 .50

58 citations


Patent
13 Nov 1992
TL;DR: A variable length string matcher as mentioned in this paper finds the longest string in a stored sequence of data elements (e.g., in a history buffer) that matches a string in given sequence of input data elements.
Abstract: A variable length string matcher finds the longest string in a stored sequence of data elements (e.g., in a history buffer) that matches a string in a given sequence of data elements. The matcher includes circuitry that operates iteratively to compare data elements of the strings and determine the longest matching string based on when an iteration does not result in issuance of a match signal. In another aspect, the history buffer is an associative content addressable memory (CAM), and the string matcher uses absolute addressing of the CAM to determine the longest matching string.

48 citations


Journal ArticleDOI
TL;DR: This paper presents an algorithm for the Two-Dimensional Dictionary Problem, that of finding each occurrence of a set of two-dimensional patterns in a text.

47 citations


Book ChapterDOI
29 Apr 1992
TL;DR: The standard string matching problem involves finding all occurrences of a single pattern in a single text, while there are some domains in which it is more appropriate to deal with dictionaries of patterns.
Abstract: The standard string matching problem involves finding all occurrences of a single pattern in a single text. While this approach works well in many application areas, there are some domains in which it is more appropriate to deal with dictionaries of patterns. A dictionary is a set of patterns; the goal of dictionary matching is to find all dictionary patterns in a given text, simultaneously.

34 citations


Journal ArticleDOI
06 Jan 1992
TL;DR: An approximate string-matching algorithm is described based on earlier attribute- matching algorithms, which involves building a trie from the text string which takes time O(N log2 N), for a text string of length N.
Abstract: An approximate string-matching algorithm is described based on earlier attribute-matching algorithms. The algorithm involves building a trie from the text string which takes time O(N log2 N), for a text string of length N. Once this data structure has been built any number of approximate searches can be made for pattern strings of length m. The expected complexity analysis is given for the look-up phase of the algorithm based on certain regularity assumptions about the background language. The expected look-up time for each pattern is O(m log2 N). The ideas employed in the algorithm have been shown effective in practice before, but have not previously received any theoretical analysis.

01 Jan 1992
TL;DR: This work considers several problems from a theoretical perspective and provides efficient algorithms and lower bounds for these problems in sequential and parallel models of computation for the string matching problem.
Abstract: Problems involving strings arise in many areas of computer science and have numerous practical applications. We consider several problems from a theoretical perspective and provide efficient algorithms and lower bounds for these problems in sequential and parallel models of computation. In the sequential setting, we present new algorithms for the string matching problem improving the previous bounds on the number of comparisons performed by such algorithms. In parallel computation, we present tight algorithms and lower bounds for the string matching problem, for finding the periods of a string, for detecting squares and for finding initial palindromes.

Proceedings Article
01 Jan 1992
TL;DR: The authors show an upper bound of n+8/3(m+1)(n-m) character comparisons, achieved by an online algorithm which performs O(n) work in total, requires O(m) space and O( m/sup 2/) time for preprocessing.

01 Jan 1992
TL;DR: In this article, the exact number of symbol comparisons that are required to solve the string matching problem was studied and a family of efficient algorithms were presented. Unlike previous string matching algorithms, the algorithms in this family do not "forget" results of comparisons, what makes their analysis much simpler.
Abstract: We study the exact number of symbol comparisons that are required to solve the string matching problem and present a family of efficient algorithms. Unlike previous string matching algorithms, the algorithms in this family do not "forget" results of comparisons, what makes their analysis much simpler. In particular, we give a linear-time algorithm that finds all occurrences of a pattern of length m in a text of length n in [formula] comparisons. The pattern preprocessing takes linear time and makes at most 2 m comparisons. This algorithm establishes that, in general, searching for a long pattern is easier than searching for a short one. We also show that any algorithm in the family of the algorithms presented must make at least [formula] symbol comparisons, for m = 2 k − 1 and any integer k ≥ 1.

Proceedings ArticleDOI
30 Aug 1992
TL;DR: The authors propose a generalized version of the string matching algorithm by Wagner and Fischer (1974) based on a parametrization of the edit cost, which computes the edit distance of A and B in terms of the parameter r.
Abstract: String matching is a useful concept in pattern recognition that is constantly receiving attention from both theoretical and practical points of view. The authors propose a generalized version of the string matching algorithm by Wagner and Fischer (1974). It is based on a parametrization of the edit cost. The authors assume constant cost for any delete and insert operation, but the cost for replacing a symbol is given as a parameter r. For any two given strings A and B, the algorithm computes the edit distance of A and B in terms of the parameter r. The authors give the new algorithm and study some of its properties. Its time complexity is O(n/sup 2/.m), where n and m are the lengths of the two strings to be compared and n >

Book ChapterDOI
Gene Myers1
06 Apr 1992
TL;DR: A threshold-sensitive algorithm for approximately matching both network and regular expressions and a backtracking procedure whose order of evaluation is optimal in the sense that its expected time is minimal over all such procedures are presented.
Abstract: We present two algorithmic results pertinent to the matching of patterns of interest in macromolecular sequences. The first result is an output sensitive algorithm for approximately matching network expressions, i.e., regular expressions without Kleene closure. This result generalizes the O(kn) expected-time algorithm of Ukkonen for approximately matching keywords [Ukk85]. The second result concerns the problem of matching a pattern that is a network expression whose elements are approximate matches to network expressions interspersed with specifiable distance ranges. For this class of patterns, it is shown how to determine a backtracking procedure whose order of evaluation is optimal in the sense that its expected time is minimal over all such procedures.

Proceedings Article
01 Sep 1992
TL;DR: A novel technique called string ruler approach is used to provide a characterization of several basic parameters of suffix trees (dependency among symbols are allowed !) and provide new insights and generalizations of string matching algorithms, particularly the one by Chang and Lawler.
Abstract: Suffix tree is a data structure widely used in algorithms on words and data compression. Despite this, very little is known about its typical behavior. Recently, Chang and Lawler have designed a sublinear expected time algorithm for approximate string matching using simple estimates of some parameters of suffix trees. It seems that any further advances in such an endover are subject to better understanding of suffix trees behavior. In this paper, we use a novel technique called string ruler approach to provide a characterization of several basic parameters of suffix trees (dependency among symbols are allowed !). These findings are used to :(i) settle in the negative the conjecture of Wyner and Ziv regarding the typical behavior of the universal data compression scheme of Lampel and Ziv; (ii) prove an open problem regarding the length of a block in the Lampel-Ziv parsing algorithm; (iii) provide new insights and generalizations of string matching algorithms, particularly the one by Chang and Lawler.

Journal ArticleDOI
TL;DR: An 0(1) time algorithm for string matching is designed on a two-dimensional (n-m+1)x n processor array with a reconfigurable bus system, where n and m are the length of text and pattern respectively.
Abstract: An 0(1) time algorithm for string matching is designed on a two-dimensional (n-m+1)x n processor array with a reconfigurable bus system, where n and m are the length of text and pattern respectively.

Journal ArticleDOI
01 Oct 1992
TL;DR: A parallelisation scheme for this algorithm is proposed, which applies to a very general set of errors, and allows to solve ASMP in time T with N processors, with NT of O(mn), thereby achieving optimal speedup.
Abstract: The approximate string matching problem (ASMP) consists of finding all the occurrences of a string of characters X of length m in another string Y of length n, m<

Book ChapterDOI
18 Dec 1992
TL;DR: This work considers a general string matching problem in which an arbitrary many-to-many matching relation is specified and those text positions are sought at which the pattern matches under this relation.
Abstract: In standard string matching, each symbol matches only itself. In other string matching problems, e.g., the string matching with “don't-cares” problem, a symbol may match several symbols. In general, an arbitrary many-to-many matching relation might hold between symbols. We consider a general string matching problem in which such a matching relation is specified and those text positions are sought at which the pattern matches under this relation.

Journal ArticleDOI
TL;DR: It is shown by theoretical and empirical observations that the pattern matching machine by the presented structure is about 33% smaller and about 1.3 times faster than that by the triple array.

Journal ArticleDOI
TL;DR: A simple hardware algorithm is proposed for the approximate string matching problem, where a string is searched in a large, flat text with a bounded number of insertions, deletions and substitutions.

01 Jan 1992
TL;DR: A new algorithm for approximate regular expression matching is presented, which is the first to achieve a subquadratic asymptotic time for this problem, and a new software tool called 'agrep' is developed, which are the first general purpose approximate pattern matching tool in the UNIX system.
Abstract: In this thesis, we study approximate pattern matching problems. Our study is based on the Levenshtein distance model, where errors considered are 'insertions', 'deletions', and 'substitutions'. In general, given a text string, a pattern, and an integer k, we want to find substrings in the text that match the pattern with no more than k errors. The pattern can be a fixed string, a limited expression, or a regular expression. The problem has different variations with different levels of difficulties depending on the types of the pattern as well as the constraint imposed on the matching. We present new results both of theoretical interest and practical value. We present a new algorithm for approximate regular expression matching, which is the first to achieve a subquadratic asymptotic time for this problem. For the practical side, we present new algorithms for approximate pattern matching that are very efficient and flexible. Based on these algorithms, we developed a new software tool called 'agrep', which is the first general purpose approximate pattern matching tool in the UNIX system. 'agrep' is not only usually faster than the UNIX 'grep/egrep/fgrep' family, it also provides many new features such as searching with errors allowed, record-oriented search, AND/OR combined patterns, and mixed exact/approximate matching. 'agrep' has been made publicly available through anonymous ftp from cs.arizona.edu since June 1991.

Book
01 Jan 1992
TL;DR: This paper presents a probabilistic analysis of generalized suffix trees and two algorithms for the longest common subsequence of three (or more) strings.
Abstract: Probabilistic analysis of generalized suffix trees.- A language approach to string searching evaluation.- Pattern matching with mismatches: A probabilistic analysis and a randomized algorithm.- Fast multiple keyword searching.- Heaviest increasing/common subsequence problems.- Approximate regular expression pattern matching with concave gap penalties.- Matrix longest common subsequence problem, duality and hilbert bases.- From regular expressions to DFA's using compressed NFA's.- Identifying periodic occurrences of a template with applications to protein structure.- Edit distance for genome comparison based on non-local operations.- 3-D substructure matching in protein Molecules.- Fast serial and parallel algorithms for approximate tree matching with VLDC's (Extended Abstract).- Grammatical tree matching.- Theoretical and empirical comparisons of approximate string matching algorithms.- Fast and practical approximate string matching.- DZ A text compression algorithm for natural languages.- Multiple alignment with guaranteed error bounds and communication cost.- Two algorithms for the longest common subsequence of three (or more) strings.- Color Set Size problem with applications to string matching.- Computing display conflicts in string and circular string visualization.- Efficient randomized dictionary matching algorithms.- Dynamic dictionary matching with failure functions.

Dissertation
01 Jan 1992
TL;DR: An optimal parallel algorithm to find the edit distance, a metric frequently used to measure distance, between two sequences, and introduces a new problem, the string to string rearrangement problem, that allows movement and inversion of substrings.
Abstract: As the volume of genetic sequence data increases due to improved sequencing techniques and increased interest, the computational tools available to analyze the data are becoming inadequate. This thesis seeks to improve a few of the computational methods available to access and analyze data in the genetic sequence databases. The first two results are parallel algorithms based on previously known sequential algorithms. The third result is a new approach, based on assumptions that we believe make sense in the biological context of the problem, to approximating an ${\cal NP}$-complete problem. The final result is a fundamentally new approach to approximate string matching using the divide and conquer paradigm instead of the dynamic programming approach that has been used almost exclusively in the past. Dynamic programming algorithms to measure the distance between sequences have been known since at least 1972. Recently there has been interest in developing parallel algorithms to measure the distance between two sequences. We have developed an optimal parallel algorithm to find the edit distance, a metric frequently used to measure distance, between two sequences. It is often interesting to find the substrings of length k that appear most frequently in a given string. We give a simple sequential algorithm to solve this problem and an efficient parallel version of the algorithm. The parallel algorithm uses an efficient novel parallel bucket sort. When sequencing a large segment of DNA, the original DNA sequence is reconstructed using the results of sequencing fragments, that may or may not contain errors, of many copies of the original DNA. New algorithms are given to solve the problem of reconstructing the original DNA sequence with and without errors introduced into the fragments. A program based on this algorithm is used to reconstruct the human beta globin region (HUMHBB) when given a set of 300 to 500 mers drawn randomly from the HUMHBB region. Approximate string matching is used in a biological context to model the steps of evolution. While such evolution may proceed base by base using the change, insert, or delete operators, there is also evidence that whole genes may be moved or inverted. We introduce a new problem, the string to string rearrangement problem, that allows movement and inversion of substrings. We give a divide and conquer algorithm for finding a rearrangement of one string within another.


Proceedings ArticleDOI
01 Apr 1992
TL;DR: An optimized version of the edit distance algorithm is described which has proven more accurate for a particular commercial application than the existing (benchmark) algorithm.
Abstract: Wc present an approximate string matching case study. An optimized version of the edit distance algorithm is described which has proven more accurate for a particular commercial application than the existing (benchmark) algorithm. The cvoluhon and nature of the optimization are detailed and test results are presented.

Yi Jiang1
01 Jan 1992
TL;DR: A real-time parallel algorithm, which could be implemented on a systolic array using m (the length of the pattern string) very simple processing elements, is proposed, which is well-suited for real- time searching of text databases or biological nucleic acid sequence databases.
Abstract: Given a text string, a much shorter pattern string, and an integer k , parallel algorithms for finding all occurrences of the pattern string in the text string with at most A; differences (as defined by edit distance) are discussed. First, a real-time parallel algorithm, which could be implemented on a systolic array using m (the length of the pattern string) very simple processing elements, is proposed. After the algorithm gets started, it outputs the minimum edit distance from the pattern string to a substring of the text string at each time step. Thus, the algorithm is well-suited for real-time searching of text databases or biological nucleic acid sequence databases. Second, several different ways for solving the same problem with different CRCW-PRAM assumptions (priority model, combination model, and common — value model) are developed. This class of algorithms uses 0 ( m x n) or 0 ( m x m x n) processors and achieve a time complexity of 0(k) . Key words, approximate string matching, edit distance, systolic computation, CRCW-PRAM models.

01 Jan 1992
TL;DR: An optimized version of the edit distance algorithm is described which has proven more accurate for a particular commercial application than the existing (benchmark) algorithm.
Abstract: Wc present an approximate string matching case study. An optimized version of the edit distance algorithm is described which has proven more accurate for a particular commercial application than the existing (benchmark) algorithm. The cvoluhon and nature of the optimization are detailed and test results are presented.

Proceedings ArticleDOI
30 Aug 1992
TL;DR: Presents some new results on using the theory of error control codes in pattern recognition by using the polynomial cyclic code classification to obtain the invariance at the position of the starting point when creating the string representation.
Abstract: Presents some new results on using the theory of error control codes in pattern recognition. For the shapes described by means of primitive strings , a recognition algorithm is proposed based on string matching. By using the polynomial cyclic code classification, it is obtained the invariance at the position of the starting point when creating the string representation. >