scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 1994"


Book
15 Jan 1994
TL;DR: String Matching String Distance and Common Sequences Suffix Trees Approximate String Matching Repeated Substrings.
Abstract: String Matching String Distance and Common Sequences Suffix Trees Approximate String Matching Repeated Substrings.

346 citations


Proceedings ArticleDOI
24 May 1994
TL;DR: This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases, which give information that is complementary to the best protein classifier available today.
Abstract: Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric.This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today.

193 citations


Journal ArticleDOI
TL;DR: It is shown how to speed up two string-matching algorithms: the Boyer-Moore algorithm (BM algorithm), and its version called here the reverse factor algorithm (RF algorithm), based on factor graphs for the reverse of the pattern.
Abstract: We show how to speed up two string-matching algorithms: the Boyer-Moore algorithm (BM algorithm), and its version called here the reverse factor algorithm (RF algorithm). The RF algorithm is based on factor graphs for the reverse of the pattern. The main feature of both algorithms is that they scan the text right-to-left from the supposed right position of the pattern. The BM algorithm goes as far as the scanned segment (factor) is a suffix of the pattern. The RF algorithm scans while the segment is a factor of the pattern. Both algorithms make a shift of the pattern, forget the history, and start again. The RF algorithm usually makes bigger shifts than BM, but is quadratic in the worst case. We show that it is enough to remember the last matched segment (represented by two pointers to the text) to speed up the RF algorithm considerably (to make a linear number of inspections of text symbols, with small coefficient), and to speed up the BM algorithm (to make at most 2 ·n comparisons). Only a constant additional memory is needed for the search phase. We give alternative versions of an accelerated RF algorithm: the first one is based on combinatorial properties of primitive words, and the other two use the power of suffix trees extensively. The paper demonstrates the techniques to transform algorithms, and also shows interesting new applications of data structures representing all subwords of the pattern in compact form.

190 citations


Journal ArticleDOI
TL;DR: This work gives an algorithm that is sublinear time0((n/m)k logbm) when the text is random andk is bounded by the threshold m/(logbm + O(1)).
Abstract: Given a text string of lengthn and a pattern string of lengthm over ab-letter alphabet, thek differences approximate string matching problem asks for all locations in the text where the pattern occurs with at mostk differences (substitutions, insertions, deletions). We treatk not as a constant but as a fraction ofm (not necessarily constant-fraction). Previous algorithms require at leastO(kn) time (or exponential space). We give an algorithm that is sublinear time0((n/m)k log b m) when the text is random andk is bounded by the threshold m/(logb m + O(1)). In particular, whenk=o(m/logb m) the expected running time iso(n). In the worst case our algorithm is O(kn), but is still an improvement in that it is practical and uses0(m) space compared with0(n) or0(m 2). We define three problems motivated by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein matching.

183 citations


Book ChapterDOI
01 Oct 1994

111 citations


Journal ArticleDOI
TL;DR: A new data structure is presented that allows such queries to be answered very quickly even for huge sets if the words are not too long and the query is quite close.

98 citations


Book ChapterDOI
05 Jun 1994
TL;DR: This paper describes how the distance-based sublinear expected time algorithm of Chang and Lawler can be extended to solve efficiently the local similarity problem and presents a new theoretical result, polynomialspace, constant-fraction-error matching that is provably optimal.
Abstract: The best known rigorous method for biological sequence comparison has been the algorithm of Smith and Waterman. It computes in quadratic time the highest scoring local alignment of two sequences given a (nonmetric) similarity measure and gap penalty. In this paper, we describe how the distance-based sublinear expected time algorithm of Chang and Lawler can be extended to solve efficiently the local similarity problem. We present both a new theoretical result, polynomialspace, constant-fraction-error matching that is provably optimal, and a practical adaptation of it that produces nearly identical results as Smith-Waterman, at speedups of 2X (PAM 120, roughly corresponding to 33% identity) to 10X (PAM 90, 50% identity) or better. Further improvements are anticipated. What makes this possible is the addition of a new constraint on unit score (average score per residue), which filters out both very short alignments and very long alignments with unacceptably low average. This program is part of a package called Genome Analyst that is being developed at CSHL.

89 citations


Journal ArticleDOI
TL;DR: This paper presents algorithms for three problems having to do with approximate matching for such trees with variable length don′t cares (VLDCs) with time complexity O(|P| × |D| × min(depth(P, leaves(P)) × min (depth(D), leaves(D)))

89 citations


Patent
Andreas Arning1
11 Jul 1994
TL;DR: In this paper, a system for checking the spelling of words and character strings without the need for a stored dictionary of words was proposed, where the system selects an error free string and modifies it according to one or more rules which change the error-free string to a possible error string.
Abstract: A system for checking the spelling of words and character strings without the need for a stored dictionary of words and the memory required thereby. The system selects an error-free string and modifies it according to one or more rules which change the error-free string to a possible error string. The rules creating the possible error string can modify the error-free string by predictable character manipulation to yield usual and common errors of the character string. The frequency of occurrence of both the error and error-free strings within the text are determined. These frequencies are compared to each other and, based upon the comparison, the system decides whether the possible error string is an actual error string. The system can use modifying rules which are psychological or technically related to the computer system or operator, and rules which correspond to errors common with specialized input methods, such as character and speech recognition.

64 citations


Book ChapterDOI
10 Jun 1994
TL;DR: The structure of finite automata recognizing sets of the form A*p, for some word p, is studied, and the results obtained are used to improve the Knuth-Morris-Pratt string searching algorithm.
Abstract: In this paper we study the structure of finite automata recognizing sets of the form A*p, for some word p, and use the results obtained to improve the Knuth-Morris-Pratt string searching algorithm. We also determine the average number of nontrivial edges of the above automata.

56 citations


Book
30 May 1994
TL;DR: The text describes and evaluates the BF, KMP, BM, and KR algorithms, discusses improvements for string pattern matching machines, and details a technique for detecting and removing the redundant operation of the AC machine.
Abstract: From the Publisher: Introduces the basic concepts and characteristics of string pattern matching strategies and provides numerous references for further reading. The text describes and evaluates the BF, KMP, BM, and KR algorithms, discusses improvements for string pattern matching machines, and details a technique for detecting and removing the redundant operation of the AC machine. Also explored are typical problems in approximate string matching. In addition, the reader will find a description for applying string pattern matching algorithms to multidimensional matching problems, an investigation of numerous hardware-based solutions for pattern matching, and an examination of hardware approaches for full text search. The first chapter's survey paper describes the basic concepts of algorithm classifications. The five chapters that follow include 15 papers further illustrating these classifications: single keyword matching, matching sets of keywords, approximate string matching, multidimensional matching, and hardware matching.

Patent
Richard Hull1
28 Oct 1994
TL;DR: In this paper, an improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computational intensive lower bound estimate.
Abstract: An improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computationally intensive lower bound estimate. The lower bound estimate of the string edit distance between the two strings is calculated by equalising the lengths of the two strings by adding padding elements to the shorter one. The elements of the strings are then sorted and the substitution costs between corresponding elements are summed.

Journal ArticleDOI
TL;DR: The experiments show that performing approximate string matching for a large dictionary in real-time on an ordinary sequential computer under the multiple fault model is feasible.
Abstract: An approach to designing very fast algorithms for approximate string matching in a dictionary is proposed. Multiple spelling errors corresponding to insert, delete, change, and transpose operations on character strings are considered in the fault model. The design of very fast approximate string matching algorithms through a four-step reduction procedure is described. The final and most effective step uses hashing techniques to avoid comparing the given word with words at large distances. The technique has been applied to a library book catalog textbase. The experiments show that performing approximate string matching for a large dictionary in real-time on an ordinary sequential computer under our multiple fault model is feasible. >

Book ChapterDOI
Tadao Takaoka1
25 Aug 1994
TL;DR: A more general analysis of expected time with the simplified algorithm for the one-dimensional case under a non-uniform probability distribution, and it is shown that the method can easily be generalized to the two-dimensional approximate pattern matching problem with sublinear expected time.
Abstract: We simplify in this paper the algorithm by Chang and Lawler for the approximate string matching problem, by adopting the concept of sampling. We have a more general analysis of expected time with the simplified algorithm for the one-dimensional case under a non-uniform probability distribution, and we show that our method can easily be generalized to the two-dimensional approximate pattern matching problem with sublinear expected time.

Journal ArticleDOI
TL;DR: An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod‐4 integers into a computer word and a 21‐fold parallelism over the conventional algorithm can be obtained.
Abstract: Given a text string, a pattern string, and an integer k, the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two strings is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod-4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21-fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, to the algorithm of Galil and Park, and to a limited implementation of the Wu-Manber algorithm are given.

Patent
30 Sep 1994
TL;DR: In this paper, a VLSI circuit structure for computing the edit distance between two strings over a given alphabet is presented, which can perform approximate string matching for variable edit costs, and does not place any constraint on the lengths of the strings that can be compared.
Abstract: The edit distance between two strings a1, . . . , am and b1, . . . , bn is the minimum cost s of a sequence of editing operations (insertions, deletions and substitutions) that convert one string into the other. This invention provides VLSI circuit structure for computing the edit distance between two strings over a given alphabet. The circuit structure can perform approximate string matching for variable edit costs. More importantly, the circuit structure does not place any constraint on the lengths of the strings that can be compared. It makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation.


Book ChapterDOI
05 Jun 1994
TL;DR: It can be decided in O(∥T∥+l2 · ¦S− T¦+l ·δ·¦¦S − T) time whether or not there exists a δ-characteristic string of T under S, where l denotes the length of a shortest string in T, the cardinality of S − T, and ∥T ∥ the size of T.
Abstract: The difference between two strings is the minimum number of editing steps (insertions, deletions, changes) that convert one string into the other. Let S be a finite set of strings, let T be a subset of S, and let δ be a positive integer. A δ-characteristic string of T under S is a string that is a common substring of T and that has at least δ-differences from any substring of any string in S − T. In this paper, the following result is presented.lt can be decided in O(∥T∥+l2 · ¦S− T¦+l ·δ·¦¦S−T¦¦) time whether or not there exists a δ-characteristic string of T under S, where l denotes the length of a shortest string in T, ¦S− T¦ the cardinality of S − T, and ∥T∥ the size of T. If such a string exits, then all the shortest δ-characteristic strings of T under S can also be obtained in that time.

Proceedings Article
01 Jan 1994
TL;DR: In this article, the authors show how to break symmetries that occur in the process of assigning labels using the Deterministic Coin Tossing (DCT) technique, and thereby reduce the number of labeled substrings to linear.
Abstract: Suffix trees are the main data-structure in string matching algorithmes. There are several serial algorithms for suffix tree construction which run in linear time, but the number of operations in the only parallel algorithm available, due to Apostolico, Iliopoulos, Landau, Schieber and Vishkin, is proportional to n log n. The algorithm is based on labeling substrings, similar to a classical serial algorithm, with the same operations bound, by Karp, Miller and Rosenberg. We show how to break symmetries that occur in the process of assigning labels using the Deterministic Coin Tossing (DCT) technique, and thereby reduce the number of labeled substrings to linear.

Book ChapterDOI
05 Jun 1994
TL;DR: In this conference version, this work considers only Bernoulli model (i.e., memoryless channel) but the results hold under much weaker probabilistic assumptions.
Abstract: A practical suboptimal algorithm (source coding) for lossy (non-faithful) data compression is discussed. This scheme is based on an approximate string matching, and it naturally extends lossless (faithful) Lempel-Ziv data compression scheme. The construction of the algorithm is based on a careful probabilistic analysis of an approximate string matching problem that is of its own interest. This extends Wyner-Ziv model to lossy environment. In this conference version, we consider only Bernoulli model (i.e., memoryless channel) but our results hold under much weaker probabilistic assumptions.

Book ChapterDOI
25 Aug 1994
TL;DR: This work presents a lineartime algorithm for deciding whether or not there exists a characteristic string of T under S, and if such a string exists, then the algorithm returns all the shortest characteristic strings of Tunder S in that time.
Abstract: Let S be a finite set of strings and let T be a subset of S. A characteristic string of T under S is a string that is a common substring of T and that is not a substring of any string in S-T. We present a lineartime algorithm for deciding whether or not there exists a characteristic string of T under S. If such a string exists, then the algorithm returns all the shortest characteristic strings of T under S in that time.

Book ChapterDOI
Tatsuya Akutsu1
05 Jun 1994
TL;DR: This paper presents parallel and serial approximate matching algorithms for strings with don't care characters based on Landau and Vishkin's approximate string matching algorithm and Fisher and Paterson's exact string matching algorithms with don's care characters.
Abstract: This paper presents parallel and serial approximate matching algorithms for strings with don't care characters. They are based on Landau and Vishkin's approximate string matching algorithm and Fisher and Paterson's exact string matching algorithm with don't care characters. The serial algorithm works in O(√kmn log¦Σ¦ log2m/k log log m/k) time, and the parallel algorithm works in O(k log m) time using O(√m/kn log ¦Σ¦ log m/k log log m/k) Processors on a CRCW-PRAM, where n denotes the length of a text string, m denotes the length of a pattern string, k denotes the maximum number of differences, and ∑ denotes the alphabet (i.e. the set of characters). Several extensions are also described.

Journal ArticleDOI
01 Jan 1994
TL;DR: Infrared spectra are identified by matching to a standard database using a fuzzy peak matching algorithm, and the results obtained can be compared to those obtained from conventional X-ray diffraction analysis.
Abstract: Infrared spectra are identified by matching to a standard database using a fuzzy peak matching algorithm.

Book ChapterDOI
26 Sep 1994
TL;DR: In this article, the exact comparison complexity of the string prefix-matching problem in the deterministic sequential comparison model with equality tests was studied, and almost tight lower and upper bounds on the number of comparisons required in the worst case by on-line prefix matching algorithms for any fixed pattern and variable text were derived.
Abstract: In this paper we study the exact comparison complexity of the string prefix-matching problem in the deterministic sequential comparison model with equality tests We derive almost tight lower and upper bounds on the number of comparisons required in the worst case by on-line prefix-matching algorithms for any fixed pattern and variable text Unlike previous results on the comparison complexity of string-matching and prefix-matching algorithms, our bounds are almost tight for any particular pattern

Book
01 May 1994
TL;DR: A space efficient algorithm for finding the best non-overlapping alignment score and a lossy data compression scheme that allows fast searching directly in the compressed file.
Abstract: A space efficient algorithm for finding the best non-overlapping alignment score.- The parameterized complexity of sequence alignment and consensus.- Computing all suboptimal alignments in linear space.- Approximation algorithms for multiple sequence alignment.- A context dependent method for comparing sequences.- Fast identification of approximately matching substrings.- Alignment of trees - An alternative to tree edit.- Parametric recomputing in alignment graphs.- A lossy data compression based on string matching: Preliminary analysis and suboptimal algorithms.- A text compression scheme that allows fast searching directly in the compressed file.- An alphabet-independent optimal parallel search for three dimensional pattern.- Unit route upper bound for string-matching on hypercube.- Computation of squares in a string.- Minimization of sequential transducers.- Shortest common superstrings for strings of random letters.- Maximal common subsequences and minimal common supersequences.- Dictionary-matching on unbounded alphabets: Uniform length dictionaries.- Proximity matching using fixed-queries trees.- Query primitives for tree-structured data.- Multiple matching of parameterized patterns.- Approximate string matching with don't care characters.- Matching with matrix norm minimization.- Approximate string matching and local similarity.- Polynomial-time algorithms for computing characteristic strings.- Recent methods for RNA modeling using stochastic context-free grammars.- Efficient bounds for oriented chromosome inversion distance.

Book ChapterDOI
01 Jan 1994
TL;DR: A methods for search in a dictionary based on knowledge about error statistics at the output of a HR-classifier and method applicability to HR is assessed in terms of “Damerau-Levenstein” metrics that is frequently used for similarity definition of HR strings.
Abstract: A brief analysis of existing methods for performance improvement of handwriting recognition (HR) that are based on text-to-lexicon matching postprocessing is provided in the paper. A method for search in a dictionary is proposed, based on knowledge about error statistics at the output of a HR-classifier. The method is developed for the most probable misspellings, namely a character substitution, omission, insertion and neighboring character reversal. Method applicability to HR is assessed in terms of “Damerau-Levenstein” metrics that is frequently used for similarity definition of HR strings. The error modelling proposed is distributed between the dictionary structure and the processing algorithm. Character deletion is the only operation utilized. Thus, a substantial amount of errors are implicitly represented, resulting in a comparatively low processing time and space complexity. The method is implemented in a software subsystem for fault-tolerant keyboard input processing of natural language names. The experimental results are briefly reported.

Proceedings ArticleDOI
01 Jan 1994
TL;DR: The authors provide a data structure for approximate string searching and discuss the searching algorithm.
Abstract: Summary form only given. The problem of searching for approximate occurrences of a pattern in a set of strings is called the approximate string searching problem. The recent interest in this problem comes from DNA sequence analysis: whenever a sequence investigator determines a new sequence, one of the first things he must do is to compare it with all available sequences to see if it resembles something already known. The authors provide a data structure for approximate string searching and discuss the searching algorithm. >


Journal Article
TL;DR: A string check function that removes extraneous neighboring characters from recognized character strings based on notation rules and determines whether error correction is required and is determined to be effective by conducting experiments.
Abstract: At1 algorithm ,for recognizing nutvreric strings with notation rules by using string checking has been developed. The proposed string check function removes extraneous characters from recognized character strings by using the notation rules. This function also determines whether to carry out recognition error correction. In this correction process, recognized characters are compared with string in a dictionary. Errors in character strings are autotnatically corrected to meaninRfu1 letters by using the notcltion rules and dictionary. The string space of the dictionar.~ to be compared is restricted based on the notation rules; this reduces processing time. The string characters. These systems that recognize such strings as an 1. D. code as a numeric string are thus unsuitable. These strings should be read as a word. The above tasks needs to be solved by utilizing sources of notation rules and a dictionary. We propose a string check function that removes extraneous neighboring characters from recognized character strings based on notation rules. It also determines whether error correction is required. Correction is done by comparing recognized character strings with strings in a dictionary. This function was determined to be effective by conducting experiments using a sample set of 3.983 input pages. check function improved the string recognition rate from 98.5 % to 99.7 %. and decreases the error rate by 98 %. 2. Numeric String

Journal ArticleDOI
TL;DR: A data type called padded string is presented, a string type with faster operations that can run faster than traditional strings such as char* in C language.
Abstract: A string is a sequence of characters. The operations such as copy and comparison on strings ar e usually performed character by character . This note presents a data type called padded string, a string type with faster operations . A padded string is a sequence of machine words . For 32-bi t machines, four characters can be operated in one machine instruction . Operations on padde d strings can then run faster than traditional strings such as char* in C language . An experimen t sorting an array of strings shows speedup 24% using padded string.