scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 1991"


Book ChapterDOI
Alfred V. Aho1
02 Jan 1991
TL;DR: This chapter discusses the algorithms for solving string-matching problems that have proven useful for text-editing and text-processing applications and several innovative, theoretically interesting algorithms have been devised that run significantly faster than the obvious brute-force method.
Abstract: Publisher Summary This chapter discusses the algorithms for solving string-matching problems that have proven useful for text-editing and text-processing applications. String pattern matching is an important problem that occurs in many areas of science and information processing. In computing, it occurs naturally as part of data processing, text editing, term rewriting, lexical analysis, and information retrieval. Many text editors and programming languages have facilities for matching strings. In biology, string-matching problems arise in the analysis of nucleic acids and protein sequences, and in the investigation of molecular phylogeny. String matching is also one of the central and most widely studied problems in theoretical computer science. The simplest form of the problem is to locate an occurrence of a keyword as a substring in a sequence of characters, which is called the input string. For example, the input string queueing contains the keyword ueuei as a substring. Even for this problem, several innovative, theoretically interesting algorithms have been devised that run significantly faster than the obvious brute-force method.

413 citations


Book ChapterDOI
09 Sep 1991
TL;DR: A scheme in which T is first preprocessed to make the subsequent searches with different P fast to find all approximate occurrences P′ of a pattern string P in a text string T such that the edit distance between P and P′ is ≤k.
Abstract: The problem of finding all approximate occurrences P′ of a pattern string P in a text string T such that the edit distance between P and P′ is ≤k is considered We concentrate on a scheme in which T is first preprocessed to make the subsequent searches with different P fast Two preprocessing methods and the corresponding search algorithms are described The first is based suffix automata and is applicable for edit distances with general edit operation costs The second is a special design for unit cost edit distance and is based on q-gram lists The preprocessing needs in both cases time and space O(|T|) The search algorithms run in the worst case in time O(|P||T|) or O(k|T|), and in the best case in time O(|P|)

175 citations


Patent
15 Mar 1991
TL;DR: In this paper, a lossless data compression circuit compares a new data string with a set of comparison data, and produces a sequence of codewords representing the sequence of successive, non-overlapping substrings of the new string.
Abstract: A lossless data compression circuit compares a new data string with a set of comparison data, and produces a sequence of codewords representing a sequence of successive, non-overlapping substrings of the new data string. A shift register stores and shifts the comparison data until all of the characters in the comparison data have been compared with the new data string. A composite reproduction length circuit finds the maximum length string within the set of comparison characters matching substrings of characters in the new data string beginning at each position in the new data string. The composite reproduction length circuit produces a multiplicity of data pairs, one for each position of the new data string. Each data pair comprises a maximum length value, corresponding to the maximum length matching comparison string found for the new data substring starting at the corresponding position, and a pointer value denoting where the maximum length matching comparison string is located in the comparison data. A codeword generator then generates a sequence of codewords representing the new data string, each codeword including one of these data pairs and representing a substring of the new data string. By using N such data compression units in parallel, with each storing an identical new data string, and each unit's shift register storing and shifting a different subset of a specified comparison string, processing time for generating codewords is reduced by a factor of approximately (N-1)/N.

105 citations


Proceedings ArticleDOI
01 Mar 1991
TL;DR: In this paper, the Smaller Matching Problem and the k-Aligned Ones with Location Problem are solved in O(kn2 √ m logm √ k log k + k 2n2) time.
Abstract: Finding all occurrences of a non-rectangular pattern of height m and area a in an n×n text with no more than k mismatch, insertion, and deletion errors is an important problem in computer vision. It can be solved using a dynamic programming approach in time O(an2). We show a O(kn2 √ m logm √ k log k + k2n2) algorithm which combines convolutions with dynamic programming. At the heart of the algorithm are the Smaller Matching Problem and the k-Aligned Ones with Location Problem. Efficient algorithms to solve both these problems are presented. The results presented in this paper appeared in the proceedings of the Second Symposium on Descrete Algorithms [AF91] College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0280; (404) 853-0083; amir@cc.gatech.edu; Partially supported by NSF grant IRI-9013055. DIMACS, Box 1179, Rutgers University, Piscataway, NJ 08855; (908) 932-5928; farach@dimacs.rutgers.edu

95 citations


Journal ArticleDOI
01 Nov 1991
TL;DR: The novel approach formalizes an entropy weights activation of prototypes for fuzzy partial matching and brings another dimension to the fuzzy matching criterion that provides a measure of uncertainty in the partial matching to each prototype.
Abstract: A characteristic approach of approximate reasoning is the partial matching of observations to prototypes. This analysis is cast in the framework of fuzzy set theory and brings another dimension to the fuzzy matching criterion; this dimension is the measure of uncertainty through the concept of subjective entropy. While a similarity measure, in the matching process, activates relevant prototypes, the entropy formalism derived provides a measure of uncertainty in the partial matching to each prototype. The novel approach formalizes an entropy weights activation of prototypes for fuzzy partial matching. A methodology is developed for matching of observation to a set of prototypes making use of a suitable aggregation done with a framework of fuzzy integrals. A method of dealing with compound hypothesis is also developed. >

24 citations


Book ChapterDOI
Moni Naor1
08 Jul 1991
TL;DR: This work considers the well known string matching problem where a text and pattern are given and the problem is to determine if the pattern appears as a substring in the text and provides preprocessing and on-line algorithms such that the preprocessing algorithm runs in linear time and requires linear storage and the on- line complexity is logarithmic in theText.
Abstract: We consider the well known string matching problem where a text and pattern are given and the problem is to determine if the pattern appears as a substring in the text. The setting we investigate is where the pattern and the text are preprocessed separately at two different sites. At some point in time the two sites wish to determine if the pattern appears in the text (this is the on-line stage). We provide preprocessing and on-line algorithms such that the preprocessing algorithm runs in linear time and requires linear storage and the on-line complexity is logarithmic in the text. We also give an application of the algorithm to parallel data compression, and show how to implement the Lempel Ziv algorithm in logarithmic time with linear number of processors.

21 citations


01 Jan 1991
TL;DR: An algorithm that is sublinear time 0 (n/m)k logb m) on the average, when k is bounded by the threshold m/(logbm) the expected running time is o(n) and in the worst case, the algorithm is 0(kn).
Abstract: The k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We treat k not as a constant but as a fraction of m (not necessarily constant-fraction). Previous algorithms require at least O(kn) time (or else exponential space). We are interested in much faster algorithms for restricted cases of the problem, such as when the text string is random and the allowable error rate is not too high (log-fraction). We have devised an algorithm that is sublinear time 0 (n/m)k logb m) on the average, when k is bounded by the threshold m/(logbm) the expected running time is o(n). In the worst case, our algorithm is 0(kn), but still an improvement in that it is practical and uses 0(m) space compared to 0(n) or 0(msquared). We define three problems inspired by molecular biology and describe efficient algorithms based

19 citations


Patent
30 Dec 1991
TL;DR: In this paper, a four-step reduction procedure was proposed to improve the efficiency of an approximate string matching algorithm, using the upper bound, the string length partition criterion and the cut-off criterion.
Abstract: A data string processing system uses fast algorithms for approximate string matching in a dictionary (23). Multiple spelling errors of insert, delete, change and transpose operations on character strings are considered in the disclosed fault model. S-trace, the fault model is used in formulating the algorithms and, a four-step reduction procedure improves the efficiency of an approximate string matching algorithm. These approaches to spelling correction, (i.e., using the upper bound, the string length partition criterion and the cut-off criterion) represent three improvements from the basic exhaustive comparison approach. Each can be naturally incorporated into the next step. In the fourth-step, a hashing scheme avoids comparing the given string with words at large distances when searching in the neighborhood of a small distance. An algorithm that is sub-linear to the number of words in dictionary (23) results. An application of the algorithms to a library information system uses original text files (21), information description files (22) and a negative dictionary (23) stored on disks (12).

3 citations


01 Jan 1991
TL;DR: Issues and examples related to the techniques and difficulties in implementing a fuzzy matching algorithm using SAS® software are presented, including data integrity issues, processing efficiency, blocking, and distance measures.
Abstract: As the sheer volume of data increases, an increase in the need to link people or places or products from one data source to another has followed, particularly in the healthcare field. Unfortunately, as the number of available data sources has sharply increased, the use of “unique” identifiers has not. Further, what constitutes a unique identifier in one data set may have no relationship with a unique identifier in another. This paper introduces the concept of fuzzy matching, which is a technique to link people (or other entities) across data sources when there are no unique identifiers available. Fuzzy matching typically utilizes numerous fields which combine to create approximate matches which then must be evaluated. Issues and examples related to the techniques and difficulties in implementing a fuzzy matching algorithm using SAS® software are presented, including data integrity issues, processing efficiency, blocking, and distance measures.

2 citations