scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 1997"


Book
01 Jan 1997
TL;DR: In this paper, the authors introduce suffix trees and their use in sequence alignment, core string edits, alignments and dynamic programming, and extend the core problems to extend the main problems.
Abstract: Part I. Exact String Matching: The Fundamental String Problem: 1. Exact matching: fundamental preprocessing and first algorithms 2. Exact matching: classical comparison-based methods 3. Exact matching: a deeper look at classical methods 4. Semi-numerical string matching Part II. Suffix Trees and their Uses: 5. Introduction to suffix trees 6. Linear time construction of suffix trees 7. First applications of suffix trees 8. Constant time lowest common ancestor retrieval 9. More applications of suffix trees Part III. Inexact Matching, Sequence Alignment and Dynamic Programming: 10. The importance of (sub)sequence comparison in molecular biology 11. Core string edits, alignments and dynamic programming 12. Refining core string edits and alignments 13. Extending the core problems 14. Multiple string comparison: the Holy Grail 15. Sequence database and their uses: the motherlode Part IV. Currents, Cousins and Cameos: 16. Maps, mapping, sequencing and superstrings 17. Strings and evolutionary trees 18. Three short topics 19. Models of genome-level mutations.

3,904 citations


Book
29 May 1997
TL;DR: This tutorial jumps right in to the meat of the book without dragging you through the basic concepts of programming.
Abstract: 1. Off-Line Serial Exact String Searching 2. Off-Line Parallel Exact String Searching 3. On-Line String Searching 4. Serial Computations of Levenshtein Distances 5. Parallel Computations of Levenshtein Distances 6. Approximate String Searching 7. Dynamic Programming: Special Cases 8. Shortest Common Superstrings 9. Two Dimensional Matching 10. Suffix Tree Data Structures for Matrices 11. Tree Pattern Matching

307 citations


Patent
25 Jan 1997
TL;DR: In this paper, a document browser for electronic filing systems, which supports pen-based markup and annotation, is described, where the user may electronically write notes (60-64) anywhere on a page (32, 38) and then later search for those notes using the approximate ink matching (AIM) technique.
Abstract: In summary there is disclosed a document browser for electronic filing systems, which supports pen-based markup and annotation. The user may electronically write notes (60-64) anywhere on a page (32, 38) and then later search for those notes using the approximate ink matching (AIM) technique. The technique segments (104) the user-drawn strokes, extracts (108) and vector quantizes (112) features contained in those strokes. An edit distance comparison technique (118) is used to query the database (120), rendering the system capable of performing approximate or partial matches to achieve fuzzy search capability.

299 citations



Journal ArticleDOI
TL;DR: This paper examines string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence, and shows that several variants are NPcomplete and give polynomial-time algorithms for solving the remainder.

118 citations



Journal Article
TL;DR: An O(n 4 log n) time algorithm is shown for the pattern matching problem for strings which are succinctly described in terms of straight-line programs, in which the constants are symbols and the only operation is the concatenation.
Abstract: We investigate the time complexity of the pattern matching problem for strings which are succinctly described in terms of straight-line programs, in which the constants are symbols and the only operation is the concatenation. Most strings of descriptive size n are of exponential length with respect to n. We show an O(n 4 log n) time algorithm for this problem. The crucial point in our algorithm is the succinct representation of all periods of a (possibly long) string described in this manner. We also show a (rather straightforward) result that a very simple extension of the pattern-matching problem for shortly described strings is NP-complete.

91 citations


Patent
Lauri Karttunen1
16 May 1997
TL;DR: In this paper, a processor implemented method of modifying a string of a regular language, which includes at least two symbols and two predetermined substrings, was described, and the processor then replaced the matching substring with the string of the lower language associated with the selected preselected substrings and outputs the modified string.
Abstract: A processor implemented method of modifying a string of a regular language, which includes at least two symbols and at least two predetermined substrings. Upon receipt of the string, the processor determines an initial position within the string of a substring matching one of the preselected substrings. To make this determination, the processor either matches symbols of the string starting from the left and proceeding to the right or by starting from the right and proceeding to the left. After identifying the initial position, the processor then selects either the longest or the shortest of the preselected substrings. The processor then replaces the matching substring with the string of the lower language associated with the selected preselected substring and outputs the modified string.

86 citations


Patent
23 Jul 1997
TL;DR: In this paper, a dictionary based data compression and decompression system is proposed, where, in the compressor, when a partial string W and a character C are matched in the dictionary, a new string is entered into the dictionary with C as an extension character on the string PW where P is the string corresponding to the last output compressed code signal.
Abstract: A dictionary based data compression and decompression system where, in the compressor, when a partial string W and a character C are matched in the dictionary, a new string is entered into the dictionary with C as an extension character on the string PW where P is the string corresponding to the last output compressed code signal. An update string is entered into the compression dictionary for each input character that is read and matched. The updating is immediate and interleaved with the character-by-character matching of the current string. The update process continues until the longest match is found in the dictionary. The code of the longest matched string is output in a string matching cycle. If a single character or multi-character string "A" exists in the dictionary, the string AAA . . . A is encoded in two compressed code signals regardless of the string length. This encoding results in an unrecognized code signal at the decompressor. The decompressor, in response to an unrecognized code signal, enters update strings into the decompressor dictionary in accordance with the recovered string corresponding to the previously received code signal, the unrecognized code signal, the extant code of the decompressor and the number of characters in the previously recovered string.

85 citations


Proceedings ArticleDOI
01 Jan 1997
TL;DR: The notion of approximate word matching is introduced and how it can be used to improve detection and categorization of variant forms in bibliographic entries is shown.
Abstract: As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. The need to discover and reconcile variant forms of strings in bibliographic entries, i.e., authority work, will become more difficult. Spelling variants, misspellings, and transliteration differences will all increase the difficulty of retrieving information. Approximate string matching has traditionally been used to help with this problem. In this paper we introduce the notion of approximate word matching and show how it can be used to improve detection and categorization of variant forms.

63 citations


Patent
30 Jul 1997
TL;DR: In this article, the authors proposed a data matching mechanism for string replication compression using a dictionary of data, which can be used to find a sequence of data in a data buffer (e.g. looking for a particular series of words, letters, or numbers in an online document).
Abstract: Efficiencies in searching and matching information in a computer system are achieved using embodiments of the invention. The invention can be used, for example, to build and utilize a dictionary of data for string replication compression. The data matching mechanism can also be applied to other situations where it is necessary to find a sequence of data in a data buffer (e.g. looking for a particular series of words, letters, or numbers in an online document). As a result of processing a current string using the data dictionary, it is possible to find a previously-processed dictionary string that has the greatest number of initial characters in common with the current string, and a location at which the current string can be inserted into the dictionary tree. A count field is used to improve the speed of searching for matched strings.

01 Jan 1997
TL;DR: This paper presents an analysis of the performance of the system using different search criteria involving melodic contour, musical intervals and rhythm; tests were carried out using both exact and approximate string matching.
Abstract: This paper describes a system designed to retrieve melodies from a database on the basis of a few notes sung into a microphone. The system first accepts acoustic input from the user, transcribes it into common music notation, then searches a database of 9400 folk tunes for those containing the sung pattern, or patterns similar to the sung pattern; retrieval is ranked according to the closeness of the match. The paper presents an analysis of the performance of the system using different search criteria involving melodic contour, musical intervals and rhythm; tests were carried out using both exact and approximate string matching. Approximate matching used a dynamic programming algorithm designed for comparing musical sequences. Current work focuses on developing a faster algorithm.

Book ChapterDOI
06 Aug 1997
TL;DR: Two new algorithms for on-line multiple approximate string matching are presented, extensions of previous algorithms that search for a single pattern that partitions the pattern in sub-patterns that are searched with no errors, with a fast exact multipattern search algorithm.
Abstract: We present two new algorithms for on-line multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The single-pattern version of the first one is based on the simulation with bits of a non-deterministic finite automaton built from the pattern and using the text as input. To search for multiple patterns, we superimpose their automata, using the result as a filter. The second algorithm partitions the pattern in sub-patterns that are searched with no errors, with a fast exact multipattern search algorithm. To handle multiple patterns, we search the sub-patterns of all of them together. The average running time achieved is in both cases O(n) for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when each algorithm should be used, and show experimentally that they are faster than previous solutions in a wide range of cases.

Book
29 May 1997
TL;DR: This chapter focuses on the problem of evaluating a longest common subsequence, which is expressively equivalent to the simple form of the Levenshtein distance.
Abstract: In the previous chapters, we discussed problems involving an exact match of string patterns. We now turn to problems involving similar but not necessarily exact pattern matches. There are a number of similarity or distance measures, and many of them are special cases or generalizations of the Levenshtein metric. The problem of evaluating the measure of string similarity has numerous applications, including one arising in the study of the evolution of long molecules such as proteins. In this chapter, we focus on the problem of evaluating a longest common subsequence, which is expressively equivalent to the simple form of the Levenshtein distance.

Journal ArticleDOI
01 Apr 1997
TL;DR: This paper presents a learning-automaton based solution to string taxonomy that utilizes the Object Migrating Automaton the power of which in clustering objects and images has been reported.
Abstract: A typical syntactic pattern recognition (PR) problem involves comparing a noisy string with every element of a dictionary, X. The problem of classification can be greatly simplified if the dictionary is partitioned into a set of subdictionaries. In this case, the classification can be hierarchical-the noisy string is first compared to a representative element of each subdictionary and the closest match within the subdictionary is subsequently located. Indeed, the entire problem of subdividing a set of string into subsets where each subset contains "similar" strings has been referred to as the "String Taxonomy Problem". To our knowledge there is no reported solution to this problem. In this paper we present a learning-automaton based solution to string taxonomy. The solution utilizes the Object Migrating Automaton the power of which in clustering objects and images has been reported. The power of the scheme for string taxonomy has been demonstrated using random string and garbled versions of string representations of fragments of macromolecules.

Proceedings Article
08 Jul 1997
TL;DR: In this paper, a stochastic model for string-edit distance is proposed, which is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.
Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This work considers the problem of finding the longest common subsequence of two strings, and develops significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems.
Abstract: Measuring the similarity between two strings, through such standard measures as Hamming distance, edit distance, and longest common subsequence, is one of the fundamental problems in pattern matching. We consider the problem of finding the longest common subsequence of two strings. A well-known dynamic programming algorithm computes the longest common subsequence of strings X and Y in O(|X|/spl middot/|Y|) time. We develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is run-length encoded if it is described as an ordered sequence of pairs (/spl sigma/,i), each consisting of an alphabet symbol /spl sigma/ and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of /spl sigma/. For example, the string aaaabbbbcccabbbbcc can be encoded as a/sup 4/b/sup 4/c/sup 3/a/sup 1/b/sup 4/c/sup 2/. Such a run-length encoded string can be significantly shorter than the expanded string representation. Indeed, runlength coding serves as a popular image compression technique, since many classes of images, such as binary images in facsimile transmission, typically contain large patches of identically-valued pixels.

Patent
Robert Walter Schreiber1
10 Nov 1997
TL;DR: In this paper, a system and method for computer-aided heuristic adaptive attribute matching is described, which comprises a server for receiving a status message and for further processing of the status message according to the following steps: (i) preparing a candidate list of the candidates; (ii) searching a search list of search attributes; (iii) eliminating non-matching candidates; and (iv) selecting a matching candidate.
Abstract: A system and method for computer-aided heuristic adaptive attribute matching are disclosed. A system for computer-aided heuristic adaptive attribute matching comprises a server for receiving a status message and for further processing of the status message according to the following steps: (i) preparing a candidate list of the candidates; (ii) preparing a search list of search attributes; (iii) eliminating non-matching candidates; and, (iv) selecting a matching candidate. A method for computer-aided heuristic adaptive attribute matching in accordance with the invention comprises four steps. Those steps are: (1) preparing a candidate list comprising a plurality of candidates; (2) preparing a search list comprising at least one search attribute; (3) fuzzy matching at least one known attribute to the search attribute responsive to more than one candidate existing; and (4) returning a result of the fuzzy matching.

Proceedings ArticleDOI
18 Dec 1997
TL;DR: Three algorithms for string matching on reconfigurable mesh architectures are presented and the first algorithm finds the exact matching between T and P in O(1) time on a 2-dimensional RMESH of size (n-m+1)/spl times/m.
Abstract: String matching problem received much attention over the years due to its importance in various applications such as text/file comparison, DNA sequencing, search engines, and spelling correction. Especially with the introduction of search engines dealing with tremendous amount of textual information presented on the world wide web and the research on DNA sequencing, this problem deserves special attention and any algorithmic or hardware improvements to speed up the process will benefit these important applications. In this paper, we present three algorithms for string matching on reconfigurable mesh architectures. Given a text T of length n and a pattern P of length m, the first algorithm finds the exact matching between T and P in O(1) time on a 2-dimensional RMESH of size (n-m+1)/spl times/m. The second algorithm finds the approximate matching between T and P in O(k) time on a 2D RMESH, where k is the maximum edit distance between T and P. The third algorithm allows only the replacement operation in the calculation of the edit distance and finds an approximate matching between T and P in constant-time on a 3D RMESH.

Proceedings ArticleDOI
01 Jul 1997
TL;DR: A new approach to measuring the similarity of 3D curves is presented, based on an extension of the classical string edit distance that allows the possibility to use strings, where each element can be a vector rather than a single symbol.
Abstract: In this paper a new approach to measuring the similarity of 3D curves is presented. This approach is based on an extension of the classical string edit distance in two ways. The first extension is the possibility to use strings, where each element can be a vector rather than a single symbol, while the second extension is the use of fuzzy set based cost functions in the edit distance computation. These two extensions allow us to tackle various problems, that can't be solved by means of "classical" string edit distance.

Journal ArticleDOI
TL;DR: A simple recursive, memoized version of the Knuth-Morris-Pratt string matching algorithm is given, along with a proof of correctness and worst-case analysis.

Proceedings Article
01 Jan 1997
TL;DR: In this article, a detailed description of simulation of non-deterministic finite automata (NFA) for approximate string matching using bit parallelism is presented. And the modi cation of ShiftOr algorithm is designed using generalized Levenshtein distance and modi-cation for exact and approximate sequence matching.
Abstract: We present detailed description of simulation of nondeterministic nite automata (NFA) for approximate string matching. This simulation uses bit parallelism and used algorithm is called Shift-Or algorithm. Using knowledge of simulation of NFA by Shift-Or algorithm we design modi cation of ShiftOr algorithm for approximate string matching using generalized Levenshtein distance and modi cation for exact and approximate sequence matching.

Book ChapterDOI
30 Jun 1997
TL;DR: This paper includes the swap operation that interchanges two adjacent characters into the set of allowable edit operations, and presents an O(t min(m,n)-time algorithm for the extended edit distance problem, where t is the edit distance between the given strings.
Abstract: Most research on the edit distance problem and the k-differences problem considered the set of edit operations consisting of changes, deletions, and insertions. In this paper we include the swap operation that interchanges two adjacent characters into the set of allowable edit operations, and we present an O(t min(m,n))-time algorithm for the extended edit distance problem, where t is the edit distance between the given strings, and an O(kn)-time algorithm for the extended k-differences problem. That is, we add swaps into the set of edit operations without increasing the time complexities of previous algorithms that consider only changes, deletions, and insertions for the edit distance and k-differences problems.

Proceedings ArticleDOI
18 Aug 1997
TL;DR: A fast approximate string matching method to use a portion of characters of a word and a distance pattern in order to use current index techniques and achieves high recall even for the poorly recognized texts.
Abstract: This paper presents a fast approximate string matching method. In constructing information spaces such as digital libraries, we have to collect vast amount of information and convert it into uniformly organized data. Since much of the information must be converted from various media automatically, the space contains garbled text with various accuracy. For utilizing these texts, we need to satisfy the three requirements, i.e., high recall, high precision and fast matching process. In order to satisfy these requirements, we have been developing a two-phase matching system. The presented method is used for fast and high recall candidate word selection in the first phase. The key idea of the method is to use a portion of characters of a word and a distance pattern in order to use current index techniques. By experiments, we confirm that the presented method achieves high recall even for the poorly recognized texts.


Proceedings ArticleDOI
20 Oct 1997
TL;DR: The systolic solution for approximate string matching is modified and extended for the OCS problem in this paper and the architecture presented here can also be used to determine the minimum edit distance, the Longest Common Subsequence (LCS) and its length.
Abstract: The string matching problem arises in many fields of text analysis, image analysis and speech recognition. The computationally intensive nature of string matching makes it a candidate for VLSI implementation. Most of the existing algorithms and architectures for string matching consider strings that are from a finite alphabet set. The Optimal Correspondence of String Subsequence (OCS) problem, on the other hand, considers strings from an infinite alphabet set. This paper describes the design of a linear systolic array VLSI architecture for the OCS problem. The systolic solution for approximate string matching is modified and extended for the OCS problem in this paper. The architecture presented here can also be used to determine the minimum edit distance, the Longest Common Subsequence (LCS) and its length. The systolic architecture was simulated and verified using the Cadence design tools.

Journal ArticleDOI
TL;DR: An effective string match algorithm has been developed which can tolerate common types of distortions, e.g. connected strokes, missing/extra strokes, and variations in writing sequence.


Proceedings Article
01 Jan 1997
TL;DR: In this paper, a new family of single keyword pattern matching algorithms is presented, which can be used to do a minimal number of match attempts within the input string (by maintaining as much information as possible from each match attempt).
Abstract: Even though the field of pattern matching has been well studied, there are still many interesting algorithms to be discovered. In this paper, we present a new family of single keyword pattern matching algorithms. We begin by deriving a common ancestor algorithm, which na¨ývely solves the problem. Through a series of correctness preserving predicate strengthenings, and implementation choices, we derive efficient variants of this algorithm. This paper also presents one of the first algorithms which could be used to do a minimal number of match attempts within the input string (by maintaining as much information as possible from each match attempt). Keywords: Single keyword pattern matching, Shift distances, Match attempts, Reusing match information, Predicate strengthening and weakening, D.1.4, E.1, F.2.2, G.2.2

Journal ArticleDOI
Xindong Wu1
TL;DR: This paper describes the fuzzy matching techniques implemented in the HCV (Version 2.0) software, and presents a hybrid interpretation mechanism which combines fuzzy matching with probability estimation.
Abstract: When applying rules produced by induction from training examples to a test example, there are three possible cases that demand different actions: (i) no match; (ii) single match; and (iii) multiple match. Existing techniques for dealing with the first and third cases are exclusively based on probability estimation. However, when there are continuous attributes in the example space, and if these attributes have been discretized into intervals before induction, fuzzy interpretation of the discretized intervals at deduction time could be very valuable. This paper describes the fuzzy matching techniques implemented in the HCV (Version 2.0) software, and presents a hybrid interpretation mechanism which combines fuzzy matching with probability estimation. Experimental results of the HCV (Version 2.0) software with different interpretation techniques are provided on a number of data sets from the University of California at Irvine Repository of Machine Learning Databases.