Showing papers on "Approximate string matching published in 1998"

PDF

Open Access

Journal Article•DOI•

[...]

Eric Sven Ristad¹, Peter N. Yianilos¹•Institutions (1)

01 May 1998-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The stochastic model allows us to learn a string-edit distance function from a corpus of examples and is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

...read moreread less

Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

...read moreread less

897 citations

Journal Article•DOI•

Incremental String Comparison

[...]

Gad M. Landau, Eugene W. Myers, Jeanette P. Schmidt

01 Apr 1998-SIAM Journal on Computing

TL;DR: This paper considers the following incremental version of comparing two sequences A and B to determine their longest common subsequence (LCS) or the edit distance between them, and obtains O(nk) algorithms for the longest prefix approximate match problem, the approximate overlap problem, and cyclic string comparison.

...read moreread less

Abstract: The problem of comparing two sequences A and B to determine their longest common subsequence (LCS) or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k2) time required to compute a solution from scratch. We further show, with a series of applications, that this algorithm is indeed more powerful than its nonincremental counterpart. We show this by solving the applications with greater asymptotic efficiency than heretofore possible. For example, we obtain O(nk) algorithms for the longest prefix approximate match problem, the approximate overlap problem, and cyclic string comparison.

...read moreread less

216 citations

Journal Article•DOI•

String matching in Lempel-Ziv compressed strings

[...]

Martin Farach¹, Mikkel Thorup²•Institutions (2)

Rutgers University¹, University of Copenhagen²

01 Jan 1998-Algorithmica

TL;DR: This paper gives the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm, which is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit.

...read moreread less

Abstract: String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if |Z|=o(|T|) , i.e., if the compression ratio |T|/|Z| is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(n log 2 u/n + p) where n=|Z| , u=|T| , and p=|P| . Compare with the naive ``decompresion'' algorithm, which takes time Θ(u+p) to decide if P occurs in T . Writing u+p as (n u)/n+p , we see that we have improved the complexity, replacing the compression factor u/n by a factor log 2 u/n . Our algorithm is competitive in the sense that O(n log 2 u/n + p)=O(u+p) , and opportunistic in the sense that O(n log 2 u/n + p)=o(u+p) if n=o(u) and p=o(u) .

...read moreread less

179 citations

Book Chapter•DOI•

A fast bit-vector algorithm for approximate string matching based on dynamic programming

[...]

Gene Myers¹•Institutions (1)

University of Arizona¹

20 Jul 1998

TL;DR: This work presents an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the approximate string matching problem.

...read moreread less

Abstract: The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in O(nmk/w) time where w is the word size of the machine (e.g. 32 or 64 in practice). Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m.

...read moreread less

124 citations

Proceedings Article•DOI•

Faster algorithms for string matching problems: matching the convolution bound

[...]

Piotr Indyk¹•Institutions (1)

Stanford University¹

08 Nov 1998

TL;DR: This paper gives a randomized O(nlogn)-time algorithm for the string matching with don't cares problem, which improves the Fischer-Paterson bound from 1974 and answers the open problem posed by Weiner and Galil.

...read moreread less

Abstract: In this paper we give a randomized O(nlogn)-time algorithm for the string matching with don't cares problem. This improves the Fischer-Paterson bound from 1974 and answers the open problem posed (among others) by Weiner and Galil. Using the same technique, we give an O(nlogn)-time algorithm for other problems, including subset matching, tree pattern matching, (general) approximate threshold matching and point set matching. As this bound essentially matches the complexity of computing of the fast Fourier transform which is the only known technique for solving problems of this type, it is likely that the algorithms are in fact optimal. Additionally the technique used for the threshold matching problem can be applied to the on-line version of this problem, in which we are allowed to preprocess the text and require to process the pattern in time sublinear in the text length. This result involves an interesting variant of the Karp-Rabin fingerprint method in which hash functions are locality-sensitive, i.e. the probability of collision of two words depends on the distance between them.

...read moreread less

116 citations

Proceedings Article•DOI•

Fast approximate string matching in a dictionary

[...]

Ricardo Baeza-Yates, Gonzalo Navarro¹•Institutions (1)

University of Chile¹

09 Sep 1998

TL;DR: It is shown that with reasonable space overhead the authors can improve by a factor of two over the fastest online algorithms, when the tolerated error level is low (which is reasonable in text searching).

...read moreread less

Abstract: A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that online search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms, when the tolerated error level is low (which is reasonable in text searching).

...read moreread less

82 citations

Proceedings Article•DOI•

Query by rhythm: an approach for song retrieval in music databases

[...]

J.C.C. Chen¹, Arbee L. P. Chen¹•Institutions (1)

National Tsing Hua University¹

23 Feb 1998

TL;DR: This work proposes techniques for retrieving songs by rhythm from music databases by defining similarity measures on rhythm strings and proposing an index structure, called L-tree, to support efficient sub-string matching.

...read moreread less

Abstract: We propose techniques for retrieving songs by rhythm from music databases. The rhythm of songs is modeled by rhythm strings. The song retrieval problem is then transformed to the string matching problem. In order to allow approximate string matching, we define similarity measures on rhythm strings. An index structure, called L-tree, is proposed to support efficient sub-string matching. Retrieval algorithms based on L-tree are then designed to provide approximate and sub- song retrieval. Experimental results show that this approach is effective and efficient.

...read moreread less

77 citations

Proceedings Article•DOI•

Approximate string matching: a simpler faster algorithm

[...]

Richard Cole¹, Ramesh Hariharan²•Institutions (2)

New York University¹, Indian Institute of Science²

01 Jan 1998

TL;DR: This article gave two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k. The first algorithm, which is quite simple, runs in time O( nk 3 m + n + m) on all patterns except k-break periodic strings.

...read moreread less

Abstract: We give two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k The first algorithm, which is quite simple, runs in time O( nk 3 m + n + m) on all patterns except k-break periodic strings (defined later) The second algorithm runs in time O( nk 4 m + n + m )o nk-break periodic patterns The two classes of patterns are easily distinguished in O(m) time

...read moreread less

55 citations

Proceedings Article•

Compression of Strings with Approximate Repeats

[...]

Lloyd Allison¹, Timothy C Edgoose, Trevor I. Dix•Institutions (1)

Monash University¹

01 Jul 1998

TL;DR: A model for strings of characters that is loosely based on the Lempel Ziv model with the addition that a repeated substring can be an approximate match to the original substring is described; this is close to the situation of DNA, for example.

...read moreread less

Abstract: We describe a model for strings of characters that is loosely based on the Lempel Ziv model with the addition that a repeated substring can be an approximate match to the original substring; this is close to the situation of DNA, for example. Typically there are many explanations for a given string under the model, some optimal and many suboptimal. Rather than commit to one optimal explanation, we sum the probabilities over all explanations under the model because this gives the probability of the data under the model. The model has a small number of parameters and these can be estimated from the given string by an expectationmaximization (EM) algorithm. Each iteration of the EM algorithm takes O(n 2) time and a few iterations are typically sufficient. O(n 2) complexity is impractical for strings of more than a few tens of thousands of characters and a faster approximation algorithm is also given. The model is further extended to include approximate reverse complementary repeats when analyzing DNA strings. Tests include the recovery of parameter estimates from known sources and applications to real DNA strings.

...read moreread less

50 citations

Patent•

Finding selected character strings in text and providing information relating to the selected character strings

[...]

Jean-Pierre Chanod¹•Institutions (1)

Xerox¹

29 Dec 1998

TL;DR: In this paper, a character string is automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings and that ends at a probable string ending.

...read moreread less

Abstract: Selected character strings are automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings. The automatic search includes a series of iterations, each with a starting point in the text. Each iteration determines whether its starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending. Each iteration also finds a starting point for the next iteration that is a probable string beginning. The selected strings can be words and multiple word expressions, in which case probable string endings and beginnings are word boundaries. A finite state lexicon, such as a finite state transducer or a finite state automation, can be used to determine whether character strings match the list of selected strings. A tokenizing automation can be used to find starting points.

...read moreread less

50 citations

Proceedings Article•

A Very Fast String Matching Algorithm for Small Alphabeths and Long Patterns (Extended Abstract)

[...]

Christian Charras, Thierry Lecroq, Joseph Daniel Pehoushek

20 Jul 1998

TL;DR: In this paper, a small amount of germanium or gallium was added to the ferrite and an atmosphere, such as air, was used during the sintering and cooling steps.

...read moreread less

Abstract: Desirable properties of manganese zinc ferrites are obtained without the need for controlling or changing the oxygen partial pressure during the sintering and cooling steps by adding a small amount of germanium or gallium to the ferrite and using an atmosphere, such as air, during the sintering and cooling steps, that has at least 1 percent oxygen by volume.

...read moreread less

Journal Article•DOI•

Efficient special cases of Pattern Matching with Swaps

[...]

Amihood Amir¹, Gad M. Landau², Gad M. Landau³, Moshe Lewenstein¹, Noa Lewenstein¹ - Show less +1 more•Institutions (3)

Bar-Ilan University¹, New York University², University of Haifa³

15 Nov 1998-Information Processing Letters

TL;DR: Some interesting special cases of patterns are considered, namely, patterns where there is no length-one run, i.e., there are no a, b, c ϵ ∑ where b ≧ a and b ≠ c and where the substring abc appears in the pattern.

...read moreread less

Journal Article•DOI•

Optimizing the cost matrix for approximate string matching using genetic algorithms

[...]

Marc Parizeau, Nadia Ghazzali, Jean-François Hébert¹•Institutions (1)

Laval University¹

01 Apr 1998-Pattern Recognition

TL;DR: Experimental results show that this approach can effectively discover the hidden costs of elementary operations in a set of string classes.

...read moreread less

Book Chapter•DOI•

Reporting Exact and Approximate Regular Expression Matches

[...]

Eugene W. Myers¹, Paulo Oliva², Katia S. Guimarães²•Institutions (2)

University of Arizona¹, Federal University of Pernambuco²

20 Jul 1998

TL;DR: This work develops notions of what constitutes a significant match, and develops algorithms for them, and gives algorithms for finding a longest match and all symbols in a match for exact regular expression pattern matching.

...read moreread less

Abstract: While much work has been done on determining if a document or a line of a document contains an exact or approximate match to a regular expression, less effort has been expended in formulating and determining what to report as “the match” once such a “hit” is detected. For exact regular expression pattern matching, we give algorithms for finding a longest match, all symbols involved in some match, and finding optimal submatches to tagged parts of a pattern. For approximate regular expression matching, we develop notions of what constitutes a significant match, give algorithms for them, and also for finding a longest match and all symbols in a match.

...read moreread less

Journal Article•DOI•

String Noninclusion Optimization Problems

[...]

Anatoly R. Rubinov, Vadim G. Timkovsky

01 Aug 1998-SIAM Journal on Discrete Mathematics

TL;DR: This paper considers a class of opposite problems connected with string noninclusion relations: find a shortest string included in no string of a given finite language and find a longest string including nostring of agiven finite language.

...read moreread less

Abstract: For every string inclusion relation there are two optimization problems: find a longest string included in every string of a given finite language, and find a shortest string including every string of a given finite language. As an example, the two well-known pairs of problems, the longest common substring (or subsequence) problem and the shortest common superstring (or supersequence) problem, are interpretations of these two problems. In this paper we consider a class of opposite problems connected with string noninclusion relations: find a shortest string included in no string of a given finite language and find a longest string including no string of a given finite language. The predicate "string $\alpha$ is not included in string $\beta$" is interpreted as either "$\alpha$ is not a substring of $\beta$" or "$\alpha$ is not a subsequence of $\beta$". The main purpose is to determine the complexity status of the string noninclusion optimization problems. Using graph approaches we present polynomial-time algorithms for the first interpretation and NP-hardness proofs for the second. We also discuss restricted versions of the problems, correlations between the string inclusion and noninclusion problems, and generalized problems which are the string inclusion problems for one language and the string noninclusion problems for another. In applications the string inclusion problems are used to find a similarity between any structures which can be represented by strings. Respectively, the noninclusion problems can be used to find a nonsimilarity. Such problems occur in computational molecular biology, data compression, pattern recognition, and flexible manufacturing. The above generalized problems arise naturally in all of these applied areas. Apart from this practical reason, we hope that studying the string noninclusion problems will yield deeper understanding of the string inclusion problems.

...read moreread less

Proceedings Article•DOI•

Automatic speech recognition for generalised time based media retrieval and indexing

[...]

John Robertson¹, Wai Yat Wong¹, Charles Y. C. Chung¹, Dong Ki Kim²•Institutions (2)

Commonwealth Scientific and Industrial Research Organisation¹, Macquarie University²

01 Sep 1998

TL;DR: To copy o~hewise, to republis~ lo post on servers or to redistribute 10 lists, requires prior specific permission and or a fee.

...read moreread less

Abstract: Perrrrks]onto make dig]ial or hard copies of all or parr of xh~ work for personal or cl-room use is granted uithout fee provided Ihat copies are nol made or distribmed for profi! or commercial advantage, and ~ha~ copies bear ~hls notice and ihe full citation on the first page. To copy o~hewise, to republis~ lo post on servers or to redistribute 10 lists, requires prior specific permission and or a fee.

...read moreread less

Journal Article•DOI•

An improvement of the Aho-Corasick machine

[...]

Jun-ichi Aoe¹, Kazuaki Anda¹, Toshiharu Kinoshita¹, Masami Shishibori¹•Institutions (1)

University of Tokushima¹

01 Nov 1998-Information Sciences

TL;DR: An efficient multi-attribute pattern matching machine to locate all occurrences of any of a finite number of the sequence of rule structures (called matching rules) in a sequence of input structures is described.

...read moreread less

Journal Article•DOI•

Experiments on string matching in memory structures

[...]

Thierry Lecroq¹•Institutions (1)

University of Rouen¹

25 Apr 1998-Software - Practice and Experience

TL;DR: Experimental results where symbols are taken among potentially infinite sets such as integers, reals or composed structures show that, in most cases, it is better to decompose each symbol into a sequence of bytes and use algorithms which assume that the alphabet is bounded.

...read moreread less

Abstract: Various string matching algorithms have been designed and some experimental work on string matching over bounded alphabets has been performed, but string matching over unbounded alphabets has been little investigated. We present here experimental results where symbols are taken among potentially infinite sets such as integers, reals or composed structures. These results show that, in most cases, it is better to decompose each symbol into a sequence of bytes and use algorithms which assume that the alphabet is bounded, and use heuristics on symbols. © 1998 John Wiley & Sons, Ltd.

...read moreread less

Book Chapter•DOI•

Similarity in Two-Dimensional Strings

[...]

Ricardo Baeza-Yates¹•Institutions (1)

University of Chile¹

12 Aug 1998

TL;DR: New similarity measures are presented and they can be used to perform more general two-dimensional approximate pattern matching and to compute the edit distance between two images.

...read moreread less

Abstract: In this paper we discuss how to compute the edit distance (or similarity) between two images. We present new similarity measures and how to compute them. They can be used to perform more general two-dimensional approximate pattern matching. Previous work on two-dimensional approximate string matching either work with only substitutions or a restricted edit distance that allows only some type of errors.

...read moreread less

Book•

Combinatorial Pattern Matching: 9th Annual Symposium, CPM'98, Piscataway, New Jersey, USA, July 20-22, 1998, Proceedings

[...]

Martin Farach-Colton

08 Jul 1998

TL;DR: A fast bit-vector algorithm for approximate string matching based on dynamic programming and a bit-parallel approach to suffix automata: Fast extended string matching.

...read moreread less

Abstract: A fast bit-vector algorithm for approximate string matching based on dynamic programming.- A bit-parallel approach to suffix automata: Fast extended string matching.- A dictionary matching algorithm fast on the average for terms of varying length.- A very fast string matching algorithm for small alphabets and long patterns.- Approximate word sequence matching over Sparse Suffix Trees.- Efficient parallel algorithm for the editing distance between ordered trees.- Reporting exact and approximate regular expression matches.- An approximate oracle for distance in metric spaces.- A rotation invariant filter for two-dimensional string matching.- Constructing suffix arrays for multi-dimensional matrices.- Simple and flexible detection of contiguous repeats using a suffix tree Preliminary Version.- Comparison of coding DNA.- Fixed topology alignment with recombination.- Aligning alignments.- Efficient special cases of pattern matching with swaps.- Aligning DNA sequences to minimize the change in protein.- Genome halving.

...read moreread less

Proceedings Article•

Dynamic Programming for Reduced NFAs for Approximate String and Sequence Matching.

[...]

Jan Holub

01 Jan 1998

TL;DR: In this article, a reduced finite automata (NFA) for approximate string matching is presented, where the pattern can occur with some limited number of errors given by edit distance.

...read moreread less

Abstract: Approximate string and sequence matching is a problem of searching for all occurrences of a pattern (string or sequence) in some text, where the pattern can occur with some limited number of errors given by edit distance.Several methods were designed for the approximate string matching that simulate nondeterministic finite automata (NFA) constructed for this problem. This paper presents reduced NFAs for the approximate string matching usable in case, when we are interested only in occurrences having edit distance less than or equal to a given integer, but we are not interested in exact edit distance of each found occurrence. Then an algorithm based on the dynamic programming that simulates these reduced NFAs is presented. It is also presented how to use this algorithm for the approximate sequence matching.

...read moreread less

Proceedings Article•

Approximate String Matching by Fuzzy Automata.

[...]

Vaclav Snasel

01 Jan 1998

TL;DR: In this paper, a fuzzy automaton-based approximate string matching algorithm is presented, which can be used for approximate searching in special cases when some pairs of symbols are more similar to each other than the others.

...read moreread less

Abstract: We explain new ways of constructing search algorithms using fuzzy sets and fuzzy automata. This technique can be used to search or match strings in special cases when some pairs of symbols are more similar to each other than the others. This kind of similarity cannot be handled by usual searching algorithms.We present sample situations, which would use this kind of searching. Then we define a fuzzy automaton, and some basic constructions we need for our purposes. We continue with definition of our fuzzy automaton based approximate string matching algorithm, and add some notes to fuzzy-trellis construction which can be used for approximate searching.

...read moreread less

Book Chapter•DOI•

Approximate Word Sequence Matching over Sparse Suffix Trees

[...]

Knut Magne Risvik¹•Institutions (1)

Norwegian University of Science and Technology¹

20 Jul 1998

TL;DR: In this paper, word sequence matching is discussed, and the common edit distance metric for approximate string matching to searching for words and sequences of words is adapted.

...read moreread less

Abstract: In this paper, we discuss word sequence matching, and we adapt the common edit distance metric for approximate string matching to searching for words and sequences of words. We furthermore create a variant of the Sparse Suffix Tree([3]) and adapt algorithms for approximate word and word sequence matching over the Sparse Suffix Tree variant. The algorithms have been implemented and tested in WWW information retrieval environment, and performance data is presented.

...read moreread less

Proceedings Article•DOI•

Speech annotation by multi-sensory recording

[...]

Robert W. P. Luk¹•Institutions (1)

Hong Kong Polytechnic University¹

16 Aug 1998

TL;DR: This paper describes the effort to mark and annotate read Cantonese speech for both citation pronunciation and reading aloud sentences/phrases using dynamic programming as for approximate string matching.

...read moreread less

Abstract: This paper describes our effort to mark and annotate read Cantonese speech for both citation pronunciation and reading aloud sentences/phrases. Four signals are recorded simultaneously to assist marking and annotation: acoustic, laryngograph, nasal and air burst signals. A coarse match between voiced segments of speech and voiced segments of the phonetic spelling of the utterance is executed by dynamic programming as for approximate string matching. Finally, we discuss general issues in the design of our software for annotation.

...read moreread less

Proceedings Article•DOI•

The asymptotics of string matching probabilities for Gaussian random sequences

[...]

Shunsuke Ihara¹, Masashi Kubo²•Institutions (2)

Nagoya University¹, Tokoha Gakuen University²

16 Aug 1998

TL;DR: This work considers Gaussian stationary sources and studies the problem of string matching with distortion, and proves theorems concerning the asymptotic behavior of the probability of string match with distortion and the waiting time for the string matching.

...read moreread less

Abstract: Wyner and Ziv (1989) studied the asymptotic properties of recurrence times of stationary processes, and applied the results to obtain optimal data compression schemes in information transmission. Since then many data compression algorithms based upon string matching have been proposed and studied. We consider Gaussian stationary sources and study the problem of string matching with distortion. We prove theorems concerning the asymptotic behavior of the probability of string matching with distortion and the waiting time for the string matching.

...read moreread less