scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2000"


Proceedings ArticleDOI
01 Feb 2000
TL;DR: This work presents an algorithm that is faster than both the Galil-Giancarlo and Abrahamson algorithms in finding all locations where the pattern has at most k errors in time O(n√k log k).
Abstract: The string matching with mismatches problem is that of finding the number of mismatches between a pattern P of length m and every length m substring of the text T. Currently, the fastest algorithms for this problem are the following. The Galil-Giancarlo algorithm finds all locations where the pattern has at most k errors (where k is part of the input) in time O(nk). The Abrahamson algorithm finds the number of mismatches at every location in time O(n√ m log m). We present an algorithm that is faster than both. Our algorithm finds all locations where the pattern has at most k errors in time O(n√k log k). We also show an algorithm that solves the above problem in time O((n + (nk3)/m) log k).

221 citations


Proceedings Article
01 Jan 2000
TL;DR: In this article, the authors proposed a space-efficient text index based on suffix arrays and suffix trees, which achieves a speedup of O(m/lg | σ|Sigma|) + O(1)-time.
Abstract: The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ consisting of $n$ symbols drawn from a fixed alphabet $\Sigma$. The text $T$ can be represented in $n \lg |\Sigma|$ bits by encoding each symbol with $\lg |\Sigma|$ bits. The goal is to support fast online queries for searching any string pattern $P$ of $m$ symbols, with $T$ being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{|\Sigma|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg |\Sigma|)$ time or in $O(m + \lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{|\Sigma|} n + \lg_{|\Sigma|}^\epsilon n)}$ search time in the worst case, for any constant $0 < \epsilon \leq 1$, using at most $\smash{\bigl(\epsilon^{-1} + O(1)\bigr) \, n \lg |\Sigma|}$ bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB \textsc{ascii} file can require 30--40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve \emph{both} time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving $O(\mathit{occ} \, \smash{\lg_{|\Sigma|}^\epsilon n})$ time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in $O(n \lg |\Sigma|)$ bits to obtain a total search bound of $O(m /\lg_{|\Sigma|} n + \mathit{occ})$ time, which is optimal.

205 citations


Book ChapterDOI
21 Jun 2000
TL;DR: A new index for approximate string matching is presented and it is shown experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filTration is still effcient.
Abstract: We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. We show experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filtration is still effcient.

79 citations


Journal ArticleDOI
TL;DR: The developed computer-assisted system can help marine mammalogists in their identification of dolphins, since it allows them to examine only a handful of candidate images instead of the currently used manual searching of the entire database.
Abstract: This paper presents a syntactic/semantic string representation scheme as well as a string matching method as part of a computer-assisted system to identify dolphins from photographs of their dorsal fins. A low-level string representation is constructed from the curvature function of a dolphin's fin trailing edge, consisting of positive and negative curvature primitives. A high-level string representation is then built over the low-level string via merging appropriate groupings of primitives in order to have a less sensitive representation to curvature fluctuations or noise. A family of syntactic/semantic distance measures between two strings is introduced. A composite distance measure is then defined and used as a dissimilarity measure for database search, highlighting both the syntax (structure or sequence) and semantic (attribute or feature) differences. The syntax consists of an ordered sequence of significant protrusions and intrusions on the edge, while the semantics consist of seven attributes extracted from the edge and its curvature function. The matching results are reported for a database of 624 images corresponding to 164 individual dolphins. The identification results indicate that the developed string matching method performs better than the previous matching methods including dorsal ratio, curvature, and curve matching. The developed computer-assisted system can help marine mammalogists in their identification of dolphins, since it allows them to examine only a handful of candidate images instead of the currently used manual searching of the entire database. © 2000 Biomedical Engineering Society. PAC00: 8780Tq, 4230Sy, 0705Pj

55 citations


Journal ArticleDOI
16 May 2000
TL;DR: This paper describes a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data, and instantiates its generic techniques by adapting the 2-dimensional R-tree to string data.
Abstract: As databases have expanded in scope from storing purely business data to include XML documents, product catalogs, e-mail messages, and directory data, it has become increasingly important to search databases based on wild-card string matching: prefix matching, for example, is more common (and useful) than exact matching, for such data In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions Traditional multi-dimensional index structures, designed with (fixed length) numeric data in mind, are not suitable for matching unbounded length string dataIn this paper, we describe a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data The key ideas are (a) a carefully developed mapping function from strings to rational numbers, (b) representing an unbounded length string in an index leaf page by a fixed length offset to an external key, and (c) storing multiple elided tries, one per dimension, in an index page to prune search during traversal of index pages These basic ideas affect all index algorithms In this paper, we present efficient algorithms for different types of string matchingWhile our technique is applicable to a wide range of multi-dimensional index structures, we instantiate our generic techniques by adapting the 2-dimensional R-tree to string data We demonstrate the space effectiveness and time benefits of using the string R-tree both analytically and experimentally

50 citations


Book ChapterDOI
21 Jun 2000
TL;DR: The algorithm can be adapted to run in O(k2n+min(mkn,m2(mσ)k) + R) average time, where σ is the alphabet size, and results show a speedup over the basic approach for moderate m and small k.
Abstract: We present a solution to the problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, specifically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions, in O(mkn+R) time. The existence problem needs O(mkn) time. We also show that the algorithm can be adapted to run in O(k2n+min(mkn,m2(mσ)k) + R) average time, where σ is the alphabet size. The experimental results show a speedup over the basic approach for moderate m and small k.

46 citations


Proceedings ArticleDOI
01 Sep 2000
TL;DR: In this work an algorithm is proposed that iteratively improves the approximate median string and showed that the proposed median string is a better representation of a given set than the corresponding set median.
Abstract: A string that minimizes the sum of distances to the strings of a given set is known as (generalized) median string of the set. This concept is important in pattern recognition for modelling a (large) set of garbled strings or patterns. The search of such a string is an NP-Hard problem and, therefore, no efficient algorithms to compute the median strings can be designed. A greedy approach has been proposed to compute an approximate median string of a set of strings. In this work an algorithm is proposed that iteratively improves the approximate solution given above. Experiments have been carried out on synthetic and real data to compare the performances of the approximate median string with the conventional set median. These experiments showed that the proposed median string is a better representation of a given set than the corresponding set median.

41 citations


Journal ArticleDOI
TL;DR: The notion of approximate word matching is introduced and it is shown how it can be used to improve detection and categorization of variant forms in bibliographic entries and reduce the human effort involved in the creation of authority files.
Abstract: As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. Authority work, the need to discover and reconcile variant forms of strings in bibliographic entries, will become more critical in the future. Spelling variants, misspellings, and transliteration differences will all increase the difficulty of retrieving information. We investigate a number of approximate string matching techniques that have traditionally been used to help with this problem. We then introduce the notion of approximate word matching and show how it can be used to improve detection and categorization of variant forms. We demonstrate the utility of these approaches using data from the Astrophysics Data System and show how we can reduce the human effort involved in the creation of authority files.

35 citations


Proceedings ArticleDOI
21 Dec 2000
TL;DR: Of the five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.
Abstract: Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.

30 citations


Proceedings ArticleDOI
10 Jul 2000
TL;DR: The algorithm and architecture of a processor for approximate string matching with high throughput rate is presented, dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary.
Abstract: In this paper we present the algorithm and architecture of a processor for approximate string matching with high throughput rate. The processor is dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary. The algorithm used for the approximate string matching is based on a dynamic programming procedure known as the string-to-string correction problem. It has been extended to fulfil the requirements of full text search in a database system, including string matching with wildcards and handling of idiomatic turns of some languages. The processor has been fabricated in a 0.6 /spl mu/m CMOS technology. It performs a maximum of 8.5 billion character comparisons per second when operating at the specified clock frequency of 132 MHz.

27 citations


Journal Article
TL;DR: A new algorithm of fasttemplate matching based on the projection that projects the image to get1 D data and changes the data into the 0 - 1 string using the difference operator is proposed.
Abstract: A new algorithm of fasttemplate matching based on the projection was proposed.Itprojects the image to get1 D data and changes the data into the0 - 1 string using the difference operator.The coarse matching using the fast string matching algorithms is obtained.The finer matching is achieved using the NC ( Normalized Correlation) method.This algorithm is shown as a new robust algorithm by the test on computer.

Journal ArticleDOI
TL;DR: A framework for clarifying and formalizing the duplicate detection problem is introduced, and four distinct models are presented, each with a corresponding algorithm for its solution adapted from the realm of approximate string matching.
Abstract: Detecting duplicates in document image databases is a problem of growing importance. The task is made difficult by the various degradations suffered by printed documents, and by conflicting notions of what it means to be a “duplicate”. To address these issues, this paper introduces a framework for clarifying and formalizing the duplicate detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution adapted from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data derived from real-world noise sources. Also described are several heuristics that have the potential to speed up the computation by several orders of magnitude.

Journal ArticleDOI
TL;DR: An answer to the query whether a pattern P occurs in text T with k differences is discussed to be done by an algorithm having the time complexity independent on the length of text T.

Patent
Akagi Takuma1
31 Jul 2000
TL;DR: In this article, the authors compare each character of a first character string with each characters of a second character string, vote for a matrix having two sides corresponding to the characters of the first character strings and the characters from the second character strings, and calculate values of the voting result for respective components arranged in an oblique direction of the matrix.
Abstract: This invention is to compare each character of a first character string with each character of a second character string, vote for a matrix having two sides corresponding to the characters of the first character string and the characters of the second character string and calculate values of the voting result for respective components arranged in an oblique direction of the matrix The matching result is determined based on the calculated values of the voting result As a result, a high-speed and highly precise matching process which is noise-resistant and takes the character arrangement into consideration can be attained

Book ChapterDOI
TL;DR: Two efficient approximate techniques for measuring dissimilarities between cyclic patterns are presented, inspired on the quadratic time algorithm proposed by Bunke and Buhler, achieving even more accurate solutions.
Abstract: Two efficient approximate techniques for measuring dissimilarities between cyclic patterns are presented. They are inspired on the quadratic time algorithm proposed by Bunke and Buhler. The first technique completes pseudoalignments built by the Bunke and Buhler algorithm (BBA), obtaining full alignments between cyclic patterns. The edit cost of the minimum-cost alignment is given as an upper-bound estimation of the exact cyclic edit distance, which results in a more accurate bound than the lower one obtained by BBA. The second technique uses both bounds to compute a weighted average, achieving even more accurate solutions. Weights come from minimizing the sum of squared relative errors with respect to exact distance values on a training set of string pairs. Experiments were conducted on both artificial and real data, to demonstrate the capabilities of new techniques in both accurateness and quadratic computing time.

Proceedings ArticleDOI
28 Jun 2000
TL;DR: A fast approximate Chinese word-matching algorithm that can deal with not only character substitution errors but also insertion, deletion and string substitution errors and can handle Chinese "non-word" error, making it possible and easy to establish a two-level structure in Chinese spelling correction.
Abstract: A fast approximate Chinese word-matching algorithm is presented. The algorithm can be used to implement the Chinese fuzzy-matching conception. Based on the algorithm, an automatic Chinese text error correction approach using confusing-word substitution and language model evaluation is designed. Compared with Zhang's (1994) confusing-character substitution method, this new approach can deal with not only character substitution errors but also insertion, deletion and string substitution errors. Besides, the algorithm can handle Chinese "non-word" error, making it possible and easy to establish a two-level structure in Chinese spelling correction.

Journal Article
TL;DR: An efficient and scalable distributed string matching algorithm is presented by parallelizing the improved KMP (Knuth Morris Pratt) algorithm and making use of the pattern period.
Abstract: Parallel string matching algorithms are mainly based on PRAM (parallel random access machine) computation model, while the research on parallel string matching algorithm for other more realistic models is very limited In this paper, the authors present an efficient and scalable distributed string matching algorithm is presented by parallelizing the improved KMP (Knuth Morris Pratt) algorithm and making use of the pattern period Its computation complexity is O(n/p+m) and communication time is O(u log p), where n is the length of text, m the length of pattern, p the number of processors and u the period length of pattern

Journal Article
01 Jan 2000-Scopus
TL;DR: This study considers stroke direction and pressure sequence strings of a character as character level image signatures for writer identification and presents the newly defined and modified edit distances depending upon their measurement types.
Abstract: The problem of Writer Identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance. Among many features, we consider stroke direction and pressure sequence strings of a character as character level image signatures for writer identification. As the conventional definition of edit distance is not applicable in essence, we present the newly defined and modified edit distances depending upon their measurement types. Finally, we present a prototype stroke directional and pressure sequence string extractor used on the writer identification. The importance of this study is the attempt to give a definition of distance between two characters based on the two types of strings.

Book
07 Jun 2000
TL;DR: Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts and Periods and Quasiperiods Characterization are studied.
Abstract: Invited Lectures.- Identifying and Filtering Near-Duplicate Documents.- Machine Learning for Efficient Natural-Language Processing.- Browsing around a Digital Library: Today and Tomorrow.- Summer School Lectures.- Algorithmic Aspects of Speech Recognition: A Synopsis.- Some Results on Flexible-Pattern Discovery.- Contributed Papers.- Explaining and Controlling Ambiguity in Dynamic Programming.- A Dynamic Edit Distance Table.- Parametric Multiple Sequence Alignment and Phylogeny Construction.- Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Sequence Alignment.- A Polynomial Time Approximation Scheme for the Closest Substring Problem.- Approximation Algorithms for Hamming Clustering Problems.- Approximating the Maximum Isomorphic Agreement Subtree Is Hard.- A Faster and Unifying Algorithm for Comparing Trees.- Incomplete Directed Perfect Phylogeny.- The Longest Common Subsequence Problem for Arc-Annotated Sequences.- Boyer-Moore String Matching over Ziv-Lempel Compressed Text.- A Boyer-Moore Type Algorithm for Compressed Pattern Matching.- Approximate String Matching over Ziv-Lempel Compressed Text.- Improving Static Compression Schemes by Alphabet Extension.- Genome Rearrangement by Reversals and Insertions/Deletions of Contiguous Segments.- A Lower Bound for the Breakpoint Phylogeny Problem.- Structural Properties and Tractability Results for Linear Synteny.- Shift Error Detection in Standardized Exams.- An Upper Bound for Number of Contacts in the HP-Model on the Face-Centered-Cubic Lattice (FCC).- The Combinatorial Partitioning Method.- Compact Suffix Array.- Linear Bidirectional On-Line Construction of Affix Trees.- Using Suffix Trees for Gapped Motif Discovery.- Indexing Text with Approximate q-Grams.- Simple Optimal String Matching Algorithm.- Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts.- Periods and Quasiperiods Characterization.- Finding Maximal Quasiperiodicities in Strings.- On the Complexity of Determining the Period of a String.

Journal ArticleDOI
TL;DR: It is shown that specialization with respect to a pattern yields a matcher with code size linear in the length of the pattern and a running time independent of the length.
Abstract: Specialization of a string matcher is a canonical example of partial evaluation. A naive implementation of a string matcher repeatedly matches a pattern against every substring of the data string; this operation should intuitively benefit from specializing the matcher with respect to the pattern. In practice, however, producing an efficient implementation by performing this specialization using standard partial-evaluation techniques requires non-trivial binding-time improvements. Starting with a naive matcher, we thus present a derivation of such a binding-time improved string matcher. We show that specialization with respect to a pattern yields a matcher with code size linear in the length of the pattern and a running time independent of the length of the pattern and linear in the length of the data string. We then consider several variants of matchers that specialize well, amongst them the first such matcher presented in the literature, and we demonstrate how variants can be derived from each other systematically.