Showing papers on "Approximate string matching published in 2000"

PDF

Open Access

Proceedings Article•DOI•

Faster algorithms for string matching with k mismatches

[...]

Amihood Amir¹, Moshe Lewenstein¹, Ely Porat¹•Institutions (1)

01 Feb 2000

TL;DR: This work presents an algorithm that is faster than both the Galil-Giancarlo and Abrahamson algorithms in finding all locations where the pattern has at most k errors in time O(n√k log k).

...read moreread less

Abstract: The string matching with mismatches problem is that of finding the number of mismatches between a pattern P of length m and every length m substring of the text T. Currently, the fastest algorithms for this problem are the following. The Galil-Giancarlo algorithm finds all locations where the pattern has at most k errors (where k is part of the input) in time O(nk). The Abrahamson algorithm finds the number of mismatches at every location in time O(n√ m log m). We present an algorithm that is faster than both. Our algorithm finds all locations where the pattern has at most k errors in time O(n√k log k). We also show an algorithm that solves the above problem in time O((n + (nk3)/m) log k).

...read moreread less

221 citations

Proceedings Article•

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

[...]

Roberto Grossi

01 Jan 2000

TL;DR: In this article, the authors proposed a space-efficient text index based on suffix arrays and suffix trees, which achieves a speedup of O(m/lg | σ|Sigma|) + O(1)-time.

...read moreread less

Abstract: The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ consisting of $n$ symbols drawn from a fixed alphabet $\Sigma$. The text $T$ can be represented in $n \lg |\Sigma|$ bits by encoding each symbol with $\lg |\Sigma|$ bits. The goal is to support fast online queries for searching any string pattern $P$ of $m$ symbols, with $T$ being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{|\Sigma|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg |\Sigma|)$ time or in $O(m + \lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{|\Sigma|} n + \lg_{|\Sigma|}^\epsilon n)}$ search time in the worst case, for any constant $0 < \epsilon \leq 1$, using at most $\smash{\bigl(\epsilon^{-1} + O(1)\bigr) \, n \lg |\Sigma|}$ bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB \textsc{ascii} file can require 30--40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve \emph{both} time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving $O(\mathit{occ} \, \smash{\lg_{|\Sigma|}^\epsilon n})$ time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in $O(n \lg |\Sigma|)$ bits to obtain a total search bound of $O(m /\lg_{|\Sigma|} n + \mathit{occ})$ time, which is optimal.

...read moreread less

205 citations

Book Chapter•DOI•

Indexing Text with Approximate q-Grams

[...]

Gonzalo Navarro¹, Erkki Sutinen², Jani Tanninen², Jorma Tarhio²•Institutions (2)

University of Chile¹, University of Eastern Finland²

21 Jun 2000

TL;DR: A new index for approximate string matching is presented and it is shown experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filTration is still effcient.

...read moreread less

Abstract: We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. We show experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filtration is still effcient.

...read moreread less

79 citations

Journal Article•DOI•

A string matching computer-assisted system for dolphin photoidentification.

[...]

Babak Nadjar Araabi¹, Nasser Kehtarnavaz¹, T. McKinney¹, Gilbert R. Hillman², Bernd Würsig¹ - Show less +1 more•Institutions (2)

Texas A&M University¹, University of Texas Medical Branch²

01 Jan 2000-Annals of Biomedical Engineering

TL;DR: The developed computer-assisted system can help marine mammalogists in their identification of dolphins, since it allows them to examine only a handful of candidate images instead of the currently used manual searching of the entire database.

...read moreread less

Abstract: This paper presents a syntactic/semantic string representation scheme as well as a string matching method as part of a computer-assisted system to identify dolphins from photographs of their dorsal fins. A low-level string representation is constructed from the curvature function of a dolphin's fin trailing edge, consisting of positive and negative curvature primitives. A high-level string representation is then built over the low-level string via merging appropriate groupings of primitives in order to have a less sensitive representation to curvature fluctuations or noise. A family of syntactic/semantic distance measures between two strings is introduced. A composite distance measure is then defined and used as a dissimilarity measure for database search, highlighting both the syntax (structure or sequence) and semantic (attribute or feature) differences. The syntax consists of an ordered sequence of significant protrusions and intrusions on the edge, while the semantics consist of seven attributes extracted from the edge and its curvature function. The matching results are reported for a database of 624 images corresponding to 164 individual dolphins. The identification results indicate that the developed string matching method performs better than the previous matching methods including dorsal ratio, curvature, and curve matching. The developed computer-assisted system can help marine mammalogists in their identification of dolphins, since it allows them to examine only a handful of candidate images instead of the currently used manual searching of the entire database. © 2000 Biomedical Engineering Society. PAC00: 8780Tq, 4230Sy, 0705Pj

...read moreread less

55 citations

Journal Article•DOI•

On effective multi-dimensional indexing for strings

[...]

H. V. Jagadish¹, Nick Koudas², Divesh Srivastava²•Institutions (2)

University of Michigan¹, AT&T Labs²

16 May 2000

TL;DR: This paper describes a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data, and instantiates its generic techniques by adapting the 2-dimensional R-tree to string data.

...read moreread less

Abstract: As databases have expanded in scope from storing purely business data to include XML documents, product catalogs, e-mail messages, and directory data, it has become increasingly important to search databases based on wild-card string matching: prefix matching, for example, is more common (and useful) than exact matching, for such data In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions Traditional multi-dimensional index structures, designed with (fixed length) numeric data in mind, are not suitable for matching unbounded length string dataIn this paper, we describe a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data The key ideas are (a) a carefully developed mapping function from strings to rational numbers, (b) representing an unbounded length string in an index leaf page by a fixed length offset to an external key, and (c) storing multiple elided tries, one per dimension, in an index page to prune search during traversal of index pages These basic ideas affect all index algorithms In this paper, we present efficient algorithms for different types of string matchingWhile our technique is applicable to a wide range of multi-dimensional index structures, we instantiate our generic techniques by adapting the 2-dimensional R-tree to string data We demonstrate the space effectiveness and time benefits of using the string R-tree both analytically and experimentally

...read moreread less

50 citations

Book Chapter•DOI•

Approximate String Matching over Ziv-Lempel Compressed Text

[...]

Juha Kärkkäinen¹, Gonzalo Navarro², Esko Ukkonen¹•Institutions (2)

University of Helsinki¹, University of Chile²

21 Jun 2000

TL;DR: The algorithm can be adapted to run in O(k2n+min(mkn,m2(mσ)k) + R) average time, where σ is the alphabet size, and results show a speedup over the basic approach for moderate m and small k.

...read moreread less

Abstract: We present a solution to the problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, specifically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions, in O(mkn+R) time. The existence problem needs O(mkn) time. We also show that the algorithm can be adapted to run in O(k2n+min(mkn,m2(mσ)k) + R) average time, where σ is the alphabet size. The experimental results show a speedup over the basic approach for moderate m and small k.

...read moreread less

46 citations

Proceedings Article•DOI•

Use of median string for classification

[...]

Carlos D. Martínez-Hinarejos, Alfons Juan, Francisco Casacuberta

01 Sep 2000

TL;DR: In this work an algorithm is proposed that iteratively improves the approximate median string and showed that the proposed median string is a better representation of a given set than the corresponding set median.

...read moreread less

Abstract: A string that minimizes the sum of distances to the strings of a given set is known as (generalized) median string of the set. This concept is important in pattern recognition for modelling a (large) set of garbled strings or patterns. The search of such a string is an NP-Hard problem and, therefore, no efficient algorithms to compute the median strings can be designed. A greedy approach has been proposed to compute an approximate median string of a set of strings. In this work an algorithm is proposed that iteratively improves the approximate solution given above. Experiments have been carried out on synthetic and real data to compare the performances of the approximate median string with the conventional set median. These experiments showed that the proposed median string is a better representation of a given set than the corresponding set median.

...read moreread less

41 citations

Journal Article•DOI•

Using clustering strategies for creating authority files

[...]

James C. French¹, Allison L. Powell¹, Eric Schulman²•Institutions (2)

University of Virginia¹, Institute for Defense Analyses²

01 May 2000-Journal of the Association for Information Science and Technology

TL;DR: The notion of approximate word matching is introduced and it is shown how it can be used to improve detection and categorization of variant forms in bibliographic entries and reduce the human effort involved in the creation of authority files.

...read moreread less

Abstract: As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. Authority work, the need to discover and reconcile variant forms of strings in bibliographic entries, will become more critical in the future. Spelling variants, misspellings, and transliteration differences will all increase the difficulty of retrieving information. We investigate a number of approximate string matching techniques that have traditionally been used to help with this problem. We then introduce the notion of approximate word matching and show how it can be used to improve detection and categorization of variant forms. We demonstrate the utility of these approaches using data from the Astrophysics Data System and show how we can reduce the human effort involved in the creation of authority files.

...read moreread less

35 citations

Proceedings Article•DOI•

Approximate string matching algorithms for limited-vocabulary OCR output correction

[...]

Thomas A. Lasko, Susan E. Hauser¹•Institutions (1)

National Institutes of Health¹

21 Dec 2000

TL;DR: Of the five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.

...read moreread less

Abstract: Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.

...read moreread less

30 citations

Proceedings Article•DOI•

A programmable processor for approximate string matching with high throughput rate

[...]

H.-M. Bluthgen, T.G. Noll

10 Jul 2000

TL;DR: The algorithm and architecture of a processor for approximate string matching with high throughput rate is presented, dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary.

...read moreread less

Abstract: In this paper we present the algorithm and architecture of a processor for approximate string matching with high throughput rate. The processor is dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary. The algorithm used for the approximate string matching is based on a dynamic programming procedure known as the string-to-string correction problem. It has been extended to fulfil the requirements of full text search in a database system, including string matching with wildcards and handling of idiomatic turns of some languages. The processor has been fabricated in a 0.6 /spl mu/m CMOS technology. It performs a maximum of 8.5 billion character comparisons per second when operating at the specified clock frequency of 132 MHz.

...read moreread less

27 citations

Journal Article•

Fast Template Matching Algorithm Based on the Projection

[...]

Sun Yuan

01 Jan 2000-Journal of Shanghai Jiaotong University

TL;DR: A new algorithm of fasttemplate matching based on the projection that projects the image to get1 D data and changes the data into the 0 - 1 string using the difference operator is proposed.

...read moreread less

Abstract: A new algorithm of fasttemplate matching based on the projection was proposed.Itprojects the image to get1 D data and changes the data into the0 - 1 string using the difference operator.The coarse matching using the fast string matching algorithms is obtained.The finer matching is achieved using the NC ( Normalized Correlation) method.This algorithm is shown as a new robust algorithm by the test on computer.

...read moreread less

Journal Article•DOI•

String techniques for detecting duplicates in document databases

[...]

Daniel P. Lopresti¹•Institutions (1)

Alcatel-Lucent¹

01 Jun 2000-International Journal on Document Analysis and Recognition

TL;DR: A framework for clarifying and formalizing the duplicate detection problem is introduced, and four distinct models are presented, each with a corresponding algorithm for its solution adapted from the realm of approximate string matching.

...read moreread less

Abstract: Detecting duplicates in document image databases is a problem of growing importance. The task is made difficult by the various degradations suffered by printed documents, and by conflicting notions of what it means to be a “duplicate”. To address these issues, this paper introduces a framework for clarifying and formalizing the duplicate detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution adapted from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data derived from real-world noise sources. Also described are several heuristics that have the potential to speed up the computation by several orders of magnitude.

...read moreread less

Journal Article•DOI•

Approximate string matching using factor automata

[...]

Jan Holub¹, Bořivoj Melichar¹•Institutions (1)

Czech Technical University in Prague¹

28 Oct 2000-Theoretical Computer Science

TL;DR: An answer to the query whether a pattern P occurs in text T with k differences is discussed to be done by an algorithm having the time complexity independent on the length of text T.

...read moreread less

Patent•

Pattern string matching apparatus and pattern string matching method

[...]

Akagi Takuma¹•Institutions (1)

Toshiba¹

31 Jul 2000

TL;DR: In this article, the authors compare each character of a first character string with each characters of a second character string, vote for a matrix having two sides corresponding to the characters of the first character strings and the characters from the second character strings, and calculate values of the voting result for respective components arranged in an oblique direction of the matrix.

...read moreread less

Abstract: This invention is to compare each character of a first character string with each character of a second character string, vote for a matrix having two sides corresponding to the characters of the first character string and the characters of the second character string and calculate values of the voting result for respective components arranged in an oblique direction of the matrix The matching result is determined based on the calculated values of the voting result As a result, a high-speed and highly precise matching process which is noise-resistant and takes the character arrangement into consideration can be attained

...read moreread less

Book Chapter•DOI•

Efficient Techniques for a Very Accurate Measurement of Dissimilarities between Cyclic Patterns

[...]

Ramón Alberto Mollineda¹, Enrique Vidal¹, Francisco Casacuberta¹•Institutions (1)

Polytechnic University of Valencia¹

30 Aug 2000-Lecture Notes in Computer Science

TL;DR: Two efficient approximate techniques for measuring dissimilarities between cyclic patterns are presented, inspired on the quadratic time algorithm proposed by Bunke and Buhler, achieving even more accurate solutions.

...read moreread less

Abstract: Two efficient approximate techniques for measuring dissimilarities between cyclic patterns are presented. They are inspired on the quadratic time algorithm proposed by Bunke and Buhler. The first technique completes pseudoalignments built by the Bunke and Buhler algorithm (BBA), obtaining full alignments between cyclic patterns. The edit cost of the minimum-cost alignment is given as an upper-bound estimation of the exact cyclic edit distance, which results in a more accurate bound than the lower one obtained by BBA. The second technique uses both bounds to compute a weighted average, achieving even more accurate solutions. Weights come from minimizing the sum of squared relative errors with respect to exact distance values on a training set of string pairs. Experiments were conducted on both artificial and real data, to demonstrate the capabilities of new techniques in both accurateness and quadratic computing time.

...read moreread less

Proceedings Article•DOI•

Automatic Chinese text error correction approach based-on fast approximate Chinese word-matching algorithm

[...]

Zhang Lei¹, Zhou Ming, Huang Changning, Sun Maosong•Institutions (1)

Tsinghua University¹

28 Jun 2000

TL;DR: A fast approximate Chinese word-matching algorithm that can deal with not only character substitution errors but also insertion, deletion and string substitution errors and can handle Chinese "non-word" error, making it possible and easy to establish a two-level structure in Chinese spelling correction.

...read moreread less

Abstract: A fast approximate Chinese word-matching algorithm is presented. The algorithm can be used to implement the Chinese fuzzy-matching conception. Based on the algorithm, an automatic Chinese text error correction approach using confusing-word substitution and language model evaluation is designed. Compared with Zhang's (1994) confusing-character substitution method, this new approach can deal with not only character substitution errors but also insertion, deletion and string substitution errors. Besides, the algorithm can handle Chinese "non-word" error, making it possible and easy to establish a two-level structure in Chinese spelling correction.

...read moreread less

Journal Article•

Design and Analysis of String Matching Algorithm on Distributed Memory Machine

[...]

Chen Guo

01 Jan 2000-Journal of Software

TL;DR: An efficient and scalable distributed string matching algorithm is presented by parallelizing the improved KMP (Knuth Morris Pratt) algorithm and making use of the pattern period.

...read moreread less

Abstract: Parallel string matching algorithms are mainly based on PRAM (parallel random access machine) computation model, while the research on parallel string matching algorithm for other more realistic models is very limited In this paper, the authors present an efficient and scalable distributed string matching algorithm is presented by parallelizing the improved KMP (Knuth Morris Pratt) algorithm and making use of the pattern period Its computation complexity is O(n/p+m) and communication time is O(u log p), where n is the length of text, m the length of pattern, p the number of processors and u the period length of pattern

...read moreread less

Journal Article•

Approximate string matching for stroke direction and pressure sequences

[...]

Sung-Hyuk Cha, Yong-Chul Shin, Sargur N. Srihari

01 Jan 2000-Scopus

TL;DR: This study considers stroke direction and pressure sequence strings of a character as character level image signatures for writer identification and presents the newly defined and modified edit distances depending upon their measurement types.

...read moreread less

Abstract: The problem of Writer Identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance. Among many features, we consider stroke direction and pressure sequence strings of a character as character level image signatures for writer identification. As the conventional definition of edit distance is not applicable in essence, we present the newly defined and modified edit distances depending upon their measurement types. Finally, we present a prototype stroke directional and pressure sequence string extractor used on the writer identification. The importance of this study is the attempt to give a definition of distance between two characters based on the two types of strings.

...read moreread less

Book•

Combinatorial Pattern Matching: 11th Annual Symposium. CPM 2000, Montreal, Canada, June 21-23, 2000, Proceedings

[...]

Raffaele Giancarlo, David Sankoff

07 Jun 2000

TL;DR: Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts and Periods and Quasiperiods Characterization are studied.

...read moreread less

Abstract: Invited Lectures.- Identifying and Filtering Near-Duplicate Documents.- Machine Learning for Efficient Natural-Language Processing.- Browsing around a Digital Library: Today and Tomorrow.- Summer School Lectures.- Algorithmic Aspects of Speech Recognition: A Synopsis.- Some Results on Flexible-Pattern Discovery.- Contributed Papers.- Explaining and Controlling Ambiguity in Dynamic Programming.- A Dynamic Edit Distance Table.- Parametric Multiple Sequence Alignment and Phylogeny Construction.- Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Sequence Alignment.- A Polynomial Time Approximation Scheme for the Closest Substring Problem.- Approximation Algorithms for Hamming Clustering Problems.- Approximating the Maximum Isomorphic Agreement Subtree Is Hard.- A Faster and Unifying Algorithm for Comparing Trees.- Incomplete Directed Perfect Phylogeny.- The Longest Common Subsequence Problem for Arc-Annotated Sequences.- Boyer-Moore String Matching over Ziv-Lempel Compressed Text.- A Boyer-Moore Type Algorithm for Compressed Pattern Matching.- Approximate String Matching over Ziv-Lempel Compressed Text.- Improving Static Compression Schemes by Alphabet Extension.- Genome Rearrangement by Reversals and Insertions/Deletions of Contiguous Segments.- A Lower Bound for the Breakpoint Phylogeny Problem.- Structural Properties and Tractability Results for Linear Synteny.- Shift Error Detection in Standardized Exams.- An Upper Bound for Number of Contacts in the HP-Model on the Face-Centered-Cubic Lattice (FCC).- The Combinatorial Partitioning Method.- Compact Suffix Array.- Linear Bidirectional On-Line Construction of Affix Trees.- Using Suffix Trees for Gapped Motif Discovery.- Indexing Text with Approximate q-Grams.- Simple Optimal String Matching Algorithm.- Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts.- Periods and Quasiperiods Characterization.- Finding Maximal Quasiperiodicities in Strings.- On the Complexity of Determining the Period of a String.

...read moreread less

Journal Article•DOI•

Partial Evaluation of Pattern Matching in Strings, revisited

[...]

Bernd Grobauer¹, Julia Lawall²•Institutions (2)

Aarhus University¹, University of Copenhagen²

01 Jun 2000-BRICS Report Series

TL;DR: It is shown that specialization with respect to a pattern yields a matcher with code size linear in the length of the pattern and a running time independent of the length.

...read moreread less

Abstract: Specialization of a string matcher is a canonical example of partial evaluation. A naive implementation of a string matcher repeatedly matches a pattern against every substring of the data string; this operation should intuitively benefit from specializing the matcher with respect to the pattern. In practice, however, producing an efficient implementation by performing this specialization using standard partial-evaluation techniques requires non-trivial binding-time improvements. Starting with a naive matcher, we thus present a derivation of such a binding-time improved string matcher. We show that specialization with respect to a pattern yields a matcher with code size linear in the length of the pattern and a running time independent of the length of the pattern and linear in the length of the data string. We then consider several variants of matchers that specialize well, amongst them the first such matcher presented in the literature, and we demonstrate how variants can be derived from each other systematically.

...read moreread less