Showing papers on "Approximate string matching published in 2003"

PDF

Open Access

A Comparison of String Metrics for Matching Names and Records

[...]

W. W. Cohen and P. Ravikumar and S. Fienberg

01 Jan 2003

TL;DR: An open-source Java toolkit of methods for matching names and records is described and results obtained from using various string distance metrics on the task of matching entity names are summarized.

...read moreread less

Abstract: We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We then describe an extension to the toolkit which allows records to be compared. We discuss some issues involved in performing a similar comparision for record-matching techniques, and finally present results for some baseline record-matching algorithms that aggregate string comparisons between fields.

...read moreread less

552 citations

Proceedings Article•DOI•

Robust and efficient fuzzy match for online data cleaning

[...]

Surajit Chaudhuri¹, Kris Ganjam¹, Venkatesh Ganti¹, Rajeev Motwani²•Institutions (2)

Microsoft¹, Stanford University²

09 Jun 2003

TL;DR: A new similarity function is proposed which overcomes limitations of commonly used similarity functions, and an efficient fuzzy match algorithm is developed which can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation.

...read moreread less

Abstract: To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.

...read moreread less

548 citations

Journal Article•DOI•

Distinguishing string selection problems

[...]

J. Kevin Lanctot¹, Ming Li¹, Bin Ma², Shaojiu Wang, Louxin Zhang³ - Show less +1 more•Institutions (3)

University of Waterloo¹, University of Western Ontario², National University of Singapore³

25 Aug 2003-Information & Computation

TL;DR: This paper presents a collection of string algorithms that are at the core of several biological problems such as discovering potential drug targets, creating diagnostic probes, universal primers or unbiased consensus sequences, and proves that the Closest Substring Problem is NP-Hard.

...read moreread less

Abstract: This paper presents a collection of string algorithms that are at the core of several biological problems such as discovering potential drug targets, creating diagnostic probes, universal primers or unbiased consensus sequences. All these problems reduce to the task of finding a pattern that, with some error, occurs in one set of strings (Closest Substring Problem) and does not occur in another set (Farthest String Problem). In this paper, we break down the problem into several subproblems and prove the following results. 1. The following are all NP-Hard: the Farthest String Problem, the Closest Substring Problem, and the Closest String Problem of finding a string that is close to each string in a set. 2. There is a PTAS for the Farthest String Problem based on a linear programming relaxation technique. 3. There is a polynomial-time (4/3 + e)-approximation algorithm for the Closest String Problem for any small constant e > 0. Using this algorithm, we also provide an efficient heuristic algorithm for the Closest Substring Problem. 4. The problem of finding a string that is at least Hamming distance d from as many strings in a set as possible, cannot be approximated within ne in polynomial time for some fixed constant e unless NP = P, where n is the number of strings in the set. 5. There is a polynomial-time 2-approximation for finding a string that is both the Closest Substring to one set, and the Farthest String from another set.

...read moreread less

264 citations

Journal Article•DOI•

Image retrieval by texture similarity

[...]

Po-Whei Huang¹, S. K. Dai¹•Institutions (1)

National Chung Hsing University¹

01 Mar 2003-Pattern Recognition

TL;DR: An efficient image retrieval system with high performance of accuracy based on two novel features, the composite sub-band gradient vector and the energy distribution pattern string, which are generated from the sub-images of a wavelet decomposition of the original image.

...read moreread less

195 citations

Patent•

Efficient fuzzy match for evaluating data records

[...]

Surajit Chaudhuri¹, Kris Ganjam¹, Venkatesh Ganti¹, Rajeev Motwani¹•Institutions (1)

Microsoft¹

20 Jun 2003

TL;DR: In this article, a disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process is proposed.

...read moreread less

Abstract: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.

...read moreread less

120 citations

Proceedings Article•DOI•

A sublinear algorithm for weakly approximating edit distance

[...]

Tugkan Batu¹, Funda Ergün², Joe Kilian, Avner Magen³, Sofya Raskhodnikova⁴, Ronitt Rubinfeld, Rahul Sami⁵ - Show less +3 more•Institutions (5)

University of Pennsylvania¹, Case Western Reserve University², University of Toronto³, Massachusetts Institute of Technology⁴, Yale University⁵

09 Jun 2003

TL;DR: The algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance and shows a lower bound of Ω(nΑ/2) on the query complexity of every algorithm that distinguishes pairs of strings with edit distance at most nΑ from those with edit Distance at least n/6.

...read moreread less

Abstract: We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings A and B, runs in time o(n) and with high probability returns "CLOSE" if their edit distance is O(nΑ), and "FAR" if their edit distance is Ω(n), where Α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time O(nmax(Α/2, 2Α - 1\)) for any fixed Α

...read moreread less

116 citations

Journal Article•DOI•

The consensus string problem for a metric is NP-complete

[...]

Jeong Seop Sim¹, Kunsoo Park²•Institutions (2)

Electronics and Telecommunications Research Institute¹, Seoul National University²

01 Feb 2003-Journal of Discrete Algorithms

TL;DR: The problem of finding a consensus string based on consensus error is NP-complete when the penalty matrix is a metric.

...read moreread less

63 citations

Proceedings Article•DOI•

Distance based indexing for string proximity search

[...]

S.C. Sahinalp¹, Murat Tasan¹, J. Macker¹, Z.M. Ozsoyoglu¹•Institutions (1)

Case Western Reserve University¹

01 Mar 2003

TL;DR: It is shown that several distance measures, such as the compression distance and weighted character edit distance are almost metrics, and how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space.

...read moreread less

Abstract: In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance. The main goal is to develop efficient near(est) neighbor search tools that work for both character and block edit distances. Our premise is that distance-based indexing methods, which are originally designed for metric distances can be modified for string distance measures, provided that they form almost metrics. We show that several distance measures, such as the compression distance and weighted character edit distance are almost metrics. In order to analyze the performance of distance based indexing methods (in particular VP trees) for strings, we then develop a model based on distribution of pairwise distances. Based on this model we show how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space. We test our theoretical results on synthetic data sets and protein strings.

...read moreread less

59 citations

Journal Article•DOI•

On the complexity of finding common approximate substrings

[...]

Patricia A. Evans¹, Andrew D. Smith¹, H. Todd Wareham²•Institutions (2)

University of New Brunswick¹, Memorial University of Newfoundland²

05 Sep 2003-Theoretical Computer Science

TL;DR: This paper uses techniques from parameterized complexity to assess non-polynomial time algorithmic options and complexity for the COMMON APPROXIMATE SUBSTRING (CAS) problem, and indicates under which parameter restrictions useful algorithms are possible.

...read moreread less

58 citations

Book Chapter•DOI•

Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants

[...]

Heikki Keskustalo¹, Ari Pirkola¹, Kari Visala¹, Erkka Leppänen¹, Kalervo Järvelin¹ - Show less +1 more•Institutions (1)

University of Tampere¹

08 Oct 2003

TL;DR: This paper established the best method among six baseline matching methods for each language pair and tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words that consistently outperformed all baseline methods.

...read moreread less

Abstract: Untranslatable query keys pose a problem in dictionary-based cross-language information retrieval (CLIR). One solution consists of using approximate string matching methods for finding the spelling variants of the source key among the target database index. In such a setting, it is important to select a matching method suited especially for CLIR. This paper focuses on comparing the effectiveness of several matching methods in a cross-lingual setting. Search words from five domains were expressed in six languages (French, Spanish, Italian, German, Swedish, and Finnish). The target data consisted of the index of an English full-text database. In this setting, we first established the best method among six baseline matching methods for each language pair. Secondly, we tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words. The latter methods consistently outperformed all baseline methods.

...read moreread less

55 citations

Proceedings Article•

A bit-vector algorithm for computing Levenshtein and Damerau edit distances

[...]

Heikki Hyyrö¹•Institutions (1)

University of Tampere¹

01 Mar 2003

TL;DR: In this paper, the authors present an algorithm that runs in time O(d/w⌉m + ⌈n/w ⌉σ) or O(⌈d/m/w+n +m/m+n) where m and n are the lengths of the two strings, w is the computer word size and σ is the size of the alphabet.

...read moreread less

Abstract: The edit distance between strings A and B is defined as the minimum number of edit operations needed in converting A into B or vice versa. The Levenshtein edit distance allows three types of operations: an insertion, a deletion or a substitution of a character. The Damerau edit distance allows the previous three plus in addition a transposition between two adjacent characters. To our best knowledge the best current practical algorithms for computing these edit distances run in time O(dm) and O(⌈m/w⌉(n + σ)), where d is the edit distance between the two strings, m and n are their lengths (m ≤ n), w is the computer word size and σ is the size of the alphabet. In this paper we present an algorithm that runs in time O(⌈d/w⌉m + ⌈n/w⌉σ) or O(⌈d/w⌉n + ⌈m/w⌉σ). The structure of the algorithm is such, that in practice it is mostly suitable for testing whether the edit distance between two strings is within some pre-determined error threshold. We also present some initial test results with thresholded edit distance computation. In them our algorithm works faster than the original algorithm of Myers.

...read moreread less

Journal Article•DOI•

Approximate matching of run-length compressed strings

[...]

Veli Mäkinen, Gonzalo Navarro¹, Esko Ukkonen•Institutions (1)

University of Chile¹

01 Apr 2003-Algorithmica

TL;DR: This work extends an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity, and extends this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily.

...read moreread less

Abstract: We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily. This approach also gives an algorithm for approximate searching of a pattern of m letters (m' runs) in a text of n letters (n' runs) in O(mm'n') time. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.

...read moreread less

Proceedings Article•DOI•

Approximate string matching in DNA sequences

[...]

Lok-Lam Cheng¹, David W. Cheung¹, Siu-Ming Yiu¹•Institutions (1)

University of Hong Kong¹

26 Mar 2003

TL;DR: This paper proposes a novel auxiliary data structure which greatly improves the efficiency of suffix array in the approximate string matching problem in the external memory model and proposes 2 novel parallel algorithms for this problem and implements them on a PC cluster.

...read moreread less

Abstract: Approximate string matching on large DNA sequences data is very important in bioinformatics. Some studies have shown that suffix tree is an efficient data structure for approximate string matching. It performs better than suffix array if the data structure can be stored entirely in the memory. However our study find that suffix array is much better than suffix tree for indexing the DNA sequences since the data structure has to be created and stored on the disk due to its size. We propose a novel auxiliary data structure which greatly improves the efficiency of suffix array in the approximate string matching problem in the external memory model. The second problem we have tackled is the parallel approximate matching in DNA sequence. We propose 2 novel parallel algorithms for this problem and implement them on a PC cluster The result shows that when the error allowed is small, a direct partitioning of the array over the machines in the cluster is a more efficient approach. On the other hand, when the error allowed is large, partitioning the data over the machines is a better approach.

...read moreread less

Parameterized Approximate String Matching and Local-Similarity-Based Point-Pattern Matching

[...]

Veli Mäkinen

01 Aug 2003

Journal Article•DOI•

Two-dimensional substring indexing

[...]

Paolo Ferragina¹, Nick Koudas², S. Muthukrishnan², Divesh Srivastava²•Institutions (2)

University of Pisa¹, AT&T Labs²

01 Jun 2003-Journal of Computer and System Sciences

TL;DR: A technique for two-dimensional substring indexing based on a reduction to the geometric problem of identifying common colors in two ranges containing colored points is presented and results in a family of secondary memory index structures that trade space for time, with no loss of accuracy.

...read moreread less

Journal Article•DOI•

Approximate string matching on Ziv-Lempel compressed text

[...]

Juha Kärkkäinen¹, Gonzalo Navarro², Esko Ukkonen¹•Institutions (2)

University of Helsinki¹, University of Chile²

01 Jun 2003-Journal of Discrete Algorithms

TL;DR: The first nontrivial algorithm for approximate pattern matching on compressed text in the Ziv-Lempel family is presented and a practical speedup over the basic approach of up to 2X for moderate m and small k is shown.

...read moreread less

Book Chapter•DOI•

Complexities of the centre and median string problems

[...]

François Nicolas¹, Eric Rivals¹•Institutions (1)

Centre national de la recherche scientifique¹

25 Jun 2003

TL;DR: This work provides an answer to the question whether the MEDIAN STRING problem is NP-complete for finite and even binary alphabets and gives the complexity of the related CENTRESTRING problem.

...read moreread less

Abstract: Given a finite set of strings, the MEDIAN STRING problem consists in finding a string that minimizes the sum of the distances to the strings in the set. Approximations of the median string are used in a very broad range of applications where one needs a representative string that summarizes common information to the strings of the set. It is the case in Classification, in Speech and Pattern Recognition, and in Computational Biology. In the latter, MEDIAN STRING is related to the key problem of Multiple Alignment. In the recent literature, one finds a theorem stating the NP-completeness of the MEDIAN STRING for unbounded alphabets. However, in the above mentioned areas, the alphabet is often finite. Thus, it remains a crucial question whether the MEDIAN STRING problem is NP-complete for finite and even binary alphabets. In this work, we provide an answer to this question and also give the complexity of the related CENTRE STRING problem. Moreover, we study the parametrized complexity of both problems with respect to the number of input strings.

...read moreread less

Book Chapter•DOI•

Fast-search: a new efficient variant of the Boyer-Moore string matching algorithm

[...]

Domenico Cantone¹, Simone Faro¹•Institutions (1)

University of Catania¹

26 May 2003-Lecture Notes in Computer Science

TL;DR: It turns out that this new variant of the Boyer-Moore string matching algorithm achieve very good results in terms of both time efficiency and number of character inspections, especially in the cases in which the patterns are very short.

...read moreread less

Abstract: We present a new variant of the Boyer-Moore string matching algorithm which, though not linear, is very fast in practice. We compare our algorithm with the Horspool, Quick Search, Tuned Boyer-Moore, and Reverse Factor algorithms, which are among the fastest string matching algorithms for practical uses. It turns out that our algorithm achieve very good results in terms of both time efficiency and number of character inspections, especially in the cases in which the patterns are very short.

...read moreread less

Journal Article•DOI•

Combinatorial and Experimental Methods for Approximate Point Pattern Matching

[...]

Martin Gavrilov¹, Piotr Indyk¹, Rajeev Motwani¹, Suresh Venkatasubramanian²•Institutions (2)

Stanford University¹, AT&T²

01 Oct 2003-Algorithmica

TL;DR: New and efficient algorithms for approximate point pattern matching in two and three dimensions are presented, based on approximate combinatorial distance bounds on sets of points, and via the use of methods from combinatorsial pattern matching.

...read moreread less

Abstract: Point pattern matching is an important problem in computational geometry, with applications in areas like computer vision, object recognition, molecular modeling, and image registration. Traditionally, it has been studied in an exact formulation, where the input point sets are given with arbitrary precision. This leads to algorithms that typically have running times of the order of high-degree polynomials, and require robust calculations of intersection points of high-degree surfaces.We study approximate point pattern matching, with the goal of developing algorithms that are more efficient and more practical than exact algorithms. Our work is motivated by the observation that in practice, data sets that form instances of pattern matching problems are noisy, and so approximate formulations are more appropriate.We present new and efficient algorithms for approximate point pattern matching in two and three dimensions, based on approximate combinatorial distance bounds on sets of points, and via the use of methods from combinatorial pattern matching. We also present an average-case analysis and a detailed empirical study of our methods.

...read moreread less

Journal Article•DOI•

Median strings for k-nearest neighbour classification

[...]

Carlos D. Martínez-Hinarejos¹, Alfons Juan¹, Francisco Casacuberta¹•Institutions (1)

Polytechnic University of Valencia¹

01 Jan 2003-Pattern Recognition Letters

TL;DR: Exhaustive experiments showed that the proposed approximations to the median string are a better representation of a given set than the corresponding set median.

...read moreread less

Journal Article•DOI•

Matchsimile: a flexible approximate matching tool for searching proper names

[...]

Gonzalo Navarro¹, Ricardo Baeza-Yates¹, João Marcelo Azevedo Arcoverde•Institutions (1)

University of Chile¹

01 Jan 2003-Journal of the Association for Information Science and Technology

TL;DR: The architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for extracting person and company names from large texts, are presented, which allows searching for lawyer names in official law publications.

...read moreread less

Abstract: We present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for extracting person and company names from large texts. Part of a larger information extraction environment, this specific engine receives a large set of proper names to search for, a text to search, and search options; and outputs all the occurrences of the names found in the text. Beyond the similarity search capabilities applied at the intraword level, the tool considers a set of specific person name formation rules at the word level, such as combination, abbreviation, duplicity detections, ordering, word omission and insertion, among others. This engine is used in a successful commercial application (also named Matchsimile), which allows searching for lawyer names in official law publications.

...read moreread less

Journal Article•DOI•

Dynamic computation of generalised median strings

[...]

Xiaoyi Jiang¹, K. Abegglen², Horst Bunke², János Csirik•Institutions (2)

Technical University of Berlin¹, University of Bern²

01 Dec 2003-Pattern Analysis and Applications

TL;DR: This paper presents a novel approach that is able to operate in a dynamic environment, where there is a steady arrival of new strings belonging to the considered set and needs only the median of the set computed before together with the new string to compute an updated median string of the new set.

...read moreread less

Abstract: The generalised median string is defined as a string that has the smallest sum of distances to the elements of a given set of strings. It is a valuable tool in representing a whole set of objects by a single prototype, and has interesting applications in pattern recognition. All algorithms for computing generalised median strings known from the literature are of static nature. That is, they require all elements of the underlying set of strings to be given when the algorithm is started. In this paper, we present a novel approach that is able to operate in a dynamic environment, where there is a steady arrival of new strings belonging to the considered set. Rather than computing the median from scratch upon arrival of each new string, the proposed algorithm needs only the median of the set computed before together with the new string to compute an updated median string of the new set. Our approach is experimentally compared to a greedy algorithm and the set median using both synthetic and real data.

...read moreread less

Patent•

Similarity string filtering

[...]

Bhashyam Ramesh¹, Michael W. Watzke¹•Institutions (1)

NCR Corporation¹

19 Dec 2003

TL;DR: A method, computer program and system for optimizing similarity string filtering are disclosed in this paper, where a first data string consisting of one or more data characters and selecting a second data string comprising one or multiple data characters are selected.

...read moreread less

Abstract: A method, computer program and system for optimizing similarity string filtering are disclosed A first data string comprising one or more data characters and selecting a second data string comprising one or more data characters are selected At least one of a defined set of shapes is applied to the first data string to generate one or more patterns associated with the first data string At least one of the defined set of shapes is applied to the second data string to generate one or more patterns associated with the second data string The one or more patterns associated with the first data string are compared with the one or more patterns associated with the second data string to determine if one or more matching patterns exist The first data string and the second data string are linked if one or more matching patterns exist

...read moreread less

Proceedings Article•DOI•

Effective text extraction and recognition for WWW images

[...]

Jun Sun¹, Zhulong Wang¹, Hao Yu¹, Fumihito Nishino¹, Yukata Katsuyama¹, Satoshi Naoi¹ - Show less +2 more•Institutions (1)

Fujitsu¹

20 Nov 2003

TL;DR: A novel stroke verification algorithm is used to effectively remove non-character strokes and build the binary text line image, which is segmented and recognized by dynamic programming.

...read moreread less

Abstract: Images play a very important role in web content delivery. Many WWW images contain text information that can be used for web indexing and searching. A new text extraction and recognition algorithm is proposed in this paper. The character strokes in the image are first extracted by color clustering and connected component analysis. A novel stroke verification algorithm is used to effectively remove non-character strokes. The verified strokes are then used to build the binary text line image, which is segmented and recognized by dynamic programming. Since text in WWW image usually has close relationship with webpage content, approximate string matching is used to revise the recognition result by matching the content in the webpage with the content in the image. This effective post-processing not only improves the recognition performance, but also can be used in other applications such like image - webpage paragraph corresponding.

...read moreread less

Suffix Tree Construction and Storage with Limited Main Memory

[...]

Klaus-Bernd Schrmann, Jens Stoye

01 Jan 2003

TL;DR: A construction algorithm is presented which is currently the fastest practical construction method for large suffix trees and a clustered storage scheme for the suffix tree is proposed that takes into account the locality behavior of typical query types, which leads to a significant speed-up particularly for the exact string matching problem.

...read moreread less

Abstract: Suffix trees have been established as one of the most versatile index structures for unstructured string data like genomic sequences and other strings. In this work, our goal is the development of algorithms for the efficient construction of suffix trees for very large strings and their convenient storage regarding fast access when main memory is limited. We present a construction algorithm which, to the best of our knowledge, is currently the fastest practical construction method for large suffix trees. Further we propose a clustered storage scheme for the suffix tree that takes into account the locality behavior of typical query types, which leads to a significant speed-up particularly for the exact string matching problem. For very large strings the query time is faster than that of other recent index structures like the enhanced suffix array.

...read moreread less

Book Chapter•DOI•

Indexing structures for approximate string matching

[...]

Alessandra Gabriele¹, Filippo Mignosi¹, Antonio Restivo¹, Marinella Sciortino¹•Institutions (1)

University of Palermo¹

28 May 2003

TL;DR: This paper gives the first, to the knowledge, structures and corresponding algorithms for approximate indexing, by considering the Hamming distance, having the following properties: their size is linear times a polylog of the size of the text on average.

...read moreread less

Abstract: In this paper we give the first, to our knowledge, structures and corresponding algorithms for approximate indexing, by considering the Hamming distance, having the following properties. i) Their size is linear times a polylog of the size of the text on average. ii) For each pattern x, the time spent by our algorithms for finding the list occ(x) of all occurrences of a pattern x in the text, up to a certain distance, is proportional on average to |x| + |occ(x)|, under an additional but realistic hypothesis.

...read moreread less

Practical Methods for Approximate String Matching

[...]

Heikki Hyyrö

01 Jan 2003

TL;DR: This thesis focuses on unit-cost edit distance that defines the distance beween two strings as the minimum number of edit operations that are needed in transforming one of the strings into the other.

...read moreread less

Abstract: Given a pattern string and a text, the task of approximate string matching is to find all locations in the text that are similar to the pattern. This type of search may be done for example in applications of spelling error correction or bioinformatics. Typically edit distance is used as the measure of similarity (or distance) between two strings. In this thesis we concentrate on unit-cost edit distance that defines the distance beween two strings as the minimum number of edit operations that are needed in transforming one of the strings into the other. More specifically, we discuss the Levenshtein and the Damerau edit distances. Aproximate string matching algorithms can be divided into off-line and on-line algorithms depending on whether they may or may not, respectively, preprocess the text. In this thesis we propose practical algorithms for both types of approximate string matching as well as for computing edit distance. Our main contributions are a new variant of the bit-parallel approximate string matching algorithm of Myers, a method that makes it easy to modify many existing Levenshtein edit distance algorithms into using the Damerau edit distance, a bit-parallel algorithm for computing edit distance, a more error tolerant version of the ABNDM algorithm, a two-phase filtering scheme, a tuned indexed approximate string matching method for genome searching, and an improved and extended version of the hybrid index of Navarro and Baeza-Yates. To evaluate their practicality, we compare most of the proposed methods with previously existing algorithms. The test results support the claim of the title of this thesis that our proposed algorithms work well in practice.

...read moreread less

Patent•

Arabic handwriting recognition using feature matching

[...]

Hesham Osman Mahmoud Fahmy, Samah Mohamed Elrayan

17 Jan 2003

TL;DR: In this article, an Arabic handwriting recognition system takes an input from a stylus in the form of an ordered sequence of data, and subsequently strokes (or directed line segments) are extracted from the sequence.

...read moreread less

Abstract: An Arabic handwriting recognition system takes an input from a stylus in the form of an ordered sequence of data. The sequence of data is then processed to eliminate any noise associated with data, and subsequently strokes (or directed line segments) are extracted from the sequence of data. More analysis of the strokes is performed to transform the input data into a features vector. Next, the features vector is matched against the features of all Arabic letters using fuzzy matching and dynamic programming techniques. During this matching process, the input word is segmented into the sequence of characters that maximized the matching score. In addition, external objects (such as: single dots, double dots, triple dots, hamzas, or maddas) that are above and below Arabic letters are detected.

...read moreread less

Proceedings Article•DOI•

Comparison of stringmatching algorithms: an aid to information content security

[...]

A-Ning Du¹, Binxing Fang¹, Xiao-Chun Yun¹, Mingzeng Hu¹, Xiu-Rong Zheng¹ - Show less +1 more•Institutions (1)

Harbin Institute of Technology¹

02 Nov 2003

TL;DR: From the comparison, stringmatching on Chinese text and URL strings, AQR algorithm is rather efficient; while on Email address matching, SBOM does better; the skipping matching algorithms (such as Mgrep) are much more efficient for small pattern sets.

...read moreread less

Abstract: We analyzed the core ideas of three basic string matching algorithms (KMP, BM, DFA), described the principles of five advanced online multi-pattern matching algorithms (AC, RAC, AQR, SBOM, Mgrep) and compared the matching efficiencies of the five algorithms by searching speed, preprocessing time and memory used on three web information string sets (Chinese phases, URL strings, Email address strings), especially focusing on the infection of pattern set size and min pattern length on the efficiency. From the comparison, we find that stringmatching on Chinese text and URL strings, AQR algorithm is rather efficient; while on Email address matching, SBOM does better. The skipping matching algorithms (such as Mgrep) are much more efficient for small pattern sets. So a combined algorithm of efficient matching algorithms seems to improve the performance and efficiency of information content security systems.

...read moreread less

Proceedings Article•DOI•

Dynamic Programming Matching for Large Scale Information Retrieval

[...]

Eiko Yamamoto, Masahiro Kishida, Yoshinori Takenami, Yoshiyuki Takeda¹, Kyoji Umemura¹ - Show less +1 more•Institutions (1)

Toyohashi University of Technology¹

07 Jul 2003

TL;DR: This paper proposes a method of dynamic programming matching for information retrieval that is as effective as a conventional information retrieval system, even though it is capable of approximate matching.

...read moreread less

Abstract: Though dynamic programming matching can carry out approximate string matching when there may be deletions or insertions in a document, its effectiveness and efficiency are usually too poor to use it for large-scale information retrieval. In this paper, we propose a method of dynamic programming matching for information retrieval. This method is as effective as a conventional information retrieval system, even though it is capable of approximate matching. It is also as efficient as a conventional system.

...read moreread less