scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2003"


01 Jan 2003
TL;DR: An open-source Java toolkit of methods for matching names and records is described and results obtained from using various string distance metrics on the task of matching entity names are summarized.
Abstract: We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We then describe an extension to the toolkit which allows records to be compared. We discuss some issues involved in performing a similar comparision for record-matching techniques, and finally present results for some baseline record-matching algorithms that aggregate string comparisons between fields.

552 citations


Proceedings ArticleDOI
09 Jun 2003
TL;DR: A new similarity function is proposed which overcomes limitations of commonly used similarity functions, and an efficient fuzzy match algorithm is developed which can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation.
Abstract: To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.

548 citations


Journal ArticleDOI
TL;DR: This paper presents a collection of string algorithms that are at the core of several biological problems such as discovering potential drug targets, creating diagnostic probes, universal primers or unbiased consensus sequences, and proves that the Closest Substring Problem is NP-Hard.
Abstract: This paper presents a collection of string algorithms that are at the core of several biological problems such as discovering potential drug targets, creating diagnostic probes, universal primers or unbiased consensus sequences. All these problems reduce to the task of finding a pattern that, with some error, occurs in one set of strings (Closest Substring Problem) and does not occur in another set (Farthest String Problem). In this paper, we break down the problem into several subproblems and prove the following results. 1. The following are all NP-Hard: the Farthest String Problem, the Closest Substring Problem, and the Closest String Problem of finding a string that is close to each string in a set. 2. There is a PTAS for the Farthest String Problem based on a linear programming relaxation technique. 3. There is a polynomial-time (4/3 + e)-approximation algorithm for the Closest String Problem for any small constant e > 0. Using this algorithm, we also provide an efficient heuristic algorithm for the Closest Substring Problem. 4. The problem of finding a string that is at least Hamming distance d from as many strings in a set as possible, cannot be approximated within ne in polynomial time for some fixed constant e unless NP = P, where n is the number of strings in the set. 5. There is a polynomial-time 2-approximation for finding a string that is both the Closest Substring to one set, and the Farthest String from another set.

264 citations


Journal ArticleDOI
TL;DR: An efficient image retrieval system with high performance of accuracy based on two novel features, the composite sub-band gradient vector and the energy distribution pattern string, which are generated from the sub-images of a wavelet decomposition of the original image.

195 citations


Patent
20 Jun 2003
TL;DR: In this article, a disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process is proposed.
Abstract: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.

120 citations


Proceedings ArticleDOI
09 Jun 2003
TL;DR: The algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance and shows a lower bound of Ω(nΑ/2) on the query complexity of every algorithm that distinguishes pairs of strings with edit distance at most nΑ from those with edit Distance at least n/6.
Abstract: We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings A and B, runs in time o(n) and with high probability returns "CLOSE" if their edit distance is O(nΑ), and "FAR" if their edit distance is Ω(n), where Α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time O(nmax(Α/2, 2Α - 1\)) for any fixed Α

116 citations


Journal ArticleDOI
TL;DR: The problem of finding a consensus string based on consensus error is NP-complete when the penalty matrix is a metric.

63 citations


Proceedings ArticleDOI
01 Mar 2003
TL;DR: It is shown that several distance measures, such as the compression distance and weighted character edit distance are almost metrics, and how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space.
Abstract: In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance. The main goal is to develop efficient near(est) neighbor search tools that work for both character and block edit distances. Our premise is that distance-based indexing methods, which are originally designed for metric distances can be modified for string distance measures, provided that they form almost metrics. We show that several distance measures, such as the compression distance and weighted character edit distance are almost metrics. In order to analyze the performance of distance based indexing methods (in particular VP trees) for strings, we then develop a model based on distribution of pairwise distances. Based on this model we show how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space. We test our theoretical results on synthetic data sets and protein strings.

59 citations


Journal ArticleDOI
TL;DR: This paper uses techniques from parameterized complexity to assess non-polynomial time algorithmic options and complexity for the COMMON APPROXIMATE SUBSTRING (CAS) problem, and indicates under which parameter restrictions useful algorithms are possible.

58 citations


Book ChapterDOI
08 Oct 2003
TL;DR: This paper established the best method among six baseline matching methods for each language pair and tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words that consistently outperformed all baseline methods.
Abstract: Untranslatable query keys pose a problem in dictionary-based cross-language information retrieval (CLIR). One solution consists of using approximate string matching methods for finding the spelling variants of the source key among the target database index. In such a setting, it is important to select a matching method suited especially for CLIR. This paper focuses on comparing the effectiveness of several matching methods in a cross-lingual setting. Search words from five domains were expressed in six languages (French, Spanish, Italian, German, Swedish, and Finnish). The target data consisted of the index of an English full-text database. In this setting, we first established the best method among six baseline matching methods for each language pair. Secondly, we tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words. The latter methods consistently outperformed all baseline methods.

55 citations


Proceedings Article
01 Mar 2003
TL;DR: In this paper, the authors present an algorithm that runs in time O(d/w⌉m + ⌈n/w ⌉σ) or O(⌈d/m/w+n +m/m+n) where m and n are the lengths of the two strings, w is the computer word size and σ is the size of the alphabet.
Abstract: The edit distance between strings A and B is defined as the minimum number of edit operations needed in converting A into B or vice versa. The Levenshtein edit distance allows three types of operations: an insertion, a deletion or a substitution of a character. The Damerau edit distance allows the previous three plus in addition a transposition between two adjacent characters. To our best knowledge the best current practical algorithms for computing these edit distances run in time O(dm) and O(⌈m/w⌉(n + σ)), where d is the edit distance between the two strings, m and n are their lengths (m ≤ n), w is the computer word size and σ is the size of the alphabet. In this paper we present an algorithm that runs in time O(⌈d/w⌉m + ⌈n/w⌉σ) or O(⌈d/w⌉n + ⌈m/w⌉σ). The structure of the algorithm is such, that in practice it is mostly suitable for testing whether the edit distance between two strings is within some pre-determined error threshold. We also present some initial test results with thresholded edit distance computation. In them our algorithm works faster than the original algorithm of Myers.

Journal ArticleDOI
TL;DR: This work extends an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity, and extends this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily.
Abstract: We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily. This approach also gives an algorithm for approximate searching of a pattern of m letters (m' runs) in a text of n letters (n' runs) in O(mm'n') time. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.

Proceedings ArticleDOI
26 Mar 2003
TL;DR: This paper proposes a novel auxiliary data structure which greatly improves the efficiency of suffix array in the approximate string matching problem in the external memory model and proposes 2 novel parallel algorithms for this problem and implements them on a PC cluster.
Abstract: Approximate string matching on large DNA sequences data is very important in bioinformatics. Some studies have shown that suffix tree is an efficient data structure for approximate string matching. It performs better than suffix array if the data structure can be stored entirely in the memory. However our study find that suffix array is much better than suffix tree for indexing the DNA sequences since the data structure has to be created and stored on the disk due to its size. We propose a novel auxiliary data structure which greatly improves the efficiency of suffix array in the approximate string matching problem in the external memory model. The second problem we have tackled is the parallel approximate matching in DNA sequence. We propose 2 novel parallel algorithms for this problem and implement them on a PC cluster The result shows that when the error allowed is small, a direct partitioning of the array over the machines in the cluster is a more efficient approach. On the other hand, when the error allowed is large, partitioning the data over the machines is a better approach.


Journal ArticleDOI
TL;DR: A technique for two-dimensional substring indexing based on a reduction to the geometric problem of identifying common colors in two ranges containing colored points is presented and results in a family of secondary memory index structures that trade space for time, with no loss of accuracy.

Journal ArticleDOI
TL;DR: The first nontrivial algorithm for approximate pattern matching on compressed text in the Ziv-Lempel family is presented and a practical speedup over the basic approach of up to 2X for moderate m and small k is shown.

Book ChapterDOI
25 Jun 2003
TL;DR: This work provides an answer to the question whether the MEDIAN STRING problem is NP-complete for finite and even binary alphabets and gives the complexity of the related CENTRESTRING problem.
Abstract: Given a finite set of strings, the MEDIAN STRING problem consists in finding a string that minimizes the sum of the distances to the strings in the set. Approximations of the median string are used in a very broad range of applications where one needs a representative string that summarizes common information to the strings of the set. It is the case in Classification, in Speech and Pattern Recognition, and in Computational Biology. In the latter, MEDIAN STRING is related to the key problem of Multiple Alignment. In the recent literature, one finds a theorem stating the NP-completeness of the MEDIAN STRING for unbounded alphabets. However, in the above mentioned areas, the alphabet is often finite. Thus, it remains a crucial question whether the MEDIAN STRING problem is NP-complete for finite and even binary alphabets. In this work, we provide an answer to this question and also give the complexity of the related CENTRE STRING problem. Moreover, we study the parametrized complexity of both problems with respect to the number of input strings.

Book ChapterDOI
TL;DR: It turns out that this new variant of the Boyer-Moore string matching algorithm achieve very good results in terms of both time efficiency and number of character inspections, especially in the cases in which the patterns are very short.
Abstract: We present a new variant of the Boyer-Moore string matching algorithm which, though not linear, is very fast in practice. We compare our algorithm with the Horspool, Quick Search, Tuned Boyer-Moore, and Reverse Factor algorithms, which are among the fastest string matching algorithms for practical uses. It turns out that our algorithm achieve very good results in terms of both time efficiency and number of character inspections, especially in the cases in which the patterns are very short.

Journal ArticleDOI
TL;DR: New and efficient algorithms for approximate point pattern matching in two and three dimensions are presented, based on approximate combinatorial distance bounds on sets of points, and via the use of methods from combinatorsial pattern matching.
Abstract: Point pattern matching is an important problem in computational geometry, with applications in areas like computer vision, object recognition, molecular modeling, and image registration. Traditionally, it has been studied in an exact formulation, where the input point sets are given with arbitrary precision. This leads to algorithms that typically have running times of the order of high-degree polynomials, and require robust calculations of intersection points of high-degree surfaces.We study approximate point pattern matching, with the goal of developing algorithms that are more efficient and more practical than exact algorithms. Our work is motivated by the observation that in practice, data sets that form instances of pattern matching problems are noisy, and so approximate formulations are more appropriate.We present new and efficient algorithms for approximate point pattern matching in two and three dimensions, based on approximate combinatorial distance bounds on sets of points, and via the use of methods from combinatorial pattern matching. We also present an average-case analysis and a detailed empirical study of our methods.

Journal ArticleDOI
TL;DR: Exhaustive experiments showed that the proposed approximations to the median string are a better representation of a given set than the corresponding set median.

Journal ArticleDOI
TL;DR: The architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for extracting person and company names from large texts, are presented, which allows searching for lawyer names in official law publications.
Abstract: We present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for extracting person and company names from large texts. Part of a larger information extraction environment, this specific engine receives a large set of proper names to search for, a text to search, and search options; and outputs all the occurrences of the names found in the text. Beyond the similarity search capabilities applied at the intraword level, the tool considers a set of specific person name formation rules at the word level, such as combination, abbreviation, duplicity detections, ordering, word omission and insertion, among others. This engine is used in a successful commercial application (also named Matchsimile), which allows searching for lawyer names in official law publications.

Journal ArticleDOI
TL;DR: This paper presents a novel approach that is able to operate in a dynamic environment, where there is a steady arrival of new strings belonging to the considered set and needs only the median of the set computed before together with the new string to compute an updated median string of the new set.
Abstract: The generalised median string is defined as a string that has the smallest sum of distances to the elements of a given set of strings. It is a valuable tool in representing a whole set of objects by a single prototype, and has interesting applications in pattern recognition. All algorithms for computing generalised median strings known from the literature are of static nature. That is, they require all elements of the underlying set of strings to be given when the algorithm is started. In this paper, we present a novel approach that is able to operate in a dynamic environment, where there is a steady arrival of new strings belonging to the considered set. Rather than computing the median from scratch upon arrival of each new string, the proposed algorithm needs only the median of the set computed before together with the new string to compute an updated median string of the new set. Our approach is experimentally compared to a greedy algorithm and the set median using both synthetic and real data.

Patent
19 Dec 2003
TL;DR: A method, computer program and system for optimizing similarity string filtering are disclosed in this paper, where a first data string consisting of one or more data characters and selecting a second data string comprising one or multiple data characters are selected.
Abstract: A method, computer program and system for optimizing similarity string filtering are disclosed A first data string comprising one or more data characters and selecting a second data string comprising one or more data characters are selected At least one of a defined set of shapes is applied to the first data string to generate one or more patterns associated with the first data string At least one of the defined set of shapes is applied to the second data string to generate one or more patterns associated with the second data string The one or more patterns associated with the first data string are compared with the one or more patterns associated with the second data string to determine if one or more matching patterns exist The first data string and the second data string are linked if one or more matching patterns exist

Proceedings ArticleDOI
Jun Sun1, Zhulong Wang1, Hao Yu1, Fumihito Nishino1, Yukata Katsuyama1, Satoshi Naoi1 
20 Nov 2003
TL;DR: A novel stroke verification algorithm is used to effectively remove non-character strokes and build the binary text line image, which is segmented and recognized by dynamic programming.
Abstract: Images play a very important role in web content delivery. Many WWW images contain text information that can be used for web indexing and searching. A new text extraction and recognition algorithm is proposed in this paper. The character strokes in the image are first extracted by color clustering and connected component analysis. A novel stroke verification algorithm is used to effectively remove non-character strokes. The verified strokes are then used to build the binary text line image, which is segmented and recognized by dynamic programming. Since text in WWW image usually has close relationship with webpage content, approximate string matching is used to revise the recognition result by matching the content in the webpage with the content in the image. This effective post-processing not only improves the recognition performance, but also can be used in other applications such like image - webpage paragraph corresponding.

01 Jan 2003
TL;DR: A construction algorithm is presented which is currently the fastest practical construction method for large suffix trees and a clustered storage scheme for the suffix tree is proposed that takes into account the locality behavior of typical query types, which leads to a significant speed-up particularly for the exact string matching problem.
Abstract: Suffix trees have been established as one of the most versatile index structures for unstructured string data like genomic sequences and other strings. In this work, our goal is the development of algorithms for the efficient construction of suffix trees for very large strings and their convenient storage regarding fast access when main memory is limited. We present a construction algorithm which, to the best of our knowledge, is currently the fastest practical construction method for large suffix trees. Further we propose a clustered storage scheme for the suffix tree that takes into account the locality behavior of typical query types, which leads to a significant speed-up particularly for the exact string matching problem. For very large strings the query time is faster than that of other recent index structures like the enhanced suffix array.

Book ChapterDOI
28 May 2003
TL;DR: This paper gives the first, to the knowledge, structures and corresponding algorithms for approximate indexing, by considering the Hamming distance, having the following properties: their size is linear times a polylog of the size of the text on average.
Abstract: In this paper we give the first, to our knowledge, structures and corresponding algorithms for approximate indexing, by considering the Hamming distance, having the following properties. i) Their size is linear times a polylog of the size of the text on average. ii) For each pattern x, the time spent by our algorithms for finding the list occ(x) of all occurrences of a pattern x in the text, up to a certain distance, is proportional on average to |x| + |occ(x)|, under an additional but realistic hypothesis.

01 Jan 2003
TL;DR: This thesis focuses on unit-cost edit distance that defines the distance beween two strings as the minimum number of edit operations that are needed in transforming one of the strings into the other.
Abstract: Given a pattern string and a text, the task of approximate string matching is to find all locations in the text that are similar to the pattern. This type of search may be done for example in applications of spelling error correction or bioinformatics. Typically edit distance is used as the measure of similarity (or distance) between two strings. In this thesis we concentrate on unit-cost edit distance that defines the distance beween two strings as the minimum number of edit operations that are needed in transforming one of the strings into the other. More specifically, we discuss the Levenshtein and the Damerau edit distances. Aproximate string matching algorithms can be divided into off-line and on-line algorithms depending on whether they may or may not, respectively, preprocess the text. In this thesis we propose practical algorithms for both types of approximate string matching as well as for computing edit distance. Our main contributions are a new variant of the bit-parallel approximate string matching algorithm of Myers, a method that makes it easy to modify many existing Levenshtein edit distance algorithms into using the Damerau edit distance, a bit-parallel algorithm for computing edit distance, a more error tolerant version of the ABNDM algorithm, a two-phase filtering scheme, a tuned indexed approximate string matching method for genome searching, and an improved and extended version of the hybrid index of Navarro and Baeza-Yates. To evaluate their practicality, we compare most of the proposed methods with previously existing algorithms. The test results support the claim of the title of this thesis that our proposed algorithms work well in practice.

Patent
17 Jan 2003
TL;DR: In this article, an Arabic handwriting recognition system takes an input from a stylus in the form of an ordered sequence of data, and subsequently strokes (or directed line segments) are extracted from the sequence.
Abstract: An Arabic handwriting recognition system takes an input from a stylus in the form of an ordered sequence of data. The sequence of data is then processed to eliminate any noise associated with data, and subsequently strokes (or directed line segments) are extracted from the sequence of data. More analysis of the strokes is performed to transform the input data into a features vector. Next, the features vector is matched against the features of all Arabic letters using fuzzy matching and dynamic programming techniques. During this matching process, the input word is segmented into the sequence of characters that maximized the matching score. In addition, external objects (such as: single dots, double dots, triple dots, hamzas, or maddas) that are above and below Arabic letters are detected.

Proceedings ArticleDOI
02 Nov 2003
TL;DR: From the comparison, stringmatching on Chinese text and URL strings, AQR algorithm is rather efficient; while on Email address matching, SBOM does better; the skipping matching algorithms (such as Mgrep) are much more efficient for small pattern sets.
Abstract: We analyzed the core ideas of three basic string matching algorithms (KMP, BM, DFA), described the principles of five advanced online multi-pattern matching algorithms (AC, RAC, AQR, SBOM, Mgrep) and compared the matching efficiencies of the five algorithms by searching speed, preprocessing time and memory used on three web information string sets (Chinese phases, URL strings, Email address strings), especially focusing on the infection of pattern set size and min pattern length on the efficiency. From the comparison, we find that stringmatching on Chinese text and URL strings, AQR algorithm is rather efficient; while on Email address matching, SBOM does better. The skipping matching algorithms (such as Mgrep) are much more efficient for small pattern sets. So a combined algorithm of efficient matching algorithms seems to improve the performance and efficiency of information content security systems.

Proceedings ArticleDOI
07 Jul 2003
TL;DR: This paper proposes a method of dynamic programming matching for information retrieval that is as effective as a conventional information retrieval system, even though it is capable of approximate matching.
Abstract: Though dynamic programming matching can carry out approximate string matching when there may be deletions or insertions in a document, its effectiveness and efficiency are usually too poor to use it for large-scale information retrieval. In this paper, we propose a method of dynamic programming matching for information retrieval. This method is as effective as a conventional information retrieval system, even though it is capable of approximate matching. It is also as efficient as a conventional system.