scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2009"


Book ChapterDOI
30 Aug 2009
TL;DR: This paper proposes to apply a Collaborative Work approach that leverages former explorations of the cube to recommend OLAP queries, and adapts Approximate String Matching, a technique popular in Information Retrieval, to match the current analysis with theFormer explorations and help suggesting a query to the user.
Abstract: Interactive analysis of datacube, in which a user navigates a cube by launching a sequence of queries is often tedious since the user may have no idea of what the forthcoming query should be in his current analysis. To better support this process we propose in this paper to apply a Collaborative Work approach that leverages former explorations of the cube to recommend OLAP queries. The system that we have developed adapts Approximate String Matching, a technique popular in Information Retrieval, to match the current analysis with the former explorations and help suggesting a query to the user. Our approach has been implemented with the open source Mondrian OLAP server to recommend MDX queries and we have carried out some preliminary experiments that show its efficiency for generating effective query recommendations.

74 citations


Book
24 Nov 2009
TL;DR: The Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance), often used in applications that need to determine how similar, or different, two strings are, such as spell checkers.
Abstract: In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. A generalization of the Levenshtein distance (Damerau?Levenshtein distance) allows the transposition of two characters as an operation. Some Translation Environment Tools, such as translation memory leveraging applications, use the Levenhstein algorithm to measure the edit distance between two fuzzy matching content segments.The metric is named after Vladimir Levenshtein, who considered this distance in 1965. It is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers

66 citations


Journal ArticleDOI
TL;DR: It is proved that a restricted version of the closest string problem has the same parameterized complexity as the closest substring, answering an open question in the literature.
Abstract: The closest string problem and the closest substring problem are all natural theoretical computer science problems and find important applications in computational biology. Given $n$ input strings, the closest string (substring) problem finds a new string within distance $d$ to (a substring of) each input string and such that $d$ is minimized. Both problems are NP-complete. In this paper we propose new algorithms for these two problems. For the closest string problem, we developed an exact algorithm with time complexity $O(n|\Sigma|^{O(d)})$, where $\Sigma$ is the alphabet. This improves the previously best known result $O(nd^{O(d)})$ and results into a polynomial time algorithm when $d=O(\log n)$. By using this algorithm, a polynomial time approximation scheme (PTAS) for the closest string problem is also given with time complexity $O(n^{O(\epsilon^{-2})})$, improving the previously best known $O(n^{O(\epsilon^{-2}\log\frac{1}{\epsilon})})$ PTAS. A new algorithm for the closest substring problem is also proposed. Finally, we prove that a restricted version of the closest substring problem has the same parameterized complexity as the closest substring, answering an open question in the literature.

57 citations


Journal ArticleDOI
TL;DR: This paper focuses on indexed approximate string matching (ASM), which is of great interest, say, in bioinformatics, and study ASM algorithms for Lempel-Ziv compressed indexes and for compressed suffix trees/arrays, which are competitive and provide useful space-time tradeoffs compared to classical indexes.
Abstract: A compressed full-text self-index for a text T is a data structure requiring reducedspace and able to search for patterns P in T . It can also reproduce any substring of T , thusactually replacing T . Despite the recent explosion of interest on compressed indexes, therehas not been much progress on functionalities beyond the basic exact search. In this paperwe focus on indexed approximate string matching (ASM), which is of great interest, say,in bioinformatics. We study ASM algorithms for Lempel-Ziv compressed indexes and forcompressed suffix trees/arrays. Most compressed self-indexes belong to one of these classes.We start by adapting the classical method of partitioning into exact search to self-indexes, andoptimize it over a representative of either class of self-index. Then, we show that a Lempel-Ziv index can be seen as an extension of the classical q -samples index. We give new insightson this type of index, which can be of independent interest, and then apply them to a Lempel-Ziv index. Finally, we improve hierarchical verification, a successful technique for sequentialsearching, so as to extend the matches of pattern pieces to the left or right. Most compressedsuffix trees/arrays support the required bidirectionality, thus enabling the implementation ofthe improved technique. In turn, the improved verification largely reduces the accesses to thetext, which are expensive in self-indexes. We show experimentally that our algorithms arecompetitive and provide useful space-time tradeoffs compared to classical indexes.

53 citations


Proceedings ArticleDOI
02 Nov 2009
TL;DR: Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach.
Abstract: Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static databases. However, many organisations are increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, online retail stores, eGovernment services, and digital libraries. A novel inverted index based approach for real-time entity resolution is presented in this paper. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other approaches to approximate query matching in that it allows any similarity comparison function, and any 'blocking' (encoding) function, both possibly domain specific, to be incorporated. Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach. The interested reader is referred to the longer version of this paper [5].

48 citations


Proceedings ArticleDOI
29 Jun 2009
TL;DR: This paper presents a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers.
Abstract: Approximate string matching is a problem that has received a lot of attention recently. Existing work on information retrieval has concentrated on a variety of similarity measures TF/IDF, BM25, HMM, etc.) specifically tailored for document retrieval purposes. As new applications that depend on retrieving short strings are becoming popular(e.g., local search engines like YellowPages.com, Yahoo!Local, and Google Maps) new indexing methods are needed, tailored for short strings. For that purpose, a number of indexing techniques and related algorithms have been proposed based on length normalized similarity measures. A common denominator of indexes for length normalized measures is that maintaining the underlying structures in the presence of incremental updates is inefficient, mainly due to data dependent, precomputed weights associated with each distinct token and string. Incorporating updates usually is accomplished by rebuilding the indexes at regular time intervals. In this paper we present a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers. More specifically, our techniques guarantee against false negatives and limit the number of false positives produced. We implement a fully working prototype and illustrate that the proposed ideas work really well in practice for real datasets.

45 citations


Proceedings ArticleDOI
26 Feb 2009
TL;DR: The classical four-russians technique can be incorporated into the SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings.
Abstract: We present a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs For two strings of total length $N$ having straight-line program representations of total size $n$, we provide an algorithm running in $O(n^{14}N^{12})$ time for computing the edit-distance of these two strings under any rational scoring function, and an $O(n^{134}N^{134})$ time algorithm for arbitrary scoring functions This improves on a recent algorithm of Tiskin that runs in $O(nN^{15})$ time, and works only for rational scoring functions Also, in the last part of the paper, we show how the classical four-russians technique can be incorporated into our SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings

44 citations


Journal ArticleDOI
TL;DR: This work develops a novel and unorthodox filtering technique based on transforming the problem into multiple matching of carefully chosen pattern subsequences that leads to very simple algorithms that are optimal on average.

40 citations


Proceedings ArticleDOI
04 Jan 2009
TL;DR: This work considers the classic problem of pattern matching with few mismatches in the presence of promiscuously matching wildcard symbols and develops a new framework in which to tackle approximate pattern matching problems.
Abstract: We consider the classic problem of pattern matching with few mismatches in the presence of promiscuously matching wildcard symbols. Given a text t of length n and a pattern p of length m with optional wildcard symbols and a bound k, our algorithm finds all the alignments for which the pattern matches the text with Hamming distance at most k and also returns the location and identity of each mismatch. The algorithm we present is deterministic and runs in O(kn) time, matching the best known randomised time complexity to within logarithmic factors. The solutions we develop borrow from the tool set of algebraic coding theory and provide a new framework in which to tackle approximate pattern matching problems.

35 citations


Patent
10 Jun 2009
TL;DR: In this paper, the inverted indexes are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves.
Abstract: In embodiments of the disclosed technology, indexes, such as inverted indexes, are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves. With the present technology, a batch of daily updates can be processed in a matter of minutes, rather than a few hours for rebuilding an index, and a query may be answered with assurances that the results are accurate or within a threshold of accuracy.

34 citations


Book ChapterDOI
05 Dec 2009
TL;DR: It turns out that the core technique actually solves a more general incremental string comparison problem that allows the insertion, deletion, and substitution of multiple symbols.
Abstract: We study the problem of finding all maximal approximate gapped palindromes in a string. More specifically, given a string S of length n, a parameter q ? 0 and a threshold k > 0, the problem is to identify all substrings in S of the form uvw such that (1) the Levenshtein distance between u and w r is at most k, where w r is the reverse of w and (2) v is a string of length q. The best previous work requires O(k 2 n) time. In this paper, we propose an O(kn)-time algorithm for this problem by utilizing an incremental string comparison technique. It turns out that the core technique actually solves a more general incremental string comparison problem that allows the insertion, deletion, and substitution of multiple symbols.

Journal ArticleDOI
TL;DR: This paper considers a class of pattern matching problems where the content is assumed to be correct, while the locations may have shifted/changed, and formally defines a broad class of problems, capturing situations in which the pattern is obtained from the text by a sequence of rearrangements.

Patent
25 Mar 2009
TL;DR: In this article, a computer implemented method and system that progressively relaxes search terms provided by a user is described, where the accepted search terms are modified uniquely based on the predefined types to structure first alternative queries and the second alternative queries are compared with the stored data to find approximate matches.
Abstract: Disclosed herein is a computer implemented method and system that progressively relaxes search terms provided by a user. Data of predefined types is stored in a database. The data is obtained by uniquely modifying data previously stored in the database, based on the predefined types. Search terms of predefined types are accepted from the user. The search terms are compared with the stored data to find exact matches, if length of the search terms exceeds a predefined value. On not finding exact matches, the accepted search terms are modified uniquely based on the predefined types to structure first alternative queries. The first alternative queries are compared with the stored data to find exact matches. On not finding exact matches, the first alternative queries are modified based on the predefined types to structure second alternative queries. The second alternative queries are compared with the stored data to find approximate matches.

Proceedings ArticleDOI
28 Jun 2009
TL;DR: This paper develops a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques, and develops efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists.
Abstract: An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.

Journal ArticleDOI
TL;DR: This work significantly improves the space bounds of the Ziv-Lempel adaptive dictionary compression schemes, improving the previously known complexities for both approximate string matching and regular expression matching problems.
Abstract: We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly improve the space bounds, which in practical applications are likely to be a bottleneck.

Journal Article
TL;DR: A fast algorithm for finding approximate matches of a string in a finite-state automaton, given some metric of similarity, which can be adapted to use a variety of metrics for determining the distance between two words.
Abstract: We present a fast algorithm for finding approximate matches of a string in a finite-state automaton, given some metric of similarity. The algorithm can be adapted to use a variety of metrics for determining the distance between two words.

Book ChapterDOI
18 Jun 2009
TL;DR: This paper proposes the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most " 1" from query string) in time linear in the length of query string.
Abstract: In the approximate dictionary search problem we have to construct a data structure on a set of strings so that we can answer to queries of the kind: find all strings of the set that are similar (according to some string distance) to a given string. In this paper we propose the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most "1" from query string) in time linear in the length of query string. Based on our new dictionary we propose a full-text index for approximate queries with edit distance "1" (report all positions of all sub-strings of the text that are at edit distance at most "1" from query string) answering to a query in time linear in the length of query string using space $O(n(\lg(n)\lg\lg(n))^2)$ in the worst case on a text of length n . Our index is the first index that answers queries in time linear in the length of query string while using space O (n ·poly (log (n ))) in the worst case and for any alphabet size.

Journal ArticleDOI
TL;DR: An Aho-Corasick algorithm based parallel string matching that outperforms the existing bit-split string matching in the evaluations of Snort rules is proposed.
Abstract: As the variety of hazardous packet payload contents increases, the intrusion detection system (IDS) should be able to detect numerous patterns in real time. For this reason, this paper proposes an Aho-Corasick algorithm based parallel string matching. In order to balance memory usage between homogeneous finite-state machine (FSM) tiles for each string matcher, an optimal set of bit position groups is determined. Target patterns are sorted by binary-reflected gray code (BRGC), which reduces bit transitions in patterns mapped onto a string matcher. In the evaluations of Snort rules, the proposed string matching outperforms the existing bit-split string matching.

Patent
23 Jun 2009
TL;DR: In this paper, the authors describe techniques for error-tolerant auto-completion, where characters of an input string are displayed as they are inputted by a user, and when a character is added to the input string by the user, matching strings may be selected from among a set of candidate strings by determining which of the candidate strings have a prefix whose characters match the characters of the input text within a given edit distance of input text.
Abstract: Techniques for error-tolerant autocompletion are described. While displaying characters of an input string as they are inputted by a user, when a character is added to the input string by the user, matching strings may be selected from among a set of candidate strings by determining which of the candidate strings have a prefix whose characters match the characters of the input string within a given edit distance of the input string.

Book ChapterDOI
02 Dec 2009
TL;DR: A protocol based on homomorphic encryption, combined with the novel notion of a share-hiding error-correcting secret sharing scheme, which is provably secure against passive adversaries, and has better efficiency than previous protocols for certain parameter values is presented.
Abstract: At Eurocrypt'04, Freedman, Nissim and Pinkas introduced a fuzzy private matching problem. The problem is defined as follows. Given two parties, each of them having a set of vectors where each vector has T integer components, the fuzzy private matching is to securely test if each vector of one set matches any vector of another set for at least t components where t < T. In the conclusion of their paper, they asked whether it was possible to design a fuzzy private matching protocol without incurring a communication complexity with the factor (Tt). We answer their question in the affirmative by presenting a protocol based on homomorphic encryption, combined with the novel notion of a share-hiding error-correcting secret sharing scheme, which we show how to implement with efficient decoding using interleaved Reed-Solomon codes. This scheme may be of independent interest. Our protocol is provably secure against passive adversaries, and has better efficiency than previous protocols for certain parameter values.

Proceedings ArticleDOI
02 Nov 2009
TL;DR: An incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text and significantly outperform existing methods in the literature.
Abstract: We study the problem of approximate membership extraction (AME), i.e., how to efficiently extract substrings in a text document that approximately match some strings in a given dictionary. This problem is important in a variety of applications such as named entity recognition and data cleaning. We solve this problem in two steps. In the first step, for each substring in the text, we filter away the strings in the dictionary that are very different from the substring. In the second step, each candidate string is verified to decide whether the substring should be extracted. We develop an incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text. Our experimental study of the proposed algorithms on real and synthetic datasets showed that our solutions significantly outperform existing methods in the literature.

Journal ArticleDOI
TL;DR: An FFT-based algorithm is presented that uses a novel prime-numbers encoding scheme, which is logn/logm times faster than the fastest extant approaches, which are based on boolean convolutions, and speeds up solutions to approximate matching with character classes problems.

Patent
William T. Laaser1
29 Jan 2009
TL;DR: In this paper, a comparison technique for efficiently comparing an input string to a set of strings is described, where the input string is represented in a tree structure as paths from a root to leaves of the tree structure, and strings in the set of string that share common substrings share nodes in the tree.
Abstract: A comparison technique for efficiently comparing an input string to a set of strings is described. This set of strings may be represented in a tree structure as paths from a root of the tree structure to leaves of the tree structure, and strings in the set of strings that share common substrings share nodes in the tree structure. During the comparison technique, labels may be assigned to a given node in the tree structure based at least in part on comparisons between a given character in the input string and a character associated with the given node. These labels may include a position of the given character in the input string, and a cumulative error between the characters in a string that are associated with a branch in the tree structure and the characters in the input string that have been processed. Based at least in part on these labels, an actual string, which corresponds to the input string, may be identified.

Proceedings ArticleDOI
08 Mar 2009
TL;DR: A fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware to detect similar patterns even when the spammer creates a variation of the original pattern.
Abstract: In this paper, a fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware. A modified scoring strategy is applied in dynamic programming to find subjects that are similar to each other. A recursive seed selection strategy allows the algorithm to detect similar patterns even when the spammer creates a variation of the original pattern. A sliding threshold based on string length helps to minimize false-positives.The algorithm proves to be effective in detecting and grouping spam emails using templates. It also helps spam investigators to collect and sort large amount of malware-generated spam more efficiently without looking at the email content.

Proceedings ArticleDOI
08 Mar 2009
TL;DR: This work combines various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space.
Abstract: We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents.We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.

Journal ArticleDOI
TL;DR: A special device which can do string matching by performing n−m + 1 text-to-pattern comparisons is described, which uses light and optical filters for performing computations.
Abstract: String matching is a very important problem in computer science. The problem consists in finding all the occurrences of a pattern P of length m in a text T of length n. We describe a special device which can do string matching by performing n−m + 1 text-to-pattern comparisons. The proposed device uses light and optical filters for performing computations. Two physical implementations are proposed. One of them uses colored glass and the other one uses polarizing filters. The strengths and the weaknesses of each method are deeply discussed.

Proceedings ArticleDOI
01 Nov 2009
TL;DR: A hardware-efficient string matching architecture using the brute-force algorithm is proposed and a process element that organizes the proposed architecture is optimized by reducing the number of the comparators.
Abstract: Due to the growth of network environment complexity, the necessity of packet payload inspection at application layer is increased. String matching, which is critical to network intrusions detection systems, inspects packet payloads and detects malicious network attacks using a set of rules. Because string matching is a computationally intensive task, hardware based string matching is required. In this paper, we propose a hardware-efficient string matching architecture using the brute-force algorithm. A process element that organizes the proposed architecture is optimized by reducing the number of the comparators. The performance of the proposed architecture is nearly equal to a previous work. The experimental results show that the proposed architecture with any process width reduces the comparator requirements in comparison with the previous work.

Book ChapterDOI
31 Mar 2009
TL;DR: This work presents several non-trivial applications of Matryoshka counters in string matching algorithms, improving their worst- or average-case time complexities.
Abstract: Many algorithms, e.g. in the field of string matching, are based on handling many counters, which can be performed in parallel, even on a sequential machine, using bit-parallelism. The recently presented technique of nested counters (Matryoshka counters ) [1] is to handle small counters most of the time, and refer to larger counters periodically, when the small counters may get full, to prevent overflow. In this work, we present several non-trivial applications of Matryoshka counters in string matching algorithms, improving their worst- or average-case time complexities. The set of problems comprises (Δ ,α )-matching, matching with k insertions, episode matching, and matching under Levenshtein distance.

Journal ArticleDOI
TL;DR: The proposed algorithm builds a lexicon enriched with topic information in three steps: transcription of an audio stream into phone sequences with a speaker- and task-independent phone recogniser, automatic lexical acquisition based on approximate string matching, and hierarchical topic clustering of the lexical entries based on a knowledge-poor co-occurrence approach.

Proceedings ArticleDOI
08 Jul 2009
TL;DR: In trials on twenty-five instances of the closest string problem with alphabets ranging is size from 2 to 30, the algorithm that used the data-based representation of candidate strings consistently returned the best results, and its advantage increased with the sizes of the test instances' alphABets.
Abstract: Given a set of strings S of equal lengths over an alphabet σ, the closest string problem seeks a string over σ whose maximum Hamming distance to any of the given strings is as small as possible. A data-based coding of strings for evolutionary search represents candidate closest strings as sequences of indexes of the given strings. The string such a chromosome represents consists of the symbols in the corresponding positions of the indexed strings.A genetic algorithm using this coding was compared with two GAs that encoded candidate strings directly as strings over σ. In trials on twenty-five instances of the closest string problem with alphabets ranging is size from 2 to 30, the algorithm that used the data-based representation of candidate strings consistently returned the best results, and its advantage increased with the sizes of the test instances' alphabets.