scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2006"


01 Jan 2006
TL;DR: Agrep as mentioned in this paper is a tool for approximate pattern matching based on a new efficient and flexible algorithm for approximate string matching, which is also competitive with other tools for exact string matching; it include many options that make searching more powerful and convenient.
Abstract: Searching for a pattern in a text file is a very common operation in many applications ranging from text editor sand databases to applications in molecular biology. In many instances the pattern does not appear in the text exactly. Errors in the text or in the query can result from misspelling or from experimental errors (e.g., when the text is a DNA sequence). The use of such approximate pattern matching has been limited until now to specific applications. Most text editors and searching programs do not support searching with errors because of the complexity involved in implementing it. In this paper we describe a new tool, called agrep, for approximate pattern matching. Agrep is based on a new efficient and flexible algorithm for approximate string matching. Agrep is also competitive with other tools for exact string matching; it include many options that make searching more powerful and convenient.

162 citations


Patent
Kyung-eun Lee1
16 Jun 2006
TL;DR: A string matching method, system, and a computer-readable medium storing instructions for determining and obtaining a representative string for a plurality of strings that are written in various manners but share the same meaning is described in this article.
Abstract: A string matching method, system, and a computer-readable medium storing instructions for determining and obtaining a representative string for a plurality of strings that are written in various manners but share the same meaning. The string matching method includes: converting the input string into one or more second-language strings with reference to a language mapping table, which stores a plurality of pieces of mapping information for mapping a first-language string to a second-language string, and generating a conversion list; searching a representative list database, which storing a plurality of records each with a representative string and a corresponding second-language representative string, for records including the same second-language representative strings as the respective second-language strings in the conversion list and generating a candidate list; and determining a representative string from the candidate list to be an output representative string. Therefore, the string matching can provide string-based multimedia data classification scenarios.

130 citations


Proceedings ArticleDOI
J.C. Herbordt1, J. Model1, Yongfeng Gu1, Bharat Sukhwani1, T. VanCourt1 
24 Apr 2006
TL;DR: Two new algorithms for emulating the seeding and extension phases of BLAST are contributed, which operate in a single pass through a database at streaming rate, and with no preprocessing other than loading the query string.
Abstract: Approximate string matching is funda- mental to bioinformatics, and has been the subject of numerous FPGA acceleration studies. We ad- dress issues with respect to FPGA implementations of both BLAST- and dynamic-programming- (DP) based methods. Our primary contributions are two new algo- rithms for emulating the seeding and extension phases of BLAST. These operate in a single pass through a database at streaming rate (110 Maa/sec on a VP70 for query sizes up to 600 and 170 Maa/sec on a Virtex4 for query sizes up to 1024), and with no preprocessing other than loading the query string. Further, they use very high sensitivity with no slowdown. While cur- rent DP-based methods also operate at streaming rate, generating results can be cumbersome. We address this with a new structure for data extraction. We present results from several implementations.

95 citations


Journal ArticleDOI
TL;DR: Simple and practical algorithms for finding all pattern occurrences in sublinear time on average for parameterized string matching the pattern P matches a substring t of the text T if there exist a bijective mapping from the symbols of P to the symbol of t.

57 citations


Proceedings Article
01 Jan 2006
TL;DR: This work presents an efficient method for combating obfuscation through the use of inexact string matching kernels, which were first developed to measure similarity among mutating genes in computational biology, and employs the Perceptron Algorithm using Margins for fast on-line training.
Abstract: Contemporary spammers commonly seek to defeat statistical spam filters through the use of word obfuscation. Such methods include character level substitutions, repetitions, and insertions to reduce the effectiveness of word-based features. We present an efficient method for combating obfuscation through the use of inexact string matching kernels, which were first developed to measure similarity among mutating genes in computational biology. Our system avoids the high classification costs associated with these kernel methods by working in an explicit feature space, and employs the Perceptron Algorithm using Margins for fast on-line training. No prior domain knowledge was incorporated into this system. We report strong experimental results on the TREC 2006 spam data sets and on other publicly available spam data, including near-perfect performance on the TREC 2006 Chinese spam data set. These results invite further exploration of the use of inexact string matching for spam filtering.

48 citations


Journal ArticleDOI
TL;DR: A well-studied case in which T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster is investigated, which allows us to exploit compressed suffix arrays to reduce the indexing space to O(n) bits, while increasing the query time by an O(log n) factor only.

42 citations


Patent
17 Oct 2006
TL;DR: In this paper, the first string is matched to a string stored in a string dictionary by k-way hashing and locating corresponding k hash locations in a first memory, where the string dictionary is used as the input.
Abstract: String matching a first string to a string stored in a string dictionary is performed by k-way hashing the first string and locating corresponding k hash locations in a first memory. When any of the k hash locations has a zero Bloom bit, the first string is deemed to not match any of the strings in the string dictionary. Otherwise, a sub-set of the k hash locations identified as those k hash locations having non-zero Bloom bits and a unique bit set to 1 each include a pointer that points to a string in the string dictionary that is fetched and compared to the first string wherein the fetches from the string dictionary are interleaved over the addresses from the first memory. A match signal is issued when the first string matches at least one of the strings stored in the dictionary.

39 citations


Proceedings ArticleDOI
08 Jun 2006
TL;DR: An automated process that takes the output of named entity recognition systems designed to identify genes and normalizes them to standard referents and identifies human gene synonyms from online databases to generate an extensive synonym lexicon.
Abstract: The identification of genes in biomedical text typically consists of two stages: identifying gene mentions and normalization of gene names. We have created an automated process that takes the output of named entity recognition (NER) systems designed to identify genes and normalizes them to standard referents. The system identifies human gene synonyms from online databases to generate an extensive synonym lexicon. The lexicon is then compared to a list of candidate gene mentions using various string transformations that can be applied and chained in a flexible order, followed by exact string matching or approximate string matching. Using a gold standard of MEDLINE abstracts manually tagged and normalized for mentions of human genes, a combined tagging and normalization system achieved 0.669 F-measure (0.718 precision and 0.626 recall) at the mention level, and 0.901 F-measure (0.957 precision and 0.857 recall) at the document level for documents used for tagger training.

39 citations


Book ChapterDOI
14 Sep 2006
TL;DR: A Θ(k)-approximation algorithm for k-SBR, a version of SBR in which each symbol is allowed to appear up to k times in each string, for some k≥1 is considered.
Abstract: In the last decade there has been an ongoing interest in string comparison problems; to a large extend the interest was stimulated by genome rearrangement problems in computational biology but related problems appear in many other areas of computer science. Particular attention has been given to the problem of sorting by reversals(SBR): given two strings, A and B, find the minimum number of reversals that transform the string A into the string B (a reversalρ(i,j), i

37 citations


Journal ArticleDOI
TL;DR: It is shown that the approximate matching problem with swap andmismatch as the edit operations, can be computed in timeO(n √m logm).
Abstract: There is no known algorithm that solves the general case of theapproximate string matching problem with the extended edit distance, where the edit operations are: insertion, deletion, mismatch and swap, in timeo(nm), wheren is the length of the text andm is the length of the pattern. In an effort to study this problem, the edit operations were analysed independently. It turns out that the approximate matching problem with only the mismatch operation can be solved in timeO(n √m logm). If the only edit operation allowed is swap, then the problem can be solved in timeO(n logm logσ), whereσ=min(m, |Σ|). In this paper we show that theapproximate string matching problem withswap andmismatch as the edit operations, can be computed in timeO(n √m logm).

30 citations


Journal Article
TL;DR: An algorithm to approximate edit distance between two ordered and rooted trees of bounded degree is presented, where each input tree is transformed into a string by computing the Euler string, where labels of some edges in the input trees are modified so that structures of small subtrees are reflected to the labels.
Abstract: This paper presents an O(n 2 ) time algorithm for approximating the unit cost edit distance for ordered and rooted trees of bounded degree within a factor of O(n 3/4 ), where n is the maximum size of two input trees, and the algorithm is based on transformation of an ordered and rooted tree into a string.

Proceedings ArticleDOI
06 Nov 2006
TL;DR: A dictionary data structure for string search with errors where the query string may didiffer from the expected matching string by a few edits is proposed and a simple reduction can be used to obtain similar results for approximate longest prefix search.
Abstract: In this paper we propose a dictionary data structure for string search with errors where the query string may didiffer from the expected matching string by a few edits. This data structure can also be used to find the database string with the longest common prefix with few errors. Specifically, with a database of n random strings, each of length of O(m), we show how to perform string search on a query string that differs from its closest match by k edits using a data structure of linear size and query time equal to O(log n2 log nklog a 2m over 2m). This means that if k

Journal ArticleDOI
TL;DR: This is the first index achieving average search time polynomial in m and independent of n, for r = O(m/logσm), and a simpler scheme needing O(n) space is presented.

Journal ArticleDOI
TL;DR: In this paper, a technique recently developed for multipattern approximate string matching is successfully extended to solve many different music retrieval problems, as well as combinations thereof not addressed before, and the resulting algorithms are average-optimal in many cases and close to averageoptimal otherwise.
Abstract: Music sequences can be treated as texts in order to perform music retrieval tasks on them. However, the text search problems that result from this modeling are unique to music retrieval. Up to date, several approaches derived from classical string matching have been proposed to cope with the new search problems, yet each problem had its own algorithms. In this paper we show that a technique recently developed for multipattern approximate string matching is flexible enough to be successfully extended to solve many different music retrieval problems, as well as combinations thereof not addressed before. We show that the resulting algorithms are average-optimal in many cases and close to average-optimal otherwise. Empirically, they are much better than existing approaches in many practical cases.

Journal ArticleDOI
TL;DR: A general solution to the retrieval problem can be transformed into the q-attribute string matching problem if q features are considered in a query and includes an index structure and the matching methodologies, which can be applied on different values of q.
Abstract: Multimedia data can be represented as a multiple-attribute string of feature values corresponding to multiple features of the data. Therefore, the retrieval problem can be transformed into the q-attribute string matching problem if q features are considered in a query. A general solution is proposed in this paper. It includes an index structure and the matching methodologies, which can be applied on different values of q. The experiment results show the efficiency of the proposed approach.

Patent
Kyung-eun Lee1, Kang Seok Joong1
11 Dec 2006
TL;DR: A string matching method and system for searching for a representative string for a plurality of strings which are written in different languages and/or in different ways but share the substantially same meaning are provided in this paper.
Abstract: A string matching method and system for searching for a representative string for a plurality of strings which are written in different languages and/or in different ways but share the substantially same meaning, and a computer-readable recording medium storing a computer program for executing the string matching method are provided.

Book ChapterDOI
13 Nov 2006
TL;DR: An entropy based Audio-Fingerprint delivering a framed, small footprint AFP is used which reduces the problem to a string matching problem and is able to correctly identify different renditions of masterpieces as well as pop music in less than a second per comparison.
Abstract: In this paper we address the problem of matching musical renditions of the same piece of music also known as performances. We use an entropy based Audio-Fingerprint delivering a framed, small footprint AFP which reduces the problem to a string matching problem. The Entropy AFP has very low resolution (750 ms per symbol), making it suitable for flexible string matching. We show experimental results using dynamic time warping (DTW), Levenshtein or edit distance and the Longest Common Subsequence (LCS) distance. We are able to correctly (100%) identify different renditions of masterpieces as well as pop music in less than a second per comparison. The three approaches are 100% effective, but LCS and Levenshtein can be computed online, making them suitable for monitoring applications (unlike DTW), and since they are distances a metric index could be use to speed up the recognition process.

01 Jan 2006
TL;DR: This paper presents a lightweight program similarity detection model based on the XPDec model and is capable of distinguishing a flat structure from a nested structure of control sequences that avoids globally involved string comparisons.
Abstract: Program plagiarism is one of the most significant problems in Computer Science education. Most common plagiarism includes modifying comments, reordering statements, and changing variable names. Such simple changes, however, require excessive string comparisons. This paper presents a lightweight program similarity detection model. Unlike other detection models, our model avoids globally involved string comparisons. String matching is only involved locally when comparing control sequences. To this end we use XML and Levenshtein distance algorithm. The XML’s tree-like representation reduces intensive string comparisons for the simple modifications. Levenshtein distance algorithm makes our model reliable for logic changes. Our approach is based on the XPDec model and is capable of distinguishing a flat structure from a nested structure of control sequences. Such improvement will lead to simple and reliable implementation of program similarity detection systems.

Proceedings ArticleDOI
13 Sep 2006
TL;DR: The main use of this method is to reduce the time spent on comparisons of string matching by distributing the data among processors which achieves a linear speedup and requires layered architecture and additionally p*# processors.
Abstract: In this paper we present new method for exact string matching algorithm based on layered architecture and two-dimensional array. This has applications such as string databases and computational biology. The main use of this method is to reduce the time spent on comparisons of string matching by distributing the data among processors which achieves a linear speedup and requires layered architecture and additionally p * £ processors. In this paper we proposed generalized mapping scheme for distributed computing environment. We introduced efficient dataflow schemes for the exact string matching problems.

Journal ArticleDOI
TL;DR: This paper addresses a modified version of Horspool's string matching algorithm using the probabilities of the different symbols to speed up the search, and shows that the distribution of the symbols can be approximated to a high precision using a random sample of sublinear size.

Book ChapterDOI
05 Jul 2006
TL;DR: A new suffix tree layout scheme for secondary storage is presented and construction, substring search, insertion and deletion algorithms that are competitive with the string B-tree are presented.
Abstract: Designing external memory data structures for string data-bases is of significant recent interest due to the proliferation of biological sequence data The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data However, string B-trees provide the best known asymptotic performance in external memory for substring search and update operations Work on external memory variants of suffix trees has largely focused on constructing suffix trees in external memory or layout schemes for suffix trees that preserve link locality In this paper, we present a new suffix tree layout scheme for secondary storage and present construction, substring search, insertion and deletion algorithms that are competitive with the string B-tree For a set of strings of total length n, a pattern p and disk blocks of size B, we provide a substring search algorithm that uses O(|p|/B + logBn) disk accesses We present algorithms for insertion and deletion of all suffixes of a string of length m that take O(m logB (n+m)) and O(mlogBn) disk accesses, respectively Our results demonstrate that suffix trees can be directly used as efficient secondary storage data structures for string and sequence data

Journal Article
TL;DR: In this article, an entropy-based Audio-Fingerprint (AFP) was used to match musical renditions of the same piece of music also known as performances, which reduced the problem to a string matching problem.
Abstract: In this paper we address the problem of matching musical renditions of the same piece of music also known as performances. We use an entropy based Audio-Fingerprint delivering a framed, small footprint AFP which reduces the problem to a string matching problem. The Entropy AFP has very low resolution (750 ms per symbol), making it suitable for flexible string matching. We show experimental results using dynamic time warping (DTW), Levenshtein or edit distance and the Longest Common Subsequence (LCS) distance. We are able to correctly (100%) identify different renditions of masterpieces as well as pop music in less than a second per comparison. The three approaches are 100% effective, but LCS and Levenshtein can be computed online, making them suitable for monitoring applications (unlike DTW), and since they are distances a metric index could be use to speed up the recognition process.

Journal ArticleDOI
TL;DR: A way of measuring each pianist's habit of playing similar phrases in similar ways is presented and a ranking of the performers based on that is proposed.
Abstract: We propose novel machine learning methods for exploring the domain of music performance praxis. Based on simple measurements of timing and intensity in 12 recordings of a Schubert piano piece, short performance sequences are fed into a SOM algorithm in order to calculate 'performance archetypes'. The archetypes are labeled with letters and approximate string matching done by an evolutionary algorithm is applied to find similarities in the performances represented by these letters. We present a way of measuring each pianist's habit of playing similar phrases in similar ways and propose a ranking of the performers based on that. Finally, an experiment revealing common expression patterns is briefly described.

Patent
06 Jul 2006
TL;DR: In this article, a method, computer program product, apparatus, and system that detects a substring in an input data string by producing a fingerprint of a portion of the data string and comparing the fingerprint of the portion of a data string to at least one predefined fingerprint is presented.
Abstract: A method, computer program product, apparatus, and system that detects a substring in an input data string by producing a fingerprint of a portion of the data string and comparing the fingerprint of the portion of the data string to at least one predefined fingerprint. The predefined fingerprint may be a fingerprint of a portion of a predefined pattern of interest. If the fingerprints match, further pattern recognition processing may be performed on the input string.

Journal Article
TL;DR: In this article, a new suffix tree layout scheme for secondary storage is presented and algorithms for insertion and deletion of all suffixes of a string of length m that take O(m log B (n + m)) and O(mlog B n) disk accesses, respectively.
Abstract: Designing external memory data structures for string databases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string B-trees provide the best known asymptotic performance in external memory for substring search and update operations. Work on external memory variants of suffix trees has largely focused on constructing suffix trees in external memory or layout schemes for suffix trees that preserve link locality. In this paper, we present a new suffix tree layout scheme for secondary storage and present construction, substring search, insertion and deletion algorithms that are competitive with the string B-tree. For a set of strings of total length n, a pattern p and disk blocks of size B, we provide a substring search algorithm that uses O(|p|/B+log B n) disk accesses. We present algorithms for insertion and deletion of all suffixes of a string of length m that take O(m log B (n + m)) and O(mlog B n) disk accesses, respectively. Our results demonstrate that suffix trees can be directly used as efficient secondary storage data structures for string and sequence data.

Proceedings ArticleDOI
01 Sep 2006
TL;DR: This work shows that pessimistic error pruning method gives better generalization in a coreference resolution task than that reported in W.M. Soon et al. (2001) when weights of positive and negative examples are properly chosen.
Abstract: Coreference resolution is the process of determining whether two expressions in natural language refer to the same entity in the world. We adopt machine learning approach using decision tree to a coreference resolution of general noun phrases in unrestricted text based on well defined features. We also use approximate matching algorithms for a string match feature and databases of American last names and male and female first names for gender agreement and alias feature. For the evaluation we use MUC-6 coreference corpora. We show that pessimistic error pruning method gives better generalization in a coreference resolution task than that reported in W.M. Soon et al. (2001) when weights of positive and negative examples are properly chosen

Proceedings Article
01 Jan 2006
TL;DR: A new algorithm for pattern matching when both a text T and a pattern P are presented by SLPs and it is shown how to count all occurrences, check whether any given position is an occurrence or not in time O(n 2 m).
Abstract: Here we study the complexity of string problems as a function of the size of a program that generates input. We consider straight-line programs (SLP), since all algorithms on SLP-generated strings could be applied to processing LZ-compressed texts. The main result is a new algorithm for pattern matching when both a text T and a pattern P are presented by SLPs (so-called fully compressed pattern matching problem). We show how to nd a rst occurrence, count all occurrences, check whether any given position is an occurrence or not in time O(n 2 m). Here m; n are the sizes of straight-line programs generating correspondingly P and T . Then we present polynomial algorithms for computing ngerprint table and compressed representation of all covers (for the rst time) and for nding periods of a given compressed string (our algorithm is faster than previously known). On the other hand, we show that computing the Hamming distance between two SLP-generated strings is NP- and coNP-hard.

Proceedings Article
01 May 2006
TL;DR: In this article, the authors explore the feasibility of using only unsupervised means to identify typos in a frequency list derived from a large corpus of Dutch and to distinguish between these non-words and real-words in the language.
Abstract: We explore the feasibility of using only unsupervised means to identify non-words, ie typos, in a frequency list derived from a large corpus of Dutch and to distinguish between these non-words and real-words in the language We call the system we built and evaluate in this paper ciccl, which stands for “Corpus-Induced Corpus Clean-up” The algorithm on which ciccl is primarily based is the anagram-key hashing algorithm introduced by (Reynaert, 2004) The core correction mechanism is a simple and effective method which translates the actual characters which make up a word into a large natural number in such a way that all the anagrams, ie all the words composed of precisely the same subset of characters, are allocated the same natural number In effect, this constitutes a novel approximate string matching algorithm for indexed text search This is because by simple addition, subtraction or a combination of both, all variants within reach of the range of numerical values defined in the alphabet are retrieved by iterating over the alphabet ciccl's input consists primarily of corpus derived frequency lists, from which it derives valuable morphological information by performing frequency counts over the substrings of the words, which are then used to perform decompounding, as well as for distinguishing between most likely correctly spelled words and typos

Book ChapterDOI
18 Dec 2006
TL;DR: An O( n2) time algorithm for approximating the unit cost edit distance for ordered and rooted trees of bounded degree within a factor of O(n3/4), where n is the maximum size of two input trees, and the algorithm is based on transformation of anordered and rooted tree into a string.
Abstract: This paper presents an O(n2) time algorithm for approximating the unit cost edit distance for ordered and rooted trees of bounded degree within a factor of O(n3/4), where n is the maximum size of two input trees, and the algorithm is based on transformation of an ordered and rooted tree into a string.

Book ChapterDOI
11 Sep 2006
TL;DR: This work revisits the problem of indexing a string S to support searching all substrings in S that match a given pattern P[1..m] with at most k errors and gives an index to support matching in O(m + occ + logn loglogn) time.
Abstract: We revisit the problem of indexing a string S[1..n] to support searching all substrings in S that match a given pattern P[1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(mk) time for searching. Motivated by the indexing of DNA sequences, we investigate space efficient indexes that occupy only O(n) space. For k = 1, we give an index to support matching in O(m + occ + logn loglogn) time. The previously best solution achieving this time complexity requires an index of size O(n logn). This new index can be used to improve existing indexes for k ≥2 errors. Among others, it can support matching with k=2 errors in O(m logn loglogn + occ) time.