Topic
Approximate string matching
About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.
Papers published on a yearly basis
Papers
More filters
•
TL;DR: An inverted index structure, which the authors call the n-gram/2L-Approximation index, that improves these drawbacks and an approximate string matching algorithm based on it and reduces false positives compared with then-gram inverted index if a large number of errors are allowed.
Abstract: Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: the query performance tends to be bad, and many false positives occur if a large number of errors are allowed. In this paper, we propose an inverted index structure, which we call the n-gram/2L-Approximation index, that improves these drawbacks and an approximate string matching algorithm based on it. The n-gram/2L-Approximation is an adaptation of the n-gram/2L index [4], which the authors have proposed earlier for exact matching. Inheriting the advantages of the n-gram/2L index, the n-gram/2L-Approximation index reduces the size of the index and improves the query performance compared with the n-gram inverted index. In addition, the n-gram/2L-Approximation index reduces false positives compared with the n-gram inverted index if a large number of errors are allowed. We perform extensive experiments using the text and protein databases. Experimental results using databases of 1 GBytes show that the n-gram/2L-Approximation index reduces the index size by up to 1.8 times and, at the same time, improves the query performance by up to 4.2 times compared with those of the n-gram inverted index.
16 citations
••
TL;DR: This work considers the problem of finding the longest common subsequence of two strings, and develops significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems.
Abstract: Measuring the similarity between two strings, through such standard measures as Hamming distance, edit distance, and longest common subsequence, is one of the fundamental problems in pattern matching. We consider the problem of finding the longest common subsequence of two strings. A well-known dynamic programming algorithm computes the longest common subsequence of strings X and Y in O(|X|/spl middot/|Y|) time. We develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is run-length encoded if it is described as an ordered sequence of pairs (/spl sigma/,i), each consisting of an alphabet symbol /spl sigma/ and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of /spl sigma/. For example, the string aaaabbbbcccabbbbcc can be encoded as a/sup 4/b/sup 4/c/sup 3/a/sup 1/b/sup 4/c/sup 2/. Such a run-length encoded string can be significantly shorter than the expanded string representation. Indeed, runlength coding serves as a popular image compression technique, since many classes of images, such as binary images in facsimile transmission, typically contain large patches of identically-valued pixels.
16 citations
••
TL;DR: The main use of this method is to reduce the time spent on comparisons in string matching by using LARPBS, and obtained O (n) bus cycles algorithm and constantBus cycles algorithm for exact string matching and approximate string matching problems.
Abstract: We considered string matching on LARPBS and 2D LARPBS. This has applications such as string databases, cellular automata and computational biology. The main use of this method is to reduce the time spent on comparisons in string matching by using LARPBS. We investigated exact string matching and approximate string matching problems. For these two sub problems, we obtained O (n) bus cycles algorithm and constant bus cycles algorithm. These algorithms have some disadvantages: Reconnecting the sub buses and shuffling the contents .These problems can be solved by 2D LARPBS.
16 citations
••
01 Jun 1993TL;DR: This work describes the first efficient algorithm for simultaneously matching multiple rectangular patterns of varying sizes and aspect, ratios in a rectangular text, and extends the algorithm to a dynamic setting where the set of patterns can change over time.
Abstract: We describe the first worst-case efficient algorithm for simultaneously matching multiple rectangular patterns of varying sizes and aspect ratios in a rectangular text. Efficient means significantly more efficient asymptotically than applying known algorithms that handle one height (or width or aspect ratio) at a time for each height. Our algorithm features an interesting use of multidimensional range searching, as well as new adaptations of several known techniques for two-dimensional string matching. We also extend our algorithm to a dynamic setting where the set of patterns can change over time.
16 citations
•
17 Jan 2003
TL;DR: In this article, an Arabic handwriting recognition system takes an input from a stylus in the form of an ordered sequence of data, and subsequently strokes (or directed line segments) are extracted from the sequence.
Abstract: An Arabic handwriting recognition system takes an input from a stylus in the form of an ordered sequence of data. The sequence of data is then processed to eliminate any noise associated with data, and subsequently strokes (or directed line segments) are extracted from the sequence of data. More analysis of the strokes is performed to transform the input data into a features vector. Next, the features vector is matched against the features of all Arabic letters using fuzzy matching and dynamic programming techniques. During this matching process, the input word is segmented into the sequence of characters that maximized the matching score. In addition, external objects (such as: single dots, double dots, triple dots, hamzas, or maddas) that are above and below Arabic letters are detected.
16 citations