Topic
Approximate string matching
About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: A very fast new family of string matching algorithms based on hashing q-grams are proposed, which are the fastest on many cases, in particular, on small size alphabets.
122 citations
•
20 Jun 2003TL;DR: In this article, a disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process is proposed.
Abstract: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.
120 citations
••
14 Oct 1996TL;DR: The authors show that this general method based on assigning labels to some of the substrings of a given string is also useful for several central problems in the area of string processing: approximate string matching, dynamic dictionary matching, and dynamic text indexing.
Abstract: A key approach in string processing algorithmics has been the labeling paradigm which is based on assigning labels to some of the substrings of a given string. If these labels are chosen consistently, they can enable fast comparisons of substrings. Until the first optimal parallel algorithm for suffix tree construction was given by the authors in 1994 the labeling paradigm was considered not to be competitive with other approaches. They show that this general method is also useful for several central problems in the area of string processing: approximate string matching, dynamic dictionary matching, and dynamic text indexing. The approximate string matching problem deals with finding all substrings of a text which match a pattern "approximately", i.e., with at most m differences. The differences can be in the form of inserted, deleted, or replaced characters. The text indexing problem deals with finding all occurrences of a pattern in a text, after the text is preprocessed. In the dynamic text indexing problem, updates to the text in the form of insertions and deletions of substrings are permitted. The dictionary matching problem deals with finding all occurrences of each pattern set of a set of patterns in a text, after the pattern set is preprocessed. In the dynamic dictionary matching problem, insertions and deletions of patterns to the pattern set are permitted.
119 citations
••
29 Apr 1992TL;DR: This work presents an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases and presents a new approach to string searching.
Abstract: We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs.
118 citations
••
TL;DR: This paper examines string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence, and shows that several variants are NPcomplete and give polynomial-time algorithms for solving the remainder.
118 citations