scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: A very fast new family of string matching algorithms based on hashing q-grams are proposed, which are the fastest on many cases, in particular, on small size alphabets.

122 citations

Patent
20 Jun 2003
TL;DR: In this article, a disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process is proposed.
Abstract: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.

120 citations

Proceedings ArticleDOI
14 Oct 1996
TL;DR: The authors show that this general method based on assigning labels to some of the substrings of a given string is also useful for several central problems in the area of string processing: approximate string matching, dynamic dictionary matching, and dynamic text indexing.
Abstract: A key approach in string processing algorithmics has been the labeling paradigm which is based on assigning labels to some of the substrings of a given string. If these labels are chosen consistently, they can enable fast comparisons of substrings. Until the first optimal parallel algorithm for suffix tree construction was given by the authors in 1994 the labeling paradigm was considered not to be competitive with other approaches. They show that this general method is also useful for several central problems in the area of string processing: approximate string matching, dynamic dictionary matching, and dynamic text indexing. The approximate string matching problem deals with finding all substrings of a text which match a pattern "approximately", i.e., with at most m differences. The differences can be in the form of inserted, deleted, or replaced characters. The text indexing problem deals with finding all occurrences of a pattern in a text, after the text is preprocessed. In the dynamic text indexing problem, updates to the text in the form of insertions and deletions of substrings are permitted. The dictionary matching problem deals with finding all occurrences of each pattern set of a set of patterns in a text, after the pattern set is preprocessed. In the dynamic dictionary matching problem, insertions and deletions of patterns to the pattern set are permitted.

119 citations

Book ChapterDOI
29 Apr 1992
TL;DR: This work presents an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases and presents a new approach to string searching.
Abstract: We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs.

118 citations

Journal ArticleDOI
TL;DR: This paper examines string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence, and shows that several variants are NPcomplete and give polynomial-time algorithms for solving the remainder.

118 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839