Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Approximate String Matching Techniques for Effective CLIR Among Indian Languages

[...]

Ranbeer Makin¹, Nikita Pandey¹, Prasad Pingali¹, Vasudeva Varma¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

07 Jul 2007

TL;DR: This article explored approximate string matching techniques to exploit the situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language.

...read moreread less

Abstract: Commonly used vocabulary in Indian language documents found on the web contain a number of words that have Sanskrit, Persian or English origin. However, such words may be written in different scripts with slight variations in spelling and morphology. In this paper we explore approximate string matching techniques to exploit this situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language. We present an approach to identify cognates and make use of them for improving dictionary based CLIR when the query and documents both belong to two different Indian languages. We conduct experiments using a Hindi document collection and a set of Telugu queries and report the improvement due to cognate recognition and translation.

...read moreread less

19 citations

Journal Article•DOI•

Strict approximate pattern matching with general gaps

[...]

Youxi Wu¹, Shuai Fu¹, He Jiang², Xindong Wu³•Institutions (3)

Hebei University of Technology¹, Dalian University of Technology², University of Vermont³

01 Apr 2015-Applied Intelligence

TL;DR: This article proposes an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and shows the completeness of the algorithm.

...read moreread less

Abstract: Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which only considers the matching position of the last pattern substring in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential pattern mining with gap constraints. Some strict pattern matching algorithms were designed to handle pattern mining tasks, since strict pattern matching can be used to compute the frequency of some patterns occurring in the given sequence and then the frequent patterns can be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into an exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O(m × Maxlen × W × d) and O(Maxlen × W × m 2 × n × d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA.

...read moreread less

19 citations

Book•

Approximate String Processing

[...]

Marios Hadjieleftheriou¹, Divesh Srivastava¹•Institutions (1)

AT&T Labs¹

08 Feb 2011

TL;DR: This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching, focusing on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions.

...read moreread less

Abstract: One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text. This monograph focuses specifically on the problem of approximate string matching, where, given a set of strings S and a query string v, the goal is to find all strings s ∈ S that have a user specified degree of similarity to v. Set S could be, for example, a corpus of documents, a set of web pages, or an attribute of a relational table. The similarity between strings is always defined with respect to a similarity function that is chosen based on the characteristics of the data and application at hand. This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching. We concentrate on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. We focus on all-match and top-k flavors of selection and join queries, and discuss the applicability, advantages and disadvantages of each technique for every query type.

...read moreread less

19 citations

Book Chapter•DOI•

Approximate regular expression searching with arbitrary integer weights

[...]

Gonzalo Navarro¹•Institutions (1)

University of Chile¹

01 Dec 2004

TL;DR: A bit-parallel technique to search a text of length n for a regular expression of m symbols permitting k differences in worst case time O(mn/logk s), where s is the amount of main memory that can be allocated.

...read moreread less

Abstract: We present a bit-parallel technique to search a text of length n for a regular expression of m symbols permitting k differences in worst case time O(mn/logk s), where s is the amount of main memory that can be allocated. The algorithm permits arbitrary integer weights and matches the complexity of the best previous techniques, but it is simpler and faster in practice. In our way, we define a new recurrence for approximate searching where the current values depend only on previous values. Interestingly, our algorithm turns out to be a relevant option also for simple approximate string matching with arbitrary integer weights.

...read moreread less

19 citations

Patent•

Systems and methods for building an electronic dictionary of multi-word names and for performing fuzzy searches in the dictionary

[...]

Hisham El-Shishiny¹, Pavel Volkov¹•Institutions (1)

IBM¹

24 Sep 2007

TL;DR: In this article, a trie-based dictionary of proper names is built from a given list of multi-word proper names and fuzzy searches in the contracted dictionary are performed to perform fuzzy search in the dictionary.

...read moreread less

Abstract: The present invention automatically builds a contracted dictionary from a given list of multi-word proper names and performs fuzzy searches in the contracted dictionary. The contracted dictionary of proper names includes two linked trie-based dictionaries: a first dictionary is used to store single word names, each word name having an ID number; and a second dictionary is used to store multi-word names encoded with ID numbers. Information related to the multi-word names is also stored as a gloss to the terminal node of the multi-word entry of the trie-based dictionary. An approximate lookup for a multi-word name is conducted first for each word of the multi-word name using an approximate matching technique such as a phonetic proximity or a simple edit distance. Accordingly, N suggestions is determined for each word of the multi-word name under consideration. Then, multi-word candidates are assembled in ID notation. Finally, an approximate search for each assembled candidate is performed based on an edit distance or a n-grams approximate string matching. Edit distances and N-grams are used to measure how similar two strings are. The result is a set of multi-word suggestions in an ID notation. This ID notation is encoded back to the original form using the first trie-based dictionary.

...read moreread less

19 citations

Collapse

Network Information

Performance

Metrics

1,942

Papers

64,998

Citations

No. of papers in the topic in previous years
Year	Papers
2023	8
2022	30
2021	32
2020	30
2019	48
2018	39

Approximate string matching

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics