scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Book ChapterDOI
07 Jul 2007
TL;DR: This article explored approximate string matching techniques to exploit the situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language.
Abstract: Commonly used vocabulary in Indian language documents found on the web contain a number of words that have Sanskrit, Persian or English origin. However, such words may be written in different scripts with slight variations in spelling and morphology. In this paper we explore approximate string matching techniques to exploit this situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language. We present an approach to identify cognates and make use of them for improving dictionary based CLIR when the query and documents both belong to two different Indian languages. We conduct experiments using a Hindi document collection and a set of Telugu queries and report the improvement due to cognate recognition and translation.

19 citations

Journal ArticleDOI
TL;DR: This article proposes an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and shows the completeness of the algorithm.
Abstract: Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which only considers the matching position of the last pattern substring in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential pattern mining with gap constraints. Some strict pattern matching algorithms were designed to handle pattern mining tasks, since strict pattern matching can be used to compute the frequency of some patterns occurring in the given sequence and then the frequent patterns can be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into an exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O(m × Maxlen × W × d) and O(Maxlen × W × m 2 × n × d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA.

19 citations

Book
08 Feb 2011
TL;DR: This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching, focusing on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions.
Abstract: One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text. This monograph focuses specifically on the problem of approximate string matching, where, given a set of strings S and a query string v, the goal is to find all strings s ∈ S that have a user specified degree of similarity to v. Set S could be, for example, a corpus of documents, a set of web pages, or an attribute of a relational table. The similarity between strings is always defined with respect to a similarity function that is chosen based on the characteristics of the data and application at hand. This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching. We concentrate on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. We focus on all-match and top-k flavors of selection and join queries, and discuss the applicability, advantages and disadvantages of each technique for every query type.

19 citations

Book ChapterDOI
01 Dec 2004
TL;DR: A bit-parallel technique to search a text of length n for a regular expression of m symbols permitting k differences in worst case time O(mn/logk s), where s is the amount of main memory that can be allocated.
Abstract: We present a bit-parallel technique to search a text of length n for a regular expression of m symbols permitting k differences in worst case time O(mn/logk s), where s is the amount of main memory that can be allocated. The algorithm permits arbitrary integer weights and matches the complexity of the best previous techniques, but it is simpler and faster in practice. In our way, we define a new recurrence for approximate searching where the current values depend only on previous values. Interestingly, our algorithm turns out to be a relevant option also for simple approximate string matching with arbitrary integer weights.

19 citations

Patent
Hisham El-Shishiny1, Pavel Volkov1
24 Sep 2007
TL;DR: In this article, a trie-based dictionary of proper names is built from a given list of multi-word proper names and fuzzy searches in the contracted dictionary are performed to perform fuzzy search in the dictionary.
Abstract: The present invention automatically builds a contracted dictionary from a given list of multi-word proper names and performs fuzzy searches in the contracted dictionary. The contracted dictionary of proper names includes two linked trie-based dictionaries: a first dictionary is used to store single word names, each word name having an ID number; and a second dictionary is used to store multi-word names encoded with ID numbers. Information related to the multi-word names is also stored as a gloss to the terminal node of the multi-word entry of the trie-based dictionary. An approximate lookup for a multi-word name is conducted first for each word of the multi-word name using an approximate matching technique such as a phonetic proximity or a simple edit distance. Accordingly, N suggestions is determined for each word of the multi-word name under consideration. Then, multi-word candidates are assembled in ID notation. Finally, an approximate search for each assembled candidate is performed based on an edit distance or a n-grams approximate string matching. Edit distances and N-grams are used to measure how similar two strings are. The result is a set of multi-word suggestions in an ID notation. This ID notation is encoded back to the original form using the first trie-based dictionary.

19 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839