Topic
Approximate string matching
About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.
Papers published on a yearly basis
Papers
More filters
••
25 Jun 2003TL;DR: This work provides an answer to the question whether the MEDIAN STRING problem is NP-complete for finite and even binary alphabets and gives the complexity of the related CENTRESTRING problem.
Abstract: Given a finite set of strings, the MEDIAN STRING problem consists in finding a string that minimizes the sum of the distances to the strings in the set. Approximations of the median string are used in a very broad range of applications where one needs a representative string that summarizes common information to the strings of the set. It is the case in Classification, in Speech and Pattern Recognition, and in Computational Biology. In the latter, MEDIAN STRING is related to the key problem of Multiple Alignment. In the recent literature, one finds a theorem stating the NP-completeness of the MEDIAN STRING for unbounded alphabets. However, in the above mentioned areas, the alphabet is often finite. Thus, it remains a crucial question whether the MEDIAN STRING problem is NP-complete for finite and even binary alphabets. In this work, we provide an answer to this question and also give the complexity of the related CENTRE STRING problem. Moreover, we study the parametrized complexity of both problems with respect to the number of input strings.
35 citations
••
29 Apr 1992TL;DR: The standard string matching problem involves finding all occurrences of a single pattern in a single text, while there are some domains in which it is more appropriate to deal with dictionaries of patterns.
Abstract: The standard string matching problem involves finding all occurrences of a single pattern in a single text. While this approach works well in many application areas, there are some domains in which it is more appropriate to deal with dictionaries of patterns. A dictionary is a set of patterns; the goal of dictionary matching is to find all dictionary patterns in a given text, simultaneously.
34 citations
•
AT&T1
TL;DR: In this paper, the inverted indexes are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves.
Abstract: In embodiments of the disclosed technology, indexes, such as inverted indexes, are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves. With the present technology, a batch of daily updates can be processed in a matter of minutes, rather than a few hours for rebuilding an index, and a query may be answered with assurances that the results are accurate or within a threshold of accuracy.
34 citations
••
TL;DR: It turns out that this new variant of the Boyer-Moore string matching algorithm achieve very good results in terms of both time efficiency and number of character inspections, especially in the cases in which the patterns are very short.
Abstract: We present a new variant of the Boyer-Moore string matching algorithm which, though not linear, is very fast in practice.
We compare our algorithm with the Horspool, Quick Search, Tuned Boyer-Moore, and Reverse Factor algorithms, which are among the fastest string matching algorithms for practical uses. It turns out that our algorithm achieve very good results in terms of both time efficiency and number of character inspections, especially in the cases in which the patterns are very short.
34 citations
••
08 Apr 2019TL;DR: M-Join is proposed, a multi-level filtering approach for fuzzy string similarity join that provides a flexible framework that can support multiple similarity functions at both levels and clearly outperforms state-of-the-art methods.
Abstract: As an essential operation in data integration and data cleaning, similarity join has attracted considerable attention from the database community. In many application scenarios, it is essential to support fuzzy matching, which allows approximate matching between elements that improves the effectiveness of string similarity join. To describe the fuzzy matching between strings, we consider two levels of similarity, i.e., element-level and record-level similarity. Then the problem of calculating fuzzy matching similarity can be transformed into finding the weighted maximal matching in a bipartite graph. In this paper, we propose MF-Join, a multi-level filtering approach for fuzzy string similarity join. MF-Join provides a flexible framework that can support multiple similarity functions at both levels. To improve performance, we devise and implement several techniques to enhance the filter power. Specifically, we utilize a partition-based signature at the element-level and propose a frequency-aware partition strategy to improve the quality of signatures. We also devise a count filter at the record level to further prune dissimilar pairs. Moreover, we deduce an effective upper bound for the record-level similarity to reduce the computational overhead of verification. Experimental results on two popular datasets shows that our proposed method clearly outperforms state-of-the-art methods.
34 citations