scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Proceedings ArticleDOI
29 Jun 2010
TL;DR: A new method to detect all variations of a vulgar word with phoneme modification by applying a phoneme based string alignment is proposed and the number of pivot words is empirically found to create a near optimal searching space.
Abstract: Verbal abuse is becoming a serious social problem in online communication, because anonymity makes it easier to use profanities. Detecting and removing some words that have been registered in a forbidden list is a straightforward filtering method. This is simple, but preparing the forbidden word list is difficult as newly coined words have to be added to the lexicon. Especially Korean is a type of agglutinative language, so the construction of new variations of a vulgar word is easy without causing difficulties in textual communications in an online environment. In this paper we propose a new method to detect all variations of a vulgar word with phoneme modification by applying a phoneme based string alignment. However, aligning a query word against all vulgar words registered in a database takes time and its computation is difficult. We propose a R*-tree based searching algorithm to overcome this expensive computation. The method applies the metric space property of string edit distance. We prepared a word database with more than 9300 prototype vulgar words for experiment. For a given query word, our algorithm quickly finds the best-aligned candidate word(0.006 sec. with 1000 words), which are within an edit distance equals of one unit. Our contribution is that we empirically found the number of pivot words to create a near optimal searching space.

17 citations

01 Oct 2008
TL;DR: An approximate string matching technique based on Levenshtein distance is applied to indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology.
Abstract: This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.

17 citations

Book ChapterDOI
02 Dec 2009
TL;DR: A protocol based on homomorphic encryption, combined with the novel notion of a share-hiding error-correcting secret sharing scheme, which is provably secure against passive adversaries, and has better efficiency than previous protocols for certain parameter values is presented.
Abstract: At Eurocrypt'04, Freedman, Nissim and Pinkas introduced a fuzzy private matching problem. The problem is defined as follows. Given two parties, each of them having a set of vectors where each vector has T integer components, the fuzzy private matching is to securely test if each vector of one set matches any vector of another set for at least t components where t < T. In the conclusion of their paper, they asked whether it was possible to design a fuzzy private matching protocol without incurring a communication complexity with the factor (Tt). We answer their question in the affirmative by presenting a protocol based on homomorphic encryption, combined with the novel notion of a share-hiding error-correcting secret sharing scheme, which we show how to implement with efficient decoding using interleaved Reed-Solomon codes. This scheme may be of independent interest. Our protocol is provably secure against passive adversaries, and has better efficiency than previous protocols for certain parameter values.

17 citations

Journal ArticleDOI
TL;DR: A new algorithm for computing the edit distance of an uncompressed string against a run-length-encoded string and its result directly implies an O(min{mN,Mn}) time algorithm for strings of lengths m and n with M and N runs, respectively.

17 citations

Journal ArticleDOI
TL;DR: The problem of String Matching with mismatches to have weighted mismatches is generalized and an O(nlog 4m) algorithm is presented that approximates the results of this problem up to a factor of O(log’m) in the case that the weight function is a metric.
Abstract: Given an alphabet Σ={1,2,…,|Σ|} text string T∈Σn and a pattern string P∈Σm , for each i=1,2,…,n−m+1 define L p (i) as the p-norm distance when the pattern is aligned below the text and starts at position i of the text. The problem of pattern matching with L p distance is to compute L p (i) for every i=1,2,…,n−m+1. We discuss the problem for d=1,2,∞. First, in the case of L 1 matching (pattern matching with an L 1 distance) we show a reduction of the string matching with mismatches problem to the L 1 matching problem and we present an algorithm that approximates the L 1 matching up to a factor of 1+e, which has an $O(\frac{1}{\varepsilon^{2}}n\log m\log|\Sigma|)$ run time. Then, the L 2 matching problem (pattern matching with an L 2 distance) is solved with a simple O(nlog m) time algorithm. Finally, we provide an algorithm that approximates the L ∞ matching up to a factor of 1+e with a run time of $O(\frac{1}{\varepsilon}n\log m\log|\Sigma|)$. We also generalize the problem of String Matching with mismatches to have weighted mismatches and present an O(nlog 4 m) algorithm that approximates the results of this problem up to a factor of O(log m) in the case that the weight function is a metric.

17 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839