Topic
Approximate string matching
About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.
Papers published on a yearly basis
Papers
More filters
•
TL;DR: In this article, the authors consider the problem of reconstructing a string from the multiset of its substring compositions and derive lower and upper bounds on the largest number of strings with given substring composition.
Abstract: Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tight bounds on the largest number of strings with given substring compositions. The lower bounds are derived by combinatorial arguments and the upper bounds by algebraic considerations that precisely characterize the set of strings with the same substring compositions in terms of the factorization of bivariate polynomials. The problem can be viewed as a combinatorial simplification of the turnpike problem, and its solution may shed light on this long-standing problem as well. Using well known results on transience of multi-dimensional random walks, we also provide a reconstruction algorithm that reconstructs random strings over alphabets of size $\ge4$ in optimal near-quadratic time.
8 citations
•
10 Mar 2010
TL;DR: In this paper, a method and a system for counting machine translation based on phrases is presented, which comprises a step of performing fuzzy match for the phrases input into a sentence in a presetphrase list.
Abstract: The invention provides a method and a system for counting machine translation based on phrases The method comprises a step of performing fuzzy match for the phrases input into a sentence in a presetphrase list By performing the fuzzy match for the phrases, the method and the system can generate high-quality translation for longer phrases input into the sentence, and can effectively improve thequality of the translation compared with a machine translation system for precise matching based on the phrases
8 citations
••
TL;DR: It is shown that the problem can be approximated in linear time for general patterns, and efficient exact solutions for different variants of the problem are provided, as well as a faster approximation.
8 citations
••
28 Jun 2017TL;DR: A plagiarism detection algorithm based on approximate string matching to be specified in “copy and paste”-type plagiarisms, and a speed improvement to an implementation of the algorithm are proposed.
Abstract: Plagiarism detection in a large number of documents requires efficient methods. This paper proposes a plagiarism detection algorithm based on approximate string matching to be specified in “copy and paste”-type plagiarisms, and a speed improvement to an implementation of the algorithm. Most of the computations required in the algorithm are omitted by two kinds of approximations of the output used for plagiarism detection, while the decrease of accuracy caused by the approximations is acceptable. The effect of the improvement on the processing time and accuracy of the algorithm is evaluated by conducting experiments with a data set. The experimental results show that the improvement can reduce the processing time to approximately one-twentieth for a 6.4% decrease of the accuracy from those for the normal implementation of the algorithm.
8 citations
••
01 Dec 2019TL;DR: This work extends existing filtering-based subgraph matching algorithms and proposes a new set of filters leveraging the monotone function properties in the multiplex setting that enables effective pruning of irrelevant subgraph regions and expedites the overall matching process.
Abstract: We study the problem of detecting matching subgraphs in a large multiplex background network based on predefined subgraph templates. Our approach extends existing filtering-based subgraph matching algorithms and proposes a new set of filters leveraging the monotone function properties in the multiplex setting. This enables effective pruning of irrelevant subgraph regions and expedites the overall matching process. In addition, our approach proposes a new strategy based on maximum likelihood estimate to identify “closely matched” subgraphs that are not isomorphic to the given templates from a noisy background network. This allows us to generalize this approach to real-world networks, which are often noisy, incomplete and ambiguous. We demonstrate the effectiveness of the proposed method on a real-world multiplex network provided by the DARPA Modeling Adversarial Activity (MAA) program. Our approach obtains highly accurate subgraph matching results for both the clean and noisy versions of the network, which significantly outperforms the baseline filtering methods. Furthermore, our proposed approach is parallelizable such that it can scale up to handle large input networks.
8 citations