scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2016"


Journal ArticleDOI
TL;DR: A stacked ensemble approach combined with fuzzy matching for biomedical named entity recognition of disease names and fuzzy string matching to tag rare disease names from the authors' in-house disease dictionary is implemented.

46 citations


Proceedings Article
01 Jan 2016
TL;DR: Smart is presented, an efficient and flexible tool designed for developing, testing, comparing and evaluating string matching algorithms, which provides the most comprehensive survey of online exact singlestring matching algorithms together with a set of corpora available for testing purposes.
Abstract: String matching is the problem of finding all occurrences of a given pattern in a given text. It is an extensively studied problem in computer science because of its direct application to several areas such as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Since 1970 more than 85 string matching algorithms have been proposed, and more than 50% of them in the last ten years. In this paper we present Smart, an efficient and flexible tool designed for developing, testing, comparing and evaluating string matching algorithms. It also provides the most comprehensive survey of online exact single string matching algorithms together with a set of corpora available for testing purposes.

28 citations


Proceedings ArticleDOI
01 Dec 2016
TL;DR: This work proposes an AP-accelerated ER solution, which accelerates the performance bottleneck of fuzzy matching for similar but potentially inexactly-matched names, and compared the proposed method with several conventional methods and achieved both promising speedups and better accuracy.
Abstract: Entity Resolution (ER), the process of finding identical entities across different databases, is critical to many information-integration applications. As sizes of databases explode in the big-data era, it becomes computationally expensive to recognize identical entities among all records with variations allowed across multiple databases. Profiling results show that approximate matching is the primary bottleneck. The Automata Processor (AP), an efficient and scalable semiconductor architecture for parallel automata processing, provides a new opportunity for hardware acceleration for ER. We propose an AP-accelerated ER solution, which accelerates the performance bottleneck of fuzzy matching for similar but potentially inexactly-matched names, and use several different real-world applications to illustrate its effectiveness. We compared the proposed method with several conventional methods and achieved both promising speedups and better accuracy (more correct pairs and less generalized merge distance cost) for different datasets.

28 citations


Journal ArticleDOI
TL;DR: ALFRED is presented, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation and facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed.
Abstract: Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.

27 citations


Journal ArticleDOI
TL;DR: This paper introduces a novel approach for biometric score fusion problem that can be viewed as a fuzzy pattern recognition one that significantly improves single best biometric matcher performance, and reaches comparable results to several relevant methods.

22 citations


Journal ArticleDOI
TL;DR: The creation of a single string-matching measure that can perform toponym matching process regardless of the language was attempted, and the creation of an ASM measure called DAS, which comprises name similarity, word similarity and sentence similarity phases, was created.
Abstract: Approximate string matching ASM is a challenging problem, which aims to match different string expressions representing the same object In this paper, detailed experimental studies were conducted on the subject of toponym matching, which is a new domain where ASM can be performed, and the creation of a single string-matching measure that can perform toponym matching process regardless of the language was attempted For this purpose, an ASM measure called DAS, which comprises name similarity, word similarity and sentence similarity phases, was created Considering the experimental results, the retrieval performance and system accuracy of DAS were much better than those of other well-known five measures that were compared on toponym test datasets In addition, DAS had the best metric values of mean average precision in six languages, and precision/recall graphs confirm this result

22 citations


Journal ArticleDOI
01 Dec 2016
TL;DR: Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.
Abstract: The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms Finnish. First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield-style test. Finally, a detailed topic-level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition OCR errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.

22 citations


Journal ArticleDOI
TL;DR: A formalism, called search schemes, is introduced to specify search strategies of this type, a probabilistic measure for the efficiency of a search scheme is developed, several combinatorial results on efficient search schemes are proved, and experimental computations supporting the superiority of these strategies are provided.

19 citations


Journal ArticleDOI
TL;DR: This study introduces an approximate pattern matching problem with Hamming distance and proposes an efficient algorithm named Single-rOot Nettree for approximate pattern matchinG with gap constraints (SONG) based on a new non-linear data structure Single-root Nettrees to effectively solve the problem.
Abstract: Pattern matching is a key issue in sequential pattern mining Many researchers now focus on pattern matching with gap constraints However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task In this study, we introduce an approximate pattern matching problem with Hamming distance Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d We propose an efficient algorithm named Single-rOot Nettree for approximate pattern matchinG with gap constraints SONG based on a new non-linear data structure Single-root Nettree to effectively solve the problem Theoretical analysis and experiments demonstrate an interesting law that the ratio MP,S,d/NP,S,m approximately follows a binomial distribution, where MP,S,d and NP,S,m are the numbers of the approximate occurrences whose distances to pattern P are d 0?d?m and no more than m the length of pattern P, respectively Experimental results for real biological data validate the efficiency and effectiveness of SONG

19 citations


Journal ArticleDOI
TL;DR: Harry is a small tool specifically designed for measuring the similarity of strings and implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel.
Abstract: Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka.

14 citations


Journal ArticleDOI
TL;DR: This article presents an O ( n ) -time algorithm for computing the prefix table of x, and outlines a number of applications of this result for solving various problems on non-standard strings, and presents some preliminary experimental results.

Book ChapterDOI
01 Jan 2016
TL;DR: A proof-of-concept implementation of Myers bit-vector algorithm for approximate string matching in hardware is presented, accelerated by using the massive parallel computing capabilities of a field programmable gate array (FPGA).
Abstract: We present a proof-of-concept implementation of Myers bit-vector algorithm for approximate string matching in hardware. In terms of bit-vector operations, the algorithm is accelerated by using the massive parallel computing capabilities of a field programmable gate array (FPGA). The system is realized on an embedded platform with a high computational and energy efficiency. Compared to the fastest software implementation running on the embedded processor, the hardware achieves an overall speed-up of approximately 2 and a speed-up of approximately 8 considering the computation only.

Proceedings ArticleDOI
05 Apr 2016
TL;DR: This paper uses the parallelism capabilities of the Graphics Processing Unit (GPU) to accelerate one of the most common algorithms to compute the edit distance between two strings, which is known as the Levenshtein distance, and employs a diagonal-based tracing technique which results in even greater improvements in terms of the running time.
Abstract: Sequence comparison problems such as sequence alignment and approximate string matching are part of the fundamental problems in many fields such as natural language processing, data mining and bioinformatics. However, the algorithms proposed to address these problems suffer from high computational complexities prohibiting them from being widely used in practical large-scale settings. Many researchers used parallel programming to reduce the execution time of these algorithms. In this paper, we follow this approach and use the parallelism capabilities of the Graphics Processing Unit (GPU) to accelerate one of the most common algorithms to compute the edit distance between two strings, which is known as the Levenshtein distance. To take full advantage of the large number of cores in a GPU, we employ a diagonal-based tracing technique which results in even greater improvements in terms of the running time. In fact, our CUDA implementation of the Levenshtein algorithm is about 11X faster than the sequential implementation. This is achieved without affecting the accuracy.

Journal ArticleDOI
TL;DR: The results obtained not only confirm the consistency across languages of this kind of character n-gram based approaches, but also constitute a further proof of their validity and applicability, these not being tied to a given implementation.

Journal ArticleDOI
TL;DR: The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM.
Abstract: Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM’s efficiency.

Proceedings ArticleDOI
01 Jan 2016
TL;DR: This paper presents a simple 7/2-approximation algorithm for the Maximum Duo-Preservation String Mapping (MPSM) problem, which improves on the previously best-known 4-app approximation algorithm by computing a simple local optimum.
Abstract: This paper presents a simple 7/2-approximation algorithm for the Maximum Duo-Preservation String Mapping (MPSM) problem. This problem is complementary to the classical and well studied min common string partition problem (MCSP), that computes the minimal edit distance between two strings when the only operation allowed is to shift blocks of characters. The algorithm improves on the previously best-known 4-approximation algorithm by computing a simple local optimum.

Journal ArticleDOI
01 Apr 2016
TL;DR: A combined algorithm is offered, which has been developed on the basis of well-known Knuth-Morris-Pratt and Boyer-Moore string searching algorithms, and allows acquiring the larger shift in case of pattern and string characters' mismatch.
Abstract: The string searching task can be classified as a classic information processing task. Users either encounter the solution of this task while working with text processors or browsers, employing standard built-in tools, or this task is solved unseen by the users, while they are working with various computer programmes. Nowadays there are many algorithms for solving the string searching problem. The main criterion of these algorithms' effectiveness is searching speed. The larger the shift of the pattern relative to the string in case of pattern and string characters' mismatch is, the higher is the algorithm running speed. This article offers a combined algorithm, which has been developed on the basis of well-known Knuth-Morris-Pratt and Boyer-Moore string searching algorithms. These algorithms are based on two different basic principles of pattern matching. Knuth-Morris-Pratt algorithm is based upon forward pattern matching and Boyer-Moore is based upon backward pattern matching. Having united these two algorithms, the combined algorithm allows acquiring the larger shift in case of pattern and string characters' mismatch. The article provides an example, which illustrates the results of Boyer-Moore and Knuth-Morris- Pratt algorithms and combined algorithm's work and shows advantage of the latter in solving string searching problem.

Proceedings ArticleDOI
17 Mar 2016
TL;DR: A novel pattern grouping algorithm for heterogeneous bit-split string matching architectures that achieves an average of 41% reduction in memory consumption compared to the best existing approach found in the literature, while offering orders of magnitude faster execution time compared to an exhaustive search.
Abstract: The increasing complexity of cyber-attacks necessitates the design of more efficient hardware architectures for real-time Intrusion Detection Systems (IDSs). String matching is the main performance-demanding component of an IDS. An effective technique to design high-performance string matching engines is to partition the target set of strings into multiple subgroups and to use a parallel string matching hardware unit for each subgroup. This paper introduces a novel pattern grouping algorithm for heterogeneous bit-split string matching architectures. The proposed algorithm presents a reliable method to estimate the correlation between strings. The correlation factors are then used to find a preferred group for each string in a seed growing approach. Experimental results demonstrate that the proposed algorithm achieves an average of 41% reduction in memory consumption compared to the best existing approach found in the literature, while offering orders of magnitude faster execution time compared to an exhaustive search.

Journal ArticleDOI
TL;DR: This work introduces a new type of seeds: the 010 seeds, made of two exact parts separated by parts with exactly one error, and shows that those seeds are lossless, and applies them to two filtration algorithms for two popular applications.

Journal ArticleDOI
TL;DR: LibFLASM can be used to improve the accuracy of multiple circular sequence alignment in terms of the inferred likelihood-based phylogenies; and it is described how it is used to efficiently find motifs in molecular sequences representing regulatory or functional regions.
Abstract: Approximate string matching is the problem of finding all factors of a given text that are at a distance at most k from a given pattern. Fixed-length approximate string matching is the problem of finding all factors of a text of length n that are at a distance at most k from any factor of length l of a pattern of length m. There exist bit-vector techniques to solve the fixed-length approximate string matching problem in time $\mathcal {O}(m\lceil \ell /w \rceil n)$ and space $\mathcal {O}(m\lceil \ell /w\rceil)$ under the edit and Hamming distance models, where w is the size of the computer word; as such these techniques are independent of the distance threshold k or the alphabet size. Fixed-length approximate string matching is a generalisation of approximate string matching and, hence, has numerous direct applications in computational molecular biology and elsewhere. We present and make available libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching under both the edit and the Hamming distance models. Moreover we describe how fixed-length approximate string matching is applied to solve real problems by incorporating libFLASM into established applications for multiple circular sequence alignment as well as single and structured motif extraction. Specifically, we describe how it can be used to improve the accuracy of multiple circular sequence alignment in terms of the inferred likelihood-based phylogenies; and we also describe how it is used to efficiently find motifs in molecular sequences representing regulatory or functional regions. The comparison of the performance of the library to other algorithms show how it is competitive, especially with increasing distance thresholds. Fixed-length approximate string matching is a generalisation of the classic approximate string matching problem. We present libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching. The extensive experimental results presented here suggest that other applications could benefit from using libFLASM, and thus further maintenance and development of libFLASM is desirable.

Posted Content
TL;DR: This work calculates the distance between two string variables using the Jaro-Winkler distance metric, used in record linkage to compare first or last names in different sources.
Abstract: jarowinkler calculates the distance between two string variables using the Jaro-Winkler distance metric. The distance metric is often used in record linkage to compare first or last names in different sources.

Posted Content
TL;DR: A comprehensive bibliography for the online exact string matching problem is presented, containing a comprehensive list of (almost) all string matching algorithms proposed since 1970.
Abstract: In this short note we present a comprehensive bibliography for the online exact string matching problem The problem consists in finding all occurrences of a given pattern in a text It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry Since 1970 more than 120 string matching algorithms have been proposed In this note we present a comprehensive list of (almost) all string matching algorithms The list is updated to May 2016

Proceedings ArticleDOI
21 Feb 2016
TL;DR: This paper proposes novel integer linear programming-based methods for finding median and center strings for a probability distribution on a set of strings under Levenshtein distance, and restricts several variables to a region near the diagonal in the formulation.
Abstract: We address problems of finding median and center strings for a probability distribution on a set of strings under Levenshtein distance, which are known to be NP-hard in a special case. There are many applications in various research fields, for instance, to find functional motifs in protein amino acid sequences, and to recognize shapes and characters in image processing. In this paper, we propose novel integer linear programming-based methods for finding median and center strings for a probability distribution on a set of strings under Levenshtein distance. Furthermore, we restrict several variables to a region near the diagonal in the formulation, and propose novel integer linear programming-based methods also for finding approximate median and center strings for a probability distribution on a set of strings. For evaluation of our proposed methods, we perform several computational experiments, and show that the restricted formulation reduced the execution time.

Journal ArticleDOI
TL;DR: An efficient alphabet-independent Four-Russians' lookup table that requires O ( 3 2 t ( 2 t ) ! t ) space and can be constructed and used irrespective of the alphabet size is presented.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU that relies on warp shuffle operations, which are used to reduce the communication overhead between threads.
Abstract: The task of finding strings having a partial match to a given pattern is of interest to a number of practical applications, including DNA sequencing and text searching. Owing to its importance, alternatives to accelerate the Approximate String Matching (ASM) have been widely investigated in the literature. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The key idea of our implementation relies on warp shuffle operations, which are used to reduce the communication overhead between threads. Experimental results, carried out on a GeForce GTX 960 GPU, show that the proposed implementation provides acceleration between 1.31 and 1.84 times when compared to another noteworthy alternative.

Book ChapterDOI
21 Feb 2016
TL;DR: This study generalizes the definitions of median and center strings of string data into those of a probability distribution on a set of all strings composed of letters in a given alphabet, which corresponds to that of a mean of numerical data into an expected value of a probabilities distribution onA set of numbers or numerical vectors.
Abstract: For a data set composed of numbers or numerical vectors, a mean is the most fundamental measure for capturing the center of the data. However, for a data set of strings, a mean of the data cannot be defined, and therefore, median and center strings are frequently used as a measure of the center of the data. In contrast to calculating a mean of numerical data, constructing median and center strings of string data is not easy, and no algorithm is found that is guaranteed to construct exact solutions of center strings. In this study, we first generalize the definitions of median and center strings of string data into those of a probability distribution on a set of all strings composed of letters in a given alphabet. This generalization corresponds to that of a mean of numerical data into an expected value of a probability distribution on a set of numbers or numerical vectors. Next, we develop methods for constructing exact solutions of median and center strings for a probability distribution on a set of strings, applying integer linear programming. These methods are improved into faster ones by using the triangle inequality on the Levenshtein distance in the case where a set of strings is a metric space with the Levenshtein distance. Furthermore, we also develop methods for constructing approximate solutions of median and center strings very rapidly if the probability of a subset composed of similar strings is close to one. Lastly, we perform simulation experiments to examine the usefulness of our proposed methods in practical applications.

Journal ArticleDOI
01 Mar 2016
TL;DR: A fuzzy string matching algorithm is applied for self‐citation detection and near full recall can be achieved with the proposed method while incurring only negligible precision loss.
Abstract: In this article I investigate the shortcomings of exact string match-based author self-citation detection methods. The contributions of this study are twofold. First, I apply a fuzzy string matching algorithm for self-citation detection and benchmark this approach and other common methods of exclusively author name-based self-citation detection against a manually curated ground truth sample. Near full recall can be achieved with the proposed method while incurring only negligible precision loss. Second, I report some important observations from the results about the extent of latent self-citations and their characteristics and give an example of the effect of improved self-citation detection on the document level self-citation rate of real data.

Journal ArticleDOI
TL;DR: This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice and proves its correctness and efficiency in theory.
Abstract: Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of the most excellent algorithms for approximate string matching if combined with other indexing techniques. However, its time complexity is sensitive to the length of pattern string because it searches $$m+k$$m+k characters on each path from the root before backtracking. In this paper, we propose an efficient pruning strategy to solve this problem. We prove its correctness and efficiency in theory. Particularly, we proved that if the pruning strategy is adopted, it averagely searches O(k) characters on each path before backtracking instead of O(m). Considering each internal node of suffix tree has multiple branches, the pruning strategy should work very well. We also experimentally show that when k is much smaller than m, the efficiency improves hundreds of times, and when k is not much smaller than m, it is still several times faster. This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice.

Patent
12 Oct 2016
TL;DR: In this article, a character string fuzzy matching method was proposed, which consists of obtaining the number of matched characters of a source text and each target text, calculating the source matching degree of each source text according to the matched characters and the numbers of characters of the source text, and obtaining a first preset threshold value corresponding to the source texts according to their number of fields.
Abstract: The present invention discloses a character string fuzzy matching method The character string fuzzy matching method comprises the following steps of obtaining the number of matched characters of a source text and each target text; calculating source matching degree of each target text according to the number of the matched characters and the number of characters of the source text; obtaining a first preset threshold value corresponding to the source text according to the number of fields of the source text; and obtaining the target text having the source matching degree of each target text greater than or equal to the first preset threshold value, and using the obtained target text as the matched target text The present invention also discloses a character string fuzzy matching apparatus The problem that precision of matched target character strings searched by using an exact search method is low is solved, and the recognition rate of the character strings is increased