Showing papers on "Approximate string matching published in 2016"

PDF

Open Access

Journal Article•DOI•

Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases.

[...]

Balu Bhasuran¹, Gurusamy Murugesan¹, Sabenabanu Abdulkadhar¹, Jeyakumar Natarajan¹•Institutions (1)

01 Dec 2016-Journal of Biomedical Informatics

TL;DR: A stacked ensemble approach combined with fuzzy matching for biomedical named entity recognition of disease names and fuzzy string matching to tag rare disease names from the authors' in-house disease dictionary is implemented.

...read moreread less

46 citations

Proceedings Article•

The String Matching Algorithms Research Tool.

[...]

Simone Faro¹, Thierry Lecroq², Stefano Borzi, Simone Di Mauro, Alessandro Maggio - Show less +1 more•Institutions (2)

University of Catania¹, University of Rouen²

01 Jan 2016

TL;DR: Smart is presented, an efficient and flexible tool designed for developing, testing, comparing and evaluating string matching algorithms, which provides the most comprehensive survey of online exact singlestring matching algorithms together with a set of corpora available for testing purposes.

...read moreread less

Abstract: String matching is the problem of finding all occurrences of a given pattern in a given text. It is an extensively studied problem in computer science because of its direct application to several areas such as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Since 1970 more than 85 string matching algorithms have been proposed, and more than 50% of them in the last ten years. In this paper we present Smart, an efficient and flexible tool designed for developing, testing, comparing and evaluating string matching algorithms. It also provides the most comprehensive survey of online exact single string matching algorithms together with a set of corpora available for testing purposes.

...read moreread less

28 citations

Proceedings Article•DOI•

Entity resolution acceleration using the automata processor

[...]

Chunkun Bo¹, Ke Wang¹, Jeffrey J. Fox¹, Kevin Skadron¹•Institutions (1)

University of Virginia¹

01 Dec 2016

TL;DR: This work proposes an AP-accelerated ER solution, which accelerates the performance bottleneck of fuzzy matching for similar but potentially inexactly-matched names, and compared the proposed method with several conventional methods and achieved both promising speedups and better accuracy.

...read moreread less

Abstract: Entity Resolution (ER), the process of finding identical entities across different databases, is critical to many information-integration applications. As sizes of databases explode in the big-data era, it becomes computationally expensive to recognize identical entities among all records with variations allowed across multiple databases. Profiling results show that approximate matching is the primary bottleneck. The Automata Processor (AP), an efficient and scalable semiconductor architecture for parallel automata processing, provides a new opportunity for hardware acceleration for ER. We propose an AP-accelerated ER solution, which accelerates the performance bottleneck of fuzzy matching for similar but potentially inexactly-matched names, and use several different real-world applications to illustrate its effectiveness. We compared the proposed method with several conventional methods and achieved both promising speedups and better accuracy (more correct pairs and less generalized merge distance cost) for different datasets.

...read moreread less

28 citations

Journal Article•DOI•

ALFRED: A Practical Method for Alignment-Free Distance Computation

[...]

Sharma V. Thankachan¹, Sriram P. Chockalingam², Yongchao Liu¹, Alberto Apostolico¹, Srinivas Aluru¹ - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Indian Institute of Technology Bombay²

07 Jun 2016-Journal of Computational Biology

TL;DR: ALFRED is presented, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation and facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed.

...read moreread less

Abstract: Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.

...read moreread less

27 citations

Journal Article•DOI•

Fuzzy pattern recognition-based approach to biometric score fusion problem

[...]

Khalid Fakhar¹, Mohamed El Aroussi¹, Mohamed Nabil Saidi¹, Driss Aboutajdine¹•Institutions (1)

Mohammed V University¹

15 Dec 2016-Fuzzy Sets and Systems

TL;DR: This paper introduces a novel approach for biometric score fusion problem that can be viewed as a fuzzy pattern recognition one that significantly improves single best biometric matcher performance, and reaches comparable results to several relevant methods.

...read moreread less

22 citations

Journal Article•DOI•

An accurate toponym-matching measure based on approximate string matching

[...]

Deniz Kilinç¹•Institutions (1)

Celal Bayar University¹

01 Apr 2016-Journal of Information Science

TL;DR: The creation of a single string-matching measure that can perform toponym matching process regardless of the language was attempted, and the creation of an ASM measure called DAS, which comprises name similarity, word similarity and sentence similarity phases, was created.

...read moreread less

Abstract: Approximate string matching ASM is a challenging problem, which aims to match different string expressions representing the same object In this paper, detailed experimental studies were conducted on the subject of toponym matching, which is a new domain where ASM can be performed, and the creation of a single string-matching measure that can perform toponym matching process regardless of the language was attempted For this purpose, an ASM measure called DAS, which comprises name similarity, word similarity and sentence similarity phases, was created Considering the experimental results, the retrieval performance and system accuracy of DAS were much better than those of other well-known five measures that were compared on toponym test datasets In addition, DAS had the best metric values of mean average precision in six languages, and precision/recall graphs confirm this result

...read moreread less

22 citations

Journal Article•DOI•

Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach

[...]

Anni Järvelin¹, Heikki Keskustalo¹, Eero Sormunen¹, Miamaria Saastamoinen¹, Kimmo Kettunen - Show less +1 more•Institutions (1)

University UCINF¹

01 Dec 2016

TL;DR: Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.

...read moreread less

Abstract: The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms Finnish. First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield-style test. Finally, a detailed topic-level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition OCR errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.

...read moreread less

22 citations

Journal Article•DOI•

Approximate string matching using a bidirectional index

[...]

Gregory Kucherov¹, Kamil Salikhov², Dekel Tsur³•Institutions (3)

University of Paris¹, Moscow State University², Ben-Gurion University of the Negev³

25 Jul 2016-Theoretical Computer Science

TL;DR: A formalism, called search schemes, is introduced to specify search strategies of this type, a probabilistic measure for the efficiency of a search scheme is developed, several combinatorial results on efficient search schemes are proved, and experimental computations supporting the superiority of these strategies are provided.

...read moreread less

19 citations

Journal Article•DOI•

Approximate pattern matching with gap constraints

[...]

Youxi Wu¹, Zhiqiang Tang¹, He Jiang², Xindong Wu³, Xindong Wu⁴ - Show less +1 more•Institutions (4)

Hebei University of Technology¹, Dalian University of Technology², Hefei University of Technology³, University of Vermont⁴

01 Oct 2016-Journal of Information Science

TL;DR: This study introduces an approximate pattern matching problem with Hamming distance and proposes an efficient algorithm named Single-rOot Nettree for approximate pattern matchinG with gap constraints (SONG) based on a new non-linear data structure Single-root Nettrees to effectively solve the problem.

...read moreread less

Abstract: Pattern matching is a key issue in sequential pattern mining Many researchers now focus on pattern matching with gap constraints However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task In this study, we introduce an approximate pattern matching problem with Hamming distance Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d We propose an efficient algorithm named Single-rOot Nettree for approximate pattern matchinG with gap constraints SONG based on a new non-linear data structure Single-root Nettree to effectively solve the problem Theoretical analysis and experiments demonstrate an interesting law that the ratio MP,S,d/NP,S,m approximately follows a binomial distribution, where MP,S,d and NP,S,m are the numbers of the approximate occurrences whose distances to pattern P are d 0?d?m and no more than m the length of pattern P, respectively Experimental results for real biological data validate the efficiency and effectiveness of SONG

...read moreread less

19 citations

Journal Article•DOI•

Harry: a tool for measuring string similarity

[...]

Konrad Rieck¹, Christian Wressnegger¹•Institutions (1)

University of Göttingen¹

01 Jan 2016-Journal of Machine Learning Research

TL;DR: Harry is a small tool specifically designed for measuring the similarity of strings and implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel.

...read moreread less

Abstract: Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka.

...read moreread less

14 citations

Journal Article•DOI•

Linear-time computation of prefix table for weighted strings & applications

[...]

Carl Barton¹, Chang Liu², Solon P. Pissis²•Institutions (2)

Queen Mary University of London¹, King's College London²

20 Dec 2016-Theoretical Computer Science

TL;DR: This article presents an O ( n ) -time algorithm for computing the prefix table of x, and outlines a number of applications of this result for solving various problems on non-standard strings, and presents some preliminary experimental results.

...read moreread less

Book Chapter•DOI•

Using FPGAs to Accelerate Myers Bit-Vector Algorithm

[...]

Jörn Hoffmann¹, Dirk Zeckzer¹, Martin Bogdan¹•Institutions (1)

Leipzig University¹

01 Jan 2016

TL;DR: A proof-of-concept implementation of Myers bit-vector algorithm for approximate string matching in hardware is presented, accelerated by using the massive parallel computing capabilities of a field programmable gate array (FPGA).

...read moreread less

Abstract: We present a proof-of-concept implementation of Myers bit-vector algorithm for approximate string matching in hardware. In terms of bit-vector operations, the algorithm is accelerated by using the massive parallel computing capabilities of a field programmable gate array (FPGA). The system is realized on an embedded platform with a high computational and energy efficiency. Compared to the fastest software implementation running on the embedded processor, the hardware achieves an overall speed-up of approximately 2 and a speed-up of approximately 8 considering the computation only.

...read moreread less

Proceedings Article•DOI•

Using GPUs to speed-up Levenshtein edit distance computation

[...]

Khaled Balhaf¹, Mohammed A. Shehab¹, Walaa Al-Sarayrah¹, Mahmoud Al-Ayyoub¹, Mohammed I. Al-Saleh¹, Yaser Jararweh¹ - Show less +2 more•Institutions (1)

Jordan University of Science and Technology¹

05 Apr 2016

TL;DR: This paper uses the parallelism capabilities of the Graphics Processing Unit (GPU) to accelerate one of the most common algorithms to compute the edit distance between two strings, which is known as the Levenshtein distance, and employs a diagonal-based tracing technique which results in even greater improvements in terms of the running time.

...read moreread less

Abstract: Sequence comparison problems such as sequence alignment and approximate string matching are part of the fundamental problems in many fields such as natural language processing, data mining and bioinformatics. However, the algorithms proposed to address these problems suffer from high computational complexities prohibiting them from being widely used in practical large-scale settings. Many researchers used parallel programming to reduce the execution time of these algorithms. In this paper, we follow this approach and use the parallelism capabilities of the Graphics Processing Unit (GPU) to accelerate one of the most common algorithms to compute the edit distance between two strings, which is known as the Levenshtein distance. To take full advantage of the large number of cores in a GPU, we employ a diagonal-based tracing technique which results in even greater improvements in terms of the running time. In fact, our CUDA implementation of the Levenshtein algorithm is about 11X faster than the sequential implementation. This is achieved without affecting the accuracy.

...read moreread less

Journal Article•DOI•

On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks

[...]

Jesús Vilares, Manuel Vilares¹, Miguel A. Alonso, Michael Oakes²•Institutions (2)

University of Vigo¹, University of Wolverhampton²

01 Mar 2016-Computer Speech & Language

TL;DR: The results obtained not only confirm the consistency across languages of this kind of character n-gram based approaches, but also constitute a further proof of their validity and applicability, these not being tied to a given implementation.

...read moreread less

Journal Article•DOI•

siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves

[...]

Yoshimasa Takabatake¹, Kenta Nakashima¹, Tetsuji Kuboyama², Yasuo Tabei, Hiroshi Sakamoto¹ - Show less +1 more•Institutions (2)

Kyushu Institute of Technology¹, Gakushuin University²

15 Apr 2016-Algorithms

TL;DR: The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM.

...read moreread less

Abstract: Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM’s efficiency.

...read moreread less

Proceedings Article•DOI•

A 7/2-approximation algorithm for the Maximum Duo-Preservation String Mapping Problem

[...]

Nicolas Boria¹, Gianpiero Cabodi², Paolo Camurati², Marco Palena², Paolo Pasini², Stefano Quer² - Show less +2 more•Institutions (2)

Dalle Molle Institute for Artificial Intelligence Research¹, Polytechnic University of Turin²

01 Jan 2016

TL;DR: This paper presents a simple 7/2-approximation algorithm for the Maximum Duo-Preservation String Mapping (MPSM) problem, which improves on the previously best-known 4-app approximation algorithm by computing a simple local optimum.

...read moreread less

Abstract: This paper presents a simple 7/2-approximation algorithm for the Maximum Duo-Preservation String Mapping (MPSM) problem. This problem is complementary to the classical and well studied min common string partition problem (MCSP), that computes the minimal edit distance between two strings when the only operation allowed is to shift blocks of characters. The algorithm improves on the previously best-known 4-approximation algorithm by computing a simple local optimum.

...read moreread less

Journal Article•DOI•

Combined string searching algorithm based on knuth-morris- pratt and boyer-moore algorithms

[...]

R Yu Tsarev¹, A S Chernigovskiy¹, E A Tsareva², V V Brezitskaya², A Yu Nikiforov¹, N A Smirnov² - Show less +2 more•Institutions (2)

Siberian Federal University¹, Siberian State Aerospace University²

01 Apr 2016

TL;DR: A combined algorithm is offered, which has been developed on the basis of well-known Knuth-Morris-Pratt and Boyer-Moore string searching algorithms, and allows acquiring the larger shift in case of pattern and string characters' mismatch.

...read moreread less

Abstract: The string searching task can be classified as a classic information processing task. Users either encounter the solution of this task while working with text processors or browsers, employing standard built-in tools, or this task is solved unseen by the users, while they are working with various computer programmes. Nowadays there are many algorithms for solving the string searching problem. The main criterion of these algorithms' effectiveness is searching speed. The larger the shift of the pattern relative to the string in case of pattern and string characters' mismatch is, the higher is the algorithm running speed. This article offers a combined algorithm, which has been developed on the basis of well-known Knuth-Morris-Pratt and Boyer-Moore string searching algorithms. These algorithms are based on two different basic principles of pattern matching. Knuth-Morris-Pratt algorithm is based upon forward pattern matching and Boyer-Moore is based upon backward pattern matching. Having united these two algorithms, the combined algorithm allows acquiring the larger shift in case of pattern and string characters' mismatch. The article provides an example, which illustrates the results of Boyer-Moore and Knuth-Morris- Pratt algorithms and combined algorithm's work and shows advantage of the latter in solving string searching problem.

...read moreread less

Proceedings Article•DOI•

Memory-Efficient String Matching for Intrusion Detection Systems using a High-Precision Pattern Grouping Algorithm

[...]

Shervin Vakili¹, J. M. Pierre Langlois¹, Bochra Boughzala², Yvon Savaria¹•Institutions (2)

École Polytechnique de Montréal¹, Ericsson²

17 Mar 2016

TL;DR: A novel pattern grouping algorithm for heterogeneous bit-split string matching architectures that achieves an average of 41% reduction in memory consumption compared to the best existing approach found in the literature, while offering orders of magnitude faster execution time compared to an exhaustive search.

...read moreread less

Abstract: The increasing complexity of cyber-attacks necessitates the design of more efficient hardware architectures for real-time Intrusion Detection Systems (IDSs). String matching is the main performance-demanding component of an IDS. An effective technique to design high-performance string matching engines is to partition the target set of strings into multiple subgroups and to use a parallel string matching hardware unit for each subgroup. This paper introduces a novel pattern grouping algorithm for heterogeneous bit-split string matching architectures. The proposed algorithm presents a reliable method to estimate the correlation between strings. The correlation factors are then used to find a preferred group for each string in a seed growing approach. Experimental results demonstrate that the proposed algorithm achieves an average of 41% reduction in memory consumption compared to the best existing approach found in the literature, while offering orders of magnitude faster execution time compared to an exhaustive search.

...read moreread less

Journal Article•DOI•

Approximate search of short patterns with high error rates using the 010 lossless seeds

[...]

Christophe Vroland¹, Mikal Salson¹, Sbastien Bini¹, Hlne Touzet¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Mar 2016-Journal of Discrete Algorithms

TL;DR: This work introduces a new type of seeds: the 010 seeds, made of two exact parts separated by parts with exactly one error, and shows that those seeds are lossless, and applies them to two filtration algorithms for two popular applications.

...read moreread less

Journal Article•DOI•

libFLASM: a software library for fixed-length approximate string matching

[...]

Lorraine A.K. Ayad¹, Solon P. Pissis¹, Ahmad Retha¹•Institutions (1)

King's College London¹

10 Nov 2016-BMC Bioinformatics

TL;DR: LibFLASM can be used to improve the accuracy of multiple circular sequence alignment in terms of the inferred likelihood-based phylogenies; and it is described how it is used to efficiently find motifs in molecular sequences representing regulatory or functional regions.

...read moreread less

Abstract: Approximate string matching is the problem of finding all factors of a given text that are at a distance at most k from a given pattern. Fixed-length approximate string matching is the problem of finding all factors of a text of length n that are at a distance at most k from any factor of length l of a pattern of length m. There exist bit-vector techniques to solve the fixed-length approximate string matching problem in time $\mathcal {O}(m\lceil \ell /w \rceil n)$ and space $\mathcal {O}(m\lceil \ell /w\rceil)$ under the edit and Hamming distance models, where w is the size of the computer word; as such these techniques are independent of the distance threshold k or the alphabet size. Fixed-length approximate string matching is a generalisation of approximate string matching and, hence, has numerous direct applications in computational molecular biology and elsewhere. We present and make available libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching under both the edit and the Hamming distance models. Moreover we describe how fixed-length approximate string matching is applied to solve real problems by incorporating libFLASM into established applications for multiple circular sequence alignment as well as single and structured motif extraction. Specifically, we describe how it can be used to improve the accuracy of multiple circular sequence alignment in terms of the inferred likelihood-based phylogenies; and we also describe how it is used to efficiently find motifs in molecular sequences representing regulatory or functional regions. The comparison of the performance of the library to other algorithms show how it is competitive, especially with increasing distance thresholds. Fixed-length approximate string matching is a generalisation of the classic approximate string matching problem. We present libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching. The extensive experimental results presented here suggest that other applications could benefit from using libFLASM, and thus further maintenance and development of libFLASM is desirable.

...read moreread less

Posted Content•

JAROWINKLER: Stata module to calculate the Jaro-Winkler distance between strings

[...]

James Feigenbaum

13 Oct 2016-Research Papers in Economics

TL;DR: This work calculates the distance between two string variables using the Jaro-Winkler distance metric, used in record linkage to compare first or last names in different sources.

...read moreread less

Abstract: jarowinkler calculates the distance between two string variables using the Jaro-Winkler distance metric. The distance metric is often used in record linkage to compare first or last names in different sources.

...read moreread less

Posted Content•

Exact Online String Matching Bibliography.

[...]

Simone Faro

17 May 2016-arXiv: Data Structures and Algorithms

TL;DR: A comprehensive bibliography for the online exact string matching problem is presented, containing a comprehensive list of (almost) all string matching algorithms proposed since 1970.

...read moreread less

Abstract: In this short note we present a comprehensive bibliography for the online exact string matching problem The problem consists in finding all occurrences of a given pattern in a text It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry Since 1970 more than 120 string matching algorithms have been proposed In this note we present a comprehensive list of (almost) all string matching algorithms The list is updated to May 2016

...read moreread less

Proceedings Article•DOI•

Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings

[...]

Morihiro Hayashida¹, Hitoshi Koyano¹•Institutions (1)

Kyoto University¹

21 Feb 2016

TL;DR: This paper proposes novel integer linear programming-based methods for finding median and center strings for a probability distribution on a set of strings under Levenshtein distance, and restricts several variables to a region near the diagonal in the formulation.

...read moreread less

Abstract: We address problems of finding median and center strings for a probability distribution on a set of strings under Levenshtein distance, which are known to be NP-hard in a special case. There are many applications in various research fields, for instance, to find functional motifs in protein amino acid sequences, and to recognize shapes and characters in image processing. In this paper, we propose novel integer linear programming-based methods for finding median and center strings for a probability distribution on a set of strings under Levenshtein distance. Furthermore, we restrict several variables to a region near the diagonal in the formulation, and propose novel integer linear programming-based methods also for finding approximate median and center strings for a probability distribution on a set of strings. For evaluation of our proposed methods, we perform several computational experiments, and show that the restricted formulation reduced the execution time.

...read moreread less

Journal Article•DOI•

A space-efficient alphabet-independent Four-Russians' lookup table and a multithreaded Four-Russians' edit distance algorithm

[...]

Youngho Kim¹, Joong Chae Na², Heejin Park³, Jeong Seop Sim¹•Institutions (3)

Inha University¹, Sejong University², Hanyang University³

20 Dec 2016-Theoretical Computer Science

TL;DR: An efficient alphabet-independent Four-Russians' lookup table that requires O ( 3 2 t ( 2 t ) ! t ) space and can be constructed and used irrespective of the alphabet size is presented.

...read moreread less

Proceedings Article•DOI•

A Memory-Access-Efficient Implementation of the Approximate String Matching Algorithm on GPU

[...]

Lucas S. N. Nunes¹, Jacir Luiz Bordim¹, Koji Nakano², Yasuaki Ito²•Institutions (2)

University of Brasília¹, Hiroshima University²

01 Nov 2016

TL;DR: The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU that relies on warp shuffle operations, which are used to reduce the communication overhead between threads.

...read moreread less

Abstract: The task of finding strings having a partial match to a given pattern is of interest to a number of practical applications, including DNA sequencing and text searching. Owing to its importance, alternatives to accelerate the Approximate String Matching (ASM) have been widely investigated in the literature. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The key idea of our implementation relies on warp shuffle operations, which are used to reduce the communication overhead between threads. Experimental results, carried out on a GeForce GTX 960 GPU, show that the proposed implementation provides acceleration between 1.31 and 1.84 times when compared to another noteworthy alternative.

...read moreread less

Book Chapter•DOI•

Finding Median and Center Strings for a Probability Distribution on a Set of Strings Under Levenshtein Distance Based on Integer Linear Programming

[...]

Morihiro Hayashida¹, Hitoshi Koyano¹•Institutions (1)

Kyoto University¹

21 Feb 2016

TL;DR: This study generalizes the definitions of median and center strings of string data into those of a probability distribution on a set of all strings composed of letters in a given alphabet, which corresponds to that of a mean of numerical data into an expected value of a probabilities distribution onA set of numbers or numerical vectors.

...read moreread less

Abstract: For a data set composed of numbers or numerical vectors, a mean is the most fundamental measure for capturing the center of the data. However, for a data set of strings, a mean of the data cannot be defined, and therefore, median and center strings are frequently used as a measure of the center of the data. In contrast to calculating a mean of numerical data, constructing median and center strings of string data is not easy, and no algorithm is found that is guaranteed to construct exact solutions of center strings. In this study, we first generalize the definitions of median and center strings of string data into those of a probability distribution on a set of all strings composed of letters in a given alphabet. This generalization corresponds to that of a mean of numerical data into an expected value of a probability distribution on a set of numbers or numerical vectors. Next, we develop methods for constructing exact solutions of median and center strings for a probability distribution on a set of strings, applying integer linear programming. These methods are improved into faster ones by using the triangle inequality on the Levenshtein distance in the case where a set of strings is a metric space with the Levenshtein distance. Furthermore, we also develop methods for constructing approximate solutions of median and center strings very rapidly if the probability of a subset composed of similar strings is close to one. Lastly, we perform simulation experiments to examine the usefulness of our proposed methods in practical applications.

...read moreread less

Journal Article•DOI•

Enhanced self-citation detection by fuzzy author name matching and complementary error estimates

[...]

Paul Donner

01 Mar 2016

TL;DR: A fuzzy string matching algorithm is applied for self‐citation detection and near full recall can be achieved with the proposed method while incurring only negligible precision loss.

...read moreread less

Abstract: In this article I investigate the shortcomings of exact string match-based author self-citation detection methods. The contributions of this study are twofold. First, I apply a fuzzy string matching algorithm for self-citation detection and benchmark this approach and other common methods of exclusively author name-based self-citation detection against a manually curated ground truth sample. Near full recall can be achieved with the proposed method while incurring only negligible precision loss. Second, I report some important observations from the results about the extent of latent self-citations and their characteristics and give an example of the effect of improved self-citation detection on the document level self-citation rate of real data.

...read moreread less

Journal Article•DOI•

An efficient pruning strategy for approximate string matching over suffix tree

[...]

Huan Hu¹, Hongzhi Wang¹, Jianzhong Li¹, Hong Gao¹•Institutions (1)

Harbin Institute of Technology¹

01 Oct 2016-Knowledge and Information Systems

TL;DR: This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice and proves its correctness and efficiency in theory.

...read moreread less

Abstract: Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of the most excellent algorithms for approximate string matching if combined with other indexing techniques. However, its time complexity is sensitive to the length of pattern string because it searches $$m+k$$m+k characters on each path from the root before backtracking. In this paper, we propose an efficient pruning strategy to solve this problem. We prove its correctness and efficiency in theory. Particularly, we proved that if the pruning strategy is adopted, it averagely searches O(k) characters on each path before backtracking instead of O(m). Considering each internal node of suffix tree has multiple branches, the pruning strategy should work very well. We also experimentally show that when k is much smaller than m, the efficiency improves hundreds of times, and when k is not much smaller than m, it is still several times faster. This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice.

...read moreread less

Patent•

Character string fuzzy matching method and apparatus

[...]

Zeng Hong

12 Oct 2016

TL;DR: In this article, a character string fuzzy matching method was proposed, which consists of obtaining the number of matched characters of a source text and each target text, calculating the source matching degree of each source text according to the matched characters and the numbers of characters of the source text, and obtaining a first preset threshold value corresponding to the source texts according to their number of fields.

...read moreread less

Abstract: The present invention discloses a character string fuzzy matching method The character string fuzzy matching method comprises the following steps of obtaining the number of matched characters of a source text and each target text; calculating source matching degree of each target text according to the number of the matched characters and the number of characters of the source text; obtaining a first preset threshold value corresponding to the source text according to the number of fields of the source text; and obtaining the target text having the source matching degree of each target text greater than or equal to the first preset threshold value, and using the obtained target text as the matched target text The present invention also discloses a character string fuzzy matching apparatus The problem that precision of matched target character strings searched by using an exact search method is low is solved, and the recognition rate of the character strings is increased

...read moreread less

Bit-parallel approximate string matching under Hamming distance

[...]

Tommi Hirvola

24 Aug 2016