Showing papers on "Approximate string matching published in 2015"

PDF

Open Access

Proceedings Article•DOI•

Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false)

[...]

Arturs Backurs¹, Piotr Indyk¹•Institutions (1)

14 Jun 2015

TL;DR: This paper shows that, if the edit distance can be computed in time O(n2-δ) for some constant δ>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time MO(1) 2(1-ε)N for a constant ε>0.

...read moreread less

Abstract: The edit distance (a.k.a. the Levenshtein distance) between two strings is defined as the minimum number of insertions, deletions or substitutions of symbols needed to transform one string into another. The problem of computing the edit distance between two strings is a classical computational task, with a well-known algorithm based on dynamic programming. Unfortunately, all known algorithms for this problem run in nearly quadratic time.In this paper we provide evidence that the near-quadratic running time bounds known for the problem of computing edit distance might be {tight}. Specifically, we show that, if the edit distance can be computed in time O(n2-δ) for some constant δ>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time MO(1) 2(1-e)N for a constant e>0. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.

...read moreread less

264 citations

Proceedings Article•DOI•

Quadratic Conditional Lower Bounds for String Problems and Dynamic Time Warping

[...]

Karl Bringmann¹, Marvin Künnemann²•Institutions (2)

ETH Zurich¹, Max Planck Society²

17 Oct 2015

TL;DR: In this article, it was shown that these measures do not have strongly sub quadratic time algorithms, i.e., no algorithm with running time O(n 2 ) for any a#x03B5; > 0, unless the Strong Exponential Time Hypothesis fails.

...read moreread less

Abstract: Classic similarity measures of strings are longest common subsequence and Levenshtein distance (i.e., The classic edit distance). A classic similarity measure of curves is dynamic time warping. These measures can be computed by simple O(n2) dynamic programming algorithms, and despite much effort no algorithms with significantly better running time are known. We prove that, even restricted to binary strings or one-dimensional curves, respectively, these measures do not have strongly sub quadratic time algorithms, i.e., No algorithms with running time O(n2 -- a#x03B5;) for any a#x03B5; > 0, unless the Strong Exponential Time Hypothesis fails. We generalize the result to edit distance for arbitrary fixed costs of the four operations (deletion in one of the two strings, matching, substitution), by identifying trivial cases that can be solved in constant time, and proving quadratic-time hardness on binary strings for all other cost choices. This improves and generalizes the known hardness result for Levenshtein distance [Backurs, Indyk STOC'15] by the restriction to binary strings and the generalization to arbitrary costs, and adds important problems to a recent line of research showing conditional lower bounds for a growing number of quadratic time problems. As our main technical contribution, we introduce a framework for proving quadratic-time hardness of similarity measures. To apply the framework it suffices to construct a single gadget, which encapsulates all the expressive power necessary to emulate a reduction from satisfiability. Finally, we prove quadratic-time hardness for longest palindromic subsequence and longest tandem subsequence via reductions from longest common subsequence, showing that conditional lower bounds based on the Strong Exponential Time Hypothesis also apply to string problems that are not necessarily similarity measures.

...read moreread less

195 citations

Journal Article•DOI•

Random Access to Grammar-Compressed Strings and Trees

[...]

Philip Bille, Gad M. Landau¹, Rajeev Raman¹, Kunihiko Sadakane¹, Srinivasa Rao Satti¹, Oren Weimann - Show less +2 more•Institutions (1)

University of Haifa¹

05 May 2015-SIAM Journal on Computing

TL;DR: A novel grammar representation that allows efficient random access to any character or substring without decompressing the string is presented.

...read moreread less

Abstract: Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures (sometimes with slight reduction in efficiency) many of the popular compression schemes, including the Lempel--Ziv family, run-length encoding, byte-pair encoding, Sequitur, and Re-Pair. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot\alpha_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $\alpha_k(n)$ is the inverse of the $k$th row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompres...

...read moreread less

114 citations

Book Chapter•DOI•

Named Entity Recognition for Mongolian Language

[...]

Zoljargal Munkhjargal¹, Gábor Bella², Altangerel Chagnaa¹, Fausto Giunchiglia²•Institutions (2)

National University of Mongolia¹, University of Trento²

14 Sep 2015

TL;DR: This work explores the fittest feature set from a wide range of features and a method that refines machine learning approach using gazetteers with approximate string matching, in an effort for robust handling of out-of-vocabulary words.

...read moreread less

Abstract: This paper presents a pioneering work on building a Named Entity Recognition system for the Mongolian language, with an agglutinative morphology and a subject-object-verb word order. Our work explores the fittest feature set from a wide range of features and a method that refines machine learning approach using gazetteers with approximate string matching, in an effort for robust handling of out-of-vocabulary words. As well as we tried to apply various existing machine learning methods and find optimal ensemble of classifiers based on genetic algorithm. The classifiers uses different feature representations. The resulting system constitutes the first-ever usable software package for Mongolian NER, while our experimental evaluation will also serve as a much-needed basis of comparison for further research.

...read moreread less

75 citations

Proceedings Article•DOI•

Generalized pattern matching string search on encrypted data in cloud systems

[...]

Dongsheng Wang¹, Xiaohua Jia², Cong Wang², Kan Yang³, Shaojing Fu¹, Ming Xu¹ - Show less +2 more•Institutions (3)

National University of Defense Technology¹, City University of Hong Kong², University of Waterloo³

24 Aug 2015

TL;DR: This paper proposes a scheme for Generalized Pattern-matching String-search on Encrypted data (GPSE) in cloud systems and implements two most commonly used pattern matching search functions on encrypted data, the substring matching and the longest-prefix-first matching.

...read moreread less

Abstract: Searchable encryption is an important and challenging issue. It allows people to search on encrypted data. This is a very useful function when more and more people choose to host their data in the cloud and the cloud server is not fully trustable. Existing solutions for searchable encryption are only limited to some simple functions of search, such as boolean search or similarity search. In this paper, we propose a scheme for Generalized Pattern-matching String-search on Encrypted data (GPSE) in cloud systems. GPSE allows users to specify their search queries by using generalized wildcard-based string patterns (such as SQL-like patterns). It gives users great expressive power in specifying highly targeted search queries. In the framework of GPSE, we particularly implemented two most commonly used pattern matching search functions on encrypted data, the substring matching and the longest-prefix-first matching. We also prove that GPSE is secure under the known-plaintext model. Experiments over real data sets show that GPSE achieves high search accuracy.

...read moreread less

31 citations

Journal Article•DOI•

A fuzzy matching model with Hurwicz criteria for one-shot multi-attribute exchanges in E-brokerage

[...]

Zhong-Zhong Jiang¹, Ruiyou Zhang², Zhi-Ping Fan¹, Xiaohong Chen³•Institutions (3)

Northeastern University¹, Northeastern University (China)², Central South University³

01 Mar 2015-Fuzzy Optimization and Decision Making

TL;DR: Taking into account the fuzzy information involved in one-shot multi-attribute exchanges, a new fuzzy matching model is proposed for the trade determination problem and a novel calculation method of the matching degree based on the improved fuzzy information axiom is presented.

...read moreread less

Abstract: The trade determination problem is an important decision problem for multi-attribute exchanges in E-brokerages. As of now, some studies have focused on this issue. However, theories and guidelines for the trade determination problem under fuzzy environments are still sparse. In this paper, taking into account the fuzzy information involved in one-shot multi-attribute exchanges, a new fuzzy matching model is proposed for the trade determination problem. In the model, we present a novel calculation method of the matching degree based on the improved fuzzy information axiom as a baseline study. Also, the credibility measure and Hurwicz criterion are introduced to convert the model into a crisp one. Since the crisp model is a 0---1 integer programming problem, the commonly used branch and bound algorithm and related optimization techniques become applicable. Finally, an example is employed to illustrate the application and sensitivity analysis of the proposed model.

...read moreread less

28 citations

Proceedings Article•DOI•

Local Filtering: Improving the Performance of Approximate Queries on String Collections

[...]

Xiaochun Yang¹, Yaoshu Wang², Bin Wang¹, Wei Wang²•Institutions (2)

Northeastern University (China)¹, University of New South Wales²

27 May 2015

TL;DR: A new filtering method, called local filtering, is proposed, based on the idea that two strings exhibiting substantial local dissimilarities must be globally dissimilar, which can achieve substantial speedup compared with state-of-the-art methods and be robust against factors such as dataset characteristics and large edit distance thresholds.

...read moreread less

Abstract: We study efficient query processing for approximate string queries, which find strings within a string collection whose edit distances to the query strings are within the given thresholds. Existing methods typically hinge on the property that globally similar strings must share at least certain number of identical substrings or subsequences. They become ineffective when there are burst errors or when the number of errors is large. In this paper, we explore the opposite paradigm focusing on finding out the differences of database strings to the query string. We propose a new filtering method, called local filtering, based on the idea that two strings exhibiting substantial local dissimilarities must be globally dissimilar. We propose the concept of (positional) local distance to quantify the minimum amount of errors a query fragment contributes to the edit distance between the query and a data string. It also leads to effective pruning rules and can speed up verification via early termination. We devise a family of indexing methods based on the idea of precomputing (positional) local distances for all possible combinations of query fragments and edit distance thresholds. Based on careful analyses of subtle relationships among local distances, novel techniques are proposed to drastically reduce the amount of enumeration with no or little impact on the pruning power. Efficient query processing methods exploiting the new index and bit-parallelism are also proposed. Experimental results on real datasets show that our local filtering-based methods can achieve substantial speedup compared with state-of-the-art methods, and they are robust against factors such as dataset characteristics and large edit distance thresholds.

...read moreread less

24 citations

Posted Content•

MATCHIT: Stata module to match two datasets based on similar text patterns

[...]

Julio Raffo

01 Jan 2015-Research Papers in Economics

TL;DR: A tool to join observations from two datasets based on string variables which do not necessarily need to be exactly the same, allowing for a fuzzy similarity between the two different text variables.

...read moreread less

Abstract: matchit is a tool to join observations from two datasets based on string variables which do not necessarily need to be exactly the same. It performs many different string-based matching techniques, allowing for a fuzzy similarity between the two different text variables.

...read moreread less

22 citations

Journal Article•DOI•

Hybrid string matching algorithm with a pivot

[...]

Abdulrakeeb M. Al-Ssulami¹•Institutions (1)

King Saud University¹

01 Feb 2015-Journal of Information Science

TL;DR: The proposed algorithm is a hybrid that combines the modification of Horspool's algorithm with two observations on string matching and scans the text from left to right and matches the pattern from right to left.

...read moreread less

Abstract: Pattern matching is important in text processing, molecular biology, operating systems and web search engines. Many algorithms have been developed to search for a specific pattern in a text, but the need for an efficient algorithm is an outstanding issue. In this paper, we present a simple and practical string matching algorithm. The proposed algorithm is a hybrid that combines our modification of Horspool's algorithm with two observations on string matching. The algorithm scans the text from left to right and matches the pattern from right to left. Experimental results on natural language texts, genomes and human proteins demonstrate that the new algorithm is competitive with practical algorithms.

...read moreread less

22 citations

Dissertation•DOI•

Approximate string matching for high-throughput sequencing

[...]

Enrico Siragusa

01 Jan 2015

TL;DR: This thesis presents novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures, and provides all implementations within SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de.

...read moreread less

Abstract: Over thepast years, high-throughput sequencing (HTS)hasbecomean invaluablemethod of investigation in molecular and medical biology. HTS technologies allow to sequence cheaply and rapidly an individual’s DNA sample under the form of billions of short DNA reads. The ability to assess the content of a DNA sample at base-level resolution opens the way to a myriad of applications, including individual genotyping and assessment of large structural variations, measurement of gene expression levels and characterization of epigenetic features. Nonetheless, the quantity and quality of data produced by HTS instruments call for computationally ef icient and accurate analysis methods. In this thesis, I present novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures. Read mapping is a fundamental step of any HTS data analysis pipeline in resequencing projects, where DNA reads are reassembled by aligning them back to a previously known reference genome. The ingenuity of approximate string matching methods is crucial to design ef icient and accurate read mapping tools. In the irst part of this thesis, I cover practical indexing and iltering methods for exact and approximate stringmatching. I present state of the art algorithms and data structures, give their pseudocode and discuss their implementation. Furthermore, I provide all implementationswithin SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de/. Subsequently, I experimentally evaluate all implemented methods, with the aim of guiding the engineering of new sequence alignment software. To the best of my knowledge, this is the irst study providing a comprehensive exposition, implementation and evaluation of such methods. In the second part of this thesis, I turn to the engineering and evaluation of readmapping tools. First, I present a novel method to ind all mapping locations per read within a user-de ined error rate; this method is published in the peer-reviewed journal Nucleic Acids Research and packaged in a open source tool nicknamedMasai. Afterwards, I generalize this method to quickly report all co-optimal or suboptimal mapping locations per read within a user-de ined error rate; this method, packaged in a tool called Yara, provides amore practical, yet sound solution to the readmapping problem. Extensive evaluations, both on simulated and real datasets, show that Yara has better speed and accuracy than de-facto standard read mapping tools.

...read moreread less

22 citations

Book Chapter•DOI•

Average-Case Optimal Approximate Circular String Matching

[...]

Carl Barton¹, Costas S. Iliopoulos², Costas S. Iliopoulos¹, Solon P. Pissis¹•Institutions (2)

King's College London¹, University of Western Australia²

02 Mar 2015

TL;DR: A new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time $\mathcal {O}(n(k + \log m) /m)$.

...read moreread less

Abstract: Approximate string matching is the problem of finding all factors of a text $t$ of length $n$ that are at a distance at most $k$ from a pattern $x$ of length $m$. Approximate circular string matching is the problem of finding all factors of $t$ that are at a distance at most $k$ from $x$ or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time $\mathcal {O}(n(k + \log m) /m)$. Optimal average-case search time can also be achieved by the algorithms for multiple approximate string matching (Fredriksson and Navarro, 2004) using $x$ and its rotations as the set of multiple patterns. Here we reduce the preprocessing time and space requirements compared to that approach.

...read moreread less

Journal Article•DOI•

Strict approximate pattern matching with general gaps

[...]

Youxi Wu¹, Shuai Fu¹, He Jiang², Xindong Wu³•Institutions (3)

Hebei University of Technology¹, Dalian University of Technology², University of Vermont³

01 Apr 2015-Applied Intelligence

TL;DR: This article proposes an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and shows the completeness of the algorithm.

...read moreread less

Abstract: Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which only considers the matching position of the last pattern substring in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential pattern mining with gap constraints. Some strict pattern matching algorithms were designed to handle pattern mining tasks, since strict pattern matching can be used to compute the frequency of some patterns occurring in the given sequence and then the frequent patterns can be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into an exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O(m × Maxlen × W × d) and O(Maxlen × W × m 2 × n × d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA.

...read moreread less

Journal Article•DOI•

A simple yet time-optimal and linear-space algorithm for shortest unique substring queries

[...]

Atalay Mert Ileri¹, M. Oğuzhan Külekci², Bojian Xu³•Institutions (3)

Massachusetts Institute of Technology¹, Istanbul Medipol University², Eastern Washington University³

11 Jan 2015-Theoretical Computer Science

TL;DR: The theoretical results are validated by an empirical study with real-world data that shows the proposed optimal O ( n ) time and space algorithm that can find an SUS for every location of a string of size n is at least 8 times faster and uses at least 20 times less memory.

...read moreread less

Book Chapter•DOI•

Filtration Algorithms for Approximate Order-Preserving Matching

[...]

Tamanna Chhabra¹, Emanuele Giaquinta¹, Jorma Tarhio¹•Institutions (1)

Aalto University¹

01 Sep 2015

TL;DR: Practical solutions for the exact order-preserving matching problem to find all the substrings of a text T which have the same length and relative order as a pattern P are presented.

...read moreread less

Abstract: The exact order-preserving matching problem is to find all the substrings of a text T which have the same length and relative order as a pattern P. Like string maching, order-preserving matching can be generalized by allowing the match to be approximate. In approximate order-preserving matching two strings match if they have the same relative order after removing up to k elements in the same positions in both strings. In this paper we present practical solutions for this problem. The methods are based on filtration, and one of them is the first sublinear solution on average. We show by practical experiments that the new solutions are fast and efficient.

...read moreread less

Journal Article•DOI•

Word spotting in historical documents using primitive codebook and dynamic programming

[...]

Partha Pratim Roy¹, Frédéric Rayar¹, Jean-Yves Ramel¹•Institutions (1)

François Rabelais University¹

01 Dec 2015-Image and Vision Computing

TL;DR: This paper presents a novel approach towards word spotting using text line decomposition into character primitives and string matching, and shows that the method is robust in searching text in noisy documents.

...read moreread less

Assessing linguistically aware fuzzy matching in translation memories

[...]

Tom Vanallemeersch¹, Vincent Vandeghinste¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2015

TL;DR: It is found that combinations of fuzzy matching metrics outperform single metrics and that the best-scoring combination is a non-linear combination of the different metrics the authors have tested.

...read moreread less

Abstract: The concept of fuzzy matching in translation memories can take place using linguistically aware or unaware methods, or a combination of both. We designed a flexible and time-efficient framework which applies and combines linguistically unaware or aware metrics in the source and target language. We measure the correlation of fuzzy matching metric scores with the evaluation score of the suggested translation to find out how well the usefulness of a suggestion can be predicted, and we measure the difference in recall between fuzzy matching metrics by looking at the improvements in mean TER as the match score decreases. We found that combinations of fuzzy matching metrics outperform single metrics and that the best-scoring combination is a non-linear combination of the different metrics we have tested.

...read moreread less

Book Chapter•DOI•

An In-place Framework for Exact and Approximate Shortest Unique Substring Queries

[...]

Wing-Kai Hon¹, Sharma V. Thankachan², Bojian Xu³•Institutions (3)

National Tsing Hua University¹, Georgia Institute of Technology², Eastern Washington University³

09 Dec 2015

TL;DR: In this article, a generic in-place framework was proposed to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words plus n bytes space, where n is the input string size.

...read moreread less

Abstract: We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words plus n bytes space, where n is the input string size. By using the in-place framework, we can find the exact and approximate k-mismatch SUS for every string position using a total of O(n) and $O(n^2)$ time, respectively, regardless of the value of k. Our framework does not involve any compressed or succinct data structures and thus is practical and easy to implement.

...read moreread less

Proceedings Article•DOI•

A hybrid cross-language name matching technique using novel modified Levenshtein Distance

[...]

Doaa Medhat¹, Ahmed Hassan¹, Cherif Salama¹•Institutions (1)

Ain Shams University¹

01 Dec 2015

TL;DR: A new modified Cross-Language Levenshtein Distance (CLLD) algorithm that supports matching names across different writing scripts and with many-to-many characters mapping and a hybrid cross-language name matching technique that uses phonetic matching technique mixed with the proposed CLLD algorithm to improve the overall f-measure and speed up the matching process.

...read moreread less

Abstract: Name matching is a key component in various applications in our life like record linkage and data mining applications. This process suffers from multiple complexities such as matching data from different languages or data written by people from different cultures. In this paper, we present a new modified Cross-Language Levenshtein Distance (CLLD) algorithm that supports matching names across different writing scripts and with many-to-many characters mapping. In addition, we present a hybrid cross-language name matching technique that uses phonetic matching technique mixed with our proposed CLLD algorithm to improve the overall f-measure and speed up the matching process. Our experiments demonstrate that this method substantially outperforms a number of well-known standard phonetic and approximate string similarity methods in terms of precision, recall, and f-measure.

...read moreread less

Journal Article•DOI•

Structural Off-line Handwriting Character Recognition Using Approximate Subgraph Matching and Levenshtein Distance

[...]

Made Edwin Wira Putra¹, Iping Supriana Suwardi¹•Institutions (1)

Bandung Institute of Technology¹

01 Jan 2015-Procedia Computer Science

TL;DR: A model to model a handwritten character into string graph representation to provide ability in improving recognition accuracy without relying in normalisation technique and the similarity distance between graph is measured using approximate subgraph matching and string edit distance method.

...read moreread less

Proceedings Article•DOI•

A Fast Approximate String Matching Algorithm on GPU

[...]

Lucas S. N. Nunes¹, Jacir Luiz Bordim¹, Koji Nakano², Yasuaki Ito²•Institutions (2)

University of Brasília¹, Hiroshima University²

08 Dec 2015

TL;DR: The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU called w-SCAN, which relies on warp shuffle for communication between threads without resorting to shared memory access.

...read moreread less

Abstract: The approximate string matching (ASM) problem asks to find a substring of string Y of length n that is most similar to string X of length m. The ASM can be solved by dynamic programming technique, which computes a table of size m × n. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The key idea of our implementation relies on warp shuffle for communication between threads without resorting to shared memory access. Surprisingly, our implementation performs only O(mn/w) memory access operations, where w is the warp size, although O(mn) memory access operations are necessary to access all elements in the table of size n × m. Experimental results, carried out on a GeForce GTX 980 GPU, shows that the proposed implementation called w-SCAN provides a speed-up factor over 200 as compared to a single CPU implementation. Also, w-SCAN computes the ASM in less than 40% of the time required by another prominent alternative.

...read moreread less

Journal Article•DOI•

Parameterized complexity analysis for the Closest String with Wildcards problem

[...]

Danny Hermelin¹, Liat Rozenberg²•Institutions (2)

Ben-Gurion University of the Negev¹, University of Haifa²

04 Oct 2015-Theoretical Computer Science

TL;DR: A new variant of Closest String where the input strings can contain wildcards that can match any letter in the alphabet, and the goal is to find a solution string without wildcards.

...read moreread less

Journal Article•DOI•

A Memory-Efficient Deterministic Finite Automaton-Based Bit-Split String Matching Scheme Using Pattern Uniqueness in Deep Packet Inspection

[...]

HyunJin Kim¹, Kang-Il Choi², Sang-Il Choi¹•Institutions (2)

Dankook University¹, Electronics and Telecommunications Research Institute²

04 May 2015-PLOS ONE

TL;DR: The experimental results show that the proposed string matching scheme can reduce the storage cost significantly compared to the previous bit-split string matching methods.

...read moreread less

Abstract: This paper proposes a memory-efficient bit-split string matching scheme for deep packet inspection (DPI). When the number of target patterns becomes large, the memory requirements of the string matching engine become a critical issue. The proposed string matching scheme reduces the memory requirements using the uniqueness of the target patterns in the deterministic finite automaton (DFA)-based bit-split string matching. The pattern grouping extracts a set of unique patterns from the target patterns. In the set of unique patterns, a pattern is not the suffix of any other patterns. Therefore, in the DFA constructed with the set of unique patterns, when only one pattern can be matched in an output state. In the bit-split string matching, multiple finite-state machine (FSM) tiles with several input bit groups are adopted in order to reduce the number of stored state transitions. However, the memory requirements for storing the matching vectors can be large because each bit in the matching vector is used to identify whether its own pattern is matched or not. In our research, the proposed pattern grouping is applied to the multiple FSM tiles in the bit-split string matching. For the set of unique patterns, the memory-based bit-split string matching engine stores only the pattern match index for each state to indicate the match with its own unique pattern. Therefore, the memory requirements are significantly decreased by not storing the matching vectors in the string matchers for the set of unique patterns. The experimental results show that the proposed string matching scheme can reduce the storage cost significantly compared to the previous bit-split string matching methods.

...read moreread less

Journal Article•DOI•

INSPIRE: A Framework for Incremental Spatial Prefix Query Relaxation

[...]

Yuxin Zheng¹, Zhifeng Bao², Lidan Shou³, Anthony K. H. Tung¹•Institutions (3)

National University of Singapore¹, RMIT University², Zhejiang University³

01 Jul 2015-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this paper, INSPIRE is proposed, a general framework, which adopts a unifying strategy for processing different variants of spatial keyword queries, and adopts the auto completion paradigm that generates an initial query as a prefix matching query.

...read moreread less

Abstract: Geo-textual data are generated in abundance. Recent studies focused on the processing of spatial keyword queries which retrieve objects that match certain keywords within a spatial region. To ensure effective retrieval, various extensions were done including the allowance of errors in keyword matching and autocompletion using prefix matching. In this paper, we propose INSPIRE, a general framework, which adopts a unifying strategy for processing different variants of spatial keyword queries. We adopt the autocompletion paradigm that generates an initial query as a prefix matching query. If there are few matching results, other variants are performed as a form of relaxation that reuses the processing done in the earlier phase. The types of relaxation allowed include spatial region expansion and exact/approximate prefix/substring matching. Moreover, since the autocompletion paradigm allows appending characters after the initial query, we look at how query processing done for the initial query and relaxation can be reused in such instances. Compared to existing works which process variants of spatial keyword query as new queries over different indexes, our approach offers a more compelling way to efficient and effective spatial keyword search. Extensive experiments substantiate our claims.

...read moreread less

Patent•

Fuzzy word segmentation based non-multi-character word error automatic proofreading method

[...]

Liu Liangliang, Wu Jiankang

21 Oct 2015

TL;DR: In this article, a fuzzy word segmentation based non-multi-character word error automatic proofreading method is presented. But the method is not suitable for Chinese word error detection.

...read moreread less

Abstract: The invention discloses a fuzzy word segmentation based non-multi-character word error automatic proofreading method. According to the method, accurate segmentation is carried out based on a correct word dictionary and a wrong character word dictionary to generate a word graph; then the similarity of Chinese word strings is calculated by utilizing a fuzzy matching algorithm, accurately segmented disperse strings are subjected to fuzzy matching, and a fuzzy matching result is added into the word graph to form a fuzzy word graph; and finally a shortest path of the fuzzy word graph is calculated by utilizing a binary model of words in combination with similarity, so that automatic proofreading of Chinese non-multi-character word errors is realized. According to the fuzzy word segmentation based non-multi-character word error automatic proofreading method provided by the invention, the system response is quick, the precision meets actual application demands, and the effectiveness and the accuracy are high.

...read moreread less

Patent•

Address matching method

[...]

Shen Qiming, Mi Tiebin

28 Oct 2015

TL;DR: In this paper, a step-by-step progressive matching method is adopted, which consists of four steps of fast matching, longitude and latitude matching, fuzzy matching and manual judgment.

...read moreread less

Abstract: The invention discloses an address matching method. A step-by-step progressive matching method is adopted. The method concretely comprises four steps of fast matching, longitude and latitude matching, fuzzy matching and manual judgment, wherein in the fast matching step, high-quality target addresses are subjected to precise matching, and a chain type complementary mechanism is used for proper complementary matching; in the longitude and latitude matching step, the target addresses and adjacent cells are matched according to longitude and latitude information provided by a map service provider; in the fuzzy matching step, a fuzzy index is used for matching the target addresses and similar cells; and a manual judgment mechanism is used for checking and controlling the matching result. The address matching method also comprises an address word segmentation technology and an address matching accuracy confidence index mechanism. The address matching method has the advantages that the matching efficiency is improved under the condition of ensuring the high matching success rate; the problem of multiple-address matching technology composition application is solved; the success rate and the fault tolerance of the address matching are improved to a great degree; and meanwhile, a series of optimization mechanisms are used for ensuring the program running efficiency.

...read moreread less

Journal Article•DOI•

Boosting the Quality of Approximate String Matching by Synonyms

[...]

Jiaheng Lu¹, Chunbin Lin², Wei Wang³, Chen Li⁴, Xiaokui Xiao⁵ - Show less +1 more•Institutions (5)

University of Helsinki¹, University of California, San Diego², University of New South Wales³, University of California, Irvine⁴, Nanyang Technological University⁵

23 Oct 2015-ACM Transactions on Database Systems

TL;DR: An expansion-based framework to measure string similarities efficiently while considering synonyms is presented and an estimator to estimate the size of candidates to enable an online selection of signature filters is developed, providing strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs.

...read moreread less

Abstract: A string-similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings “Sam” and “Samuel” can be considered to be similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, for example, number of common words or q-grams. While this is indeed an indicator of similarity, there are many important cases where syntactically-different strings can represent the same real-world object. For example, “Bill” is a short form of “William,” and “Database Management Systems” can be abbreviated as “DBMS.” Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching.In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by proposing two novel indexes, called SI-trees and QP-trees, which combine signature-filtering and length-filtering strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches.

...read moreread less

Journal Article•DOI•

Fast randomized approximate string matching with succinct hash data structures

[...]

Alberto Policriti¹, Nicola Prezza¹•Institutions (1)

University of Udine¹

01 Jun 2015-BMC Bioinformatics

TL;DR: This work introduces a new data structure called dB-hash, which maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption and can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.

...read moreread less

Abstract: The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light. The available modern solutions usually are a compromise between the mentioned constraints: in particular, indexes based on the Burrows-Wheeler transform offer reduced memory requirements at the price of lower sensitivity, while hash-based text indexes guarantee high sensitivity at the price of significant memory consumption. In this work we describe a technique that permits to attain the advantages granted by both classes of indexes. This is achieved using Hamming-aware hash functions--hash functions designed to search the entire Hamming sphere in reduced time--which are also homomorphisms on de Bruijn graphs. We show that, using this particular class of hash functions, the corresponding hash index can be represented in linear space introducing only a logarithmic slowdown (in the query length) for the lookup operation. We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible. The new data structure introduced in this work is called dB-hash and we show how its implementation--BW-ERNE--maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption. Extensive comparison experiments conducted with several popular alignment tools on both simulated and real NGS data, show, finally, that BW-ERNE is able to attain both the positive features of succinct data structures (that is, small space) and hash indexes (that is, sensitivity). In applications where space and speed are both a concern, standard methods often sacrifice accuracy to obtain competitive throughputs and memory footprints. In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.

...read moreread less

Book Chapter•DOI•

Text Censoring System for Filtering Malicious Content Using Approximate String Matching and Bayesian Filtering

[...]

Khairil Imran Ghauth¹, Muhammad Shurazi Sukhur¹•Institutions (1)

Multimedia University¹

01 Jan 2015

TL;DR: A hybrid text censoring method based on Bayesian Filtering and Approximate String Matching techniques is introduced which shows that Bayesian filtering technique can be used to filter profane words.

...read moreread less

Abstract: Information obtained nowadays often contains malicious contents. These malicious contents such as profane words have to be censored as they can influence the minds of the young ones and create hate among people. In censoring the profane words, this paper introduces a hybrid text censoring method which is based on Bayesian Filtering and Approximate String Matching techniques. The Bayesian filtering technique is used to detect the malicious contents (profane words) while the Approximate String Matching technique is used to enhance the effectiveness of detecting profane words. In evaluating the performance of the proposed system, the evaluation metrics of Precision, Recall, F-measure and MAE were used. The results show that Bayesian filtering technique can be used to filter profane words.

...read moreread less

Proceedings Article•DOI•

A fast string matching algorithm based on lowlight characters in the pattern

[...]

Zhengjun Cao¹, Zhenzhen Yan¹, Lihua Liu²•Institutions (2)

Shanghai University¹, Shanghai Maritime University²

27 Mar 2015

TL;DR: In this paper, a new string matching algorithm which matches the pattern from neither the left nor the right end, instead a special position was proposed, which is more flexible to pick the position for starting comparisons.

...read moreread less

Abstract: String matching is of great importance in pattern recognition. We put forth a new string matching algorithm which matches the pattern from neither the left nor the right end, instead a special position. Comparing with the Knuth-Morris-Pratt algorithm and the Boyer-Moore algorithm, the new algorithm is more flexible to pick the position for starting comparisons. The option really brings it a saving in cost. The method requires a statistical probability table for alphabets which can be set up using evolution strategies for dynamic conditions. If the chosen lowlight character in a given pattern has the probability λ, the length of the text is n and the length of the pattern is m. then we conjecture that the complexity of the new algorithm is Θ(n/λm).

...read moreread less

Improving translation memory fuzzy matching by paraphrasing

[...]

Konstantinos Chatzitheodorou

01 Sep 2015

TL;DR: An innovative approach to match sentences having different words but the same meaning is presented, using NooJ to create paraphrases of Support Verb Constructions of all source translation units to expand the fuzzy matching capabilities when searching in the translation memory (TM).

...read moreread less

Abstract: Computer-assisted translation (CAT) tools have become the major language technology to support and facilitate the translation process. Those kind of programs store previously translated source texts and their equivalent target texts in a database and retrieve related segments during the translation of new texts. However, most of them are based on string or word edit distance, not allowing retrieving of matches that are similar. In this paper we present an innovative approach to match sentences having different words but the same meaning. We use NooJ to create paraphrases of Support Verb Constructions (SVC) of all source translation units to expand the fuzzy matching capabilities when searching in the translation memory (TM). Our first results for the EN-IT language pair show consistent and significant improvements in matching over state-of-the-art CAT systems, across different text domains.

...read moreread less