scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2015"


Proceedings ArticleDOI
14 Jun 2015
TL;DR: This paper shows that, if the edit distance can be computed in time O(n2-δ) for some constant δ>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time MO(1) 2(1-ε)N for a constant ε>0.
Abstract: The edit distance (a.k.a. the Levenshtein distance) between two strings is defined as the minimum number of insertions, deletions or substitutions of symbols needed to transform one string into another. The problem of computing the edit distance between two strings is a classical computational task, with a well-known algorithm based on dynamic programming. Unfortunately, all known algorithms for this problem run in nearly quadratic time.In this paper we provide evidence that the near-quadratic running time bounds known for the problem of computing edit distance might be {tight}. Specifically, we show that, if the edit distance can be computed in time O(n2-δ) for some constant δ>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time MO(1) 2(1-e)N for a constant e>0. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.

264 citations


Proceedings ArticleDOI
17 Oct 2015
TL;DR: In this article, it was shown that these measures do not have strongly sub quadratic time algorithms, i.e., no algorithm with running time O(n 2 ) for any a#x03B5; > 0, unless the Strong Exponential Time Hypothesis fails.
Abstract: Classic similarity measures of strings are longest common subsequence and Levenshtein distance (i.e., The classic edit distance). A classic similarity measure of curves is dynamic time warping. These measures can be computed by simple O(n2) dynamic programming algorithms, and despite much effort no algorithms with significantly better running time are known. We prove that, even restricted to binary strings or one-dimensional curves, respectively, these measures do not have strongly sub quadratic time algorithms, i.e., No algorithms with running time O(n2 -- a#x03B5;) for any a#x03B5; > 0, unless the Strong Exponential Time Hypothesis fails. We generalize the result to edit distance for arbitrary fixed costs of the four operations (deletion in one of the two strings, matching, substitution), by identifying trivial cases that can be solved in constant time, and proving quadratic-time hardness on binary strings for all other cost choices. This improves and generalizes the known hardness result for Levenshtein distance [Backurs, Indyk STOC'15] by the restriction to binary strings and the generalization to arbitrary costs, and adds important problems to a recent line of research showing conditional lower bounds for a growing number of quadratic time problems. As our main technical contribution, we introduce a framework for proving quadratic-time hardness of similarity measures. To apply the framework it suffices to construct a single gadget, which encapsulates all the expressive power necessary to emulate a reduction from satisfiability. Finally, we prove quadratic-time hardness for longest palindromic subsequence and longest tandem subsequence via reductions from longest common subsequence, showing that conditional lower bounds based on the Strong Exponential Time Hypothesis also apply to string problems that are not necessarily similarity measures.

195 citations


Journal ArticleDOI
TL;DR: A novel grammar representation that allows efficient random access to any character or substring without decompressing the string is presented.
Abstract: Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures (sometimes with slight reduction in efficiency) many of the popular compression schemes, including the Lempel--Ziv family, run-length encoding, byte-pair encoding, Sequitur, and Re-Pair. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot\alpha_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $\alpha_k(n)$ is the inverse of the $k$th row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompres...

114 citations


Book ChapterDOI
14 Sep 2015
TL;DR: This work explores the fittest feature set from a wide range of features and a method that refines machine learning approach using gazetteers with approximate string matching, in an effort for robust handling of out-of-vocabulary words.
Abstract: This paper presents a pioneering work on building a Named Entity Recognition system for the Mongolian language, with an agglutinative morphology and a subject-object-verb word order. Our work explores the fittest feature set from a wide range of features and a method that refines machine learning approach using gazetteers with approximate string matching, in an effort for robust handling of out-of-vocabulary words. As well as we tried to apply various existing machine learning methods and find optimal ensemble of classifiers based on genetic algorithm. The classifiers uses different feature representations. The resulting system constitutes the first-ever usable software package for Mongolian NER, while our experimental evaluation will also serve as a much-needed basis of comparison for further research.

75 citations


Proceedings ArticleDOI
24 Aug 2015
TL;DR: This paper proposes a scheme for Generalized Pattern-matching String-search on Encrypted data (GPSE) in cloud systems and implements two most commonly used pattern matching search functions on encrypted data, the substring matching and the longest-prefix-first matching.
Abstract: Searchable encryption is an important and challenging issue. It allows people to search on encrypted data. This is a very useful function when more and more people choose to host their data in the cloud and the cloud server is not fully trustable. Existing solutions for searchable encryption are only limited to some simple functions of search, such as boolean search or similarity search. In this paper, we propose a scheme for Generalized Pattern-matching String-search on Encrypted data (GPSE) in cloud systems. GPSE allows users to specify their search queries by using generalized wildcard-based string patterns (such as SQL-like patterns). It gives users great expressive power in specifying highly targeted search queries. In the framework of GPSE, we particularly implemented two most commonly used pattern matching search functions on encrypted data, the substring matching and the longest-prefix-first matching. We also prove that GPSE is secure under the known-plaintext model. Experiments over real data sets show that GPSE achieves high search accuracy.

31 citations


Journal ArticleDOI
TL;DR: Taking into account the fuzzy information involved in one-shot multi-attribute exchanges, a new fuzzy matching model is proposed for the trade determination problem and a novel calculation method of the matching degree based on the improved fuzzy information axiom is presented.
Abstract: The trade determination problem is an important decision problem for multi-attribute exchanges in E-brokerages. As of now, some studies have focused on this issue. However, theories and guidelines for the trade determination problem under fuzzy environments are still sparse. In this paper, taking into account the fuzzy information involved in one-shot multi-attribute exchanges, a new fuzzy matching model is proposed for the trade determination problem. In the model, we present a novel calculation method of the matching degree based on the improved fuzzy information axiom as a baseline study. Also, the credibility measure and Hurwicz criterion are introduced to convert the model into a crisp one. Since the crisp model is a 0---1 integer programming problem, the commonly used branch and bound algorithm and related optimization techniques become applicable. Finally, an example is employed to illustrate the application and sensitivity analysis of the proposed model.

28 citations


Proceedings ArticleDOI
27 May 2015
TL;DR: A new filtering method, called local filtering, is proposed, based on the idea that two strings exhibiting substantial local dissimilarities must be globally dissimilar, which can achieve substantial speedup compared with state-of-the-art methods and be robust against factors such as dataset characteristics and large edit distance thresholds.
Abstract: We study efficient query processing for approximate string queries, which find strings within a string collection whose edit distances to the query strings are within the given thresholds. Existing methods typically hinge on the property that globally similar strings must share at least certain number of identical substrings or subsequences. They become ineffective when there are burst errors or when the number of errors is large. In this paper, we explore the opposite paradigm focusing on finding out the differences of database strings to the query string. We propose a new filtering method, called local filtering, based on the idea that two strings exhibiting substantial local dissimilarities must be globally dissimilar. We propose the concept of (positional) local distance to quantify the minimum amount of errors a query fragment contributes to the edit distance between the query and a data string. It also leads to effective pruning rules and can speed up verification via early termination. We devise a family of indexing methods based on the idea of precomputing (positional) local distances for all possible combinations of query fragments and edit distance thresholds. Based on careful analyses of subtle relationships among local distances, novel techniques are proposed to drastically reduce the amount of enumeration with no or little impact on the pruning power. Efficient query processing methods exploiting the new index and bit-parallelism are also proposed. Experimental results on real datasets show that our local filtering-based methods can achieve substantial speedup compared with state-of-the-art methods, and they are robust against factors such as dataset characteristics and large edit distance thresholds.

24 citations


Posted Content
TL;DR: A tool to join observations from two datasets based on string variables which do not necessarily need to be exactly the same, allowing for a fuzzy similarity between the two different text variables.
Abstract: matchit is a tool to join observations from two datasets based on string variables which do not necessarily need to be exactly the same. It performs many different string-based matching techniques, allowing for a fuzzy similarity between the two different text variables.

22 citations


Journal ArticleDOI
TL;DR: The proposed algorithm is a hybrid that combines the modification of Horspool's algorithm with two observations on string matching and scans the text from left to right and matches the pattern from right to left.
Abstract: Pattern matching is important in text processing, molecular biology, operating systems and web search engines. Many algorithms have been developed to search for a specific pattern in a text, but the need for an efficient algorithm is an outstanding issue. In this paper, we present a simple and practical string matching algorithm. The proposed algorithm is a hybrid that combines our modification of Horspool's algorithm with two observations on string matching. The algorithm scans the text from left to right and matches the pattern from right to left. Experimental results on natural language texts, genomes and human proteins demonstrate that the new algorithm is competitive with practical algorithms.

22 citations


DissertationDOI
01 Jan 2015
TL;DR: This thesis presents novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures, and provides all implementations within SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de.
Abstract: Over thepast years, high-throughput sequencing (HTS)hasbecomean invaluablemethod of investigation in molecular and medical biology. HTS technologies allow to sequence cheaply and rapidly an individual’s DNA sample under the form of billions of short DNA reads. The ability to assess the content of a DNA sample at base-level resolution opens the way to a myriad of applications, including individual genotyping and assessment of large structural variations, measurement of gene expression levels and characterization of epigenetic features. Nonetheless, the quantity and quality of data produced by HTS instruments call for computationally ef icient and accurate analysis methods. In this thesis, I present novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures. Read mapping is a fundamental step of any HTS data analysis pipeline in resequencing projects, where DNA reads are reassembled by aligning them back to a previously known reference genome. The ingenuity of approximate string matching methods is crucial to design ef icient and accurate read mapping tools. In the irst part of this thesis, I cover practical indexing and iltering methods for exact and approximate stringmatching. I present state of the art algorithms and data structures, give their pseudocode and discuss their implementation. Furthermore, I provide all implementationswithin SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de/. Subsequently, I experimentally evaluate all implemented methods, with the aim of guiding the engineering of new sequence alignment software. To the best of my knowledge, this is the irst study providing a comprehensive exposition, implementation and evaluation of such methods. In the second part of this thesis, I turn to the engineering and evaluation of readmapping tools. First, I present a novel method to ind all mapping locations per read within a user-de ined error rate; this method is published in the peer-reviewed journal Nucleic Acids Research and packaged in a open source tool nicknamedMasai. Afterwards, I generalize this method to quickly report all co-optimal or suboptimal mapping locations per read within a user-de ined error rate; this method, packaged in a tool called Yara, provides amore practical, yet sound solution to the readmapping problem. Extensive evaluations, both on simulated and real datasets, show that Yara has better speed and accuracy than de-facto standard read mapping tools.

22 citations


Book ChapterDOI
02 Mar 2015
TL;DR: A new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time \(\mathcal {O}(n(k + \log m) /m)\).
Abstract: Approximate string matching is the problem of finding all factors of a text \(t\) of length \(n\) that are at a distance at most \(k\) from a pattern \(x\) of length \(m\). Approximate circular string matching is the problem of finding all factors of \(t\) that are at a distance at most \(k\) from \(x\) or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time \(\mathcal {O}(n(k + \log m) /m)\). Optimal average-case search time can also be achieved by the algorithms for multiple approximate string matching (Fredriksson and Navarro, 2004) using \(x\) and its rotations as the set of multiple patterns. Here we reduce the preprocessing time and space requirements compared to that approach.

Journal ArticleDOI
TL;DR: This article proposes an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and shows the completeness of the algorithm.
Abstract: Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which only considers the matching position of the last pattern substring in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential pattern mining with gap constraints. Some strict pattern matching algorithms were designed to handle pattern mining tasks, since strict pattern matching can be used to compute the frequency of some patterns occurring in the given sequence and then the frequent patterns can be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into an exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O(m × Maxlen × W × d) and O(Maxlen × W × m 2 × n × d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA.

Journal ArticleDOI
TL;DR: The theoretical results are validated by an empirical study with real-world data that shows the proposed optimal O ( n ) time and space algorithm that can find an SUS for every location of a string of size n is at least 8 times faster and uses at least 20 times less memory.

Book ChapterDOI
01 Sep 2015
TL;DR: Practical solutions for the exact order-preserving matching problem to find all the substrings of a text T which have the same length and relative order as a pattern P are presented.
Abstract: The exact order-preserving matching problem is to find all the substrings of a text T which have the same length and relative order as a pattern P. Like string maching, order-preserving matching can be generalized by allowing the match to be approximate. In approximate order-preserving matching two strings match if they have the same relative order after removing up to k elements in the same positions in both strings. In this paper we present practical solutions for this problem. The methods are based on filtration, and one of them is the first sublinear solution on average. We show by practical experiments that the new solutions are fast and efficient.

Journal ArticleDOI
TL;DR: This paper presents a novel approach towards word spotting using text line decomposition into character primitives and string matching, and shows that the method is robust in searching text in noisy documents.

01 Jan 2015
TL;DR: It is found that combinations of fuzzy matching metrics outperform single metrics and that the best-scoring combination is a non-linear combination of the different metrics the authors have tested.
Abstract: The concept of fuzzy matching in translation memories can take place using linguistically aware or unaware methods, or a combination of both. We designed a flexible and time-efficient framework which applies and combines linguistically unaware or aware metrics in the source and target language. We measure the correlation of fuzzy matching metric scores with the evaluation score of the suggested translation to find out how well the usefulness of a suggestion can be predicted, and we measure the difference in recall between fuzzy matching metrics by looking at the improvements in mean TER as the match score decreases. We found that combinations of fuzzy matching metrics outperform single metrics and that the best-scoring combination is a non-linear combination of the different metrics we have tested.

Book ChapterDOI
09 Dec 2015
TL;DR: In this article, a generic in-place framework was proposed to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words plus n bytes space, where n is the input string size.
Abstract: We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words plus n bytes space, where n is the input string size. By using the in-place framework, we can find the exact and approximate k-mismatch SUS for every string position using a total of O(n) and \(O(n^2)\) time, respectively, regardless of the value of k. Our framework does not involve any compressed or succinct data structures and thus is practical and easy to implement.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: A new modified Cross-Language Levenshtein Distance (CLLD) algorithm that supports matching names across different writing scripts and with many-to-many characters mapping and a hybrid cross-language name matching technique that uses phonetic matching technique mixed with the proposed CLLD algorithm to improve the overall f-measure and speed up the matching process.
Abstract: Name matching is a key component in various applications in our life like record linkage and data mining applications. This process suffers from multiple complexities such as matching data from different languages or data written by people from different cultures. In this paper, we present a new modified Cross-Language Levenshtein Distance (CLLD) algorithm that supports matching names across different writing scripts and with many-to-many characters mapping. In addition, we present a hybrid cross-language name matching technique that uses phonetic matching technique mixed with our proposed CLLD algorithm to improve the overall f-measure and speed up the matching process. Our experiments demonstrate that this method substantially outperforms a number of well-known standard phonetic and approximate string similarity methods in terms of precision, recall, and f-measure.

Journal ArticleDOI
TL;DR: A model to model a handwritten character into string graph representation to provide ability in improving recognition accuracy without relying in normalisation technique and the similarity distance between graph is measured using approximate subgraph matching and string edit distance method.

Proceedings ArticleDOI
08 Dec 2015
TL;DR: The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU called w-SCAN, which relies on warp shuffle for communication between threads without resorting to shared memory access.
Abstract: The approximate string matching (ASM) problem asks to find a substring of string Y of length n that is most similar to string X of length m. The ASM can be solved by dynamic programming technique, which computes a table of size m × n. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The key idea of our implementation relies on warp shuffle for communication between threads without resorting to shared memory access. Surprisingly, our implementation performs only O(mn/w) memory access operations, where w is the warp size, although O(mn) memory access operations are necessary to access all elements in the table of size n × m. Experimental results, carried out on a GeForce GTX 980 GPU, shows that the proposed implementation called w-SCAN provides a speed-up factor over 200 as compared to a single CPU implementation. Also, w-SCAN computes the ASM in less than 40% of the time required by another prominent alternative.

Journal ArticleDOI
TL;DR: A new variant of Closest String where the input strings can contain wildcards that can match any letter in the alphabet, and the goal is to find a solution string without wildcards.

Journal ArticleDOI
04 May 2015-PLOS ONE
TL;DR: The experimental results show that the proposed string matching scheme can reduce the storage cost significantly compared to the previous bit-split string matching methods.
Abstract: This paper proposes a memory-efficient bit-split string matching scheme for deep packet inspection (DPI). When the number of target patterns becomes large, the memory requirements of the string matching engine become a critical issue. The proposed string matching scheme reduces the memory requirements using the uniqueness of the target patterns in the deterministic finite automaton (DFA)-based bit-split string matching. The pattern grouping extracts a set of unique patterns from the target patterns. In the set of unique patterns, a pattern is not the suffix of any other patterns. Therefore, in the DFA constructed with the set of unique patterns, when only one pattern can be matched in an output state. In the bit-split string matching, multiple finite-state machine (FSM) tiles with several input bit groups are adopted in order to reduce the number of stored state transitions. However, the memory requirements for storing the matching vectors can be large because each bit in the matching vector is used to identify whether its own pattern is matched or not. In our research, the proposed pattern grouping is applied to the multiple FSM tiles in the bit-split string matching. For the set of unique patterns, the memory-based bit-split string matching engine stores only the pattern match index for each state to indicate the match with its own unique pattern. Therefore, the memory requirements are significantly decreased by not storing the matching vectors in the string matchers for the set of unique patterns. The experimental results show that the proposed string matching scheme can reduce the storage cost significantly compared to the previous bit-split string matching methods.

Journal ArticleDOI
TL;DR: In this paper, INSPIRE is proposed, a general framework, which adopts a unifying strategy for processing different variants of spatial keyword queries, and adopts the auto completion paradigm that generates an initial query as a prefix matching query.
Abstract: Geo-textual data are generated in abundance. Recent studies focused on the processing of spatial keyword queries which retrieve objects that match certain keywords within a spatial region. To ensure effective retrieval, various extensions were done including the allowance of errors in keyword matching and autocompletion using prefix matching. In this paper, we propose INSPIRE, a general framework, which adopts a unifying strategy for processing different variants of spatial keyword queries. We adopt the autocompletion paradigm that generates an initial query as a prefix matching query. If there are few matching results, other variants are performed as a form of relaxation that reuses the processing done in the earlier phase. The types of relaxation allowed include spatial region expansion and exact/approximate prefix/substring matching. Moreover, since the autocompletion paradigm allows appending characters after the initial query, we look at how query processing done for the initial query and relaxation can be reused in such instances. Compared to existing works which process variants of spatial keyword query as new queries over different indexes, our approach offers a more compelling way to efficient and effective spatial keyword search. Extensive experiments substantiate our claims.

Patent
21 Oct 2015
TL;DR: In this article, a fuzzy word segmentation based non-multi-character word error automatic proofreading method is presented. But the method is not suitable for Chinese word error detection.
Abstract: The invention discloses a fuzzy word segmentation based non-multi-character word error automatic proofreading method. According to the method, accurate segmentation is carried out based on a correct word dictionary and a wrong character word dictionary to generate a word graph; then the similarity of Chinese word strings is calculated by utilizing a fuzzy matching algorithm, accurately segmented disperse strings are subjected to fuzzy matching, and a fuzzy matching result is added into the word graph to form a fuzzy word graph; and finally a shortest path of the fuzzy word graph is calculated by utilizing a binary model of words in combination with similarity, so that automatic proofreading of Chinese non-multi-character word errors is realized. According to the fuzzy word segmentation based non-multi-character word error automatic proofreading method provided by the invention, the system response is quick, the precision meets actual application demands, and the effectiveness and the accuracy are high.

Patent
28 Oct 2015
TL;DR: In this paper, a step-by-step progressive matching method is adopted, which consists of four steps of fast matching, longitude and latitude matching, fuzzy matching and manual judgment.
Abstract: The invention discloses an address matching method. A step-by-step progressive matching method is adopted. The method concretely comprises four steps of fast matching, longitude and latitude matching, fuzzy matching and manual judgment, wherein in the fast matching step, high-quality target addresses are subjected to precise matching, and a chain type complementary mechanism is used for proper complementary matching; in the longitude and latitude matching step, the target addresses and adjacent cells are matched according to longitude and latitude information provided by a map service provider; in the fuzzy matching step, a fuzzy index is used for matching the target addresses and similar cells; and a manual judgment mechanism is used for checking and controlling the matching result. The address matching method also comprises an address word segmentation technology and an address matching accuracy confidence index mechanism. The address matching method has the advantages that the matching efficiency is improved under the condition of ensuring the high matching success rate; the problem of multiple-address matching technology composition application is solved; the success rate and the fault tolerance of the address matching are improved to a great degree; and meanwhile, a series of optimization mechanisms are used for ensuring the program running efficiency.

Journal ArticleDOI
TL;DR: An expansion-based framework to measure string similarities efficiently while considering synonyms is presented and an estimator to estimate the size of candidates to enable an online selection of signature filters is developed, providing strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs.
Abstract: A string-similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings “Sam” and “Samuel” can be considered to be similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, for example, number of common words or q-grams. While this is indeed an indicator of similarity, there are many important cases where syntactically-different strings can represent the same real-world object. For example, “Bill” is a short form of “William,” and “Database Management Systems” can be abbreviated as “DBMS.” Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching.In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by proposing two novel indexes, called SI-trees and QP-trees, which combine signature-filtering and length-filtering strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches.

Journal ArticleDOI
TL;DR: This work introduces a new data structure called dB-hash, which maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption and can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.
Abstract: The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light. The available modern solutions usually are a compromise between the mentioned constraints: in particular, indexes based on the Burrows-Wheeler transform offer reduced memory requirements at the price of lower sensitivity, while hash-based text indexes guarantee high sensitivity at the price of significant memory consumption. In this work we describe a technique that permits to attain the advantages granted by both classes of indexes. This is achieved using Hamming-aware hash functions--hash functions designed to search the entire Hamming sphere in reduced time--which are also homomorphisms on de Bruijn graphs. We show that, using this particular class of hash functions, the corresponding hash index can be represented in linear space introducing only a logarithmic slowdown (in the query length) for the lookup operation. We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible. The new data structure introduced in this work is called dB-hash and we show how its implementation--BW-ERNE--maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption. Extensive comparison experiments conducted with several popular alignment tools on both simulated and real NGS data, show, finally, that BW-ERNE is able to attain both the positive features of succinct data structures (that is, small space) and hash indexes (that is, sensitivity). In applications where space and speed are both a concern, standard methods often sacrifice accuracy to obtain competitive throughputs and memory footprints. In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.

Book ChapterDOI
01 Jan 2015
TL;DR: A hybrid text censoring method based on Bayesian Filtering and Approximate String Matching techniques is introduced which shows that Bayesian filtering technique can be used to filter profane words.
Abstract: Information obtained nowadays often contains malicious contents. These malicious contents such as profane words have to be censored as they can influence the minds of the young ones and create hate among people. In censoring the profane words, this paper introduces a hybrid text censoring method which is based on Bayesian Filtering and Approximate String Matching techniques. The Bayesian filtering technique is used to detect the malicious contents (profane words) while the Approximate String Matching technique is used to enhance the effectiveness of detecting profane words. In evaluating the performance of the proposed system, the evaluation metrics of Precision, Recall, F-measure and MAE were used. The results show that Bayesian filtering technique can be used to filter profane words.

Proceedings ArticleDOI
27 Mar 2015
TL;DR: In this paper, a new string matching algorithm which matches the pattern from neither the left nor the right end, instead a special position was proposed, which is more flexible to pick the position for starting comparisons.
Abstract: String matching is of great importance in pattern recognition. We put forth a new string matching algorithm which matches the pattern from neither the left nor the right end, instead a special position. Comparing with the Knuth-Morris-Pratt algorithm and the Boyer-Moore algorithm, the new algorithm is more flexible to pick the position for starting comparisons. The option really brings it a saving in cost. The method requires a statistical probability table for alphabets which can be set up using evolution strategies for dynamic conditions. If the chosen lowlight character in a given pattern has the probability λ, the length of the text is n and the length of the pattern is m. then we conjecture that the complexity of the new algorithm is Θ(n/λm).

01 Sep 2015
TL;DR: An innovative approach to match sentences having different words but the same meaning is presented, using NooJ to create paraphrases of Support Verb Constructions of all source translation units to expand the fuzzy matching capabilities when searching in the translation memory (TM).
Abstract: Computer-assisted translation (CAT) tools have become the major language technology to support and facilitate the translation process. Those kind of programs store previously translated source texts and their equivalent target texts in a database and retrieve related segments during the translation of new texts. However, most of them are based on string or word edit distance, not allowing retrieving of matches that are similar. In this paper we present an innovative approach to match sentences having different words but the same meaning. We use NooJ to create paraphrases of Support Verb Constructions (SVC) of all source translation units to expand the fuzzy matching capabilities when searching in the translation memory (TM). Our first results for the EN-IT language pair show consistent and significant improvements in matching over state-of-the-art CAT systems, across different text domains.