scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2013"


Journal ArticleDOI
TL;DR: This article addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t and presents experimental results in order to bring order among the dozens of articles published in this area.
Abstract: This article addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, information retrieval, data compression, computational biology and chemistry.In the last decade more than 50 new algorithms have been proposed for the problem, which add up to a wide set of (almost 40) algorithms presented before 2000. In this article we review the string matching algorithms presented in the last decade and present experimental results in order to bring order among the dozens of articles published in this area.

167 citations


Proceedings ArticleDOI
08 Apr 2013
TL;DR: This paper proposes a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance, and develops a range-based method by grouping the pivotal entries to avoid duplicated computations.
Abstract: String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.

74 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: An expansion-based framework to measure string similarities efficiently while considering synonyms is presented, and an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency.
Abstract: A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or q-grams. While these are indeed indicators of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, "Bill" is a short form of "William". Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures more effectively and efficiently, thereby boosting the quality of string matching.In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing structure called SI-tree, which combines both signature and length filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the results from an empirical study of the algorithms verify the effectiveness and efficiency of our approach.

65 citations


Journal ArticleDOI
TL;DR: A novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings, and a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary are proposed.
Abstract: Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.

50 citations


Journal ArticleDOI
TL;DR: This work investigates range queries augmented with a string similarity search predicate in both euclidean space and road networks and proposes a novel exact method, RSASSOL, which significantly outperforms the baseline algorithm in practice.
Abstract: This work deals with the approximate string search in large spatial databases. Specifically, we investigate range queries augmented with a string similarity search predicate in both euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the subtree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the subtrees of index nodes. We also discuss how to estimate the selectivity of a SAS query in euclidean space, for which we present a novel adaptive algorithm to find balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which significantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram-based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approaches.

45 citations


Journal ArticleDOI
22 Oct 2013
TL;DR: The string analysis algorithm is implemented, and used to augment an industrial security analysis for Web applications by automatically detecting and verifying sanitizers—methods that eliminate malicious patterns from untrusted strings, making these strings safe to use in security-sensitive operations.
Abstract: We propose a novel technique for statically verifying the strings generated by a program. The verification is conducted by encoding the program in Monadic Second-order Logic (M2L). We use M2L to describe constraints among program variables and to abstract built-in string operations. Once we encode a program in M2L, a theorem prover for M2L, such as MONA, can automatically check if a string generated by the program satisfies a given specification, and if not, exhibit a counterexample. With this approach, we can naturally encode relationships among strings, accounting also for cases in which a program manipulates strings using indices. In addition, our string analysis is path sensitive in that it accounts for the effects of string and Boolean comparisons, as well as regular-expression matches.We have implemented our string analysis algorithm, and used it to augment an industrial security analysis for Web applications by automatically detecting and verifying sanitizers—methods that eliminate malicious patterns from untrusted strings, making these strings safe to use in security-sensitive operations. On the 8 benchmarks we analyzed, our string analyzer discovered 128 previously unknown sanitizers, compared to 71 sanitizers detected by a previously presented string analysis.

44 citations


Journal ArticleDOI
TL;DR: This article Study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, and proposes a new filter, called the segment filter.
Abstract: As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this article, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a new filter, called the segment filter. We partition a string into a set of segments and use the segments as a filter to find similar string pairs. We first create inverted indices for the segments. Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this string. Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidates. We also extend our techniques to support normalized edit distance. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real-world datasets.

38 citations


Proceedings ArticleDOI
05 Nov 2013
TL;DR: A fuzzy match algorithm using machine learning (SVM) that checks both for approximate spelling and approximate geocoding in order to find duplicates between the crowd-sourced tags and the gazetteer in effort to absorb those tags that are novel.
Abstract: Geographical knowledge resources or gazetteers that are enriched with local information have the potential to add geographic precision to information retrieval. We have identified sources of novel local gazetteer entries in crowd-sourced OpenStreetMap and Wikimapia geotags that include geo-coordinates. We created a fuzzy match algorithm using machine learning (SVM) that checks both for approximate spelling and approximate geocoding in order to find duplicates between the crowd-sourced tags and the gazetteer in effort to absorb those tags that are novel. For each crowd-sourced tag, our algorithm generates candidate matches from the gazetteer and then ranks those candidates based on word form or geographical relations between each tag and gazetteer candidate. We compared a baseline of edit distance for candidate ranking to an SVM-trained candidate ranking model on a city level location tag match task. Experiment results show that the SVM greatly outperforms the baseline.

32 citations


Journal ArticleDOI
TL;DR: This work develops two novel cache-conscious predicate evaluation techniques, namely, lazy and bitmap evaluations, that also exploit the underlying discrete and finite space to substantially reduce BE-Tree's matching time by up to 75%.
Abstract: BE-Tree is a novel dynamic data structure designed to efficiently index Boolean expressions over a high-dimensional discrete space. BE Tree-copes with both high-dimensionality and expressiveness of Boolean expressions by introducing an effective two-phase space-cutting technique that specifically utilizes the discrete and finite domain properties of the space. Furthermore, BE-Tree employs self-adjustment policies to dynamically adapt the tree as the workload changes. Moreover, in BE-Tree, we develop two novel cache-conscious predicate evaluation techniques, namely, lazy and bitmap evaluations, that also exploit the underlying discrete and finite space to substantially reduce BE-Tree's matching time by up to 75pBE-Tree is a general index structure for matching Boolean expression which has a wide range of applications including (complex) event processing, publish/subscribe matching, emerging applications in cospaces, profile matching for targeted web advertising, and approximate string matching. Finally, the superiority of BE-Tree is proven through a comprehensive evaluation with state-of-the-art index structures designed for matching Boolean expressions.

31 citations


Journal ArticleDOI
TL;DR: The proposed syntactic string matching approach achieved significantly better performance in recognizing partially occluded faces, but also showed its ability to perform direct matching between sketch faces and photo faces, breaking the barrier that prevents string matching techniques from being used for addressing complex image recognition problems.
Abstract: In this paper, we present a syntactic string matching approach to solve the frontal face recognition problem. String matching is a powerful partial matching technique, but is not suitable for frontal face recognition due to its requirement of globally sequential representation and the complex nature of human faces, containing discontinuous and non-sequential features. Here, we build a compact syntactic Stringface representation, which is an ensemble of strings. A novel ensemble string matching approach that can perform non-sequential string matching between two Stringfaces is proposed. It is invariant to the sequential order of strings and the direction of each string. The embedded partial matching mechanism enables our method to automatically use every piece of non-occluded region, regardless of shape, in the recognition process. The encouraging results demonstrate the feasibility and effectiveness of using syntactic methods for face recognition from a single exemplar image per person, breaking the barrier that prevents string matching techniques from being used for addressing complex image recognition problems. The proposed method not only achieved significantly better performance in recognizing partially occluded faces, but also showed its ability to perform direct matching between sketch faces and photo faces.

29 citations


Proceedings ArticleDOI
17 Jun 2013
TL;DR: A systematic literature survey of 35 service matching approaches which consider fuzzy matching is performed, a classification is proposed, how different matching approaches can be combined into a comprehensive matching method is discussed, and future research challenges are identified.
Abstract: In the last decades, development turned from monolithic software products towards more flexible software components that can be provided on world-wide markets in form of services. Customers request such services or compositions of several services. However, in many cases, discovering the best services to address a given request is a tough challenge and requires expressive, gradual matching results, considering different aspects of a service description, e.g., inputs/ouputs, protocols, or quality properties. Furthermore, in situations in which no service exactly satisfies the request, approximate matching which can deal with a certain amount of fuzziness becomes necessary. There is a wealth of service matching approaches, but it is not clear whether there is a comprehensive, fuzzy matching approach which addresses all these challenges. Although there are a few service matching surveys, none of them is able to answer this question. In this paper, we perform a systematic literature survey of 35 (out of 504) service matching approaches which consider fuzzy matching. Based on this survey, we propose a classification, discuss how different matching approaches can be combined into a comprehensive matching method, and identify future research challenges.

Journal ArticleDOI
TL;DR: A real-time variation of the elegant Crochemore-Perrin constant-space string matching algorithm that has a simple and efficient control structure that searches for complementary parts of the pattern whose simultaneous occurrence indicates an occurrence of the complete pattern.

Proceedings ArticleDOI
22 Jun 2013
TL;DR: The efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings are proposed and the novel filtering techniques which take advantages of q-grams and invertedq-gram indexes available are utilized.
Abstract: There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold. Without top-k approximate substring matching, users have to try repeatedly different maximum distance threshold values when the proper threshold is unknown in advance.In our paper, we first propose the efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings. To reduce the number of expensive distance computations, the proposed algorithms utilize our novel filtering techniques which take advantages of q-grams and inverted q-gram indexes available. We conduct extensive experiments with real-life data sets. Our experimental results confirm the effectiveness and scalability of our proposed algorithms.

Journal ArticleDOI
TL;DR: This paper presents a new algorithm for the approximate string matching problem allowing for non-overlapping inversions which runs in O(nm) worst-case time and O(m^2) space, for a character sequence of size n and pattern of size m.

Journal ArticleDOI
TL;DR: This paper proposed a bit-parallel multiple approximate string match algorithm, and developed a GPU implementation which achieved speedups about 28 relative to a single-thread CPU code.

Journal ArticleDOI
TL;DR: This paper presents an algorithm running in O(nNlg(N/n) time for computing the edit-distance of these two strings under any rational scoring function, and an O( n2/3N4/3) time algorithm for arbitrary scoring functions.
Abstract: The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n2/3N4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

Posted Content
TL;DR: In this paper, the authors study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. They introduce a formalism called search schemes to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of their strategies.
Abstract: We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of our strategies.

Proceedings ArticleDOI
26 Sep 2013
TL;DR: This paper shows an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU and shows that the implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.
Abstract: The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The approximate string matching (ASM) for two strings X and Y of length m and n is a task to find a sub string of Y most similar to X. The main contribution of this paper is to show an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU. Our algorithm runs in O(n/w+mn/dw + nL/p + mnl/p) on the HMM with d streaming processors, memory band width w, global memory access latency L, and shared memory access latency l. Further, we implement our algorithm on GeForce GTX 580 GPU and evaluate the performance. The experimental results show that the ASM of two strings of 1024 and 4M (=222) characters can be computed in 419.6ms, while the sequential algorithm can compute it in 27720ms. Thus, our implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.

Patent
03 Apr 2013
TL;DR: In this paper, a semantic fuzzy matching method consisting of extracting characteristics of the text identified by voice to obtain the characteristic data, carrying out named entity identification on characteristic data by a conditional random field (CRF) to find the key semantic categories of sentences, accurately matching the key semantics categories, performing fuzzy matching when the accurate match is failed, calculating the similarity of the keysemantic categories and the key words in the dictionary, selecting the words with largest similarity to replace the wrong semantic categories, and marking the categories.
Abstract: The embodiment of the invention provides a semantic fuzzy matching method. The method comprises the following steps of: extracting characteristics of the text identified by voice to obtain the characteristic data; carrying out named entity identification on the characteristic data by a conditional random field (CRF) to find the key semantic categories of sentences; and accurately matching the key semantic categories, performing fuzzy matching when the accurate match is failed, calculating the similarity of the key semantic categories and the key words in the dictionary, selecting the key words with largest similarity to replace the key semantic categories, and marking the categories. By the method of the embodiment, the CRF is used for marking the sequence, the key semantic categories in the inquire statement are initially marked and located; the fuzzy matching range is shortened; the similarity is calculated according to the domain dictionary; the dictionary entries with the largest similarity are used for replacing the wrong key semantic categories in the user query; the calculation amount is reduced; and the identifying speed is improved.

Proceedings ArticleDOI
27 Jun 2013
TL;DR: A system for privacy-preserving matching of strings is presented, which differs from existing systems by providing a deterministic approximation instead of an exact distance, which is efficient, non-interactive and does not involve a third party which makes it particularly suitable for cloud computing.
Abstract: Consider two parties who want to compare their strings, e.g., genomes, but do not want to reveal them to each other. We present a system for privacy-preserving matching of strings, which differs from existing systems by providing a deterministic approximation instead of an exact distance. It is efficient (linear complexity), non-interactive and does not involve a third party which makes it particularly suitable for cloud computing. We extend our protocol, such that it only reveals whether there is a match and not the exact distance. Further an implementation of the system is evaluated and compared against current privacy-preserving string matching algorithms.

Journal ArticleDOI
TL;DR: The edit-distance between two strings is the smallest number of operations required to transform one string into the other.
Abstract: The edit-distance between two strings is the smallest number of operations required to transform one string into the other. The distance between languages L1 and L2 is the smallest edit-distance be...

Patent
15 Apr 2013
TL;DR: In this article, an optimized pattern matching rule for one or more respective pattern matching rules is derived from an original matching rule, which includes an extracted text string from the respective pattern-matching rule or a less complex pattern match than the original rule.
Abstract: Exemplary methods, apparatuses, and systems for parsing unstructured data with a plurality of pattern matching rules are disclosed An optimized pattern matching rule for one or more respective pattern matching rules is derived from an original pattern matching rule The optimized pattern matching rule includes an extracted text string from the respective pattern matching rule or a less complex pattern match than the respective pattern matching rule If the extracted text string or pattern is determined to match any of the data to be parsed, application of the original pattern matching rule is bypassed The original pattern matching rule is applied when the one or more optimized pattern matching rules match the data

Book ChapterDOI
02 Sep 2013
TL;DR: In this paper, an extension of this problem to unbalanced strings, so that some elements may not be covered by any block, was considered, and an efficient fixed-parameter algorithm for the parameters number k of blocks and maximum occurrence d of a letter in either string was presented.
Abstract: Motivated by the study of genome rearrangements, the NP-hard Minimum Common String Partition problems asks, given two strings, to split both strings into an identical set of blocks. We consider an extension of this problem to unbalanced strings, so that some elements may not be covered by any block. We present an efficient fixed-parameter algorithm for the parameters number k of blocks and maximum occurrence d of a letter in either string. We then evaluate this algorithm on bacteria genomes and synthetic data.

Patent
13 Feb 2013
TL;DR: In this paper, a system and a method for carrying out searching match on automobile component products by using a VIN (Vehicle Identification Number) is described. But the system is not suitable for the use of vehicles.
Abstract: The invention discloses a system and a method for carrying out searching match on automobile component products by using a VIN (Vehicle Identification Number). The system comprises a product definition configuration module, a VIN character string division module, a character string analysis module, an information matching module and a fuzzy search module. The product definition configuration module is used for predefining classes and adaptable automobile modes of products in a database; the VIN character string division module is used for intercepting a plurality of character strings from the VIN according to a set rule; the character string analysis module is used for analyzing information corresponding to a vehicle represented by the VIN from the character strings obtained through division; the information matching module is used for carrying out analysis to obtain vehicle information and find out a product matched with the vehicle from a datasheet to be retrieved by virtue of a circular traverse function; and the fuzzy search module is used for finding out the final product needed by a user in combination with keyword fuzzy search specific to the name of the product per se. According to the invention, the specific product needed by the user can be searched quickly and precisely according to search words and the VIN of the vehicle of the user.

Patent
11 Sep 2013
TL;DR: In this article, a smartphone address book fuzzy search method including preprocessing data of contacts recorded in a smartphone Address Book, acquiring spelling of names of the contacts and corresponding digital sequences of a thumb keyboard according to a spelling code table, writing the key information including the spellings, the digital sequences, phone numbers and the like into a memory, respectively matching different fields of contacts in the memory on the basis of three classifications.
Abstract: The invention discloses a smartphone address book fuzzy search method including preprocessing data of contacts recorded in a smartphone address book, acquiring spelling of names of the contacts and corresponding digital sequences of a thumb keyboard according to a spelling code table, writing the key information including the spellings, the digital sequences, phone numbers and the like into a memory while backing up the key information to a designed buffer, respectively matching different fields of the contacts in the memory on the basis of three classifications by judging whether including Chinese characters, letters or numbers according to the given search keywords, and finally realizing fuzzy matching of the keywords and the spellings of the names by a modified character string sequence matching algorithm. By the smartphone address book fuzzy search method, the address book can be searched globally according to optional keywords, the digital search of the thumb keyboard is supported, names, full spellings, shorthand spellings, part spellings, phone numbers, mailboxes of the contacts can be in fuzzy search, the search results can be output in a weighted ranking manner according to specific matching reasons, and accordingly, the smartphone address book fuzzy search method is a convenient and efficient diversified operating method for users.

Patent
18 Dec 2013
TL;DR: In this paper, a method and a device for matching based on voice recognition is described, which mainly includes determining character information, in an alphabetic form, converted from voice information; on the basis of a fuzzy matching strategy and according to alphabet, performing fuzzy matching to the converted character information of the character information stored in the alphabetic and Chinese character forms from a local database.
Abstract: The invention discloses a method and a device for matching based on voice recognition. The method mainly includes determining character information, in an alphabetic form, converted from voice information; on the basis of a fuzzy matching strategy and according to alphabet, performing fuzzy matching to the converted character information of the character information stored in the alphabetic and Chinese character forms from a local database. As the single complete matching strategy applied in the prior art is expanded to the strategy of performing the fuzzy matching to the converted character information in the alphabetic form according to alphabet, voice recognition rate of the character information acquired by conversion is increased effectively, and further, the efficiency of the voice recognition technology is improved.

Journal ArticleDOI
TL;DR: This paper provides an algorithm that also runs in deterministic time O(kNlogM) but achieves a lower variance of min(M/k, M-c)(M-c)/k, which is essentially a factor of k smaller than in previous work.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: This paper contributes to the recently started investigation of the computational complexity of the string morphism problem by studying it in the framework of parameterised complexity.
Abstract: Given a source string u and a target string w, to decide whether w can be obtained by applying a string morphism on u (i. e., uniformly replacing the symbols in u by strings) constitutes an NP-complete problem. For example, the target string w := baaba can be obtained from the source string u := aba, by replacing a and b in u by the strings ba and a, respectively. In this paper, we contribute to the recently started investigation of the computational complexity of the string morphism problem by studying it in the framework of parameterised complexity.

Journal ArticleDOI
TL;DR: A method for performing convolutions efficiently in a word RAM model of computation, having a word size of w = ?
Abstract: We develop a method for performing convolutions efficiently in a word RAM model of computation, having a word size of w = ? ( log n ) bits, where n is the input size. The basic?idea?is to pack several elements of the input vector into a single computer word, effectively enabling parallel computation of convolutions. The technique is applied to approximate string matching under Hamming distance. The obtained algorithms are the fastest known. In particular, we reduce the complexity of the Amir et?al.?(2000) algorithm for k -mismatches from O ( n k log k ) to O ( n + n k / w log k ) . Those algorithms impose some (not severe) limitation on the pattern length, m . We present another, less efficient however, technique based on word-level parallelism, which works without the pattern length limitation.

Posted Content
TL;DR: A simple algorithm which computes the approximate entropy of a finite binary string of arbitrary length and successfully test the algorithm in the fields of Prime Number Theory, Human Vision, Cryptography, Random Number Generation and Quantitative Finance.
Abstract: We design, implement and test a simple algorithm which computes the approximate entropy of a finite binary string of arbitrary length. The algorithm uses a weighted average of the Shannon Entropies of the string and all but the last binary derivative of the string. We successfully test the algorithm in the fields of Prime Number Theory (where we prove explicitly that the sequence of prime numbers is not periodic), Human Vision, Cryptography, Random Number Generation and Quantitative Finance.