Showing papers on "Approximate string matching published in 2013"

PDF

Open Access

Journal Article•DOI•

The exact online string matching problem: A review of the most recent results

[...]

Simone Faro¹, Thierry Lecroq²•Institutions (2)

University of Catania¹, University of Rouen²

12 Mar 2013-ACM Computing Surveys

TL;DR: This article addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t and presents experimental results in order to bring order among the dozens of articles published in this area.

...read moreread less

Abstract: This article addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, information retrieval, data compression, computational biology and chemistry.In the last decade more than 50 new algorithms have been proposed for the problem, which add up to a wide set of (almost 40) algorithms presented before 2000. In this article we review the string matching algorithms presented in the last decade and present experimental results in order to bring order among the dozens of articles published in this area.

...read moreread less

167 citations

Proceedings Article•DOI•

[...]

Dong Deng¹, Guoliang Li¹, Jianhua Feng¹, Wen-Syan Li•Institutions (1)

Tsinghua University¹

08 Apr 2013

TL;DR: This paper proposes a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance, and develops a range-based method by grouping the pivotal entries to avoid duplicated computations.

...read moreread less

Abstract: String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.

...read moreread less

74 citations

Proceedings Article•DOI•

[...]

Jiaheng Lu¹, Chunbin Lin¹, Wei Wang², Chen Li³, Haiyong Wang¹ - Show less +1 more•Institutions (3)

Renmin University of China¹, University of New South Wales², University of California, Irvine³

22 Jun 2013

TL;DR: An expansion-based framework to measure string similarities efficiently while considering synonyms is presented, and an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency.

...read moreread less

Abstract: A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or q-grams. While these are indeed indicators of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, "Bill" is a short form of "William". Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures more effectively and efficiently, thereby boosting the quality of string matching.In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing structure called SI-tree, which combines both signature and length filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the results from an empirical study of the algorithms verify the effectiveness and efficiency of our approach.

...read moreread less

65 citations

Journal Article•DOI•

VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

[...]

Wei Wang¹, Jianbin Qin¹, Chuan Xiao², Xuemin Lin¹, Heng Tao Shen³ - Show less +1 more•Institutions (3)

University of New South Wales¹, Nagoya University², University of Queensland³

01 Aug 2013-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings, and a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary are proposed.

...read moreread less

Abstract: Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.

...read moreread less

50 citations

Journal Article•DOI•

Spatial Approximate String Search

[...]

Feifei Li¹, Bin Yao², Mingwang Tang¹, Marios Hadjieleftheriou³•Institutions (3)

University of Utah¹, Shanghai Jiao Tong University², AT&T Labs³

01 Jun 2013-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work investigates range queries augmented with a string similarity search predicate in both euclidean space and road networks and proposes a novel exact method, RSASSOL, which significantly outperforms the baseline algorithm in practice.

...read moreread less

Abstract: This work deals with the approximate string search in large spatial databases. Specifically, we investigate range queries augmented with a string similarity search predicate in both euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the subtree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the subtrees of index nodes. We also discuss how to estimate the selectivity of a SAS query in euclidean space, for which we present a novel adaptive algorithm to find balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which significantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram-based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approaches.

...read moreread less

45 citations

Journal Article•DOI•

Path- and index-sensitive string analysis based on monadic second-order logic

[...]

Takaaki Tateishi¹, Marco Pistoia¹, Omer Tripp²•Institutions (2)

IBM¹, Tel Aviv University²

22 Oct 2013

TL;DR: The string analysis algorithm is implemented, and used to augment an industrial security analysis for Web applications by automatically detecting and verifying sanitizers—methods that eliminate malicious patterns from untrusted strings, making these strings safe to use in security-sensitive operations.

...read moreread less

Abstract: We propose a novel technique for statically verifying the strings generated by a program. The verification is conducted by encoding the program in Monadic Second-order Logic (M2L). We use M2L to describe constraints among program variables and to abstract built-in string operations. Once we encode a program in M2L, a theorem prover for M2L, such as MONA, can automatically check if a string generated by the program satisfies a given specification, and if not, exhibit a counterexample. With this approach, we can naturally encode relationships among strings, accounting also for cases in which a program manipulates strings using indices. In addition, our string analysis is path sensitive in that it accounts for the effects of string and Boolean comparisons, as well as regular-expression matches.We have implemented our string analysis algorithm, and used it to augment an industrial security analysis for Web applications by automatically detecting and verifying sanitizers—methods that eliminate malicious patterns from untrusted strings, making these strings safe to use in security-sensitive operations. On the 8 benchmarks we analyzed, our string analyzer discovered 128 previously unknown sanitizers, compared to 71 sanitizers detected by a previously presented string analysis.

...read moreread less

44 citations

Journal Article•DOI•

A partition-based method for string similarity joins with edit-distance constraints

[...]

Guoliang Li¹, Dong Deng¹, Jianhua Feng¹•Institutions (1)

Tsinghua University¹

04 Jul 2013-ACM Transactions on Database Systems

TL;DR: This article Study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, and proposes a new filter, called the segment filter.

...read moreread less

Abstract: As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this article, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a new filter, called the segment filter. We partition a string into a set of segments and use the segments as a filter to find similar string pairs. We first create inverted indices for the segments. Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this string. Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidates. We also extend our techniques to support normalized edit distance. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real-world datasets.

...read moreread less

38 citations

Proceedings Article•DOI•

Automatic gazetteer enrichment with user-geocoded data

[...]

Judith Gelernter¹, Gautam Ganesh², Hamsini Krishnakumar³, Wei Zhang¹•Institutions (3)

Carnegie Mellon University¹, University of Texas at Dallas², Anna University³

05 Nov 2013

TL;DR: A fuzzy match algorithm using machine learning (SVM) that checks both for approximate spelling and approximate geocoding in order to find duplicates between the crowd-sourced tags and the gazetteer in effort to absorb those tags that are novel.

...read moreread less

Abstract: Geographical knowledge resources or gazetteers that are enriched with local information have the potential to add geographic precision to information retrieval. We have identified sources of novel local gazetteer entries in crowd-sourced OpenStreetMap and Wikimapia geotags that include geo-coordinates. We created a fuzzy match algorithm using machine learning (SVM) that checks both for approximate spelling and approximate geocoding in order to find duplicates between the crowd-sourced tags and the gazetteer in effort to absorb those tags that are novel. For each crowd-sourced tag, our algorithm generates candidate matches from the gazetteer and then ranks those candidates based on word form or geographical relations between each tag and gazetteer candidate. We compared a baseline of edit distance for candidate ranking to an SVM-trained candidate ranking model on a city level location tag match task. Experiment results show that the SVM greatly outperforms the baseline.

...read moreread less

32 citations

Journal Article•DOI•

Analysis and optimization for boolean expression indexing

[...]

Mohammad Sadoghi¹, Hans-Arno Jacobsen¹•Institutions (1)

University of Toronto¹

04 Jul 2013-ACM Transactions on Database Systems

TL;DR: This work develops two novel cache-conscious predicate evaluation techniques, namely, lazy and bitmap evaluations, that also exploit the underlying discrete and finite space to substantially reduce BE-Tree's matching time by up to 75%.

...read moreread less

Abstract: BE-Tree is a novel dynamic data structure designed to efficiently index Boolean expressions over a high-dimensional discrete space. BE Tree-copes with both high-dimensionality and expressiveness of Boolean expressions by introducing an effective two-phase space-cutting technique that specifically utilizes the discrete and finite domain properties of the space. Furthermore, BE-Tree employs self-adjustment policies to dynamically adapt the tree as the workload changes. Moreover, in BE-Tree, we develop two novel cache-conscious predicate evaluation techniques, namely, lazy and bitmap evaluations, that also exploit the underlying discrete and finite space to substantially reduce BE-Tree's matching time by up to 75pBE-Tree is a general index structure for matching Boolean expression which has a wide range of applications including (complex) event processing, publish/subscribe matching, emerging applications in cospaces, profile matching for targeted web advertising, and approximate string matching. Finally, the superiority of BE-Tree is proven through a comprehensive evaluation with state-of-the-art index structures designed for matching Boolean expressions.

...read moreread less

31 citations

Journal Article•DOI•

Face Recognition Using Ensemble String Matching

[...]

Weiping Chen¹, Yongsheng Gao¹•Institutions (1)

Griffith University¹

01 Dec 2013-IEEE Transactions on Image Processing

TL;DR: The proposed syntactic string matching approach achieved significantly better performance in recognizing partially occluded faces, but also showed its ability to perform direct matching between sketch faces and photo faces, breaking the barrier that prevents string matching techniques from being used for addressing complex image recognition problems.

...read moreread less

Abstract: In this paper, we present a syntactic string matching approach to solve the frontal face recognition problem. String matching is a powerful partial matching technique, but is not suitable for frontal face recognition due to its requirement of globally sequential representation and the complex nature of human faces, containing discontinuous and non-sequential features. Here, we build a compact syntactic Stringface representation, which is an ensemble of strings. A novel ensemble string matching approach that can perform non-sequential string matching between two Stringfaces is proposed. It is invariant to the sequential order of strings and the direction of each string. The embedded partial matching mechanism enables our method to automatically use every piece of non-occluded region, regardless of shape, in the recognition process. The encouraging results demonstrate the feasibility and effectiveness of using syntactic methods for face recognition from a single exemplar image per person, breaking the barrier that prevents string matching techniques from being used for addressing complex image recognition problems. The proposed method not only achieved significantly better performance in recognizing partially occluded faces, but also showed its ability to perform direct matching between sketch faces and photo faces.

...read moreread less

29 citations

Proceedings Article•DOI•

A survey of fuzzy service matching approaches in the context of on-the-fly computing

[...]

Marie Christin Platenius¹, Markus von Detten¹, Steffen Becker¹, Wilhelm Schäfer¹, Gregor Engels¹ - Show less +1 more•Institutions (1)

University of Paderborn¹

17 Jun 2013

TL;DR: A systematic literature survey of 35 service matching approaches which consider fuzzy matching is performed, a classification is proposed, how different matching approaches can be combined into a comprehensive matching method is discussed, and future research challenges are identified.

...read moreread less

Abstract: In the last decades, development turned from monolithic software products towards more flexible software components that can be provided on world-wide markets in form of services. Customers request such services or compositions of several services. However, in many cases, discovering the best services to address a given request is a tough challenge and requires expressive, gradual matching results, considering different aspects of a service description, e.g., inputs/ouputs, protocols, or quality properties. Furthermore, in situations in which no service exactly satisfies the request, approximate matching which can deal with a certain amount of fuzziness becomes necessary. There is a wealth of service matching approaches, but it is not clear whether there is a comprehensive, fuzzy matching approach which addresses all these challenges. Although there are a few service matching surveys, none of them is able to answer this question. In this paper, we perform a systematic literature survey of 35 (out of 504) service matching approaches which consider fuzzy matching. Based on this survey, we propose a classification, discuss how different matching approaches can be combined into a comprehensive matching method, and identify future research challenges.

...read moreread less

Journal Article•DOI•

Simple real-time constant-space string matching

[...]

Dany Breslauer¹, Roberto Grossi², Filippo Mignosi•Institutions (2)

University of Haifa¹, University of Pisa²

01 Apr 2013-Theoretical Computer Science

TL;DR: A real-time variation of the elegant Crochemore-Perrin constant-space string matching algorithm that has a simple and efficient control structure that searches for complementary parts of the pattern whose simultaneous occurrence indicates an occurrence of the complete pattern.

...read moreread less

Proceedings Article•DOI•

Efficient top-k algorithms for approximate substring matching

[...]

Younghoon Kim¹, Kyuseok Shim¹•Institutions (1)

Seoul National University¹

22 Jun 2013

TL;DR: The efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings are proposed and the novel filtering techniques which take advantages of q-grams and invertedq-gram indexes available are utilized.

...read moreread less

Abstract: There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold. Without top-k approximate substring matching, users have to try repeatedly different maximum distance threshold values when the proper threshold is unknown in advance.In our paper, we first propose the efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings. To reduce the number of expensive distance computations, the proposed algorithms utilize our novel filtering techniques which take advantages of q-grams and inverted q-gram indexes available. We conduct extensive experiments with real-life data sets. Our experimental results confirm the effectiveness and scalability of our proposed algorithms.

...read moreread less

Journal Article•DOI•

Efficient string-matching allowing for non-overlapping inversions

[...]

Domenico Cantone¹, Salvatore Cristofaro¹, Simone Faro¹•Institutions (1)

University of Catania¹

01 Apr 2013-Theoretical Computer Science

TL;DR: This paper presents a new algorithm for the approximate string matching problem allowing for non-overlapping inversions which runs in O(nm) worst-case time and O(m^2) space, for a character sequence of size n and pattern of size m.

...read moreread less

Journal Article•DOI•

Bit-Parallel Multiple Approximate String Matching based on GPU

[...]

Kefu Xu¹, Wenke Cui², Wenke Cui¹, Yue Hu², Li Guo¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, University of Science and Technology Beijing²

01 Jan 2013-Procedia Computer Science

TL;DR: This paper proposed a bit-parallel multiple approximate string match algorithm, and developed a GPU implementation which achieved speedups about 28 relative to a single-thread CPU code.

...read moreread less

Journal Article•DOI•

Unified Compression-Based Acceleration of Edit-Distance Computation

[...]

Danny Hermelin¹, Gad M. Landau², Shir Landau³, Oren Weimann²•Institutions (3)

Max Planck Society¹, University of Haifa², Tel Aviv University³

01 Feb 2013-Algorithmica

TL;DR: This paper presents an algorithm running in O(nNlg(N/n) time for computing the edit-distance of these two strings under any rational scoring function, and an O( n2/3N4/3) time algorithm for arbitrary scoring functions.

...read moreread less

Abstract: The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n2/3N4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

...read moreread less

Posted Content•

Approximate String Matching using a Bidirectional Index

[...]

Gregory Kucherov¹, Kamil Salikhov², Dekel Tsur³•Institutions (3)

University of Paris¹, Moscow State University², Ben-Gurion University of the Negev³

05 Oct 2013-arXiv: Data Structures and Algorithms

TL;DR: In this paper, the authors study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. They introduce a formalism called search schemes to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of their strategies.

...read moreread less

Abstract: We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of our strategies.

...read moreread less

Proceedings Article•DOI•

The Approximate String Matching on the Hierarchical Memory Machine, with Performance Evaluation

[...]

Duhu Man¹, Koji Nakano¹, Yasuaki Ito¹•Institutions (1)

Hiroshima University¹

26 Sep 2013

TL;DR: This paper shows an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU and shows that the implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.

...read moreread less

Abstract: The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The approximate string matching (ASM) for two strings X and Y of length m and n is a task to find a sub string of Y most similar to X. The main contribution of this paper is to show an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU. Our algorithm runs in O(n/w+mn/dw + nL/p + mnl/p) on the HMM with d streaming processors, memory band width w, global memory access latency L, and shared memory access latency l. Further, we implement our algorithm on GeForce GTX 580 GPU and evaluate the performance. The experimental results show that the ASM of two strings of 1024 and 4M (=222) characters can be computed in 419.6ms, while the sequential algorithm can compute it in 27720ms. Thus, our implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.

...read moreread less

Patent•

Semantic fuzzy matching method

[...]

Zhang Yan, Li Yanling, Weiqun Xu, Yonghong Yan

03 Apr 2013

TL;DR: In this paper, a semantic fuzzy matching method consisting of extracting characteristics of the text identified by voice to obtain the characteristic data, carrying out named entity identification on characteristic data by a conditional random field (CRF) to find the key semantic categories of sentences, accurately matching the key semantics categories, performing fuzzy matching when the accurate match is failed, calculating the similarity of the keysemantic categories and the key words in the dictionary, selecting the words with largest similarity to replace the wrong semantic categories, and marking the categories.

...read moreread less

Abstract: The embodiment of the invention provides a semantic fuzzy matching method. The method comprises the following steps of: extracting characteristics of the text identified by voice to obtain the characteristic data; carrying out named entity identification on the characteristic data by a conditional random field (CRF) to find the key semantic categories of sentences; and accurately matching the key semantic categories, performing fuzzy matching when the accurate match is failed, calculating the similarity of the key semantic categories and the key words in the dictionary, selecting the key words with largest similarity to replace the key semantic categories, and marking the categories. By the method of the embodiment, the CRF is used for marking the sequence, the key semantic categories in the inquire statement are initially marked and located; the fuzzy matching range is shortened; the similarity is calculated according to the domain dictionary; the dictionary entries with the largest similarity are used for replacing the wrong key semantic categories in the user query; the calculation amount is reduced; and the identifying speed is improved.

...read moreread less

Proceedings Article•DOI•

Approximate Two-Party Privacy-Preserving String Matching with Linear Complexity

[...]

Martin Beck, Florian Kerschbaum

27 Jun 2013

TL;DR: A system for privacy-preserving matching of strings is presented, which differs from existing systems by providing a deterministic approximation instead of an exact distance, which is efficient, non-interactive and does not involve a third party which makes it particularly suitable for cloud computing.

...read moreread less

Abstract: Consider two parties who want to compare their strings, e.g., genomes, but do not want to reveal them to each other. We present a system for privacy-preserving matching of strings, which differs from existing systems by providing a deterministic approximation instead of an exact distance. It is efficient (linear complexity), non-interactive and does not involve a third party which makes it particularly suitable for cloud computing. We extend our protocol, such that it only reveals whether there is a match and not the exact distance. Further an implementation of the system is evaluated and compared against current privacy-preserving string matching algorithms.

...read moreread less

Journal Article•DOI•

The edit-distance between a regular language and a context-free language

[...]

Yo-Sub Han¹, Sang-Ki Ko¹, Kai Salomaa²•Institutions (2)

Yonsei University¹, Queen's University²

01 Nov 2013-International Journal of Foundations of Computer Science

TL;DR: The edit-distance between two strings is the smallest number of operations required to transform one string into the other.

...read moreread less

Abstract: The edit-distance between two strings is the smallest number of operations required to transform one string into the other. The distance between languages L1 and L2 is the smallest edit-distance be...

...read moreread less

Patent•

Efficient data pattern matching

[...]

Chengdu Huang¹, Zhenmin Li¹, Spiros Xanthos¹•Institutions (1)

VMware¹

15 Apr 2013

TL;DR: In this article, an optimized pattern matching rule for one or more respective pattern matching rules is derived from an original matching rule, which includes an extracted text string from the respective pattern-matching rule or a less complex pattern match than the original rule.

...read moreread less

Abstract: Exemplary methods, apparatuses, and systems for parsing unstructured data with a plurality of pattern matching rules are disclosed An optimized pattern matching rule for one or more respective pattern matching rules is derived from an original pattern matching rule The optimized pattern matching rule includes an extracted text string from the respective pattern matching rule or a less complex pattern match than the respective pattern matching rule If the extracted text string or pattern is determined to match any of the data to be parsed, application of the original pattern matching rule is bypassed The original pattern matching rule is applied when the one or more optimized pattern matching rules match the data

...read moreread less

Book Chapter•DOI•

A Fixed-Parameter Algorithm for Minimum Common String Partition with Few Duplications

[...]

Laurent Bulteau¹, Guillaume Fertin¹, Christian Komusiewicz¹, Irena Rusu¹•Institutions (1)

University of Nantes¹

02 Sep 2013

TL;DR: In this paper, an extension of this problem to unbalanced strings, so that some elements may not be covered by any block, was considered, and an efficient fixed-parameter algorithm for the parameters number k of blocks and maximum occurrence d of a letter in either string was presented.

...read moreread less

Abstract: Motivated by the study of genome rearrangements, the NP-hard Minimum Common String Partition problems asks, given two strings, to split both strings into an identical set of blocks. We consider an extension of this problem to unbalanced strings, so that some elements may not be covered by any block. We present an efficient fixed-parameter algorithm for the parameters number k of blocks and maximum occurrence d of a letter in either string. We then evaluate this algorithm on bacteria genomes and synthetic data.

...read moreread less

Patent•

System and method for carrying out searching match on automobile component products by using VIN (Vehicle Identification Number)

[...]

Ma Tao

13 Feb 2013

TL;DR: In this paper, a system and a method for carrying out searching match on automobile component products by using a VIN (Vehicle Identification Number) is described. But the system is not suitable for the use of vehicles.

...read moreread less

Abstract: The invention discloses a system and a method for carrying out searching match on automobile component products by using a VIN (Vehicle Identification Number). The system comprises a product definition configuration module, a VIN character string division module, a character string analysis module, an information matching module and a fuzzy search module. The product definition configuration module is used for predefining classes and adaptable automobile modes of products in a database; the VIN character string division module is used for intercepting a plurality of character strings from the VIN according to a set rule; the character string analysis module is used for analyzing information corresponding to a vehicle represented by the VIN from the character strings obtained through division; the information matching module is used for carrying out analysis to obtain vehicle information and find out a product matched with the vehicle from a datasheet to be retrieved by virtue of a circular traverse function; and the fuzzy search module is used for finding out the final product needed by a user in combination with keyword fuzzy search specific to the name of the product per se. According to the invention, the specific product needed by the user can be searched quickly and precisely according to search words and the VIN of the vehicle of the user.

...read moreread less

Patent•

Smartphone address book fuzzy search method

[...]

Jianwei Yin, Yao Taojun, Ying Li, Shuiguang Deng, Jian Wu, Chaohui Wu - Show less +2 more

11 Sep 2013

TL;DR: In this article, a smartphone address book fuzzy search method including preprocessing data of contacts recorded in a smartphone Address Book, acquiring spelling of names of the contacts and corresponding digital sequences of a thumb keyboard according to a spelling code table, writing the key information including the spellings, the digital sequences, phone numbers and the like into a memory, respectively matching different fields of contacts in the memory on the basis of three classifications.

...read moreread less

Abstract: The invention discloses a smartphone address book fuzzy search method including preprocessing data of contacts recorded in a smartphone address book, acquiring spelling of names of the contacts and corresponding digital sequences of a thumb keyboard according to a spelling code table, writing the key information including the spellings, the digital sequences, phone numbers and the like into a memory while backing up the key information to a designed buffer, respectively matching different fields of the contacts in the memory on the basis of three classifications by judging whether including Chinese characters, letters or numbers according to the given search keywords, and finally realizing fuzzy matching of the keywords and the spellings of the names by a modified character string sequence matching algorithm. By the smartphone address book fuzzy search method, the address book can be searched globally according to optional keywords, the digital search of the thumb keyboard is supported, names, full spellings, shorthand spellings, part spellings, phone numbers, mailboxes of the contacts can be in fuzzy search, the search results can be output in a weighted ranking manner according to specific matching reasons, and accordingly, the smartphone address book fuzzy search method is a convenient and efficient diversified operating method for users.

...read moreread less

Patent•

Method and device for matching based on voice recognition

[...]

Weng Weiwen, Huang Xiaoqing, Liu Kun, Jiao Wei

18 Dec 2013

TL;DR: In this paper, a method and a device for matching based on voice recognition is described, which mainly includes determining character information, in an alphabetic form, converted from voice information; on the basis of a fuzzy matching strategy and according to alphabet, performing fuzzy matching to the converted character information of the character information stored in the alphabetic and Chinese character forms from a local database.

...read moreread less

Abstract: The invention discloses a method and a device for matching based on voice recognition. The method mainly includes determining character information, in an alphabetic form, converted from voice information; on the basis of a fuzzy matching strategy and according to alphabet, performing fuzzy matching to the converted character information of the character information stored in the alphabetic and Chinese character forms from a local database. As the single complete matching strategy applied in the prior art is expanded to the strategy of performing the fuzzy matching to the converted character information in the alphabetic form according to alphabet, voice recognition rate of the character information acquired by conversion is increased effectively, and further, the efficiency of the voice recognition technology is improved.

...read moreread less

Journal Article•DOI•

A lower-variance randomized algorithm for approximate string matching

[...]

Mikhail J. Atallah¹, Elena Grigorescu¹, Yi Wu¹•Institutions (1)

Purdue University¹

01 Sep 2013-Information Processing Letters

TL;DR: This paper provides an algorithm that also runs in deterministic time O(kNlogM) but achieves a lower variance of min(M/k, M-c)(M-c)/k, which is essentially a factor of k smaller than in previous work.

...read moreread less

Proceedings Article•DOI•

On the Parameterised Complexity of String Morphism Problems

[...]

Henning Fernau¹, Markus L. Schmid², Yngve Villanger³•Institutions (3)

Trier University of Applied Sciences¹, Loughborough University², University of Bergen³

01 Jan 2013

TL;DR: This paper contributes to the recently started investigation of the computational complexity of the string morphism problem by studying it in the framework of parameterised complexity.

...read moreread less

Abstract: Given a source string u and a target string w, to decide whether w can be obtained by applying a string morphism on u (i. e., uniformly replacing the symbols in u by strings) constitutes an NP-complete problem. For example, the target string w := baaba can be obtained from the source string u := aba, by replacing a and b in u by the strings ba and a, respectively. In this paper, we contribute to the recently started investigation of the computational complexity of the string morphism problem by studying it in the framework of parameterised complexity.

...read moreread less

Journal Article•DOI•

Exploiting word-level parallelism for fast convolutions and their applications in approximate string matching

[...]

Kimmo Fredriksson¹, Szymon Grabowski²•Institutions (2)

University of Eastern Finland¹, Lodz University of Technology²

01 Jan 2013-The Journal of Combinatorics

TL;DR: A method for performing convolutions efficiently in a word RAM model of computation, having a word size of w = ?

...read moreread less

Abstract: We develop a method for performing convolutions efficiently in a word RAM model of computation, having a word size of w = ? ( log n ) bits, where n is the input size. The basic?idea?is to pack several elements of the input vector into a single computer word, effectively enabling parallel computation of convolutions. The technique is applied to approximate string matching under Hamming distance. The obtained algorithms are the fastest known. In particular, we reduce the complexity of the Amir et?al.?(2000) algorithm for k -mismatches from O ( n k log k ) to O ( n + n k / w log k ) . Those algorithms impose some (not severe) limitation on the pattern length, m . We present another, less efficient however, technique based on word-level parallelism, which works without the pattern length limitation.

...read moreread less

Posted Content•

BiEntropy - The Approximate Entropy of a Finite Binary String

[...]

Grenville J. Croll

04 May 2013-arXiv: Other Computer Science

TL;DR: A simple algorithm which computes the approximate entropy of a finite binary string of arbitrary length and successfully test the algorithm in the fields of Prime Number Theory, Human Vision, Cryptography, Random Number Generation and Quantitative Finance.

...read moreread less

Abstract: We design, implement and test a simple algorithm which computes the approximate entropy of a finite binary string of arbitrary length. The algorithm uses a weighted average of the Shannon Entropies of the string and all but the last binary derivative of the string. We successfully test the algorithm in the fields of Prime Number Theory (where we prove explicitly that the sequence of prime numbers is not periodic), Human Vision, Cryptography, Random Number Generation and Quantitative Finance.

...read moreread less