scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2011"


Proceedings ArticleDOI
11 Apr 2011
TL;DR: This paper proposes a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions by allowing fuzzy match between two tokens, and achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
Abstract: String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.

137 citations


Proceedings ArticleDOI
23 Jan 2011
TL;DR: In this paper, the authors presented two representations of a string of length n compressed into a context-free grammar S of size n with O(log N) random access time and O(n · αk(n)) construction time and space on the RAM.
Abstract: Let S be a string of length N compressed into a context-free grammar S of size n We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM Here, αk(n) is the inverse of the kth row of Ackermann's function Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P|k, k4 +|P|} +log N) + occ), where occ is the number of occurrences of P in S Finally, we are able to generalize our results to navigation and other operations on grammar-compressed treesAll of the above bounds significantly improve the currently best known results To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy-paths in grammars

77 citations


Patent
07 Jun 2011
TL;DR: In this article, a method for detecting and locating occurrence in a data stream of any complex string belonging to a predefined complex dictionary is disclosed, where a complex string may comprise an arbitrary number of interleaving coherent strings and ambiguous strings.
Abstract: A method for detecting and locating occurrence in a data stream of any complex string belonging to a predefined complex dictionary is disclosed. A complex string may comprise an arbitrary number of interleaving coherent strings and ambiguous strings. The method comprises a first process for transforming the complex dictionary into a simple structure to enable continuously conducting computationally efficient search, and a second process for examining received data in real time using the simple structure. The method may be implemented as an article of manufacture having a processor-readable storage medium having instructions stored thereon for execution by a processor, causing the processor to match examined data to an object complex string belonging to the complex dictionary, where the matching process is based on equality to constituent coherent strings, and congruence to ambiguous strings, of the object complex string.

74 citations


Patent
18 Jan 2011
TL;DR: In this paper, a system and method of matching and merging records is described, where a processor executes fuzzy matching logic to determine whether one or more records in the plurality of records from the feed match an existing record, then a merged of the matching records with the existing record to form a merged composite record is stored.
Abstract: A system and method of matching and merging records is disclosed herein. Embodiments comprise receiving, a plurality of records from a feed, wherein a record in the plurality of records from the feed may be either partial or complete. A processor executes fuzzy matching logic to determine whether one or more records in the plurality of records from the feed match an existing record. The processor then executes a merged of the one or more matching records with the existing record to form a merged composite record. Finally, the merged composite record is stored.

64 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A robust method to map detected facial Action Units (AUs) to six basic emotions using a learned statistical relationship and a suitable matching technique to reduce false predictions and improve performance with rule based techniques is presented.
Abstract: We present a robust method to map detected facial Action Units (AUs) to six basic emotions. Automatic AU recognition is prone to errors due to illumination, tracking failures and occlusions. Hence, traditional rule based methods to map AUs to emotions are very sensitive to false positives and misses among the AUs. In our method, a set of chosen AUs are mapped to the six basic emotions using a learned statistical relationship and a suitable matching technique. Relationships between the AUs and emotions are captured as template strings comprising the most discriminative AUs for each emotion. The template strings are computed using a concept called discriminative power. The Longest Common Subsequence (LCS) distance, an approach for approximate string matching, is applied to calculate the closeness of a test string of AUs with the template strings, and hence infer the underlying emotions. LCS is found to be efficient in handling practical issues like erroneous AU detection and helps to reduce false predictions. The proposed method is tested with various databases like CK+, ISL, FACS, JAFFE, MindReading and many real-world video frames. We compare our performance with rule based techniques, and show clear improvement on both benchmark databases and real-world datasets.

56 citations


Proceedings ArticleDOI
17 Jul 2011
TL;DR: The string-analysis algorithm is implemented, and used to augment an industrial security analysis for Web applications by automatically detecting and verifying sanitizers---methods that eliminate malicious patterns from untrusted strings, making those strings safe to use in security-sensitive operations.
Abstract: We propose a novel technique for statically verifying the strings generated by a program. The verification is conducted by encoding the program in Monadic Second-Order Logic (M2L). We use M2L to describe constraints among program variables and to abstract built-in string operations. Once we encode a program in M2L, a theorem prover for M2L, such as MONA, can automatically check if a string generated by the program satisfies a given specification, and if not, exhibit a counterexample. With this approach, we can naturally encode relationships among strings, accounting also for cases in which a program manipulates strings using indices. In addition, our string analysis is path sensitive in that it accounts for the effects of string and Boolean comparisons, as well as regular-expression matches.We have implemented our string-analysis algorithm, and used it to augment an industrial security analysis for Web applications by automatically detecting and verifying sanitizers---methods that eliminate malicious patterns from untrusted strings, making those strings safe to use in security-sensitive operations. On the 8 benchmarks we analyzed, our string analyzer discovered 128 previously unknown sanitizers, compared to 71 sanitizers detected by a previously presented string analysis.

52 citations


Proceedings Article
22 Jan 2011
TL;DR: For the CSSP, a new formulation is given that is polytope-wise stronger than a straightforward extension of the CSP formulation and a strengthening constraint class is proposed that speeds up the running time.
Abstract: Let S be a set of k strings over an alphabet Σ each string has a length between e and n. The Closest Substring Problem (CSSP) is to find a minimal integer d (and a corresponding string t of length e) such that each string s ∈ S has a substring of length e with Hamming distance at most d to t. We say t is the closest substring to S. For e = n, this problem is known as the Closest String Problem (CSP). Particularly in computational biology, the CSP and CSSP have found numerous practical applications such as identifying regulatory motifs and approximate gene clusters, and in degenerate primer design. We study ILP formulations for both problems. Our experiments show that a position-based formulation for the CSP performs very well on real-world instances emerging from biology. Even on randomly generated instances that are hard to solve to optimality, solving the root relaxation leads to solutions very close to the optimum. For the CSSP we give a new formulation that is polytope-wise stronger than a straightforward extension of the CSP formulation. Furthermore we propose a strengthening constraint class that speeds up the running time.

33 citations


Book ChapterDOI
27 Jun 2011
TL;DR: A simple observation about the locations of critical factorizations is used to derive a real-time variation of the Crochemore-Perrin constant-space string matching algorithm that has a simple and efficient control structure.
Abstract: We use a simple observation about the locations of critical factorizations to derive a real-time variation of the Crochemore-Perrin constant-space string matching algorithm. The real-time variation has a simple and efficient control structure.

27 citations


Proceedings ArticleDOI
05 Dec 2011
TL;DR: This work presents an approach which exploits text, a major source of information for humans during orientation and navigation, without the need for error-prone optical character recognition, by quantizing them into several hundred visual words which provide significantly improved distinctiveness when compared to individual features.
Abstract: Distinctive visual cues are of central importance for image retrieval applications, in particular, in the context of visual location recognition. While in indoor environments typically only few distinctive features can be found, outdoors dynamic objects and clutter significantly impair the retrieval performance. We present an approach which exploits text, a major source of information for humans during orientation and navigation, without the need for error-prone optical character recognition. To this end, characters are detected and described using robust feature descriptors like SURF. By quantizing them into several hundred visual words we consider the distinctive appearance of the characters rather than reducing the set of possible features to an alphabet. Writings in images are transformed to strings of visual words termed visual phrases, which provide significantly improved distinctiveness when compared to individual features. An approximate string matching is performed using N-grams, which can be efficiently combined with an inverted file structure to cope with large datasets. An experimental evaluation on three different datasets shows significant improvement of the retrieval performance while reducing the size of the database by two orders of magnitude compared to state-of-the-art. Its low computational complexity makes the approach particularly suited for mobile image retrieval applications.

26 citations


Book
23 Sep 2011
TL;DR: This work presents a meta-anatomy of how the model-based and model-independent approaches to pattern recognition evolved over time changed over time, and some of the strategies used to achieve these changes were simple and straightforward to implement.
Abstract: Foreword. Correcting the Training Data R. Barandela, et al. Context Free Grammars and Semantic Networks for Flexible Assembly Recognition C. Bauckhage, G. Sagerer. Stochastic Recognition of Occluded Objects B. Bhanu, et al. Approximate String Matching for Angular String Elements with Applications to On-Line and Off-line Handwriting Recognition S.-H. Cha, S.N. Srihari. Uniform, Fast Convergence of Arbitrarily Tight Upper and Lower Bounds on the Bayes Error D. Chen, et al. Building RBF Networks for Time Series Classification by Boosting J.R. Diez, C.J.A. Gonzalez. Similarity Measures and Clustering of String Patterns A. Fred. Pattern Recognition for Intrusion Detection in Computer Networks G. Giacinto, F. Roli. Model-Based Pattern Recognition M. Haindl. Structural Pattern Recognition in Graphs L. Holder, et al. Deriving Pseudo-Probabilities of Correctness Given Scores (DPPS) K. Ianakiev, V. Govindaraju. Weighed Mean and Generalized Median of Strings Y. Jiang, H. Bunke. A Region-Based Algorithm for Classifier-Independent Feature Selection M. Kudo. Inference of K-Piecewise Testable Tree Languages D. Lopez, et al. Mining Partially Periodic Patterns With Unknown Periods From Event tream S. Ma, J.L. Hellerstein. Combination of Classifiers for Supervised Learning: A Survey S. Ma, C. Ji. Image Segmentation and Pattern Recognition: A Novel Concept, the Historgram of Connected Elements D. Maravell, M.A. Patricio. Prototype Extraction for k-NN Classifiers using Median Srings C.D. Martinez-Hinarejos, et al. Cyclic String Matching: Efficient Exact and Approximate Algorithms A. Marzal, et al. Homogeneity, Autocorrelation and Anisotropy in Patterns A. Molina. Robust Structural Indexing through Quasi-Invariant Shape Signatures and FeatureGeneration H. Nishida. Energy Minimisation Methods for Static and Dynamic Curve Matching E. Nyssen, et al. Recent Feature Selection Methods in Statistical Pattern Recognition P. Pudil, et al. Fast Image Segmentation under Noise R.M. Romano, D. Vitulano. Set Analysis of Coincident Errors and Its Applications for Combining Classifiers D. Ruta, B. Gabrys. Enhanced Neighbourhood Specifications for Pattern Classification J.S. San nchez, A.I. Marques. Algorithmic Synthesis in Neural Network Training for Pattern Recognition K. Sirlantzis. Binary Strings and multi-class learning problems T. Windeatt, R. Ghaderi.

26 citations


Journal ArticleDOI
TL;DR: A refined model of the CLOSEST STRING WITH OUTLIERS (CSWO) problem, which abstractly models finding common patterns in several but not all input strings, is proposed.
Abstract: Background Given n strings s1, …, sn each of length l and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many – but not necessarily all – input strings is an important task that plays a role in many applications in bioinformatics.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: The Crochemore-Perrin constant-space O(n)-time string matching algorithm is extended to run in optimal O( n/alpha) time and even in real-time, achieving a factor alpha speedup over traditional algorithms that examine each character individually.
Abstract: In the packed string matching problem, each machine word accomodates alpha characters, thus an n-character text occupies n/alpha memory words. We extend the Crochemore-Perrin constant-space O(n)-time string matching algorithm to run in optimal O(n/alpha) time and even in real-time, achieving a factor alpha speedup over traditional algorithms that examine each character individually. Our solution can be efficiently implemented, unlike prior theoretical packed string matching work. We adapt the standard RAM model and only use its AC0 instructions (i.e. no multiplication) plus two specialized AC0 packed string instructions. The main string-matching instruction is available in commodity processors (i.e. Intel's SSE4.2 and AVX Advanced String Operations); the other maximal-suffix instruction is only required during pattern preprocessing. In the absence of these two specialized instructions, we propose theoretically-efficient emulation using integer multiplication (not AC0) and table lookup.

Proceedings ArticleDOI
18 Sep 2011
TL;DR: This paper presents a novel approach towards word spotting using string matching of character primitives, which is tested on historical books of French alphabets and has obtained encouraging results.
Abstract: Word searching and indexing in historical document collections is a challenging problem because, characters in these documents are often touching or broken due to degradation/ ageing effects. For efficient searching in such historical documents, this paper presents a novel approach towards word spotting using string matching of character primitives. We describe the text string as a sequence of primitives which consists of a single character or a part of a character. Primitive segmentation is performed analyzing text background information that is obtained by water reservoir technique. Next, the primitives are clustered using template matching and a codebook of representative primitives is built. Using this primitive codebook, the text information in the document images are encoded and stored. For a query word, we segment it into primitives and encode the word by a string of representative primitives from codebook. Finally, an approximate string matching is applied to find similar words. The matching similarity is used to rank the retrieved words. The proposed method is tested on historical books of French alphabets and we have obtained encouraging results from the experiment.

Journal ArticleDOI
TL;DR: In this paper, the worst-case complexity of string matching on strings given in packed representation is studied, where m is the number of characters in a single word and m = m.

Book
08 Feb 2011
TL;DR: This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching, focusing on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions.
Abstract: One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text. This monograph focuses specifically on the problem of approximate string matching, where, given a set of strings S and a query string v, the goal is to find all strings s ∈ S that have a user specified degree of similarity to v. Set S could be, for example, a corpus of documents, a set of web pages, or an attribute of a relational table. The similarity between strings is always defined with respect to a similarity function that is chosen based on the characteristics of the data and application at hand. This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching. We concentrate on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. We focus on all-match and top-k flavors of selection and join queries, and discuss the applicability, advantages and disadvantages of each technique for every query type.

Journal ArticleDOI
TL;DR: This approach unifies visual appearance and the ordering information in a holistic manner with joint consideration of visual-order consistency between the query and the reference instances, and can be used for automatically identifying local alignments between two pieces of visual data.
Abstract: We present an approach to represent, match, and index various types of visual data, with the primary goal of enabling effective and computationally efficient searches. In this approach, an image/video is represented by an ordered list of feature descriptors. Similarities between such representations are then measured by the approximate string matching technique. This approach unifies visual appearance and the ordering information in a holistic manner with joint consideration of visual-order consistency between the query and the reference instances, and can be used for automatically identifying local alignments between two pieces of visual data. This capability is essential for tasks such as video copy detection where only small portions of the query and the reference videos are similar. To deal with large volumes of data, we further show that this approach can be significantly accelerated along with a dedicated indexing structure. Extensive experiments on various visual retrieval and classification tasks demonstrate the superior performance of the proposed techniques compared to existing solutions.

Journal ArticleDOI
TL;DR: The problem of String Matching with mismatches to have weighted mismatches is generalized and an O(nlog 4m) algorithm is presented that approximates the results of this problem up to a factor of O(log’m) in the case that the weight function is a metric.
Abstract: Given an alphabet Σ={1,2,…,|Σ|} text string T∈Σn and a pattern string P∈Σm , for each i=1,2,…,n−m+1 define L p (i) as the p-norm distance when the pattern is aligned below the text and starts at position i of the text. The problem of pattern matching with L p distance is to compute L p (i) for every i=1,2,…,n−m+1. We discuss the problem for d=1,2,∞. First, in the case of L 1 matching (pattern matching with an L 1 distance) we show a reduction of the string matching with mismatches problem to the L 1 matching problem and we present an algorithm that approximates the L 1 matching up to a factor of 1+e, which has an $O(\frac{1}{\varepsilon^{2}}n\log m\log|\Sigma|)$ run time. Then, the L 2 matching problem (pattern matching with an L 2 distance) is solved with a simple O(nlog m) time algorithm. Finally, we provide an algorithm that approximates the L ∞ matching up to a factor of 1+e with a run time of $O(\frac{1}{\varepsilon}n\log m\log|\Sigma|)$. We also generalize the problem of String Matching with mismatches to have weighted mismatches and present an O(nlog 4 m) algorithm that approximates the results of this problem up to a factor of O(log m) in the case that the weight function is a metric.

Proceedings ArticleDOI
16 May 2011
TL;DR: An automatic application protocol signature generating framework for Deep Packet Inspection (DPI) techniques with performance evaluation and developed several postprocessing techniques to refine the accuracy of the results.
Abstract: We present an automatic application protocol signature generating framework for Deep Packet Inspection (DPI) techniques with performance evaluation. We propose to utilize algorithms from the field of bioinformatics. We also present preprocessing methods to accelerate our system. Moreover, we developed several postprocessing techniques to refine the accuracy of the results. Finally, we propose a DPI system, based on approximate string matching, and find it a viable, novel alternative for the refinement of exact string matching algorithm's results.

Proceedings Article
01 Jan 2011
TL;DR: An adaptive way of using the ZM cross parsing is introduced, and the use of the Lloyd-Max quantization is also introduced to improve the results with the string matching approach for ECG-based biometrics.
Abstract: Conventional access control systems are typically based on a single time instant authentication However, for high-security environments, continuous user verification is needed in order to robustly prevent fraudulent or unauthorized access The electrocardiogram (ECG) is an emerging biometric modality with the following characteristics: (i) it does not require liveliness verification, (ii) there is strong evidence that it contains sufficient discriminative information to allow the identification of individuals from a large population, (iii) it allows continuous user verification Recently, a string matching approach for ECG-based biometrics, using the Ziv-Merhav (ZM) cross parsing, was proposed Building on previous work, and exploiting tools from data compression, this paper goes one step further, proposing a method for ECG-based continuous authentication An adaptive way of using the ZM cross parsing is introduced The use of the Lloyd-Max quantization is also introduced to improve the results with the string matching approach for ECG-based biometrics Results on one-lead ECG real data are presented, acquired during a concentration task, from 19 healthy individuals

Journal ArticleDOI
TL;DR: This work shows in this work how to tackle online approximate matching when the distance function is non-local, and gives new solutions which are applicable to a wide variety of matching problems including function and parameterised matching, swap matching, Swap-mismatch, k-difference with transpositions, overlap matching, edit distance/LCS and L"1 and L's rearrangement distances.

Proceedings ArticleDOI
10 Jul 2011
TL;DR: An extension to widely used ASM algorithms is proposed to detect the name aliases that generate as a result of transliteration and the experimental evaluation shows that proposed extension increases the accuracy of the basic algorithms to a considerable level.
Abstract: This paper focuses on the problem of alias detection based on orthographic variations of Arabic names. Alias detection is the process to identify different variants of the same name. To detect aliases based on orthographic variations, the approximate string matching (ASM) algorithms are widely used that measure the similarities between two strings (i.e., the name and alias). ASM algorithms work well to detect various type of orthographic variations but still there is a need to develop techniques to detect correct aliases of Arabic names that occur due to the translation of Arabic names into English. An extension to widely used ASM algorithms is proposed to detect the name aliases that generate as a result of transliteration. This paper aims to improve the accuracy of the basic ASM algorithms in order to detect correct aliases. The experimental evaluation shows that proposed extension increases the accuracy of the basic algorithms to a considerable level.

Journal ArticleDOI
TL;DR: This paper introduces data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions, and describes a novel iterative search strategy that is effecient in practice, where some of the reduction techniques can be applied.
Abstract: The center string (or closest string) problem is a classic computer science problem with important applications in computational biology. Given k input strings and a distance threshold d, we search for a string within Hamming distance at most d to each input string. This problem is NP complete. In this paper, we focus on exact methods for the problem that are also swift in application. We first introduce data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions. We describe how to use this information to speed up two previously published search tree algorithms. Then, we describe a novel iterative search strategy that is effecient in practice, where some of our reduction techniques can also be applied. Finally, we present results of an evaluation study for two different data sets from a biological application. We find that the running time for computing the optimal center string is dominated by the subroutine calls for d = dopt -1 and d = dopt. Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions. We find that this speeds up computations considerably.

Journal ArticleDOI
TL;DR: Two new bit-parallel algorithms to solve the MASM problem, which requires no verification and can handle patterns of length > w, and use the same BPA of approximate matching and concatenation to form a single pattern from the set of r patterns.
Abstract: Multi-patterns approximate string matching (MASM) problem is to find all the occurrences of set of patterns P0, P1, P2...Pr-1, r≥1, in the given text T[0…n-1], allowing limited number of errors in the matches. This problem has many applications in computational biology viz. finding DNA subsequences after possible mutations, locating positions of a disease(s) in a genome etc. The MASM problem has been previously solved by Baeza-Yates and Navarro by extending the bit-parallel automata (BPA) of approximate matching and using the concept of classes of characters. The drawbacks of this approach are: (a) It requires verification for the potential matches and, (b) It can handle patterns of length less than or equal to word length (w) of computer used. In this paper, we propose two new bit-parallel algorithms to solve the same problem. These new algorithms requires no verification and can handle patterns of length > w. These two techniques also use the same BPA of approximate matching and concatenation to form a single pattern from the set of r patterns. We compare the performance of new algorithms with existing algorithms and found that our algorithms have better running time than the previous algorithms. Key words: Algorithm, finite automata, bit-parallelism, approximate matching, multiple patterns.

Journal ArticleDOI
TL;DR: It is proved that String-to-String Correction is fixed-parameter tractable, for parameter k, and a simple fixed- parameter algorithm is presented that solves the problem in O(2^kn) time.

Proceedings ArticleDOI
10 Apr 2011
TL;DR: This work proposes a novel method to extract the partial strings from each pattern which maximizes search speed, and can compute all the corresponding searching time cost by theoretical derivation, and choose the location which yields an approximately minimal search time.
Abstract: String matching plays a key role in web content monitoring systems. Suffix matching algorithms have good time efficiency, and thus are widely used. These algorithms require that all patterns in a set have the same length. When the patterns cannot satisfy this requirement, the leftmost characters, m being the length of the shortest pattern, are extracted to construct the data structure. We call such -character strings partial strings. However, a simple extraction from the left does not address the impact of partial string locations on search speed. We propose a novel method to extract the partial strings from each pattern which maximizes search speed. More specifically, with this method we can compute all the corresponding searching time cost by theoretical derivation, and choose the location which yields an approximately minimal search time. We evaluate our method on two rule sets: Snort and ClamAV. Experiments show that in most cases, our method achieves the fastest searching speed in all possible locations of partial string extraction, and is about 5%–20% faster than the alternative methods.

Journal ArticleDOI
TL;DR: This paper revisits the problem of indexing a text for approximate string matching and constructs the first external-memory data structure that does not require @W(|P|+occ+poly(logn)) I/Os.

01 Jan 2011
TL;DR: A new filter, the TDF (token distribution filter), is proposed that performs better than previously proposed filters for a wide class of problems, and is conducted on both synthetic and real data sets.
Abstract: A common application over web data is to find all the strings in a collection of pages that match strings in a given dictionar y. We consider the problem of extracting all the strings or substr ings in a document (or a page) that approximately match some string in a given dictionary. The current state-of-art approach for th is problem involves first applying an approximate, fast filter, then app lying a more expensive exact verification algorithm to the strings t hat survive the filter. Many string filters, such as the length filter a nd prefix filter, have been proposed. However, we find many string filter s are ineffective or inefficient in some problem scenarios. In thi s paper, we propose a new filter, the TDF (token distribution filter). W e conduct experiments on both synthetic and real data sets, and show that for a wide class of problems it performs better than previously proposed filters.

Proceedings ArticleDOI
01 Aug 2011
TL;DR: This paper considers an extension of the approximate string-matching problem with Hamming distance, by also allowing the existence of a single gap, either in the text, or in the pattern, which requires O (nm) time but can be reduced to O (mβ) time, if the maximum length β of the gap is given.
Abstract: This paper deals with the approximate string-matching problem with Hamming distance and a single gap for sequence alignment. We consider an extension of the approximate string-matching problem with Hamming distance, by also allowing the existence of a single gap, either in the text, or in the pattern. This problem is strongly and directly motivated by the next-generation re-sequencing procedure. We present a general algorithm that requires O (nm) time, where n is the length of the text and m is the length of the pattern, but this can be reduced to O (mβ) time, if the maximum length β of the gap is given.

Proceedings Article
23 Jun 2011
TL;DR: This study investigates the merits of fast approximate string matching to address challenges relating to spelling variants and to utilise large-scale lexical resources for semantic class disambiguation and integrates string matching results into machine learning-based disambIGuation through the use of a novel set of features.
Abstract: In this study we investigate the merits of fast approximate string matching to address challenges relating to spelling variants and to utilise large-scale lexical resources for semantic class disambiguation. We integrate string matching results into machine learning-based disambiguation through the use of a novel set of features that represent the distance of a given textual span to the closest match in each of a collection of lexical resources. We collect lexical resources for a multitude of semantic categories from a variety of biomedical domain sources. The combined resources, containing more than twenty million lexical items, are queried using a recently proposed fast and efficient approximate string matching algorithm that allows us to query large resources without severely impacting system performance. We evaluate our results on six corpora representing a variety of disambiguation tasks. While the integration of approximate string matching features is shown to substantially improve performance on one corpus, results are modest or negative for others. We suggest possible explanations and future research directions. Our lexical resources and implementation are made freely available for research purposes at: http://github.com/ninjin/simsem

Patent
Enyuan Wu1
09 Sep 2011
TL;DR: In this article, a target string is broken into one or more target terms, and the target terms are matched to known terms in an index tree, where the terms in the index tree are associated with known string IDs.
Abstract: One or more techniques and/or systems are disclosed for matching a target string to a known string. A target string is broken into one or more target terms, and the one or more target terms are matched to known terms in an index tree. The index tree comprises one or more known terms from a plurality of known strings, where the respective known terms in the index tree are associated with one or more known string IDs. A known term that is associated with a known string ID (in the index tree, and to which a target term is matched), is comprised in a known string, which corresponds to the known string ID. The target string can be matched to the known string using the known string's corresponding known string ID that is associated with a desired number of occurrences in the matching of the one or more target terms.