Showing papers on "Approximate string matching published in 2008"

PDF

Open Access

Proceedings Article•DOI•

Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

[...]

Xiaochun Yang¹, Bin Wang¹, Chen Li²•Institutions (2)

Northeastern University (China)¹, University of California, Irvine²

09 Jun 2008

TL;DR: This study proposes a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance and proposes an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries.

...read moreread less

Abstract: Approximate queries on a collection of strings are important in many applications such as record linkage, spell checking, and Web search, where inconsistencies and errors exist in data as well as queries. Several existing algorithms use the concept of "grams," which are substrings of strings used as signatures for the strings to build index structures. A recently proposed technique, called VGRAM, improves the performance of these algorithms by using a carefully chosen dictionary of variable-length grams based on their requencies in the string collection. Since an index structure using fixed-length grams can be viewed as a special case of VGRAM, a fundamental problem arises naturally: what is the relationship between the gram dictionary and the performance of queries? We study this problem in this paper. We propose a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance. We analyze how a gram dictionary affects the index structure of the string collection and ultimately the performance of queries. We also propose an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries. Our experiments on real data sets show the improvement on query performance achieved by these techniques. To our best knowledge, this study is the first cost-based quantitative approach to deciding good grams for approximate string queries.

...read moreread less

96 citations

Proceedings Article•DOI•

An efficient filter for approximate membership checking

[...]

Kaushik Chakrabarti¹, Surajit Chaudhuri¹, Venkatesh Ganti¹, Dong Xin¹•Institutions (1)

Microsoft¹

09 Jun 2008

TL;DR: This paper develops a filter-verification framework, and proposes a novel in-memory filter structure that significantly outperforms the previous best-known methods in terms of both filtering power and computation time.

...read moreread less

Abstract: We consider the problem of identifying sub-strings of input text strings that approximately match with some member of a potentially large dictionary. This problem arises in several important applications such as extracting named entities from text documents and identifying biological concepts from biomedical literature. In this paper, we develop a filter-verification framework, and propose a novel in-memory filter structure. That is, we first quickly filter out sub-strings that cannot match with any dictionary member, and then verify the remaining sub-strings against the dictionary. Our method does not produce false negatives. We demonstrate the efficiency and effectiveness of our filter over real datasets, and show that it significantly outperforms the previous best-known methods in terms of both filtering power and computation time.

...read moreread less

92 citations

Journal Article•DOI•

Fast and compact regular expression matching

[...]

Philip Bille¹, Martin Farach-Colton²•Institutions (2)

IT University of Copenhagen¹, Rutgers University²

20 Dec 2008-Theoretical Computer Science

TL;DR: In this paper, the authors studied four problems in string matching, namely, regular expression matching, approximate regular expressions matching, string edit distance, and subsequence indexing, on a standard word RAM model of computation that allows logarithmic-sized words to be manipulated in constant time.

...read moreread less

73 citations

Patent•

Methods and systems for implementing approximate string matching within a database

[...]

Christopher J. Merz, Thomas Mcgeehan

04 Dec 2008

TL;DR: In this paper, a computer-based method for character string matching of a candidate character string with a plurality of character string records stored in a database is described, which includes a set of reference character strings in the database.

...read moreread less

Abstract: A computer-based method for character string matching of a candidate character string with a plurality of character string records stored in a database is described. The method includes a) identifying a set of reference character strings in the database, the reference character strings identified utilizing an optimization search for a set of dissimilar character strings, b) generating an n-gram representation for one of the reference character strings in the set of reference character strings, c) generating an n-gram representation for the candidate character string, d) determining a similarity between the n-gram representations, e) repeating steps b) and d) for the remaining reference character strings in the set of identified reference character strings, and f) indexing the candidate character string within the database based on the determined similarities between the n-gram representation of the candidate character string and the reference character strings in the identified set.

...read moreread less

63 citations

Patent•

Managing an archive for approximate string matching

[...]

Arlen Anderson

30 Dec 2008

TL;DR: In this article, a method for managing an archive for determining approximate matches associated with strings occurring in records is described, which includes processing records to determine a set of string representations that correspond to string occurring in the records; generating, for each of at least some of the string representations in the set, a plurality of close representations that are each generated from at leastsome of the same characters in the string; and storing entries in the archive that each represent a potential approximate match between at least two strings based on their respective close representations.

...read moreread less

Abstract: In one aspect, in general, a method is described for managing an archive for determining approximate matches associated with strings occurring in records. The method includes: processing records to determine a set of string representations that correspond to strings occurring in the records; generating, for each of at least some of the string representations in the set, a plurality of close representations that are each generated from at least some of the same characters in the string; and storing entries in the archive that each represent a potential approximate match between at least two strings based on their respective close representations.

...read moreread less

54 citations

Proceedings Article•DOI•

A Discriminative Candidate Generator for String Transformations

[...]

Naoaki Okazaki¹, Yoshimasa Tsuruoka², Sophia Ananiadou², Jun'ichi Tsujii²•Institutions (2)

University of Tokyo¹, University of Manchester²

25 Oct 2008

TL;DR: A discriminative approach for generating candidate strings that uses substring substitution rules as features and scores them using an L1-regularized logistic regression model and demonstrates the remarkable performance of the proposed method in normalizing inflected words and spelling variations.

...read moreread less

Abstract: String transformation, which maps a source string s into its desirable form t*, is related to various applications including stemming, lemmatization, and spelling correction. The essential and important step for string transformation is to generate candidates to which the given string s is likely to be transformed. This paper presents a discriminative approach for generating candidate strings. We use substring substitution rules as features and score them using an L1-regularized logistic regression model. We also propose a procedure to generate negative instances that affect the decision boundary of the model. The advantage of this approach is that candidate strings can be enumerated by an efficient algorithm because the processes of string transformation are tractable in the model. We demonstrate the remarkable performance of the proposed method in normalizing inflected words and spelling variations.

...read moreread less

35 citations

Patent•

Method and system for approximate string matching

[...]

Branimir Z. Lambov¹•Institutions (1)

IBM¹

03 Apr 2008

TL;DR: In this paper, a method and system for approximate string matching is presented for generating approximate matches whilst supporting compounding and correction rules, which includes traversing a trie data structure (211) to find approximate partial and full character string matches (203) of the input pattern (201).

...read moreread less

Abstract: A method and system for approximate string matching are provided for generating approximate matches whilst supporting compounding and correction rules. The method for approximate string matching of an input pattern to a trie data structure, includes traversing a trie data structure (211) to find approximate partial and full character string matches (203) of the input pattern (201). Traversing a node of the trie data structure (211) to process a character of the string applies any applicable correction rules (213) to the character, wherein each correction rule (213) has an associated cost, adjusted after each character processed. The method includes accumulating costs as a string of characters is gathered, and restricting the traverse through the trie data structure (211) according to the accumulated cost of a gathered string and the potential costs of applicable correction rules.

...read moreread less

27 citations

Book•DOI•

New Developments in Formal Languages and Applications

[...]

Gemma Bel-Enguix, M. Dolores Jimnez-Lpez, Carlos Martin-Vide

10 Apr 2008-Springer US

TL;DR: An Introductory Course on Communication Complexity and Formal Languages and Concurrent Behaviours and Probabilistic Parsing.

...read moreread less

Abstract: Basic Notation and Terminology.- Open Problems on Partial Words.- Alignments and Approximate String Matching.- An Introductory Course on Communication Complexity.- Formal Languages and Concurrent Behaviours.- Cellular Automata - A Computational Point of View.- Probabilistic Parsing.- DNA-Based Memories: A Survey.

...read moreread less

25 citations

Proceedings Article•DOI•

A string matching approach for visual retrieval and classification

[...]

Mei-Chen Yeh¹, Kwang-Ting Cheng¹•Institutions (1)

University of California, Santa Barbara¹

30 Oct 2008

TL;DR: This work presents an approach to measuring similarities between visual data based on approximate string matching, and shows that such a globally ordered and locally unordered representation is more discriminative than a bag-of-features representation and the similarity measure based on string matching is effective.

...read moreread less

Abstract: We present an approach to measuring similarities between visual data based on approximate string matching. In this approach, an image is represented by an ordered list of feature descriptors. We show the extraction of local features sequences from two types of 2-D signals - scene and shape images. The similarity of these two images is then measured by 1) solving a correspondence problem between two ordered sets of features and 2) calculating similarities between matched features and dissimilarities between unmatched features. Our experimental study shows that such a globally ordered and locally unordered representation is more discriminative than a bag-of-features representation and the similarity measure based on string matching is effective. We illustrate the application of the proposed approach to scene classification and shape retrieval, and demonstrate superior performance to existing solutions.

...read moreread less

25 citations

Book Chapter•DOI•

Approximate String Matching with Address Bit Errors

[...]

Amihood Amir¹, Yonatan Aumann², Oren Kapah², Avivit Levy³, Ely Porat² - Show less +1 more•Institutions (3)

Johns Hopkins University¹, Bar-Ilan University², University of Haifa³

18 Jun 2008

TL;DR: In this paper, the case where bits of imay be erroneously flipped, either in a consistent or transient manner is considered, and the corresponding approximate pattern matching problems are formally defined and efficient algorithms for their resolution are provided.

...read moreread less

Abstract: A string Si¾? Σmcan be viewed as a set of pairs S= { (i¾? i , i) : ii¾? { 0,..., mi¾? 1} }. We consider approximate pattern matching problems arising from the setting where errors are introduced to the location component (i), rather than the more traditional setting, where errors are introduced to the content itself (i¾? i ). In this paper, we consider the case where bits of imay be erroneously flipped, either in a consistent or transient manner. We formally define the corresponding approximate pattern matching problems, and provide efficient algorithms for their resolution, while introducing some novel techniques.

...read moreread less

22 citations

Proceedings Article•DOI•

Combination of rule and pattern based lithography unfriendly pattern detection in OPC flow

[...]

Jae-Hyun Kang, Jae-Young Choi, Yeon-Ah Shim, Hyesung Lee, Bo Su, Walter Chan, Ping Zhang, Joanne Wu, Keun-Young Kim - Show less +5 more

17 Oct 2008

TL;DR: This paper has developed a methodology to detect those frequently appeared hot-spots in pre-OPC design, as well as post OPC designs to separate them from the rest of designs, which provide the opportunity to treat them differently in early OPC flow.

...read moreread less

Abstract: Foundry companies encounter again and again the same or similar lithography unfriendly patterns (Hot-spots) in different designs within the same technology node and across different technology nodes, which eluded design rule check (DRC), but detected again and again in OPC verification step. Since Model-based OPC tool applies OPC on whole-chip design basis, individual hot-spot patterns are treated same as the rest of design patterns, regardless of its severity. We have developed a methodology to detect those frequently appeared hot-spots in pre-OPC design, as well as post OPC designs to separate them from the rest of designs, which provide the opportunity to treat them differently in early OPC flow. The methodology utilizes the combination of rule based and pattern based detection algorithms. Some hotspot patterns can be detected using rule-based algorithm, which offer the flexibility of detecting similar patterns within pre-defined ranges. However, not all patterns can be detected (or defined) by rules. Thus, a pattern-based approach is developed using defect pattern library concept. The GDS/OASIS format hot-spot patterns can be saved into a defect pattern library. Fast pattern matching algorithm is used to detect hot-spot patterns in a design using the library as a pattern template database. Even though the pattern matching approach lacks the flexibility to detect patterns’ similarity, but it has the capability to detect any patterns as long as a template exists. The pattern-matching algorithm can be either exact match or a fuzzy match. The rule based and pattern based hot-spot pattern detection algorithms complement each other and offer both speed and flexibility in hot spot pattern detection in pre-OPC and post-OPC designs. In this paper, we will demonstrate the methodology in our OPC flow and the benefits of such methodology application in production environment for 90nm designs. After the hot spot pattern detection, examples of special treatment to selected hot spot patterns will be shown.

...read moreread less

Proceedings Article•DOI•

Fuzzy Private Matching (Extended Abstract)

[...]

Lukasz Chmielewski¹, Jaap-Henk Hoepman¹•Institutions (1)

Radboud University Nijmegen¹

04 Mar 2008

TL;DR: This work shows that the original solution proposed by Freedman et al. is incorrect, and presents two fuzzy private matching protocols, one of which has a large bit message complexity and the other improves this, but here the client incurs a 0(n) factor time complexity.

...read moreread less

Abstract: In the private matching problem, a client and a server each hold a set of n input elements. The client wants to privately compute the intersection of these two sets: he learns which elements he has in common with the server (and nothing more), while the server gains no information at all. In certain applications it would be useful to have a fuzzy private matching protocol that reports a match even if two elements are only similar instead of equal. We consider this fuzzy private matching problem, in a semi-honest environment. First we show that the original solution proposed by Freedman et al. [9] is incorrect. Subsequently we present two fuzzy private matching protocols. The first, simple, protocol has a large bit message complexity. The second protocol improves this, but here the client incurs a 0(n) factor time complexity.

...read moreread less

A search engine for Arabic documents

[...]

Toufik Sari, Abderrahmane Kefali

01 Oct 2008

TL;DR: An approximate string matching technique based on Levenshtein distance is applied to indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology.

...read moreread less

Abstract: This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.

...read moreread less

Journal Article•DOI•

Edit distance for a run-length-encoded string and an uncompressed string

[...]

J. J. Liu¹, Guan-Shieng Huang², Yue-Li Wang¹, Richard C. T. Lee²•Institutions (2)

National Taiwan University of Science and Technology¹, National Chi Nan University²

01 Jan 2008-Information Processing Letters

TL;DR: A new algorithm for computing the edit distance of an uncompressed string against a run-length-encoded string and its result directly implies an O(min{mN,Mn}) time algorithm for strings of lengths m and n with M and N runs, respectively.

...read moreread less

Journal Article•DOI•

Fast parameterized matching with q-grams

[...]

Leena Salmela¹, Jorma Tarhio¹•Institutions (1)

Helsinki University of Technology¹

01 Sep 2008-Journal of Discrete Algorithms

TL;DR: Algorithms that solve the problem of finding all parameterized matches of a pattern in a text in sublinear time on average for moderately repetitive patterns are presented.

...read moreread less

Journal Article•DOI•

SEPIA: estimating selectivities of approximate string predicates in large Databases

[...]

Liang Jin¹, Chen Li¹, Rares Vernica¹•Institutions (1)

University of California, Irvine¹

01 Aug 2008

TL;DR: This paper develops a novel technique, called Sepia, based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, one can obtain a probability distribution from a global histogram about the similarity between q and s.

...read moreread less

Abstract: Many database applications have the emerging need to support approximate queries that ask for strings that are similar to a given string, such as "name similar to smith" and "telephone number similar to 412-0964". Query optimization needs the selectivity of such an approximate predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of approximate string predicates. We develop a novel technique, called Sepia, to solve the problem. Given a bag of strings, our technique groups the strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance metric. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of approximate string predicates.

...read moreread less

Proceedings Article•DOI•

Incorporating string transformations in record matching

[...]

Arvind Arasu¹, Surajit Chaudhuri¹, Kris Ganjam¹, Raghav Kaushik¹•Institutions (1)

Microsoft¹

09 Jun 2008

TL;DR: This work expands the problem of record matching to take such user-defined string transformations as input, and demonstrates an improvement in record matching quality and efficient retrieval based on the index structure that is cognizant of transformations.

...read moreread less

Abstract: Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record matching to take such user-defined string transformations as input. These transformations coupled with an underlying similarity function are used to define the similarity between two strings. We demonstrate the effectiveness of this approach via a fuzzy match operation that is used to lookup an input record against a table of records, where we have an additional table of transformations as input. We demonstrate an improvement in record matching quality and efficient retrieval based on our index structure that is cognizant of transformations.

...read moreread less

Posted Content•

Efficient Pattern Matching on Binary Strings

[...]

Simone Faro¹, Thierry Lecroq²•Institutions (2)

University of Catania¹, University of Rouen²

14 Oct 2008-arXiv: Data Structures and Algorithms

TL;DR: This paper presents two efficient algorithms for the binary string matching problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte.

...read moreread less

Abstract: The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed texts. Recently it has been shown that adaptations of classical exact string matching algorithms are not very efficient on binary data. In this paper we present two efficient algorithms for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte. Experimental results show that the new algorithms outperform existing solutions in most cases.

...read moreread less

Efficient FPGA-based Hardware Algorithms for Approximate String Matching

[...]

Sadatoshi Mikami, Yosuke Kawanaka, Shin'ichi Wakabayashi, Shinobu Nagayama

01 Jul 2008

TL;DR: An efficient FPGA-based hardware algorithm and its extensions are proposed for calculating the edit distance as a degree of similarity between two strings and results show the effectiveness of the proposed algorithms.

...read moreread less

Abstract: In this paper, an efficient FPGA-based hardware algorithm and its extensions are proposed for calculating the edit distance as a degree of similarity between two strings. The proposed algorithms are implemented on FPGA and compared to software programs. Experimental results show the effectiveness of the proposed algorithms.

...read moreread less

Posted Content•

Improved Algorithms for Approximate String Matching (Extended Abstract)

[...]

Dimitris Papamichail¹, Georgios Papamichail•Institutions (1)

University of Miami¹

28 Jul 2008-arXiv: Data Structures and Algorithms

TL;DR: An output sensitive algorithm solving the edit distance problem between two strings of lengths n and m respectively in time O((s - |n - m|)·min(m, n, s) + m + n) and linear space, where s is the edit Distance between the two strings.

...read moreread less

Abstract: The problem of approximate string matching is important in many different areas such as computational biology, text processing and pattern recognition. A great effort has been made to design efficient algorithms addressing several variants of the problem, including comparison of two strings, approximate pattern identification in a string or calculation of the longest common subsequence that two strings share. We designed an output sensitive algorithm solving the edit distance problem between two strings of lengths n and m respectively in time O((s-|n-m|)min(m,n,s)+m+n) and linear space, where s is the edit distance between the two strings. This worst-case time bound sets the quadratic factor of the algorithm independent of the longest string length and improves existing theoretical bounds for this problem. The implementation of our algorithm excels also in practice, especially in cases where the two strings compared differ significantly in length. Source code of our algorithm is available at this http URL

...read moreread less

Proceedings Article•DOI•

A Byte-Filtered String Matching Algorithm for Fast Deep Packet Inspection

[...]

Kun Huang¹, Dafang Zhang¹•Institutions (1)

Hunan University¹

18 Nov 2008

TL;DR: A byte-filtered string matching algorithm, where Bloom filters are used to preprocess each byte of every incoming packet payload to check whether the input byte belongs to the original alphabet or not, before performing bit-split string matching.

...read moreread less

Abstract: As link rates and traffic volumes of Internet are constantly growing, string matching using the Deterministic Finite Automaton (DFA) will be the performance bottleneck of Deep Packet Inspection (DPI). The recently proposed bit-split string matching algorithm suffers from the unnecessary state transitions problem, limiting the efficiency of DPI. The root cause lies in the fact that each tiny DFA of the bit-split algorithm only processes a k-bit substring of each input character, but can't check whether the entire character belongs to the original alphabet for a set of signature rules or no. This paper proposes a byte-filtered string matching algorithm, where Bloom filters are used to preprocess each byte of every incoming packet payload to check whether the input byte belongs to the original alphabet or not, before performing bit-split string matching. Our experimental results show that compared to the bit-split algorithm, our byte-filtered algorithm enormously decreases the time of string matching as well as the number of state transitions of tiny DFAs on both synthetic and real signature rule sets.

...read moreread less

Book Chapter•DOI•

A Stochastic Approach to Median String Computation

[...]

Cristian Olivares-Rodríguez¹, Jose Oncina¹•Institutions (1)

University of Alicante¹

04 Dec 2008

TL;DR: The algorithm is based on the extension of the string structure to multistrings (strings of stochastic vectors where each element represents the probability of each symbol) to allow the use of the Expectation Maximization technique.

...read moreread less

Abstract: Due to its robustness to outliers, many Pattern Recognition algorithms use the median as a representative of a set of points. A special case arises in Syntactical Pattern Recognition when the points (prototypes) are represented by strings. However, when the edit distance is used, finding the median becomes a NP-Hard problem. Then, either the search is restricted to strings in the data (set-median ) or some heuristic approach is applied. In this work we use the (conditional) stochastic edit distance instead of the plain edit distance. It is not yet known if in this case the problem is also NP-Hard so an approximation algorithm is described. The algorithm is based on the extension of the string structure to multistrings (strings of stochastic vectors where each element represents the probability of each symbol) to allow the use of the Expectation Maximization technique. We carry out some experiments over a chromosomes corpus to check the efficiency of the algorithm.

...read moreread less

Posted Content•

A Fast Generic Sequence Matching Algorithm

[...]

David R. Musser¹, Gor Nishanov¹•Institutions (1)

Rensselaer Polytechnic Institute¹

01 Oct 2008-arXiv: Data Structures and Algorithms

TL;DR: A string matching -- and more generally, sequence matching -- algorithm is presented that has a linear worst-case computing time bound, a low worst- case bound on the number of comparisons, and sublinear average-case behavior that is better than that of the fastest versions of the Boyer-Moore algorithm.

...read moreread less

Abstract: A string matching -- and more generally, sequence matching -- algorithm is presented that has a linear worst-case computing time bound, a low worst-case bound on the number of comparisons (2n), and sublinear average-case behavior that is better than that of the fastest versions of the Boyer-Moore algorithm. The algorithm retains its efficiency advantages in a wide variety of sequence matching problems of practical interest, including traditional string matching; large-alphabet problems (as in Unicode strings); and small-alphabet, long-pattern problems (as in DNA searches). Since it is expressed as a generic algorithm for searching in sequences over an arbitrary type T, it is well suited for use in generic software libraries such as the C++ Standard Template Library. The algorithm was obtained by adding to the Knuth-Morris-Pratt algorithm one of the pattern-shifting techniques from the Boyer-Moore algorithm, with provision for use of hashing in this technique. In situations in which a hash function or random access to the sequences is not available, the algorithm falls back to an optimized version of the Knuth-Morris-Pratt algorithm.

...read moreread less

Proceedings Article•DOI•

Measuring the impact of character recognition errors on downstream text analysis

[...]

Daniel P. Lopresti¹•Institutions (1)

Lehigh University¹

27 Jan 2008

TL;DR: This paper describes a paradigm for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging, employing a hierarchical methodology based on approximate string matching for classifying errors.

...read moreread less

Abstract: Noise presents a serious challenge in optical character recognition, as well as in the downstream applications that make use of its outputs as inputs. In this paper, we describe a paradigm for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Employing a hierarchical methodology based on approximate string matching for classifying errors, their cascading effects as they travel through the pipeline are isolated and analyzed. We present experimental results based on injecting single errors into a large corpus of test documents to study their varying impacts depending on the nature of the error and the character(s) involved. While most such errors are found to be localized, in the worst case some can have an amplifying effect that extends well beyond the site of the original error, thereby degrading the performance of the end-to-end system.© (2008) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

...read moreread less

Indexed Approximate String Matching.

[...]

Wing-Kin Sung¹•Institutions (1)

National University of Singapore¹

01 Jan 2008

TL;DR: The problem is to build an index for S such that for any query pattern P Œ1 : : :m and any integer k 0, all locations in S that match P with at most k errors can be reported efficiently.

...read moreread less

Abstract: Consider a text SŒ1 : : : n over a finite alphabet . The problem is to build an index for S such that for any query pattern P Œ1 : : :m and any integer k 0, all locations in S that match P with at most k errors can be reported efficiently. If the error is measured in terms of the Hamming distance (number of character substitutions), the problem is called k-mismatch problem. If the error is measured in terms of the edit distance (number of character substitutions, insertions, or deletions), the problem is called k-difference problem. The two problems are formally defined as follows.

...read moreread less

Book Chapter•DOI•

On the Longest Common Parameterized Subsequence

[...]

Orgad Keller¹, Tsvi Kopelowitz¹, Moshe Lewenstein¹•Institutions (1)

Bar-Ilan University¹

18 Jun 2008

TL;DR: The longest common parameterized subsequence problem which combines the LCS measure with parameterized matching is considered, and it is proved that the problem is NP-hard, and a couple of approximation algorithms for the problem are shown.

...read moreread less

Abstract: The well-known problem of the longest common subsequence (LCS), of two strings of lengths nand mrespectively, is O(nm)-time solvable and is a classical distance measure for strings. Another well-studied string comparison measure is that of parameterized matching, where two equal-length strings are a parameterized-match if there exists a bijection on the alphabets such that one string matches the other under the bijection. All works associated with parameterized pattern matching present polynomial time algorithms. There have been several attempts to accommodate parameterized matching along with other distance measures, as these turn out to be natural problems, e.g., Hamming distance, and a bounded version of edit-distance. Several algorithms have been proposed for these problems. In this paper we consider the longest common parameterized subsequence problem which combines the LCS measure with parameterized matching. We prove that the problem is NP-hard, and then show a couple of approximation algorithms for the problem.

...read moreread less

Journal Article•DOI•

Improving the bit-parallel NFA of Baeza-Yates and Navarro for approximate string matching

[...]

Heikki Hyyrö¹•Institutions (1)

University of Tampere¹

01 Nov 2008-Information Processing Letters

TL;DR: This paper proposes a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching that is more efficient than the original BPD, and takes over/extends the role of theOriginal BPD as one of the most practical approximate string Matching algorithms under moderate values of k and m.

...read moreread less

Book Chapter•DOI•

Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

[...]

Anni Järvelin¹, Antti Järvelin¹•Institutions (1)

University of Tampere¹

10 Nov 2008

TL;DR: In the current study the performance of seven proximity measures for classified s -grams in CLIR context was evaluated using eleven language pairs and the binary and non-binary proximity measures were nearly equal, though the performance at large deteriorated.

...read moreread less

Abstract: Classified s -grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s -grams have consistently outperformed other approximate string matching techniques, like edit distance or n -grams. The Jaccard coefficient has traditionally been used as an s -gram based string proximity measure. However, other proximity measures for s -gram matching have not been tested. In the current study the performance of seven proximity measures for classified s -grams in CLIR context was evaluated using eleven language pairs. The binary proximity measures performed generally better than their non-binary counterparts, but the difference depended mainly on the padding used with s -grams. When no padding was used, the binary and non-binary proximity measures were nearly equal, though the performance at large deteriorated.

...read moreread less

Patent•

Large scale rapid matching method of sentence surface

[...]

Chen Zhigang, Guoping Hu, Yu Hu, Qingfeng Liu, Renhua Wang - Show less +1 more

24 Dec 2008

TL;DR: In this paper, the authors proposed a fast matching method in sentence level, which comprises three stages which are index establishment, fuzzy matching and exact matching, and the final matched sentences are then obtained by arranging the candidate sentences according to the similarity of the exact matching.

...read moreread less

Abstract: The invention relates to a large-scale fast matching method in sentence level. The method of the invention comprises three stages which are index establishment, fuzzy matching and exact matching. The state of index establishment is in charge of carrying out the standardization of sentence content and conversion of code; the fuzzy matching stage is for picking up candidate sentences possible to match with new sentences from numerous sentences, and the number of the candidate sentences is controlled in a practicable range; the exact matching stage adopts a similarity measure algorithm based on edit distance; the final matched sentences are then obtained by arranging the candidate sentences according to the similarity of the exact matching. The method of the invention has the advantages of excellent performance of actual test, high efficiency of search, low undetected rate and being capable of meeting practical requirements.

...read moreread less

Book Chapter•DOI•

Indexed Hierarchical Approximate String Matching

[...]

Luís M. S. Russo¹, Gonzalo Navarro², Arlindo L. Oliveira³•Institutions (3)

Universidade Nova de Lisboa¹, University of Chile², Technical University of Lisbon³

10 Nov 2008

TL;DR: This work presents a new search procedure for approximate string matching over suffix trees, and shows that hierarchical verification, which is a well-established technique for on-line searching, can also be used with an indexed approach.

...read moreread less

Abstract: We present a new search procedure for approximate string matching over suffix trees. We show that hierarchical verification, which is a well-established technique for on-line searching, can also be used with an indexed approach. For this, we need that the index supports bidirectionality, meaning that the search for a pattern can be updated by adding a letter at the right or at the left. This turns out to be easily supported by most compressed text self-indexes, which represent the index and the text essentially in the same space of the compressed text alone. To complete the symbiotic exchange, our hierarchical verification largely reduces the need to access the text, which is expensive in compressed text self-indexes. The resulting algorithm can, in particular, run over an existing fully compressed suffix tree, which makes it very appealing for applications in computational biology. We compare our algorithm with related approaches, showing that our method offers an interesting space/time tradeoff, and in particular does not need of any parameterization, which is necessary in the most successful competing approaches.

...read moreread less