scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2006"


Proceedings ArticleDOI
03 Apr 2006
TL;DR: The concept of a graph closure, a generalized graph that represents a number of graphs, is introduced and the indexing technique, called Closure-tree, organizes graphs hierarchically where each node summarizes its descendants by a graphclosure.
Abstract: Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing technique, called Closure-tree, organizes graphs hierarchically where each node summarizes its descendants by a graph closure. Closure-tree can efficiently support both subgraph queries and similarity queries. Subgraph queries find graphs that contain a specific subgraph, whereas similarity queries find graphs that are similar to a query graph. For subgraph queries, we propose a technique called pseudo subgraph isomorphism which approximates subgraph isomorphism with high accuracy. For similarity queries, we measure graph similarity through edit distance using heuristic graph mapping methods. We implement two kinds of similarity queries: K-NN query and range query. Our experiments on chemical compounds and synthetic graphs show that for subgraph queries, Closuretree outperforms existing techniques by up to two orders of magnitude in terms of candidate answer set size and index size. For similarity queries, our experiments validate the quality and efficiency of the presented algorithms.

332 citations


Book ChapterDOI
TL;DR: This paper proposes two simple, but effective modifications of a standard edit distance algorithm that allow us to suboptimally compute edit distance in a faster way and demonstrates the resulting speedup and shows that classification accuracy is mostly not affected.
Abstract: Graph edit distance is one of the most flexible mechanisms for error-tolerant graph matching. Its key advantage is that edit distance is applicable to unconstrained attributed graphs and can be tailored to a wide variety of applications by means of specific edit cost functions. Its computational complexity, however, is exponential in the number of vertices, which means that edit distance is feasible for small graphs only. In this paper, we propose two simple, but effective modifications of a standard edit distance algorithm that allow us to suboptimally compute edit distance in a faster way. In experiments on real data, we demonstrate the resulting speedup and show that classification accuracy is mostly not affected. The suboptimality of our methods mainly results in larger inter-class distances, while intra-class distances remain low, which makes the proposed methods very well applicable to distance-based graph classification.

219 citations


Journal ArticleDOI
TL;DR: A binary linear programming formulation of the graph edit distance for unweighted, undirected graphs with vertex attributes is derived and applied to a graph recognition problem, and the new metric is shown to perform quite well in comparison to existing metrics when applications to a database of chemical graphs.
Abstract: A binary linear programming formulation of the graph edit distance for unweighted, undirected graphs with vertex attributes is derived and applied to a graph recognition problem. A general formulation for editing graphs is used to derive a graph edit distance that is proven to be a metric, provided the cost function for individual edit operations is a metric. Then, a binary linear program is developed for computing this graph edit distance, and polynomial time methods for determining upper and lower bounds on the solution of the binary program are derived by applying solution methods for standard linear programming and the assignment problem. A recognition problem of comparing a sample input graph to a database of known prototype graphs in the context of a chemical information system is presented as an application of the new method. The costs associated with various edit operations are chosen by using a minimum normalized variance criterion applied to pairwise distances between nearest neighbors in the database of prototypes. The new metric is shown to perform quite well in comparison to existing metrics when applied to a database of chemical graphs

195 citations


Journal ArticleDOI
TL;DR: A noniterative, polynomial time algorithm that is guaranteed to find an optimal solution for the noiseless case, and shows improvements in accuracy over current methods, particularly when matching patterns of different sizes.
Abstract: This paper describes a novel solution to the rigid point pattern matching problem in Euclidean spaces of any dimension. Although we assume rigid motion, jitter is allowed. We present a noniterative, polynomial time algorithm that is guaranteed to find an optimal solution for the noiseless case. First, we model point pattern matching as a weighted graph matching problem, where weights correspond to Euclidean distances between nodes. We then formulate graph matching as a problem of finding a maximum probability configuration in a graphical model. By using graph rigidity arguments, we prove that a sparse graphical model yields equivalent results to the fully connected model in the noiseless case. This allows us to obtain an algorithm that runs in polynomial time and is provably optimal for exact matching between noiseless point sets. For inexact matching, we can still apply the same algorithm to find approximately optimal solutions. Experimental results obtained by our approach show improvements in accuracy over current methods, particularly when matching patterns of different sizes

144 citations


Posted Content
TL;DR: Canetti et al. as discussed by the authors showed that a dictionary can be used with the DamerauLevenshtein stringedit distance metric to construct a case-insensitive passphrase system that can tolerate zero, one, or two spelling-errors per word, with no loss in security.
Abstract: It is well understood that passwords must be very long and complex to have sufficient entropy for security purposes. Unfortunately, these passwords tend to be hard to memorize, and so alternatives are sought. Smart Cards, Biometrics, and Reverse Turing Tests (human-only solvable puzzles) are options, but another option is to use pass-phrases. This paper explores methods for making passphrases suitable for use with password-based authentication and key-exchange (PAKE) protocols, and in particular, with schemes resilient to server-file compromise. In particular, the Ω-method of Gentry, MacKenzie and Ramzan, is combined with the Bellovin-Merritt protocol to provide mutual authentication (in the random oracle model (Canetti, Goldreich & Halevi 2004, Bellare, Boldyreva & Palacio 2004, Maurer, Renner & Holenstein 2004)). Furthermore, since common password-related problems are typographical errors, and the CAPSLOCK key, we show how a dictionary can be used with the DamerauLevenshtein string-edit distance metric to construct a case-insensitive pass-phrase system that can tolerate zero, one, or two spelling-errors per word, with no loss in security. Furthermore, we show that the system can be made to accept pass-phrases that have been arbitrarily reordered, with a security cost that can be calculated. While a pass-phrase space of 2 is not achieved by this scheme, sizes in the range of 2 to 2 result from various selections of parameter sizes. An attacker who has acquired the server-file must exhaust over this space, while an attacker without the serverfile cannot succeed with non-negligible probability.

94 citations


Proceedings ArticleDOI
22 Jan 2006
TL;DR: An oblivious embedding is introduced that maps strings of length n under edit distance tostrings of length at most n/r under editdistance for any value of parameter r to provide a distortion of O(r1+μ) for some μ = o(1), which is almost optimal.
Abstract: We introduce an oblivious embedding that maps strings of length n under edit distance to strings of length at most n/r under edit distance for any value of parameter r. For any given r, our embedding provides a distortion of O(r1+μ) for some μ = o(1), which we prove to be (almost) optimal. The embedding can be computed in O(21/μn) time.We also show how to use the main ideas behind the construction of our embedding to obtain an efficient algorithm for approximating the edit distance between two strings. More specifically, for any 1 > e ≥ 0, we describe an algorithm to compute the edit distance D(S, R) between two strings S and R of length n in time O(n1+e), within an approximation factor of min{n1-e/3+o(1), (D(S, R/ne)1/2+o(1)}. For the case of e = 0, we get a O(n)-time algorithm that approximates the edit distance within a factor of min{n1/3+o(1), D(S, R)1/2+o(1)}, improving the recent result of Bar-Yossef et al. [2].

93 citations


Journal ArticleDOI
TL;DR: This paper adopts a minimum description length approach to the problem of fitting a mixture of tree unions to a set of sample trees, and makes maximum-likelihood estimates of the Bernoulli parameters.
Abstract: This paper poses the problem of tree-clustering as that of fitting a mixture of tree unions to a set of sample trees. The tree-unions are structures from which the individual data samples belonging to a cluster can be obtained by edit operations. The distribution of observed tree nodes in each cluster sample is assumed to be governed by a Bernoulli distribution. The clustering method is designed to operate when the correspondences between nodes are unknown and must be inferred as part of the learning process. We adopt a minimum description length approach to the problem of fitting the mixture model to data. We make maximum-likelihood estimates of the Bernoulli parameters. The tree-unions and the mixing proportions are sought so as to minimize the description length criterion. This is the sum of the negative logarithm of the Bernoulli distribution, and a message-length criterion that encodes both the complexity of the union-trees and the number of mixture components. We locate node correspondences by minimizing the edit distance with the current tree unions, and show that the edit distance is linked to the description length criterion. The method can be applied to both unweighted and weighted trees. We illustrate the utility of the resulting algorithm on the problem of classifying 20 shapes using a shock graph representation.

88 citations


Patent
20 Jul 2006
TL;DR: In this paper, a spelling change analyzer was used to identify useful alternative spellings of search strings submitted to a search engine, as detected by programmatically analyzing search histories of a population of search engine users.
Abstract: A spelling change analyzer (160) identifies useful alternative spellings of search strings submitted to a search engine (100). The spelling change analyzer (160) takes into consideration spelling changes made by users, as detected by programmatically analyzing search histories (150) of a population of search engine users. In one embodiment, an assessment of whether a second search string represents a useful alternative spelling of a first search string takes into consideration (1) an edit distance between the first and second search strings, and (2) a likelihood that a user who submits the first search string will thereafter submit the second search string, as determined by monitoring and analyzing actions of users.

83 citations


Book ChapterDOI
TL;DR: This paper investigates the relation between non-Euclidean aspects of dissimilarity data and the classification performance of the direct NN rule and some classifiers trained in representation spaces and concludes that statistical classifiers perform well and the optimal values of the parameters characterize a non-D Euclidean and somewhat non-metric measure.
Abstract: Statistical learning algorithms often rely on the Euclidean distance. In practice, non-Euclidean or non-metric dissimilarity measures may arise when contours, spectra or shapes are compared by edit distances or as a consequence of robust object matching [1,2]. It is an open issue whether such measures are advantageous for statistical learning or whether they should be constrained to obey the metric axioms. The k-nearest neighbor (NN) rule is widely applied to general dissimilarity data as the most natural approach. Alternative methods exist that embed such data into suitable representation spaces in which statistical classifiers are constructed [3]. In this paper, we investigate the relation between non-Euclidean aspects of dissimilarity data and the classification performance of the direct NN rule and some classifiers trained in representation spaces. This is evaluated on a parameterized family of edit distances, in which parameter values control the strength of non-Euclidean behavior. Our finding is that the discriminative power of this measure increases with increasing non-Euclidean and non-metric aspects until a certain optimum is reached. The conclusion is that statistical classifiers perform well and the optimal values of the parameters characterize a non-Euclidean and somewhat non-metric measure.

80 citations


Journal ArticleDOI
TL;DR: This article aims at learning an unbiased stochastic edit distance in the form of a finite-state transducer from a corpus of (input, output) pairs of strings, and shows that the new model always outperforms the standard edit distance.

76 citations


Book ChapterDOI
TL;DR: A method for transforming strings into n-dimensional real vector spaces based on prototype selection, which allows us to subsequently classify the transformed strings with more sophisticated classifiers, such as support vector machine and other kernel based methods.
Abstract: A common way of expressing string similarity in structural pattern recognition is the edit distance. It allows one to apply the kNN rule in order to classify a set of strings. However, compared to the wide range of elaborated classifiers known from statistical pattern recognition, this is only a very basic method. In the present paper we propose a method for transforming strings into n-dimensional real vector spaces based on prototype selection. This allows us to subsequently classify the transformed strings with more sophisticated classifiers, such as support vector machine and other kernel based methods. In a number of experiments, we show that the recognition rate can be significantly improved by means of this procedure.

Proceedings ArticleDOI
22 Jan 2006
TL;DR: A new cache-oblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cache-OBlivious algorithms for Floyd-Warshall all-pairs shortest paths in graphs and 'simple DP', among other problems.
Abstract: We present efficient cache-oblivious algorithms for several fundamental dynamic programs. These include new algorithms with improved cache performance for longest common subsequence (LCS), edit distance, gap (i.e., edit distance with gaps), and least weight subsequence. We present a new cache-oblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cache-oblivious algorithms for Floyd-Warshall all-pairs shortest paths in graphs and 'simple DP', among other problems.

Book ChapterDOI
13 Jul 2006
TL;DR: This work presents a method to obtain a video representation suitable for this task, and shows how to use this representation in a matching scheme, entirely based on features and descriptors taken from the well established MPEG-7 standard.
Abstract: Video databases require that clips are represented in a compact and discriminative way, in order to perform efficient matching and retrieval of documents of interest We present a method to obtain a video representation suitable for this task, and show how to use this representation in a matching scheme In contrast with existing works, the proposed approach is entirely based on features and descriptors taken from the well established MPEG-7 standard Different clips are compared using an edit distance, in order to obtain high similarity between videos that differ for some subsequences, but are essentially related to the same content Experimental validation is performed using a prototype application that retrieves TV commercials recorded from different TV sources in real time Results show excellent performances both in terms of accuracy, and in terms of computational performances

Proceedings Article
01 May 2006
TL;DR: A multi-dimensional extension of the Dynamic Programming solution to Levenshtein Edit Distance calculations capable of evaluating STT systems during periods of overlapping, simultaneous speech.
Abstract: Since 1987, the National Institute of Standards and Technology has been providing evaluation infrastructure for the Automatic Speech Recognition (ASR), and more recently referred to as the Speech-To-Text (STT), research community From the first efforts in the Resource Management domain to the present research, the NIST SCoring ToolKit (SCTK) has formed the tool set for system developers to make continued progress in many domains; Wall Street Journal, Conversational Telephone Speech (CTS), Broadcast News (BN), and Meetings (MTG) to name a few For these domains, the community agreed to declared sections of simultaneous speech as “not scoreable” While this had minor impact on most of these domains, the highly interactive nature of Meeting speech rendered a very large fraction of the test material not scoreable This paper documents a multi-dimensional extension of the Dynamic Programming solution to Levenshtein Edit Distance calculations capable of evaluating STT systems during periods of overlapping, simultaneous speech

Proceedings ArticleDOI
04 Jun 2006
TL;DR: A solution to the problem of matching personal names in English to the same names represented in Arabic script is presented by augmenting the classic Levenshtein edit-distance algorithm with character equivalency classes.
Abstract: This paper presents a solution to the problem of matching personal names in English to the same names represented in Arabic script. Standard string comparison measures perform poorly on this task due to varying transliteration conventions in both languages and the fact that Arabic script does not usually represent short vowels. Significant improvement is achieved by augmenting the classic Levenshtein edit-distance algorithm with character equivalency classes.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: A novel boosted distance metric is proposed that not only finds the best distance metric that fits the distribution of the underlying elements but also selects the most important feature elements with respect to similarity.
Abstract: In this paper, we present a general guideline to establish the relation between a distribution model and its corresponding similarity estimation. A rich set of distance metrics, such as harmonic distance and geometric distance, is derived according to Maximum Likelihood theory. These metrics can provide a more accurate feature model than the conventional Euclidean distance (SSD) and Manhattan distance (SAD). Because the feature elements are from heterogeneous sources and may have different influence on similarity estimation, the assumption of single isotropic distribution model is often inappropriate. We propose a novel boosted distance metric that not only finds the best distance metric that fits the distribution of the underlying elements but also selects the most important feature elements with respect to similarity. We experiment with different distance metrics for similarity estimation and compute the accuracy of different methods in two applications: stereo matching and motion tracking in video sequences. The boosted distance metric is tested on fifteen benchmark data sets from the UCI repository and two image retrieval applications. In all the experiments, robust results are obtained based on the proposed methods.

Proceedings ArticleDOI
01 Sep 2006
TL;DR: This work considers the problem of similarity search in a very large sequence database with edit distance as the similarity measure, and considers two novel strategies for selecting references as well as a new strategy for assigning references to database sequences.
Abstract: We consider the problem of similarity search in a very large sequence database with edit distance as the similarity measure. Given limited main memory, our goal is to develop a reference-based index that reduces the number of costly edit distance computations in order to answer a query. The idea in reference-based indexing is to select a small set of reference sequences that serve as a surrogate for the other sequences in the database. We consider two novel strategies for selecting references as well as a new strategy for assigning references to database sequences. Our experimental results show that our selection and assignment methods far outperform competitive methods. For example, our methods prune up to 20 times as many sequences as the Omni method, and as many as 30 times as many sequences as frequency vectors. Our methods also scale nicely for databases containing many and/or very long sequences.

Proceedings ArticleDOI
13 Dec 2006
TL;DR: A recursive plagiarism evaluation function to be evaluated at each level of the document structure which is based on the Levenshtein edit distance is proposed and a method that will eliminate unnecessary chunks comparison, avoiding similarity calculation of chunks which do not share enough 4-grams is proposed.
Abstract: The paper presents the implementation of a tool for plagiarism detection developed within the AXMEDIS project. The algorithm leverages the plagiarist behaviour, which is modeled as a combination of 3 basical actions: insertion, deletion, substitution. We recognize that this behaviour may occur at various level of the document structure: the plagiarist may insert, delete or substitute a word, period or a paragraph. The procedure consists in two main steps: document structure extraction and plagiarism function calculation. We propose a recursive plagiarism evaluation function to be evaluated at each level of the document structure which is based on the Levenshtein edit distance. We also propose a method that will eliminate unnecessary chunks comparison, avoiding similarity calculation of chunks which do not share enough 4-grams. We describe the similarity algorithm and discuss some implementation issues and future work.

Journal Article
TL;DR: A low-complexity but non-trivial distance between strings to be used in biology and, even if preliminary, are quite encouraging.
Abstract: We exhibit a low-complexity but non-trivial distance between strings to be used in biology. The experimental results we provide were obtained on a standard laptop and, even if preliminary, are quite encouraging.

Proceedings ArticleDOI
12 Jul 2006
TL;DR: A prototype Unicode attack detection tool is developed, IDN-SecuChecker, which detects phishing weblinks and fake user name (account) attacks, and the possible practical use of Unicode attack detectors are introduced.
Abstract: Unicode is becoming a dominant character representation format for information processing. This presents a very dangerous usability and security problem for many applications. The problem arises because many characters in the UCS (Universal Character Set) are visually and/or semantically similar to each other. This presents a mechanism for malicious people to carry out Unicode Attacks, which include spam attacks, phishing attacks, and web identity attacks. In this paper, we address the potential attacks, and propose a methodology for countering them. To evaluate the feasibility of our methodology, we construct a Unicode Character Similarity List (UC-SimList). We then implement a visual and semantic based edit distance (VSED), as well as a visual and semantic based Knuth-Morris-Pratt algorithm (VSKMP), to detect Unicode attacks. We develop a prototype Unicode attack detection tool, IDN-SecuChecker, which detects phishing weblinks and fake user name (account) attacks. We also introduce the possible practical use of Unicode attack detectors.

Journal ArticleDOI
TL;DR: A well-studied case in which T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster is investigated, which allows us to exploit compressed suffix arrays to reduce the indexing space to O(n) bits, while increasing the query time by an O(log n) factor only.

Journal ArticleDOI
TL;DR: The main technical result is that the Ulam metric, namely, the edit distance on permu- tations of length at most n, embeds into '1 with distortion O(log n), which immediately leads to sketching algorithms with constant size sketches, and to efficient approximate nearest neighbor search algorithms, with approximation factor O( log n).
Abstract: Edit distance is a fundamental measure of distance between strings, the ex- tensive study of which has recently focused on computational problems such as nearest neighbor search, sketching and fast approximation. A very powerful paradigm is to map the metric space induced by the edit distance into a normed space (e. g., '1) with small dis- tortion, and then use the rich algorithmic toolkit known for normed spaces. Although the minimum distortion required to embed edit distance into '1 has received a lot of attention lately, there is a large gap between known upper and lower bounds. We make progress on this question by considering large, well-structured submetrics of the edit distance metric space. Our main technical result is that the Ulam metric, namely, the edit distance on permu- tations of length at most n, embeds into '1 with distortion O(log n). This immediately leads to sketching algorithms with constant size sketches, and to efficient approximate nearest neighbor search algorithms, with approximation factor O(log n). The embedding and its algorithmic consequences present a big improvement over those previously known for the Ulam metric, and they are significantly better than the state of the art for edit distance in general. Further, we extend these results for the Ulam metric to edit distance on strings that are (locally) non-repetitive, i. e., strings where (close by) substrings are distinct.

Proceedings Article
01 May 2006
TL;DR: A methodology for the automatic detection of cognates between two languages based solely on the orthography of words is proposed, which allows to achieve an improvement in the F-measure in comparison with detecting cognates based only on the edit distance between them.
Abstract: Present-day machine translation technologies crucially depend on the size and quality of lexical resources. Much of recent research in the area has been concerned with methods to build bilingual dictionaries automatically. In this paper we propose a methodology for the automatic detection of cognates between two languages based solely on the orthography of words. From a set of known cognates, the method induces rules capturing regularities of orthographic mutations that a word undergoes when migrating from one language into the other. The rules are then applied as a preprocessing step before measuring the orthographic similarity between putative cognates. As a result, the method allows to achieve an improvement in the F-measure of 11,86% in comparison with detecting cognates based only on the edit distance between them.

Journal ArticleDOI
TL;DR: The methodology developed is based on the noisy channel model for spelling correction and makes use of statistics harvested from user logs to estimate the probabilities of different types of edits that lead to misspellings.
Abstract: It is known that users of internet search engines often enter queries with misspellings in one or more search terms. Several web search engines make suggestions for correcting misspelled words, but the methods used are proprietary and unpublished to our knowledge. Here we describe the methodology we have developed to perform spelling correction for the PubMed search engine. Our approach is based on the noisy channel model for spelling correction and makes use of statistics harvested from user logs to estimate the probabilities of different types of edits that lead to misspellings. The unique problems encountered in correcting search engine queries are discussed and our solutions are outlined.

Dissertation
01 Jan 2006
TL;DR: A framework for integrating learnable similarity functions within a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) is described, which accommodates learning various distance measures, including those based on Bregman divergences and parameterized KL-divergence.
Abstract: Many machine learning and data mining tasks depend on functions that estimate similarity between instances. Similarity computations are particularly important in clustering and information integration applications, where pairwise distances play a central role in many algorithms. Typically, algorithms for these tasks rely on pre-defined similarity measures, such as edit distance or cosine similarity for strings, or Euclidean distance for vector-space data. However, standard distance functions are frequently suboptimal as they do not capture the appropriate notion of similarity for a particular domain, dataset, or application. In this thesis, we present several approaches for addressing this problem by employing learnable similarity functions. Given supervision in the form of similar or dis-similar pairs of instances, learnable similarity functions can be trained to provide accurate estimates for the domain and task at hand. We study the problem of adapting similarity functions in the context of several tasks: record linkage, clustering, and blocking. For each of these tasks, we present learnable similarity functions and training algorithms that lead to improved performance. In record linkage, also known as duplicate detection and entity matching, the goal is to identify database records referring to the same underlying entity. This requires estimating similarity between corresponding field values of records, as well as overall similarity between records. For computing field-level similarity between strings, we describe two learnable variants of edit distance that lead to improvements in linkage accuracy. For learning record-level similarity functions, we employ Support Vector Machines to combine similarities of individual record fields in proportion to their relative importance, yielding a high-accuracy linkage system. We also investigate strategies for efficient collection of training data which can be scarce due to the pairwise nature of the record linkage task. In clustering, similarity functions are essential as they determine the grouping of instances that is the goal of clustering. We describe a framework for integrating learnable similarity functions within a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs). The framework accommodates learning various distance measures, including those based on Bregman divergences (e.g., parameterized Mahalanobis distance and parameterized KL-divergence), as well as directional measures (e.g., cosine similarity). Thus, it is applicable to a wide range of domains and data representations. Similarity functions are learned within the HMRF-KMEANS algorithm derived from the framework, leading to significant improvements in clustering accuracy. The third application we consider, blocking, is critical in making record linkage and clustering algorithms scalable to large datasets, as it facilitates efficient selection of approximately similar instance pairs without explicitly considering all possible pairs. Previously proposed blocking methods require manually constructing a similarity function or a set of similarity predicates, followed by hand-tuning of parameters. We propose learning blocking functions automatically from linkage and semi-supervised clustering supervision, which allows automatic construction of blocking methods that are efficient and accurate. This approach yields computationally cheap learnable similarity functions that can be used for scaling up in a variety of tasks that rely on pairwise distance computations, including record linkage and clustering.

Patent
02 Aug 2006
TL;DR: In this article, the authors proposed a system for tax forms with handwritten material, which is trained with a variety of Roman text fonts and has a back end dictionary that can be customized to account for the fact that the system knows which field it is recognizing.
Abstract: Proprietary suite of underlying document image analysis capabilities, including a novel forms enhancement, segmentation and modeling component, forms recognition and optical character recognition. Future version of the system will include form reasoning to detect and classify fields on forms with varying layout. Product provides acquisition, modeling, recognition and processing components, and has the ability to verify recognized data on the image with a line by line comparison. The key enabling technologies center around the recognition and processing of the scanned forms. The system learns the positions of lines and the location of text on the pre-printed form, and associates various regions of the form with specific required fields in the electronic version. Once the form is recognized, the preprinted material is removed and individual regions are passed to an optical character recognition component. The current proprietary OCR engine is trained with a variety of Roman text fonts and has a back end dictionary that can be customized to account for the fact that the system knows which field it is recognizing. The engine performs segmentation to obtain isolated characters and computes a structure based feature vector. The characters are normalized and classified using a cluster centric classifier, which responds well to variations in the symbols contour. An efficient dictionary lookup scheme provides exact and edit distance lookup using a TRIE structure. An edit distance is computed and a collection of near misses can be output in a lattice to enhance the final recognition result. The current classification rate can exceed 99% with context. The ultimate goal of this system is to enable the processing of all tax forms including forms with handwritten material.

Journal ArticleDOI
TL;DR: It is shown that the approximate matching problem with swap andmismatch as the edit operations, can be computed in timeO(n √m logm).
Abstract: There is no known algorithm that solves the general case of theapproximate string matching problem with the extended edit distance, where the edit operations are: insertion, deletion, mismatch and swap, in timeo(nm), wheren is the length of the text andm is the length of the pattern. In an effort to study this problem, the edit operations were analysed independently. It turns out that the approximate matching problem with only the mismatch operation can be solved in timeO(n √m logm). If the only edit operation allowed is swap, then the problem can be solved in timeO(n logm logσ), whereσ=min(m, |Σ|). In this paper we show that theapproximate string matching problem withswap andmismatch as the edit operations, can be computed in timeO(n √m logm).

Journal Article
TL;DR: An algorithm to approximate edit distance between two ordered and rooted trees of bounded degree is presented, where each input tree is transformed into a string by computing the Euler string, where labels of some edges in the input trees are modified so that structures of small subtrees are reflected to the labels.
Abstract: This paper presents an O(n 2 ) time algorithm for approximating the unit cost edit distance for ordered and rooted trees of bounded degree within a factor of O(n 3/4 ), where n is the maximum size of two input trees, and the algorithm is based on transformation of an ordered and rooted tree into a string.

Journal Article
TL;DR: This paper defines and theoretically study geometric crossover for sequences under edit distance and shows its intimate connection with the biological notion of sequence homology.
Abstract: This paper extends a geometric framework for interpreting crossover and mutation [4] to the case of sequences. This representation is important because it is the link between artificial evolution and biological evolution. We define and theoretically study geometric crossover for sequences under edit distance and show its intimate connection with the biological notion of sequence homology.